Natural Language Processing - Predicting Fake News

SALOME SONYA LOMSADZE
8 min readJul 18, 2022
source

A- NLP Basics

NLP is a field of study based on making sense of texts and audios by using statistics and computations. This workbook includes:

  • Topic identification
  • Text Classification
  • Chatbots
  • Translation
  • Sentiment Analysis

1- RegEx

What regular expressions is for?

  • It finds all web links in documents or texts
  • Removes/replaces unwanted characters
  • Match patterns in strings
  • Parse email addresses

Methods: re.match(), re.search(), re.split(), re.findall()

import re
string=’This task takes a lot of hard work.\nSo, be ready about it!’
re.match(‘ready’,string)

The code gives None.

re.search(‘ready’,string)<re.Match object; span=(37, 42), match='ready'>

re.match returns none because the match (ready) exists in the second line of the string. re.match() works at the beginning of the string.

pattern=’\w’ # match the first character
print(re.match(pattern, string))
pattern=’\w+’ # match the first word
print(re.match(pattern, string))
pattern=’\w.*’ # match all words
print(re.match(pattern, string))
<re.Match object; span=(0, 1), match='T'>
<re.Match object; span=(0, 4), match='This'>
<re.Match object; span=(0, 35), match='This task takes a lot of hard work.'>
re.split(‘\s+’,’Split on spaces.’)['Split', 'on', 'spaces.']pattern=’\w+’ # finds all of characters
print(re.findall(pattern, string))
['This', 'task', 'takes', 'a', 'lot', 'of', 'hard', 'work', 'So', 'be', 'ready', 'about', 'it']

‘\n’ and r’\n’ are different. Print function results in an empty line for the first argument whereas for the second argument is printed as \n.

2- Word & Sentence Tokenization

  • word_tokenization() seperates words and punctuations separatelly.
  • sent_tokenization() splits documents into sentences.
  • Tokenization is the first step in NLP processing.

I’ll be using Monty Python’s Holy Grail text. It can be obtained from > from nltk.book import * command, there appear 8 other books as well.

scene_one:

“SCENE 1: [wind] [clop clop clop] \nKING ARTHUR: Whoa there! [clop clop clop] \nSOLDIER #1: Halt! Who goes there?\nARTHUR: It is I, Arthur, son of Uther Pendragon, from the castle of Camelot. King of the Britons, defeator of the Saxons, sovereign of all England!\nSOLDIER #1: Pull the other one!\nARTHUR: I am, … and this is my trusty servant Patsy. We have ridden the length and breadth of the land in search of knights who will join me in my court at Camelot. I must speak with your lord and master.\nSOLDIER #1: What? Ridden on a horse?\nARTHUR: Yes!\nSOLDIER #1: You’re using coconuts!\nARTHUR: What?\nSOLDIER #1: You’ve got two empty halves of coconut and you’re bangin’ and so on.

# split scene_one into sentences: sentences
sentences=sent_tokenize(scene_one)
# make a set of unique token in the entire scene: unique_tokens
unique_tokens=set(word_tokenize(scene_one))
print(unique_tokens)
{'swallows', 'go', 'Pull', 'in', '...', 'lord', 'knights', ']', 'times', 'trusty', 'covered', 'husk', 'weight', "'em", 'SCENE', 'temperate', 'martin', 'beat', 'They', 'right', 'my', 'yet', 'forty-three', 'ounce', 'have', 'minute', 'together', 'by', 'there', 'migrate', "'", 'King', 'The', 'carried', 'use', 'sun', 'an', 'land', 'kingdom', 'under', 'back', 'You', 'Mercea', 'That', 'Wait', 'not', 'grip', ':', 'matter', 'plover', 'all', 'carry', 'every', 'second', 'Pendragon', 'will', "'m", 'castle', 'ask', 'agree', 'to', 'got', 'be', 'with', 'defeator', 'strangers', 'five', 'a', 'do', 'European', 'England', 'bird', 'African', 'house', 'But', 'Who', '?', 'them', 'ARTHUR', 'coconuts', 'just', '[', 'horse', 'non-migratory', 'sovereign', 'warmer', 'is', 'Are', 'grips', 'here', 'our', 'wings', 'Camelot', 'this', 'simple', '#', 'held', 'Whoa', '!', 'climes', 'Britons', 'maybe', '1', 'Well', 'Where', 'or', 'he', 'strand', 'of', 'search', 'length', 'mean', 'the', 'Court', 'at', 'interested', 'since', 'then', 'bangin', 'Listen', 'ridden', 'your', "n't", 'course', 'I', 'breadth', 'tropical', 'creeper', 'could', 'clop', 'you', 'dorsal', 'it', 'No', "'s", 'Patsy', 'It', 'seek', 'master', 'speak', 'bring', 'two', 'Not', 'carrying', 'these', 'other', 'SOLDIER', 'son', 'that', 'zone', 'Oh', 'Ridden', 'So', 'Saxons', 'fly', 'needs', "'d", 'does', 'its', 'found', 'may', 'wind', 'snows', 'Found', "'re", '--', 'on', 'me', 'A', 'and', 'yeah', 'point', 'south', 'Supposing', 'Halt', 'line', 'feathers', 'using', 'swallow', '2', 'why', ',', 'wants', 'goes', 'Uther', 'What', 'one', 'guiding', 'pound', 'Please', 'Yes', 'empty', 'from', 'where', 'maintain', 'who', 'court', 'coconut', 'but', 'through', 'get', 'anyway', 'Am', 'halves', 'In', 'velocity', 'suggesting', 'winter', 'Will', 'We', 'must', 'tell', 'if', 'they', 'question', 'KING', 'am', 'ratios', 'Arthur', 'are', 'servant', "'ve", 'join', 'order', '.', 'air-speed'}

Tokenization is the fundamental step toward information retrieval and text mining projects!

# search for anything in square bracket: pattern
pattern=r’\[.*]’
re.search(pattern, scene_one)
<re.Match object; span=(9, 32), match='[wind] [clop clop clop]'>
# search for script notation (e.g. Character:)
pattern=r"[\w\s]+:"
re.search(pattern, sentences[3])
<re.Match object; span=(0, 7), match='ARTHUR:'>
  • r’[sth]’: implies a group of patterns
  • r’/[sth]’: implies square bracket pattern
import matplotlib.pyplot as plt
word_lengths=[len(w) for w in unique_tokens]
plt.hist(word_lengths)
# Split the text into lines: lines
lines=re.split(‘\n’, hg_raw) # or you can just use hg_raw.split(‘\n’)
# Remove speakers from each line
pattern = “[A-Z]{2,}(\s)?(#\d)?([A-Z]{2,})?:”
lines = [re.sub(pattern, ‘’, l) for l in lines]
# Tokenize each line: tokenized_lines
tokenized_lines=[regexp_tokenize(l, “\w+”) for l in lines]
# Create a frequency list of words: line_num_words
line_num_words=[len(t) for t in tokenized_lines]
# Just plot it
plt.hist(line_num_words)
plt.show()

Parsing Tweets

from nltk.tokenize import regexp_tokenize
from nltk.tokenize import TweetTokenizer
tweet=’Continuing to move the game forward @KingJames. Much respect my brother Flexed biceps #33644' # RIP Kobe
pattern=r’(#\d+|@\w+)’ # pattern same as in r”([#|@]\w+)”
re.findall(pattern, tweet)
['@KingJames', '#33644']

Try it with regexp_tokenize(), unlike regex syntax nltk_tokenize() accept pattern on the second argument.

hashtags_and_mentions=regexp_tokenize(tweet, pattern)
hashtags_and_mentions
['@KingJames', '#33644']

Use the TweetTokenizer to tokenize all tweets into one list

tknzr = TweetTokenizer()
all_tokens = tknzr.tokenize(tweet)
print(all_tokens)
['Continuing', 'to', 'move', 'the', 'game', 'forward', '@KingJames', '.', 'Much', 'respect', 'my', 'brother', 'Flexed', 'biceps', '#33644']

2- Topic Identification: Bag of Words, Lemmatization, isalpha(), Counter(), Wikipedia

Bag of Words (BOW) is used as a basic method in topic classification. It is a decent way to determine the significant words in a text. Logic is simple: the more frequent a word, the more important it might be.

import wikipedia
turkey=wikipedia.page(“Turkey”).content
turkey

'Turkey (Turkish: Türkiye [ˈtyɾcije]), officially the Republic of Turkey, is a country straddling Southeastern Europe and Western Asia. It shares borders with Greece and Bulgaria to the northwest; the Black Sea to the north; Georgia to the northeast; Armenia, Azerbaijan, and Iran to the east; Iraq to the southeast; Syria and the Mediterranean Sea to the south; and the Aegean Sea to the west. Istanbul, the largest city, is the financial centre, and Ankara is the capital. Turks form the vast majority of the nation\'s population, and Kurds...

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
alpha_only=[t.lower() for t in word_tokenize(turkey) if t.isalpha()]
no_stops=[t for t in alpha_only if t not in stopwords.words(‘english’)]
wordnet_lemmatizer=WordNetLemmatizer()
lammatized=[wordnet_lemmatizer.lemmatize(t) for t in no_stops]
Counter(lammatized).most_common(10)
[('turkey', 236),
('turkish', 127),
('ottoman', 74),
('world', 49),
('century', 49),
('empire', 46),
('country', 45),
('anatolia', 41),
('war', 40),
('first', 38)]

It is obvious that the article is about Turkey. isalpha() method checks if all the characters are letters. For example:

txt = “isn’t”
x = txt.isalpha()
print(x)
False

Lemmatization is the name of the process of removing word endings only and it returns lemma in the form of a meaningful word base (it differs in the context of stemming)

3- Introduction to Gensim

Why gensim?

  • Building document or word vectors.
  • Performing topic identification and document comparison.

Word vectors are multi-dimensional mathematical representations of words. Such as:

I will create a list of Turkey, Georgia and Iran processed document tokens from Wikipedia.

georgia=wikipedia.page(“Georgia (country)”).content
iran=wikipedia.page(“Iran”).content
countries=[georgia, turkey, iran]
all_countries=[]
wordnet_lm=WordNetLemmatizer()
for i in countries:
alpha=[t.lower() for t in word_tokenize(i) if t.isalpha()]
no_stops=[t for t in alpha if t not in stopwords.words(‘english’)]
lemmatized=[wordnet_lm.lemmatize(t) for t in no_stops]
all_countries.append(lemmatized)
len(all_countries)
3

The length of the list of lists is 3. Our list consists of word tokens of three different Wikipedia articles.

Creating a Gensim corpus

import gensim.downloader as api
from gensim.models import TfidfModel
from gensim.corpora import Dictionary
dictionary=Dictionary(all_countries)
corpus=[dictionary.doc2bow(doc) for doc in all_countries]
# first 10 words ids with their relative frequency counts from the SECOND document (which is the Turkey article).
corpus[2][:10]
[(1, 5),
(6, 1),
(15, 1),
(17, 6),
(19, 19),
(20, 2),
(24, 17),
(25, 2),
(28, 6),
(29, 2)]

Let’s analyze the first tuple of the corpus, (1, 5):

  • Tuple takes word id and the frequency respectively.
  • Id’s are unique for each word.

Let’s check this out.

dictionary.get(2)'abbasid'

Now, I’ll use the gensim corpus and dictionary to see the most common terms per document and across all documents.

from collections import defaultdict
import itertools
doc = corpus[0] # Select Georgia article# Sort the doc for frequency: bow_doc
bow_doc = sorted(doc, key=lambda w: w[1], reverse=True)
# Print the top 5 words of the document alongside the count
for word_id, word_count in bow_doc[:5]:
print(dictionary.get(word_id), word_count)

georgia 252
georgian 150
country 49
russian 45
region 38

At this stage, I’ll go with TF-IDF algorithm:

TF-IDF: is the inverse document frequency that allows us to determine the most significant words in each document.

# Create a new TfidfModel using the corpus: tfidf
tfidf = TfidfModel(corpus)
# Calculate the tfidf weights of doc 2 (Iran): tfidf_weights
doc = corpus[2]
tfidf_weights = tfidf[doc]# Print the first five weights
print(tfidf_weights[:5])
# Sort the weights from highest to lowest: sorted_tfidf_weights
sorted_tfidf_weights = sorted(tfidf_weights, key=lambda w: w[1], reverse=True)
# Print the top 5 weighted words
for term_id, weight in sorted_tfidf_weights[:5]:
print(dictionary.get(term_id), weight)
[(1, 0.01848832043200545), (6, 0.00369766408640109), (15, 0.00369766408640109), (31, 0.01479065634560436), (41, 0.00369766408640109)]
iranian 0.5139753080097516
tehran 0.28052815262439845
reza 0.14026407631219923
festival 0.1302452137184707
pronunciation 0.1302452137184707

Consider these two documents:

- U.S President speech in public
- Donald Trump presentation to people

If you apply tf-idf and calculate cosine similarity it gives the score of 0. No term in common. But it does not make sense right? It is one of the major drawbacks of tf-idf. How to overcome this drawback?

  • LSA (Latent Semantic Analysis)
  • Word Embeddings (Word2Vec, Glove)
  • ConceptNet

I’ll focus on these concepts later on.

4- Named entity recognition

NER answers the below questions:

  • Who?
  • What?
  • When?
  • Where?
import nltk
nltk.download(‘averaged_perceptron_tagger’)
nltk.download(‘maxent_ne_chunker’)
nltk.download(‘words’)
article=wikipedia.page(“Uber”).content
# Tokenize the article into sentences: sentences
sentences = nltk.sent_tokenize(article)
# Tokenize each sentence into words: token_sentences
token_sentences = [nltk.word_tokenize(sent) for sent in sentences]
# Tag each tokenized sentence into parts of speech: pos_sentences
pos_sentences = [nltk.pos_tag(sent) for sent in token_sentences]
# Create the named entity chunks: chunked_sentences
chunked_sentences = nltk.ne_chunk_sents(pos_sentences, binary=True)
# Test for stems of the tree with 'NE' tags
for sent in chunked_sentences:
for chunk in sent:
if hasattr(chunk, "label") and chunk.label() == "NE":
print(chunk)
(NE Uber/NNP Technologies/NNPS)
(NE American/JJ)
(NE Lime/NNP)
(NE San/NNP Francisco/NNP)
(NE United/NNP States/NNPS)
(NE Uber/NNP)
(NE Uber/NNP)
(NE Uber/NNP)
...

5- Spacy

  • It focuses on creating NLP pipelines to generate models and corpora.
import spacy
nlp=spacy.load(‘en_core_web_sm’)
doc=nlp(“””Berlin is the capital of German;
and the residence of Chancellor Angela Merkel.”””)
doc.ents
(Berlin, German, Angela Merkel)
for ent in doc.ents:
print(ent.label_, ent.text)
GPE Berlin
NORP German
PERSON Angela Merkel

Will be continued…

--

--

SALOME SONYA LOMSADZE

Sr. Customer Analytics , BI Developer, Experienced in SQL, Python, Qlik, B.Sc in Chemistry at Bogazici University https://github.com/sonyalomsadze