Sentiment analysis and Topic modeling for a Twitter Data collected using a key word “COVID 19” and “Africa”

DANIEL ZELALEM Zewdie
9 min readJun 29, 2021

Introduction

The current Coronavirus (COVID-19) pandemic has impacted and changed lives on a global scale since its emergence in late 2019. The response to this pandemic remains challenging in many African countries. This work can be used to gain insight into how COVID-19 has affected African people’s livelihoods. It can be used to know people’s knowledge, attitude, and perceptions towards COVID 19. We can have insight into the common misconceptions people have on covid 19. We can infer to covid 19 social impacts as well as its economic impacts on Africa. Having that information can help governments to devise an effective prevention strategy to control COVID 19 in Africa.

The main objective of this project is to analyze Twitter data extracted using the keywords “covide19” and “Africa” and build a fully automated MLops pipeline for identifying the sentiments and the topic of tweets.

GitHub Link

Data Extraction And Building Data Frame

The raw data were collected from Twitter using the keyword “covide19” and “Africa”. The raw data has a JSON format. To build a data frame for our analysis, we first need to read the JSON file. Then we choose 15 columns to include in our data frame and extract them.

  1. Reading JSON file

2. Extracting Columns

We choose the following 15 columns to extract from the raw JSON data

A. created_at: The date the tweet was created (date-time)

B. source: Tag for source link (string data)

C. original_text: The tweet Text (string data)

D. Polarity: Indicates the positivity or negativity sentiment of the tweet (numeric data). The value range is from -1 to 1 and is continuous. If the value is more negative, the sentiment or polarity score of the tweet is more negative. If the value is more positive, the sentiment or polarity score of the tweet is more positive. If the value is 0, then the sentiment or polarity score of the tweet is neutral.

E. subjectivity: Indicates how much subjective the tweet is (numeric data).

The value ranges from -1 to 1 and is continuous

F. lang: Language of the tweet (string data). I.e “en” stands for English

G. hashtags: Hashtags used in the tweet (Space separated strings, each string represents a hashtag)

H. user_mentions: Username of users mentioned in a tweet (Space separated strings, each string represents a username)

I. place: Location where the tweet is tweeted (string data)

J. original_author: Author of the tweet (string Data)

Other columns which have numeric data are, favorite_count, retweet_count, followers_count, friends_count, possibly_sensitive

To extract and construct the data frame. We first read the JSON file using read_json helper function.

def read_json(json_file: str) -> list:
tweets_data = []
for tweets in open(json_file, 'r'):
tweets_data.append(json.loads(tweets))
return len(tweets_data), tweets_data

Extracting the columns

Using the class we TweetDfExtractor helper class we wrote, we will extract 15 columns and create a data frame.

The link for TweetDfExtractor can be found here.

https://github.com/daniEL2371/Twitter-Data-Analysis/blob/main/extract_dataframe.py

Creating the data frame and saving it

We have now extracted the columns we need. The next step is going to be to zip the extracted columns and build a data frame using pandas and save it to a file called ‘processed_tweet_data.csv’.

CSV: means Comma Separated Values. Which is a popular format for saving a tabular data

data = zip(created_at, source, text, polarity, subjectivity, lang, fav_count, retweet_count, screen_name, follower_count, friends_count, sensitivity, hashtags, mentions, location)        df = pd.DataFrame(data=data, columns=columns)
if save:
df.to_csv('processed_tweet_data.csv', index=False)
print('File Successfully Saved.!!!')
return df

Data Preprocessing and Cleaning

Before we analyze the data and use it to model our sentiment analysis and our Topic modeling, we must treat it in advance by cleaning it up

Let’s look the data frame we have.

## getting number of columns, row and column information
def get_data_info(tweet_df: pd.DataFrame):

row_count, col_count = tweet_df.shape

print(f"Number of rows: {row_count}")
print(f"Number of columns: {col_count}")
return tweet_df.info()tweet_df = read_proccessed_data(CSV_PATH)
get_data_info(tweet_df)

As we can see from above, our data frame has 6532 rows and 15 columns.

Now Let’s look to the first five rows using pandas.head function.

tweet_df.head()

As we can see from the above, our data need cleaning. We wrote a cleaner helper class called Clean_Tweets in clean_tweets_dataframe.py. Clean_Tweets accepts a Twitter data and provide methods to clean the data.

The link to clean_tweets_dataframe.py can be found in the link below.

https://github.com/daniEL2371/Twitter-Data-Analysis/blob/main/clean_tweets_dataframe.py

We initialize out Clean_Tweets class

Tweet_cleaner = Clean_Tweets(df)

First, there is an unnamed column in the first column. We need to remove that. We do this by calling drop_unwanted_columns method

df = Tweet_cleaner.drop_unwanted_column(df)

Second, we drop rows that are duplicated, convert created_at column into a date-time type. We also convert polarity, subjectivity, retweet_count and faviorite_count into a numeric data type.

df = Tweet_cleaner.drop_duplicate(df)
df = Tweet_cleaner.convert_to_datetime(df)
df = Tweet_cleaner.convert_to_numbers(df)

Then we need to handle missing values using Class_Tweet’s handle_missing_values method. When inspecting the data frame, columns having missing values are the following.

A. polarity: handled by changing missing values to 0

B. retweet_count: handled by changing missing values to 0

C. place: handled by changing missing values to empty string

D. hashtags: handled by changing missing values to empty string

E. user mentions: handled by changing missing values to empty string

df = Tweet_cleaner.handle_missing_values(df)

Finally, we need to further clean up the orignal_text column and add a column called clean_text to hold the cleaned text in our data frame.

We first removed common punctuations used in the text. Then we changed the text into a lower case using a regular expression.

# changing to lower case

df['clean_text'] = df['clean_text'].apply(lambda x: x.lower())

# replacing common panctuations.

df['clean_text'] = df['clean_text'].map(lambda x: re.sub('[,\.!?]', '', x))

Then we removed hashtag starting with the # symbol, user mentions (username of mentioned users in the text) starting with @ symbol, and links found in the text. Finally, we removed any non-ASCII characters. We do this by filtering only characters having an “ord” less than 128.

NOTE: all ASCII characters have an ord less than 128.

def clean_text(text):
hash_tag_removed = re.sub('(#[A-Za-z]+[A-Za-z0-9-_]+)', '',text)
hash_tag_removed = re.sub('(@[A-Za-z]+[A-Za-z0-9-_]+)', '',text)

removed_links = re.sub('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', '', text, flags=re.MULTILINE)

cleaned = re.sub('(@[A-Za-z]+[A-Za-z0-9-_]+)', '', removed_links)

cleaned = ''.join([i if ord(i) < 128 else ' ' for i in cleaned])
cleaned = cleaned.strip()
return cleaned

df['clean_text'] = df['clean_text'].apply(clean_text)

One last thing to do is, to clean up the source column. But first, let’s look at the source column. The source column holds the source in the form of a tag like this.

<a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>

So we need to extract the link only. We do this by using a regular expression.

def extract_source_link(tag: str) -> str:
link_math = re.search(r'href=[\'"]?([^\'" >]+)', tag)
return link_math.group(1)
df['source'] = df['source'].apply(extract_source_link)

Finally we save the cleaned preprocessed data into a file called cleaned_tweet_data.csv.

The whole code for Data Preprocessing can be found in the link below.

Exploratory Analysis

We have built a stream-lit dashboard to visualize and explore the cleaned data we have. Let’s have some insight.

As we can see from above 52.5% of the tweets have a positive polarity. 19.6% of them have the negative polarity and 28% of them have neutral polarity.

Word Cloud of clean_text column in our cleaned data.

Word Cloud of clean_text column in our cleaned data.

Modeling

https://github.com/daniEL2371/Twitter-Data-Analysis/blob/main/notebooks/modelGeneration.ipynb

Sentiment Analysis

Sentiment analysis (or opinion mining) is a natural language processing technique used to determine whether data is positive, negative or neutral. Sentiment analysis is often performed on textual data to help businesses monitor brand and product sentiment in customer feedback, and understand customer needs.

Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to fitting linear classifiers and regressors under convex loss functions such as (linear) Support Vector Machines and Logistic Regression

TF-IDF is a statistical measure that evaluates how relevant a word is to a document in a collection of documents. This is done by multiplying two metrics: how many times a word appears in a document, and the inverse document frequency of the word across a set of documents

TfidfVectorizer — Transforms text to feature vectors that can be used as input to estimator. vocabulary_ Is a dictionary that converts each token (word) to feature index in the matrix, each unique token gets a feature index.

Data Preparation for sentiment analysis: To prepare a data for sentiment analysis. We took the following steps

First, we select only columns that we are interested in. These are clean_text and polarity.

Second, We add another column called score_map (a categorical column that represents sentiment).

‘score’ is “positive” if polarity > 0

‘score’ is “negative” if polarity < 0

We have ignored neutral polarity (polarity == 0)

def text_category(p: float) -> str:
if p > 0:
return "positive"
elif p == 0:
return "neutral"
else:
return "negative"
def remove_neutral(value):
return value != "neutral"
cleanedTweet = pd.DataFrame(columns=['clean_text','polarity'])cleanedTweet['clean_text'] = df['clean_text']
cleanedTweet['polarity'] = df['polarity']
cleanedTweet['score']=cleanedTweet['polarity'].apply(text_category)cleanedTweet=cleanedTweet[cleanedTweet['score'].map(remove_neutral)]cleanedTweet['score_map']=cleanedTweet["score"].map( lambda score: 1 if score == "positive" else 0)

Third, we separated the input data (cleaned_text) and output data (score_map) as (X, y)

(X, y) = cleanedTweet['clean_text'], cleanedTweet['score_map']

Fourth, to feed our texts to our training model, we need to represent the sentences into a kind of vector form. We used the TfidfVectorizer to represent a sentence in a vector.

trigram_vectorizer = CountVectorizer(ngram_range=(1, 3))trigram_vectorizer.fit(X.values)
X_trigram = trigram_vectorizer.transform(X.values)
trigram_tf_idf_transformer = TfidfTransformer() trigram_tf_idf_transformer.fit(X_trigram)
X_train_tf_idf = trigram_tf_idf_transformer.transform(X_trigram)

Fifth: We divide our data into training data and test data.

Finally: Train using SGD classifier

def train_and_show_scores(X, y, title: str) -> None:

X_train, X_valid, y_train, y_valid = train_test_split( X, y,
train_size=0.75, stratify=y )

clf = SGDClassifier()
clf.fit(X_train, y_train)
train_score = clf.score(X_train, y_train)
valid_score = clf.score(X_valid, y_valid)

print(f'{title}\nTrain score: {round(train_score, 2)} ;
Validation score: {round(valid_score, 2)}\n')

train_and_show_scores(X_train_tf_idf, y, title="trigram_tf_idf")

Results: The training accuracy is 1.0 which means we have an overfitting model. The validation accuracy on the other hand is 0.96

Topic Modeling

Topic Models: are a type of statistical language models used for uncovering hidden structure in a collection of texts.

LDA is a generative probabilistic model that assumes each topic is a mixture over an underlying set of words, and each document is a mixture of over a set of topic probabilities.

The Porter stemming algorithm is a process for removing the commoner morphological and inflexional endings from words in English.

First, we extract only the column, we are interested in for topic modeling.

Second, we remove any stop word using the ‘wordcloud.STOPWORDS’.

Third, using ‘nltk.stem.PortStemmer’, we do port stemming on our text vocabularies.

Fourth , we tokenize our text and convert the tokenized object into a corpus and dictionary.

Finally, we train an LDA model using 8 topics.

# reading the cleaned data
df = read_proccessed_data(CLEANED_SAVE_PATH)
# creating an instance of topic model generator with 8 topcis
tm = TopicModel(df, 8)
# model building
lda_model, lda_prepared =tm.build(show_print=True)

Results:

To visualize the topics, we used a popular visualization package called pyLDAvis.

You can manually select each topic to view its top most frequent and/or “relevant” terms, using different values of the λ parameter. This can help when you’re trying to assign a human interpretable name or “meaning” to each topic.

--

--