Twitter can be a useful tool in understanding how people are feeling about the coronavirus (COVID-19) over time. We performed sentiment analysis on tweets related to COVID19 from January 4th, 2020 to April 12th, 2020 and observe any trends or frequencies for the most positive and negative tweets during this period.
Be advised that strong and potentially offensive language is present in the sample tweets. In the hopes of this analysis being of help to someone in any small way, we also made the analysis public on Github. If you have feedback or suggestions, feel free to reach out to us at email@example.com.
The purpose & questions we were trying to answer with this analysis…
1. Observe sentiment of tweets over different time periods from January through April. What trends are associated with the sentiments over time?
2. Identify frequency of words, hashtags, mentions, and emojis in the most positive and negative tweets (top ~20% compound scores → most positive, bottom ~20% compound scores → most negative) for tokenization at the sentence and word level.
3. What are the most retweeted positive and negative tweets and what are the top 10 most positive and negative tweets?
The Twitter dataset is Version 4.0 released by Panacea Labs supplemented by additional daily data from April 5th to April 12th. The dataset is sampled from the original ~31 million tweets down to 1,384,905 tweets collected from January 4th, 2020 to April 12th, 2020. A time series dataset from Johns Hopkins University that covers confirmed cases and deaths for the U.S. and globally from January 22nd to April 12th is also used.
Starting around mid-March, coronavirus cases in the U.S. start taking off and growing exponentially.
VADER (Valence Aware Dictionary and sEntiment Reasoner) is a lexicon and rule-based sentiment analysis tool that is specifically attuned to sentiments expressed in social media, especially tweets.
VADER works well because it incorporates word-order sensitive relationships between terms and is able to determine the magnitude of intensity through punctuation, capitalization, degree modifiers, negations, slang, and others. It works best when analysis is performed at the sentence level (but it can work on single words or paragraphs).
The output of VADER are the positive, negative, and neutral ratios of sentiment. The most useful metric is the compound score, which is computed by summing the valence scores of each word in the lexicon, and then normalized to be between -1 (most extreme negative) and +1 (most extreme positive). This can be considered a ‘normalized, weighted composite score’.
The threshold values for the compound score are as follows:
* Positive sentiment : (compound score >= 0.05).
* Neutral sentiment : (compound score > -0.05) and (compound score < 0.05).
* Negative sentiment : (compound score <= -0.05)
Since we want to calculate the sentiment of an entire tweet, we weight scores at the sentence and word level to get the weighted compound scores.
The frequency of words, hashtags, mentions, and emojis are displayed from 01-04-2020 to 04-12-2020.
The most popular words are related to descriptions of coronavirus such as the origins (china, wuhan, new) and effects (cases, outbreak, pandemic, world, spread, health).
The most popular hashtags include several different variations of coronavirus (#covid-19, #covid_19, #covid–19, #coronaviruspandemic, #coronavirusoutbreak, #covid2019, #covid, #corona, #coronavirusupdate, #virus, #pandemic, #outbreak, #flu). #news and #trump also shows up in the top 20 for both words and hashtags.
The most mentioned users fall into three main categories – politicians, news/media agencies, and government agencies/political parties. Politicians include Donald Trump (@realdonaldtrump, @potus; U.S. President), Mike Pence (@VP; Vice President), Narendra Modi (@narendramodi; PM of India), Boris Johnson (@borisjohnson; PM of the U.K.), and Nancy Pelosi (@speakerpelosi; Speaker of the House). Notable news/media agencies include YouTube (@youtube), CNN (@CNN), NY Times (@nytimes) and Mail Online (@mailonline). Notable government agencies/political parties include the White House (@whitehouse), WHO (@who), CDC (@cdcgov), and GOP (@GOP). The top emojis indicate varying sentiment from crying, tears of joy, laughing, to health related emojis (hearts, microbe, masked face).
1. 1.58 sentences per tweet
2. 10.45 words per tweet after text preprocessing and word tokenization
3. 0.85 hashtags per tweet
4. 0.51 mentions per tweet
5. 0.17 emojis per tweet
01-04-2020 to 04-12-2020 : 1,384,905 tweets
1. 01-04-2020 to 01-27-2020 : 3,694 tweets
2. 01-27-2020 to 04-12-2020 : 1,381,211 tweets
The period of 01-04-2020 to 01-26-2020 is analyzed separately because there weren’t enough daily tweets to calculate general sentiment. From 01-27-2020 to 04-12-2020, there’s an average of ~18,000 tweets per day.
At the end of January there are confirmed cases in the U.S., but no deaths. Confirmed cases and fatalities are beginning to grow globally at this time.
Common words among positive and negative sentiments are: coronavirus, virus, china, chinese, new, corona, outbreak, and wuhan. Positive tweets include words such as like, novel, please, hope, case, and please while negative tweets contain words such as death, infected, toll, rises, dead, fears, and f*ck.
Most of the hashtags include different word variations of coronavirus and its origins – #china, #wuhan, #wuhancoronavirus, #coronavirusoutbreak. In January, there are a lot of hashtags related of Wuhan and China since it’s still localized to that region.
From February to March, there is an increase in positive sentiment, which rises above the negative threshold of -0.05 to neutral and trends positively. The orange dotted line indicates 0.02 and acts as a marker. Sentiment increases from 02-29-2020 to 3-09-2020 and on 03-14-2020 it goes above zero for the first time. The ratio of negative scores decreases a little, but most of the positive sentiments come from a shift of neutral to positive tweets.
Sentiment analysis from word tokenization shows a similar trend, but the tweets are much more neutral, as the compound scores are primarily in between -0.015 and 0.005. The compound score starts trending up from 03-12-2020 to above zero where it stays until 04-11-2020. Sentiment analysis on tokenized words show the ratio of positive words increases while there is a decline in ratio of neutral words in tweets.
If we take the highest ~20% of compound scores and lowest ~20% of compound scores of tokenized tweets during this period, we can identify word/hashtags/mentions/emoji frequencies related to positive and negative sentiments. We can also compare the differences between tweets that are tokenized by sentences versus words.
1. Word Frequency:
There are common words between the positive and negative tweets such as coronavirus, covid-19, people, and china. Positive tweets include words like help, positive, please, good, health, thanks, safe, great, and get, while negative tweets include the words death, trump, virus, flu, infected, and news.
2. Hashtags Frequency:
Common hashtags include different word variations of coronavirus (#covid19, #covid_19, #covid–19, #covid, #coronavirusoutbreak, #coronaviruspandemic) and location (#china, #wuhan). Hashtags associated with more positive tweets include #stayhome, #socialdistancing, #stayathome while hashtags associated with more negative tweets include #trump, #news, #virus, #flu)
3. Mentions Frequency:
Donald Trump tops the list of mentions for both positive and negative tweets, but he’s mentioned more often negatively than positively. Common mentions include the popular news/media agencies, politicians, and political parties. The more positive tweets include mentions of @narendramodi, @pmoindia, @borisjohnson, @drtedros, and @vp while negative tweets mention @breitbartnews, @gatewaypundit, and @mailonline.
4. Emojis Frequency:
The face with tears of joy emojis top both the negative and positive tweet lists. This may indicate that Twitter users are using humor to deal with the negative news. The most common emojis related to positive tweets include folded hands, heart, laughing, and clapping hands. The most common emojis related to negative tweets include loudly crying face, pouting face, crying face, flushed face, and cursing face.
The most retweeted positive tweets:
These tweets are not the most positive or negative tweets during the time period. They’re only the most retweeted “positive” or “negative” tweets in the highest 20% of compound scores and the bottom 20% of compound scores, respectively. You can see the difference between the compound score for word (w_) versus sentence (s_) tokenization. The word compound score may have very different results than the sentence compound score for each tokenization.
The most retweeted tweet here is a sarcastic tweet that VADER doesn’t exactly pick up on. There’s a lot of retweets on statuses overall about coronavirus prevention, public service announcements, generous donations, and support.
The most retweeted negative tweets:
Top 10 most positive tweets:
The positive or negative tweets from word tokenization are often very short and just a few words. In this case, “love”, “sweet” and “best” are common words among the most positive tweets from word tokenization.
Top 10 most negative tweets:
“Kill” and “murder” are common words among negative tweets using word tokenization. From sentence tokenization, some of the most negative tweets here are troll tweets, people lashing out, or news updates about the status of confirmed cases and deaths related to coronavirus.
When coronavirus grew in the U.S. starting in early March, U.S. President Donald Trump began showing up in more negative tweets compared to positive ones. Trump (@realDonaldTrump) was the most mentioned in March and April, and began appearing in word frequency related to negative tweets. Other politicians such as P.M. of India Narendra Modi (@narendramodi), P.M. of the U.K. (@borisjohnson), and Director General of the WHO Tedros Adhanom (@drtedros) were mentioned in relatively positive tweets. In mid-March to April, NY Governor Andrew Cuomo (@nygovcuomo) appears on the positive tweet mentions, possibly due to Cuomo’s response to the crisis in New York.
There is a shift of hashtags from being more localized in Wuhan, China to becoming a pandemic. In February, common hashtags were #coronaviruschina, #wuhancoronavirus, #wuhanvirus. Going into March, these hashtags disappeared and more global ones popped up #coronaviruspandemic, #coronavirusupdates. In mid-March to April, Wuhan doesn’t even appear as the most frequent words for positive or negative tweets. During this time, #Covid-19 also surpasses #coronavirus in positive tweets for the first time after receiving its official name.
The emoji, 😂, was by far most popular for both positive and negative tweets, which may indicate that Twitter users are also using humor to deal with negative news. For the first time, in mid-March to April, the folded hands emoji, 🙏, appear at the top of the positive tweets emoji followed by the red heart, ❤️. The clapping hands emoji, 👏, increased in rank, possibly related to the health worker appreciations that have been going on across the world. A blue heart also appears in and people have stopped using the microbe emoji🦠.
There is an especially interesting trend that occurs in February to Mid-March – as the death rate grows exponentially in mid-March in the U.S., the sentiments get more positive. Tweets shift from all negative sentiments in February to neutral in mid-March and trends positively to a neutral/positive sentiment in mid-April. Positive tweets about coronavirus prevention, public service announcements, generous donations, and support were prevalent at this time.
Sentiment analysis on tweets using word tokenization identifies trends, but mostly stays in the neutral zone without surpassing even -0.02 or 0.02, . Sentence tokenization was able to identify daily negative average sentiment in February to March. VADER has some downfalls in not being able to detect sarcasm and certain words in the context of the tweets, but for the most part it identifies sentiments in tweets well.
Hutto, C.J. & Gilbert, E.E. (2014). VADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text. Eighth International Conference on Weblogs and Social Media (ICWSM-14). Ann Arbor, MI, June 2014.
Banda, Juan M., Tekumalla, Ramya, Wang, Guanyu, Yu, Jingyuan, Liu, Tuo, Ding, Yuning, & Chowell, Gerardo. (2020). A Twitter Dataset of 150+ million tweets related to COVID-19 for open research (Version 4.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.3738018
Aslam, Salman. “Twitter by the Numbers: Stats, Demographics & Fun Facts.” Omnicore Agency, 10 Feb. 2020, www.omnicoreagency.com/twitter-statistics/.
Wojcik, Stefan, and Adam Hughes. “How Twitter Users Compare to the General Public.” Pew Research Center: Internet, Science & Tech, Pew Research Center, 2 Jan. 2020, www.pewresearch.org/internet/2019/04/24/sizing-up-twitter-users/.
Taylor, Derrick Bryson. “A Timeline of the Coronavirus Pandemic.” The New York Times, The New York Times, 13 Feb. 2020, www.nytimes.com/article/coronavirus-timeline.html.