Now Trending on Twitter …
In mid-2006, the world as we know it changed. Prior to that time, the words “twitter” and “tweet” referred solely to the little chirping noises that birds make. Since then, however, most of us also know Twitter as an online social-media platform that allows you to share your thoughts—in short, 140-character bursts called tweets—with the world at large, or at least whatever segment of the world chooses to listen to you.
Twitter is very popular, with 500 million users as of December 2014, according to one site, and more than 640 million users, according to another. That represents a healthy proportion of the population of the Earth, more than 7 billion people. Approximately 300 million Twitter users are active, defined as those who have interacted with the community in the last 30 days while logged in. According to a list of virtual communities with more than 100 million active users, this makes Twitter the eighth most active virtual community in the world.
To me, Twitter is a prototypical example of what people call “Big Data.” It is modern, based on technological advances that wouldn’t even have been imaginable 20 years ago; it is widespread, generating masses of information every day—every second, in fact; and it is unstructured, with strings of text that have to be mined for meaning. And as I’ve pointed out with other examples of Big Data in this column, new statistical tools and approaches for analysis are also required to handle this rich content.
There are at least two statistical aspects of interest about Twitter. The first involves the statistics of Twitter itself: such things as the number of users, tweets over a given time span, and distribution of users around the world. The second involves the statistics of Twitter data: subjects of tweets, trends in those subjects, and so on.
To gain some insight into the first type of question, I found a number of sites that provide summaries of Twitter usage. According to the site Statistic Brain, which I accessed in mid-January 2015, 135,000 new users sign up for Twitter every day. There are an average of 58 million tweets per day, and an average of 9,100 tweets per second (though the Twitter blog reports an average of 5,700 tweets per second). Only about 60% of users tweet; the rest just follow the tweets of others.
The record for most tweets in a second was apparently reached on August 3, 2013, at 11:21:50 Japan Standard Time, during an airing of Hayao Miyazaki’s “Castle in the Sky”—143,199 tweets per second, according to the Twitter blog. At a critical point in the movie, more than 100,000 viewers took to Twitter to tweet a magic spell to destroy the city of Laputa, at the same time that the protagonists of the film were uttering the spell on-screen. I found one site that claimed 618,000 tweets per second were generated during this past summer’s World Cup when Germany defeated Argentina in the final, but could not find this corroborated anywhere.
The website beevolve (ignore the awful pie charts!) claims to provide “An exhaustive study of Twitter users around the world,” although it may not be updated very frequently. The page I saw was dated October 2012, and in general I found it difficult to locate current information on this topic, although the general trends are consistent on different sites. According to the beevolve site, the sex distribution of Twitter users slightly favors women: 53% female and 47% male. Women also tweet more than men do. Not surprisingly, the overwhelming majority of users are young—about 74% in the age category 15-25! The “slice” of the distribution decreases steadily by decade of age, and only about 6% are 46 and older. As noted on the site, though, this information is gleaned from publicly available biographical information, which younger people in general are more comfortable giving. So there is a skew toward younger ages. And in fact, also according to the site, just 0.45% of Twitter users disclose their age.
The English-speaking world dominates the Twitterverse, with the top three countries for Twitter users being the United States, the United Kingdom, and Australia. Other countries in the top 10 (remember, these are 2012 numbers) include Brazil, India, and Iran. The sex distribution is relatively even in the developed countries, but in the developing world, Twitter users are predominantly male. Perhaps this reflects different access to technology and education. I also found it interesting that among younger users, women tend to outnumber men, a trend that reverses among older users.
One can also find data on various websites about the most followed accounts. In December 2014, Katy Perry, Justin Bieber, and Barack Obama were the top three accounts in terms of most followers worldwide, each being viewed by more than 50 million people. In June 2014, President Obama led the pack of world leaders, with more than 40 million followers; Pope Francis was next at just over 14 million.
A challenge in all of this, aside from the fact that the data can be hard to find in the first place, is that these figures change rapidly because Twitter is so widespread. Thus it’s hard to pin down the number of users and their demographic characteristics, or the average number of tweets per second, or the most followed celebrities. As just one example, I learned that Lady Gaga had the most followers on Twitter for much of 2014, but Katy Perry ended the year beating her. Furthermore, Gaga wasn’t even in the top three for December. Does it matter much? The answer isn’t clear to me, but should depend on the goals of your analysis. And if you the reader look these things up after reading this column, no doubt you will find different information still.
The other side of the statistics of Twitter, of course, is the analysis of Twitter data, and for this a plethora of tools have been developed. Some basic patterns are evident. People use Twitter to keep in touch with friends, and to find news; some also use it for research. In terms of content, it is common to use Twitter to post personal or work updates, to share links to news stories, to post general observations about life, and to “re-tweet” or forward tweets from others. Most tweets, however, generate no reaction.
Visualization of data is powerful; massive unstructured data like tweets can yield unexpected patterns that might be difficult to detect otherwise. Tools for the analysis of Twitter data break down further into tools to analyze and visualize your own data, and tools to analyze and visualize the Twitterverse more generally. Many of the tools and programs for the analysis of one’s own data are aimed at improving a user’s following. Naturally enough, many of these focus on social marketing and businesses. And here, as well, there is a wide variety of sites and tools available to choose from.
For fun, I decided to try some of them to see what I could find. The first thing I realized is that, like Twitter itself, sites that aim to provide content for users change rapidly—so much so that many websites and services I found mentioned didn’t work when I attempted to access them. Furthermore, I’m not a Twitter user, so this limited what I could do, and also I wasn’t sure what to search for! I fell back on “American Statistical Association.” First I plugged that search term into a website called “Tweet Archivist.” Without paying for the service, I was able to get some simple summaries for the past week—but couldn’t download the full report. Our society handle wasn’t very active during the last week of January 2015, with only 24 tweets coming up in the report. Among the top words were “American” and “Association”—no surprise there! But also “collaboration,” “data,” “digging,” “math,” and, intriguingly, “vam,” which refers to the value added model of education, a somewhat controversial concept in K–12 testing, on which the ASA has recently weighed in. The highest number of tweets during the week was on Monday, Jan. 26, with 10. Tweet Archivist seems to limit the number of tweets it will present; leading up to the Super Bowl on Feb. 1, 2015, I was able to retrieve just 100 tweets at a time from this engine using the search term “Super Bowl” or “Superbowl”—which is clearly nowhere near the amount of activity there must have been on Twitter surrounding the game.
Another interesting website is twistori, which scrolls through tweets that use certain emotive words—love, hate, think, believe, feel, and wish. When you click on one of these words, the tool starts pulling up tweets in which that word appears. For instance, clicking on “love” generated “i love good morning texts :),” “i love aloe vera drink :D,” “i love this time of year. Seriously, I do, no honestly,” and lots of “i love you.” “think” generated “i think i’m ok,” “i think too much,” “i think HTML needs a sarcasm tag,” and “i think a flying giraffe would be awesome.” I found it rather poignant to let these tweets scroll by on the screen, seeing a small slice both of the Twitterverse and of the lives and preoccupations of others. I didn’t understand everything, of course, since much of it was clearly personal and idiosyncratic, but it was definitely an interesting experience to sit back and absorb the scroll-through.
Beyond visualization, there are also tools available for the statistical analysis of Twitter data, for example using R. Much of this, not unexpectedly, still focuses on visualization such as networks of users, and text mining. There is an R package called twitteR that provides an interface to the Twitter web API. A big step is to download data, or scrape the data from Twitter feeds. As with the visualization tools, for text mining one would usually need to start with search terms of interest. The twitteR package has a lot of functions for these purposes; you can search for trends on Twitter, find followers, and manage users. Once the data are in R, you can of course use any other R packages (such as text mining) or functions to analyze and visualize the results. From the research I did for this column, I got the impression that this area is still in its infancy—most tutorials focused on how to plot the Twitter data in R. There was some discussion of sentiment analysis (that is, determining the emotional valence of tweets, and subjecting that to statistical analysis), but this was also ultimately visual.
Twitter data are in large part network data—those who tweet have followers, messages get re-tweeted, and so on. One possibly fruitful direction for statistical analysis, then, would be based on social-network analysis, with the added twists of high-temporal resolution (data are constantly generated) and high volume (many users). Other complications include the facts that most people don’t have a lot of followers, and most users don’t actually tweet that much, so that the data are sparse in some respects. In my research, I found slides from what looked like a very interesting talk titled “A Statistical Analysis of a Time Series of Twitter Graphs” (PDF download) given by David Marchette. Many of these points, and a wealth of others, are addressed there. I’d urge you to take a look at the talk for a window into some of the statistical analysis it is possible to carry out on Twitter data.
Researching this column did not make me want to join Twitter, but it did make me think more about the data that users generate. It’s a fascinating look into our modern world—the issues that people are thinking about and talking about; the things that they (we) think are important; how people feel; how we connect to each other. As statisticians, we can think about the Twitterverse as a gigantic source of rich, complicated data, which it certainly is. But this view also allows us to detach ourselves from the overwhelming emotionality and exuberance captured by the spontaneity of Twitter. It behooves us to balance both aspects!
In The Big Picture, Nicole Lazar discusses the many facets of handling, analyzing, and making sense of complex and large data sets. If you have questions or comments about the column, please contact Lazar at email@example.com.