Accessing Twitter using Spark Streaming: A minimal example application

In recent years, Twitter data has been used in various studies. An often cited example is the prediction of stock market changes based on the Twitter mood 1. Bollen et al. employed mood tracking tools to measure the mood in 6 dimensions and found that using this data they were able to predict with an accuracy of 87.6% the up and down changes of the Dow Jones. 2

Another interesting application of live Twitter data for predictive purpose is for the case of earthquake warnings. Researchers of the University of Tokyo 3 developed a classifier for tweets based on keywords etc. to scan for earthquake related posts and were able to detect earthquakes in Japan with a probability of 96%. Based on this system they run a notification service and claim to be on average faster than the Japan Meteorological Agency.

Long story short, examples like these show that interesting knowledge can be extracted from a large pool of seemingly banal short messages. In the remainder of this post I’m going to demonstrate with a minimal runnable example how Twitter live data can be accessed using Apache Spark, an open-source engine for data processing currently gaining ever more popularity.

Apache Spark comes with a component called Spark Streaming for, guess what, performing streaming analytics. For that to work it provides data from the stream source in small batches to the node for processing. A great point to note here is that you can use the same operations in batch and streaming mode which makes it easy to switch between the two.

Setup

First, You will need to have Spark installed. The application works in standalone mode, so no server configuration is required. Second, to connect to Twitter they require you to create an application on their developer portal at apps.twitter.com (if this step is unclear to you, please refer to e.g. this guide). Third and last, you can download the complete source code from: github.com/mgoettsche/SparkTwitterHelloWorldExample

The Application

Below is the code for a minimal runnable Spark application using the Twitter Streaming API. All the application does is printing the first ten (or less than ten if Twitter provided less) tweets of each one second interval in an infinite loop. For the application to work, copy and paste the relevant parameters of your Twitter application into lines 14-17.

As you can see from the access properties’ names, Spark uses the Twitter4J library and the stream provides objects of class Twitter4J.Status, which, besides getText() provides methods for accessing data other than just the status message.

For example, you can also access the location from where the user posted the status using getGeoLocation() which returns an GeoObject object. Posting with the current location is an opt-in feature so not all tweets have a location attached. Luckily, Spark makes it easy to filter streams 4. Replacing lines 29-33 with:

will only keep the tweets with the user’s location attached and output their location and text to the standard output.

Building and Running

I have included a Maven pom.xml for building the application. There is nothing special about it except perhaps that it uses the maven-assembly-plugin to build a JAR including the Spark Streaming Twitter library to avoid ClassNotFoundExceptions when deploying the app. To build, simply type execute

And to execute

Example output:

SparkStreamingTwitterExampleOutput

Wrapping up

The application I presented here does not yet actually do anything to the received tweets, but rather just demonstrates how to establish the Spark/Twitter connection. As a first further step one could experiment with processing the tweets by e.g. extracting and counting hashtags or creating usage statistics based on the geolocation. Be aware though that via the public streaming API you only receive a small sample of all tweets. If I find the time, I will write a post with another sample application that does some processing of the statuses.

Notes:

  1. Bollen J, Mao H, Zeng XJ (2011), Twitter mood predicts the stock market. Journal of Computational Science 2: 1–8. PDF
  2. This may at first sound like the holy grail of becoming rich using data analytics. However, the study was performed on an ex-post basis and thus can’t serve well as a trading guide.
  3. Sakaki T, Okazaki M, Matsuo Y (2010), Earthquake shakes Twitter users: real-time event detection by social sensors. WWW ’10 Proceedings of the 19th international conference on World wide web, pages 851-860. PDF
  4. This is a local filter, i.e. it filters the tweets after receiving them. Twitter also offers to pass filter parameters directly to the API call for server-side filtering.