Mining Twitter Data with Python Part 1: Collecting Data
2016-12-05 14:28
567 查看
http://www.kdnuggets.com/2016/06/mining-twitter-data-python-part-1.html
Part 1 of a 7 part series focusing on mining Twitter data for a variety of use cases. This first post lays the groundwork, and focuses on data collection.
By Marco Bonzanini, Independent Data Science Consultant.
Twitter is a popular social network where users can share short SMS-like messages called tweets. Users share thoughts, links and pictures on Twitter, journalists comment on live events, companies
promote products and engage with customers. The list of different ways to use Twitter could be really long, and with 500 millions of tweets per day, there’s a lot of data to analyse and to play with.
This is the first in a series of articles dedicated to mining data on Twitter using Python. In this first part, we’ll see different options to collect data from Twitter. Once we have built a data set, in the next episodes we’ll discuss some interesting data
applications.
![](http://www.kdnuggets.com/wp-content/uploads/twitter-banner.jpg)
In order to have access to Twitter data programmatically, we need to create an app that interacts with the Twitter API.
The first step is the registration of your app. In particular, you need to point your browser to http://apps.twitter.com, log-in to Twitter (if you’re not already logged in) and register a new application.
You can now choose a name and a description for your app (for example “Mining Demo” or similar). You will receive a consumer key and a consumer secret: these are application settings that should always be kept private. From the configuration
page of your app, you can also require an access token and an access token secret. Similarly to the consumer keys, these strings must also be kept private: they provide the application access to Twitter on behalf of your account. The default permissions are
read-only, which is all we need in our case, but if you decide to change your permission to provide writing features in your app, you must negotiate a new access token.
Important Note: there are rate limits in the use of the Twitter API, as well as limitations in case you want to provide a downloadable data-set, see:
https://dev.twitter.com/overview/terms/agreement-and-policy
https://dev.twitter.com/rest/public/rate-limiting
Twitter provides REST APIs you can use to interact with their service. There is also a bunch of Python-based
clients out there that we can use without re-inventing the wheel. In particular, Tweepy in one of the most interesting and straightforward to use, so let’s install it:
Update: the release 3.4.0 of Tweepy has introduced a problem with Python 3, currently fixed on github but not yet available with pip, for this reason we’re using version 3.3.0
until a new release is available.
More Updates: the release 3.5.0 of Tweepy, already available via pip, seems to solve the problem with Python 3 mentioned above.
In order to authorise our app to access Twitter on our behalf, we need to use the OAuth interface:
The api variable is now our entry point for most of the operations we can perform with Twitter.
For example, we can read our own timeline (i.e. our Twitter homepage) with:
Tweepy provides the convenient Cursor interface to iterate through different types of objects. In the example above we’re using 10 to limit the number of tweets we’re reading, but we can of course access more. The status variable is an
instance of the Status() class, a nice wrapper to access the data. The JSON response from the Twitter API is available in the attribute _json (with a leading underscore), which is not the raw JSON string, but a dictionary.
So the code above can be re-written to process/store the JSON:
What if we want to have a list of all our followers? There you go:
And how about a list of all our tweets? Simple:
In this way we can easily collect tweets (and more) and store them in the original JSON format, fairly easy to convert into different data models depending on our storage (many NoSQL technologies provide some bulk import feature).
The function process_or_store() is a place-holder for your custom implementation. In the simplest form, you could just print out the JSON, one tweet per line:
In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. We need to extend the StreamListener() to customise the way we process the incoming data. A working example
that gathers all the new tweets with the #python hashtag:
Depending on the search term, we can gather tons of tweets within a few minutes. This is especially true for live events with a world-wide coverage (World Cups, Super Bowls, Academy Awards, you name it), so keep an eye on the JSON file to understand how
fast it grows and consider how many tweets you might need for your tests. The above script will save each tweet on a new line, so you can use the command wc -l python.json from a Unix shell to know how many tweets you’ve gathered.
You can see a minimal working example of the Twitter Stream API in the following Gist:
twitter_stream_downloader.py
We have introduced tweepy as a tool to access Twitter data in a fairly easy way with Python. There are different types of data we can collect, with the obvious focus on the “tweet” object.
Once we have collected some data, the possibilities in terms of analytics applications are endless. In the next episodes, we’ll discuss some options.
Bio: Marco Bonzanini is a Data Scientist based in London, UK. Active in the PyData community, he enjoys working in text analytics and data mining applications. He's the author
of "Mastering Social Media Mining with Python" (Packt Publishing, July 2016).
Original. Reposted with permission.
Part 1 of a 7 part series focusing on mining Twitter data for a variety of use cases. This first post lays the groundwork, and focuses on data collection.
By Marco Bonzanini, Independent Data Science Consultant.
Twitter is a popular social network where users can share short SMS-like messages called tweets. Users share thoughts, links and pictures on Twitter, journalists comment on live events, companies
promote products and engage with customers. The list of different ways to use Twitter could be really long, and with 500 millions of tweets per day, there’s a lot of data to analyse and to play with.
This is the first in a series of articles dedicated to mining data on Twitter using Python. In this first part, we’ll see different options to collect data from Twitter. Once we have built a data set, in the next episodes we’ll discuss some interesting data
applications.
![](http://www.kdnuggets.com/wp-content/uploads/twitter-banner.jpg)
Register Your App
In order to have access to Twitter data programmatically, we need to create an app that interacts with the Twitter API.
The first step is the registration of your app. In particular, you need to point your browser to http://apps.twitter.com, log-in to Twitter (if you’re not already logged in) and register a new application.
You can now choose a name and a description for your app (for example “Mining Demo” or similar). You will receive a consumer key and a consumer secret: these are application settings that should always be kept private. From the configuration
page of your app, you can also require an access token and an access token secret. Similarly to the consumer keys, these strings must also be kept private: they provide the application access to Twitter on behalf of your account. The default permissions are
read-only, which is all we need in our case, but if you decide to change your permission to provide writing features in your app, you must negotiate a new access token.
Important Note: there are rate limits in the use of the Twitter API, as well as limitations in case you want to provide a downloadable data-set, see:
https://dev.twitter.com/overview/terms/agreement-and-policy
https://dev.twitter.com/rest/public/rate-limiting
Accessing the Data
Twitter provides REST APIs you can use to interact with their service. There is also a bunch of Python-based
clients out there that we can use without re-inventing the wheel. In particular, Tweepy in one of the most interesting and straightforward to use, so let’s install it:
pip install tweepy==3.3.0
Update: the release 3.4.0 of Tweepy has introduced a problem with Python 3, currently fixed on github but not yet available with pip, for this reason we’re using version 3.3.0
until a new release is available.
More Updates: the release 3.5.0 of Tweepy, already available via pip, seems to solve the problem with Python 3 mentioned above.
In order to authorise our app to access Twitter on our behalf, we need to use the OAuth interface:
import tweepy from tweepy import OAuthHandler consumer_key = 'YOUR-CONSUMER-KEY' consumer_secret = 'YOUR-CONSUMER-SECRET' access_token = 'YOUR-ACCESS-TOKEN' access_secret = 'YOUR-ACCESS-SECRET' auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_secret) api = tweepy.API(auth)
The api variable is now our entry point for most of the operations we can perform with Twitter.
For example, we can read our own timeline (i.e. our Twitter homepage) with:
for status in tweepy.Cursor(api.home_timeline).items(10): # Process a single status print(status.text)
Tweepy provides the convenient Cursor interface to iterate through different types of objects. In the example above we’re using 10 to limit the number of tweets we’re reading, but we can of course access more. The status variable is an
instance of the Status() class, a nice wrapper to access the data. The JSON response from the Twitter API is available in the attribute _json (with a leading underscore), which is not the raw JSON string, but a dictionary.
So the code above can be re-written to process/store the JSON:
for status in tweepy.Cursor(api.home_timeline).items(10): # Process a single status process_or_store(status._json)
What if we want to have a list of all our followers? There you go:
for friend in tweepy.Cursor(api.friends).items(): process_or_store(friend._json)
And how about a list of all our tweets? Simple:
for tweet in tweepy.Cursor(api.user_timeline).items(): process_or_store(tweet._json)
In this way we can easily collect tweets (and more) and store them in the original JSON format, fairly easy to convert into different data models depending on our storage (many NoSQL technologies provide some bulk import feature).
The function process_or_store() is a place-holder for your custom implementation. In the simplest form, you could just print out the JSON, one tweet per line:
def process_or_store(tweet): print(json.dumps(tweet))
Streaming
In case we want to “keep the connection open”, and gather all the upcoming tweets about a particular event, the streaming API is what we need. We need to extend the StreamListener() to customise the way we process the incoming data. A working example
that gathers all the new tweets with the #python hashtag:
from tweepy import Stream from tweepy.streaming import StreamListener class MyListener(StreamListener): def on_data(self, data): try: with open('python.json', 'a') as f: f.write(data) return True except BaseException as e: print("Error on_data: %s" % str(e)) return True def on_error(self, status): print(status) return True twitter_stream = Stream(auth, MyListener()) twitter_stream.filter(track=['#python'])
Depending on the search term, we can gather tons of tweets within a few minutes. This is especially true for live events with a world-wide coverage (World Cups, Super Bowls, Academy Awards, you name it), so keep an eye on the JSON file to understand how
fast it grows and consider how many tweets you might need for your tests. The above script will save each tweet on a new line, so you can use the command wc -l python.json from a Unix shell to know how many tweets you’ve gathered.
You can see a minimal working example of the Twitter Stream API in the following Gist:
twitter_stream_downloader.py
Summary
We have introduced tweepy as a tool to access Twitter data in a fairly easy way with Python. There are different types of data we can collect, with the obvious focus on the “tweet” object.
Once we have collected some data, the possibilities in terms of analytics applications are endless. In the next episodes, we’ll discuss some options.
Bio: Marco Bonzanini is a Data Scientist based in London, UK. Active in the PyData community, he enjoys working in text analytics and data mining applications. He's the author
of "Mastering Social Media Mining with Python" (Packt Publishing, July 2016).
Original. Reposted with permission.
相关文章推荐
- Mining Twitter Data with Python Part 4: Rugby and Term Co-occurrences
- Mining Twitter Data with Python Part 6: Sentiment Analysis Basics
- Mining Twitter Data with Python Part 2: Text Pre-processing
- Mining Twitter Data with Python Part 5: Data Visualisation Basics
- Mining Twitter Data with Python Part 7: Geolocation and Interactive Maps
- Mining Twitter Data with Python Part 3: Term Frequencies
- Mining Twitter Data with Python
- Learning Data Mining with Python-《Python数据挖掘入门与实践》学习后的分享
- Analyzing Twitter Data with Hadoop, Part 2: Gathering Data with Flume
- REST web services with Python, MongoDB, and Spatial data in the Cloud - Part 2
- Learning Data Mining with Python-第一章-affinity analysis
- Machine learning and Data Mining - Association Analysis with Python
- [zz]Scripting KVM with Python, Part 1: libvirt
- Getting Started With OData Part 2: Building an OData Services from Any Data Source
- Writing binary data to a socket (or file) with Python - Stack Overflow
- How-to: Analyze Twitter Data with Apache Hadoop
- #Note# Analyzing Twitter Data with Apache Hadoo...
- 【读书笔记】Data_Mining_with_R---Chapter_2_Predicting Algae Blooms
- A Programmer's Guide to Data Mining 2:Get started with recommendation system(User based filtering)
- Something Wrong with Data Mining Project