pablobarbera / streamr Goto Github PK

Dev version of streamR package: Access to Twitter Streaming API via R

Home Page: http://cran.r-project.org/web/packages/streamR/

R 100.00%

streamr's Introduction

streamR: Access to Twitter Streaming API via R

This package includes a series of functions that give R users access to Twitter's Streaming API, as well as a tool that parses the captured tweets and transforms them in R data frames, which can then be used in subsequent analyses. streamR requires authentication via OAuth and the ROAuth package.

Current CRAN release is 0.2.1. To install the most updated version (0.4.0) from GitHub, type:

library(devtools)
devtools::install_github("pablobarbera/streamR/streamR")

Click here to read the documentation and here to read the vignette.

Installation and authentication

streamR can be installed directly from CRAN, but the most updated version will always be on GitHub. The code below shows how to install from both sources.

install.packages("streamR")  # from CRAN
devtools::install_github("pablobarbera/streamR/streamR") # from GitHub

streamR requires authentication via OAuth. The same oauth token can be used for both twitteR and streamR. After creating an application here, and obtaining the consumer key and consumer secret, it is easy to create your own oauth credentials using the ROAuth package, which can be saved in disk for future sessions:

library(ROAuth)
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "xxxxxyyyyyzzzzzz"
consumerSecret <- "xxxxxxyyyyyzzzzzzz111111222222"
my_oauth <- OAuthFactory$new(consumerKey = consumerKey, consumerSecret = consumerSecret, 
    requestURL = requestURL, accessURL = accessURL, authURL = authURL)
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
save(my_oauth, file = "my_oauth.Rdata")

Alternatively, you can also create an access token as a list and streamR will automatically do the handshake:

 my_oauth <- list(consumer_key = "CONSUMER_KEY",
   consumer_secret = "CONSUMER_SECRET",
   access_token="ACCESS_TOKEN",
   access_token_secret = "ACCESS_TOKEN_SECRET")

filterStream

filterStream is probably the most useful function. It opens a connection to the Streaming API that will return all tweets that contain one or more of the keywords given in the track argument. We can use this function to, for instance, capture public statuses that mention Obama or Biden:

library(streamR)

## Loading required package: RCurl
## Loading required package: bitops
## Loading required package: rjson

load("my_oauth.Rdata")
filterStream("tweets.json", track = c("Obama", "Biden"), timeout = 120, 
  oauth = my_oauth)

## Loading required package: ROAuth
## Loading required package: digest
## Capturing tweets...
## Connection to Twitter stream was closed after 120 seconds with up to 350 tweets downloaded.

tweets.df <- parseTweets("tweets.json", simplify = TRUE)

## 350 tweets have been parsed.

Note that here I'm connecting to the stream for just two minutes, but ideally I should have the connection continuously open, with some method to handle exceptions and reconnect when there's an error. I'm also using OAuth authentication (see below), and storing the tweets in a data frame using the parseTweets function. As I expected, Obama is mentioned more often than Biden at the moment I created this post:

c( length(grep("obama", tweets.df$text, ignore.case = TRUE)),
   length(grep("biden", tweets.df$text, ignore.case = TRUE)) )

## [1] 347  2

Tweets can also be filtered by two additional parameters: follow, which can be used to include tweets published by only a subset of Twitter users, and locations, which will return geo-located tweets sent within bounding boxes defined by a set of coordinates. Using these two options involves some additional complications – for example, the Twitter users need to be specified as a vector of user IDs and not just screen names, and the locations filter is incremental to any keyword in the track argument. For more information, I would suggest to check Twitter's documentation for each parameter.

Here's a quick example of how one would capture and visualize tweets sent from the United States:

filterStream("tweetsUS.json", locations = c(-125, 25, -66, 50), timeout = 300, 
    oauth = my_oauth)
tweets.df <- parseTweets("tweetsUS.json", verbose = FALSE)
library(ggplot2)
library(grid)
map.data <- map_data("state")
points <- data.frame(x = as.numeric(tweets.df$lon), y = as.numeric(tweets.df$lat))
points <- points[points$y > 25, ]
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "white", 
    color = "grey20", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) + 
    theme(axis.line = element_blank(), axis.text = element_blank(), axis.ticks = element_blank(), 
        axis.title = element_blank(), panel.background = element_blank(), panel.border = element_blank(), 
        panel.grid.major = element_blank(), plot.background = element_blank(), 
        plot.margin = unit(0 * c(-1.5, -1.5, -1.5, -1.5), "lines")) + geom_point(data = points, 
    aes(x = x, y = y), size = 1, alpha = 1/5, color = "darkblue")

sampleStream

The function sampleStream allows the user to capture a small random sample (around 1%) of all tweets that are being sent at each moment. This can be useful for different purposes, such as estimating variations in “global sentiment” or describing the average Twitter user. A quick analysis of the public statuses captured with this method shows, for example, that the average (active) Twitter user follows around 500 other accounts, that a very small proportion of tweets are geo-located, and that Spanish is the second most common language in which Twitter users set up their interface.

sampleStream("tweetsSample.json", timeout = 120, oauth = my_oauth, verbose = FALSE)
tweets.df <- parseTweets("tweetsSample.json", verbose = FALSE)
mean(as.numeric(tweets.df$friends_count))

## [1] 543.5

table(is.na(tweets.df$lat))

## 
## FALSE  TRUE 
##   228 13503

round(sort(table(tweets.df$lang), decreasing = T)[1:5]/sum(table(tweets.df$lang)), 2)

## 
##   en   es   ja   pt   ar 
## 0.57 0.16 0.09 0.07 0.03

userStream

Finally, I have also included the function userStream, which allows the user to capture the tweets they would see in their timeline on twitter.com. As was the case with filterStream, this function allows to subset tweets by keyword and location, and exclude replies across users who are not followed. An example is shown below. Perhaps not surprisingly, many of the accounts I follow use Twitter in Spanish.

userStream("mytweets.json", timeout = 120, oauth = my_oauth, verbose = FALSE)
tweets.df <- parseTweets("mytweets.json", verbose = FALSE)
round(sort(table(tweets.df$lang), decreasing = T)[1:3]/sum(table(tweets.df$lang)), 2)

## 
##   en   es   ca 
## 0.62 0.30 0.08

In these examples I have used parseTweets to read the captured tweets from the text file where they were saved in disk and store them in a data frame in memory. The tweets can also be stored directly in memory by leaving the file.name argument empty, but my personal preference is to save the raw text, usually in different files, one for each hour or day. Having the files means I can run UNIX commands to quickly compute the number of tweets in each period, since each tweet is saved in a different line:

system("wc -l 'tweetsSample.json'", intern = TRUE)

## [1] "   15086 tweetsSample.json"

Concluding...

I hope this package is useful for R users who want to at least play around with this type of data. Future releases of the package will include additional functions to analyze captured tweets, and improve the already existing so that they handle errors better. My plan is to keep the GitHub version up to date fixing any possible bugs, and release only major versions to CRAN.

You can contact me at pablo.barbera[at]nyu.edu or via twitter (@p_barbera) for any question or suggestion you might have, or to report any bugs in the code.

streamr's People

Contributors

Stargazers

Watchers

Forkers

rahulgithub mazzottidr jsajuria optykali buach41ll latuji vsingh58 ranaivosonherimanitra soneint lazycrazyowl arnab621 wkuijp smc-dta fboehm camr0n fxcebx libardo1 raeed20 parthasen francescoanselmi johnfrye azurewind929 nachogaspar kronosapiens serayamaouche bekterra jmpasmoi asyafiq digideskio nabolom kuonanhong ashander santiagorodriguezarevalo sashapr charlie13 kevinbsc keveene kendramatica jcassiojr euricotu wizardshowing asubant rflsierra vanessahlira training-datalab mathiasfls allmedicalexperts serosnx

streamr's Issues

Unknown SSL protocol error

Dear community,
I run the code to get started with stream R just as in the description, inserting my consumer key/secret through the ROAuth package. However, when "doing the handshake" I get the following error:

Error in function (type, msg, asError = TRUE) : Unknown SSL protocol error in connection to api.twitter.com:443

Does anyone have an idea how to solve this?

Couldn't convert json string to dataframe using parseTweets() ; Error in iconv(lines, "ASCII", "UTF-8", sub = "")

I tried converting .json file to dataframe in R.
But it ended up showing:

> thanioruvan.tweets <- parseTweets("thanioruvan_tweets.json")
**Error in iconv(lines, "ASCII", "UTF-8", sub = "") : **
  embedded nul in string: '{"created_at":["Mon Sep 21 01:48:19 +0000 2015"],"id":[6.45776539955532e+017],"id_str":["645776539955531776"],"text":["RT @SathieshDhas: Trvld 150km 4hrs up dwn, spnt 30\0 2 watch #ThaniOruvan Worth doin mre.....

thanioruvan_tweets.json has nearly 980 tweets downloaded using getTimeline()

Kindly fix up this issue.

Thanks in advance!

Writing on file too slow?

I'm using the function FilterStream, but I realized that there is an inequality among the tweets that the function says to have download in the output message, and the real number of tweets in the json file... why?!

Reconciliation with twitteR

It would be great if this package stored tweets the in the same form as the classes in the twitteR package, so that you could use the same code to process the outputs of both.

StreamR with mongoDB

Hi Pablo,

This is Cyrille, alias @Soc_Net_Intel on Twitter, from France. Dad of 3, working, and coming back to university to get a PhD... Trying to in reality !

I need your help on your StreamR repository with MongoDB which I think could be an alternative solution for me.

I have worked during a few months with a JAVA code on IDE to stream Twitter towards a MongoDB database. I was using rmongodb package to extract then the tweets from the mongodb collections. My main problem was that my JAVA code, based on Twitter4j library, was not complete, and thus the data from Twitter where had never been complete.

Since I have discovered R, especially for data analytics and machine learning, I would like to explore this possibility based on your StreamR with mongoDB repository.

I have tested your filterStream( ) function, adding format.twitter.date ( ) function inside, but I have some problems:

only the collections of tweets to a file is working properly for me : some minor tests with a few tweets with track function ;
the collections of tweets to a mongo DB / COLL does not work properly (stopping before one tweet collected entirely) ; mongod and mongo are configured with mongod (listen & established) and mongo (established) ; a specific port specified also in filterStream ( ) function ; no alert with mongodb through R ;
the collection of tweets to a file when collecting to a mongo DB / COLL does not work properly (stopping before one tweet collected entirely) ;
I have after a few request for test in RStudio the "Exceeded connection limit for user" alert in the file ; I wonder how may I check how many connections request are in progress with my Twitter account through R ;

I have may be made a mistake or something wrong with MongoDB, which was working well with my JAVA code. Maybe you could help me to solve this.

Cyrille.

Filtered stream closes before getting as many tweets as specified, and not possible to reconnect

I'm using the version on CRAN.

First time I ran

filterStream( file.name="jeremy_corbyn.json",
              track='"jeremy corbyn"', tweets=10, oauth=my_oauth )

tweets.df <- parseTweets("jeremy_corbyn.json")

the stream closed after 62 seconds and downloaded only 2 tweets even though I specified 10.

On subsequent times I've tried to connect, it disconnects after 1 second and adds "Exceeded connection limit for user" to jeremy_corbyn.json

[edit]

I restarted my computer.
Now when I run the above, I get

Warning message:
In readLines(tweets, encoding = "UTF-8") :
  incomplete final line found on 'jeremy_corbyn.json'

tryCatch not firing

Admittedly I am not 100% sure if this could be handled better on my end, but I am attempting to handle errors with tryCatch.

The code below does not successfully catch the error.

  tweets = tryCatch(readTweets(FILE), 
                    error = function(e) e)
  if (inherits(tweets, "error")) {
    cat("file not could be parsed\n")
    next
  }

I have this code in a loop, where FILE is represented by the this file.

Here is the error that I get:

> tweets = tryCatch(readTweets(FILE), 
+                   error = function(e) e)
0 tweets have been parsed. 
Warning message:
In readLines(tweets, encoding = "UTF-8") :
  incomplete final line found on '/home/brock/github/uga-twitter-bot/tweets//uga_06-27-2014-17-24.json'
> tweets
list()

As you can see in the FILE's contents, it appears that the service was disrupted temporarily.

I raise this issue because I am wondering if my tryCatch doesn't fire because the error occurs during the readLines call within readTweets.

Error: Authorization Required

Hi,

I'm attempting to use streamR but when running this part of my code:

`requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "mykey"
consumerSecret <- "mysecret"

my_oauth <- OAuthFactory$new(consumerKey = consumerKey,
consumerSecret = consumerSecret,
requestURL = requestURL,
accessURL = accessURL,
authURL = authURL)

my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))

save(my_oauth, file = "my_oauth.Rdata")`

i get "Error: Authorization Required"

and the following page shows up:

using the same consumerKey and consumerSecret for twitteR i don't seem to have any issues with authentication.

I googled this and it seems as though others have had this same issue but don't seem to have a clear solution.

Could somebody advise me one what the issue might be due to?

I've checked both the key and secret multiple times and am 100% sure they are correct.

Thanks

Non-status entities

I would like an option in filterStream to accept non-status entities such as delete notices, scrub_geo, etc. Thanks!

parseTweets. Error encoding

Hi team! First of all, congratulations for this package.

I have problems in the parseTweets function, only few tweets are correctly parsed. I think it is due to some encoding or charset problem. I my case, I have problems when I am capturing spanish tweets when I am capturing english tweets no.

I would like to help, but I don't know where is the problem. The message error is:

Error in $<-.data.frame(*tmp*, "country_code", value = NA) :
In stream_in_int(path.expand(path)) : Parsing error on line 0

thank you.

parseTweets() not capturing complete text of tweet if it is a retweet.

@pablobarbera

The text field in the output of parseTweets() is not capturing the whole content of a tweet, especially if it is a retweet. I ended up with almost half of the tweets abruptly ending in "...".

I can get the complete content of the text only if i parse it using the "rjson" and then extracting it using tweet$retweeted_status$text . Is this because of the new extended tweets feature in twitter ?

authorization not working

This issue falls under the more reconciliation with twitteR umbrella.

The only handling of authorizations in streamR is through an OAUth object, such as created by ROAuth, but that kind of workflow is deprecated (if I'm not mistaken) and possibly broken (I get an Error: Forbidden whenever I try to use it) and I'm not even sure the package is actively maintained anymore (ROAuth is no longer used in favor of httr, please see ?setup_twitter_oauth).

So, is streamR broken?

feature request "favorite_count"

Hello Pablo,
please add "favorite_count" and every data point you can think of into parseTweets :)
favourites_count is how many tweets someone favorited.
Regards Markus @msgbi

How to not miss any tweets?

Great program, thanks for making it open source. I got everything working just fine.

I used birdy before in Python and it allows you to create a stream that runs forever like so:

response = client.stream.statuses.filter.post(track='twitter')

for data in response.stream():
    print data

Is this possible to do with streamR as well? I can run it a certain length of time and then parse the results but if I run again I could have missed something in the meantime. How do I avoid this? Is this a limitation of R? Thanks