training/02-twitter-streaming-data-collection.Rmd at master · NetDem-USC/training · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
---
title: "Scraping data from Twitter's Streaming API"
author: "Pablo Barbera"
date: "January 31, 2017"
output: html_document
---

### Scraping web data from Twitter

#### Authenticating

Follow these steps to create your token:

1. Go to apps.twitter.com and sign in.
2. Click on "Create New App". You will need to have a phone number associated with your account in order to be able to create a token.
3. Fill name, description, and website (it can be anything, even http://www.google.com). Make sure you leave 'Callback URL' empty.
4. Agree to user conditions.
5. From the "Keys and Access Tokens" tab, copy consumer key and consumer secret and paste below

```{r, eval=FALSE}
#install.packages("ROAuth")
library(ROAuth)
requestURL <- "https://api.twitter.com/oauth/request_token"
accessURL <- "https://api.twitter.com/oauth/access_token"
authURL <- "https://api.twitter.com/oauth/authorize"
consumerKey <- "8rhE418Fd7AXgzi0XKJzmGmL5"
consumerSecret <- "N7ZqM2sLyPCmChbuj6iSElNS0EqrXt1dcM3DOYPtm4K78ACAoe"

my_oauth <- OAuthFactory$new(consumerKey=consumerKey,
  consumerSecret=consumerSecret, requestURL=requestURL,
  accessURL=accessURL, authURL=authURL)
```

Run the below line and go to the URL that appears on screen. Then, type the PIN into the console (RStudio sometimes doesn't show what you're typing, but it's there!)

```{r, eval=FALSE}
my_oauth$handshake(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))
```

Now you can save oauth token for use in future sessions with netdemR or streamR. Make sure you save it in a folder where this is the only file.

```{r, eval=FALSE}
save(my_oauth, file="credentials/twitter-token-2.Rdata")
```

#### Collecting data from Twitter's Streaming API

Collecting tweets filtering by keyword:

```{r}
#install.packages("streamR")
library(streamR)
load("credentials/twitter-token-2.Rdata")
filterStream(file.name="trump-tweets.json", track="trump",
    timeout=20, oauth=my_oauth)
```

Note the options:
- `file.name` indicates the file in your disk where the tweets will be downloaded
- `track` is the keyword(s) mentioned in the tweets we want to capture.
- `timeout` is the number of seconds that the connection will remain open
- `oauth` is the OAuth token we are using

Once it has finished, we can open it in R as a data frame with the `parseTweets` function
```{r}
tweets <- parseTweets("trump-tweets.json")
str(tweets)
tweets[1,]
```

If we want, we could also export it to a csv file to be opened later with Excel
```{r}
write.csv(tweets, file="trump-tweets.csv", row.names=FALSE)
```

And this is how we would capture tweets mentioning multiple keywords:
```{r, eval=FALSE}
filterStream(file.name="politics-tweets.json",
	track=c("graham", "sessions", "trump", "clinton"),
    tweets=20, oauth=my_oauth)
```

Note that here I choose a different option, `tweets`, which indicates how many tweets (approximately) the function should capture before we close the connection to the Twitter API.

This second example shows how to collect tweets filtering by location instead. In other words, we can set a geographical box and collect only the tweets that are coming from that area.

For example, imagine we want to collect tweets from the United States. The way to do it is to find two pairs of coordinates (longitude and latitude) that indicate the southwest corner AND the northeast corner. Note the reverse order: it's not (lat, long), but (long, lat).

In the case of the US, it would be approx. (-125,25) and (-66,50). How to find these coordinates? I use: `http://itouchmap.com/latlong.html`

```{r}
filterStream(file.name="tweets_geo.json", locations=c(-125, 25, -66, 50),
    timeout=30, oauth=my_oauth)
```

We can do as before and open the tweets in R
```{r}
tweets <- parseTweets("tweets_geo.json")
```

And use the maps library to see where most tweets are coming from. Note that there are two types of geographic information on tweets: `lat`/`lon` (from geolocated tweets) and `place_lat` and `place_lon` (from tweets with place information). We will work with whatever is available.
```{r}
library(maps)
tweets$lat <- ifelse(is.na(tweets$lat), tweets$place_lat, tweets$lat)
tweets$lon <- ifelse(is.na(tweets$lon), tweets$place_lon, tweets$lon)
states <- map.where("state", tweets$lon, tweets$lat)
head(sort(table(states), decreasing=TRUE))
```

We can also prepare a map of the exact locations of the tweets.

```{r, fig.height=6, fig.width=10}
#install.packages("ggplot2")
library(ggplot2)

## First create a data frame with the map data
map.data <- map_data("state")

# And we use ggplot2 to draw the map:
# 1) map base
ggplot(map.data) + geom_map(aes(map_id = region), map = map.data, fill = "grey90",
    color = "grey50", size = 0.25) + expand_limits(x = map.data$long, y = map.data$lat) +
    # 2) limits for x and y axis
    scale_x_continuous(limits=c(-125,-66)) + scale_y_continuous(limits=c(25,50)) +
    # 3) adding the dot for each tweet
    geom_point(data = tweets,
    aes(x = lon, y = lat), size = 1, alpha = 1/5, color = "darkblue") +
    # 4) removing unnecessary graph elements
    theme(axis.line = element_blank(),
    	axis.text = element_blank(),
    	axis.ticks = element_blank(),
        axis.title = element_blank(),
        panel.background = element_blank(),
        panel.border = element_blank(),
        panel.grid.major = element_blank(),
        panel.grid.minor = element_blank(),
        plot.background = element_blank())
```


Finally, it's also possible to collect a random sample of tweets. That's what the "sampleStream" function does:

```{r}
sampleStream(file.name="tweets_random.json", timeout=30, oauth=my_oauth)
```

Here I'm collecting 30 seconds of tweets. And once again, to open the tweets in R...
```{r}
tweets <- parseTweets("tweets_random.json")
```

What is the most retweeted tweet?
```{r}
tweets[which.max(tweets$retweet_count),]
```

What are the most popular hashtags at the moment? We'll use regular expressions to extract hashtags.
```{r}
library(stringr)
ht <- str_extract_all(tweets$text, "#(\\d|\\w)+")
ht <- unlist(ht)
head(sort(table(ht), decreasing = TRUE))
```

How many tweets mention Justin Bieber?
```{r}
length(grep("bieber", tweets$text, ignore.case=TRUE))
```