Coder Social home page Coder Social logo

avax-tweets-dataset's Introduction

Avax tweets dataset

The repository contains two collections associated with vaccine hesitancy on Twitter. The "streaming collection" contains tweets collected by leveraging Twitter streaming API to listen to the set of anti-vaccine keywords. You can see the full list of these keywords in keywords.txt. The "account collection" contains historical tweets of accounts that are susceptible to anti-vaccine narratives. To comply with Twitter's Terms of Service, only tweet IDs are released. The data is for non-commercial research purposes only. It is our hope that it will help those who are studying and tracking anti-vaccine misinformation on social media and enable better understanding of vaccine hesitancy.

The associated paper to this repository can be found here: https://publichealth.jmir.org/2021/11/e30642

Data Organization

The "streaming-tweetids" folder corresponds to the streaming collection whereas the "account-tweetids" folder corresponds to the account collection. All the files are in .txt format, each containing the list of tweet IDs. Account collection files are named from 0 to 387. Streaming collection files are organized into 7 folders, each corresponds to a month of year 2020 and 2021.

Notes about the data

  1. Only English tweets are considered.
  2. The overview of our data collections are summarized below
Streaming collection Account collection
Number of tweets 53,598,237 135,949,773
Number of accounts 8,120,945 78,954
Verified accounts - 239
Average tweets per account 6.6 1721.8
Accounts with location - 363
Oldest tweet 2010-10-19 2007-03-06
Most recent tweet 2022-04-08 2021-02-02
  1. You may consider using tools such as the Hydrator, Twarc and tweepy to rehydrate the Tweet IDs. For detailed instructions please see the next section.

  2. If you have difficulties accessing some data, please contact the authors: [email protected]

How to Hydrate

Hydrating using Hydrator (GUI)

Navigate to the Hydrator github repository and follow the instructions for installation in their README. As there are a lot of separate Tweet ID files in this repository, it might be advisable to first merge files from timeframes of interest into a larger file before hydrating the Tweets through the GUI.

Hydrating using Twarc (CLI)

Many thanks to Ed Summers (edsu) for writing this script that uses Twarc to hydrate all Tweet-IDs stored in their corresponding folders.

First install Twarc and tqdm

pip3 install twarc
pip3 install tqdm

Configure Twarc with your Twitter API tokens (note you must apply for a Twitter developer account first in order to obtain the needed tokens). You can also configure the API tokens in the script, if unable to configure through CLI.

twarc configure

Run the script. The hydrated Tweets will be stored in the same folder as the Tweet-ID file, and is saved as a compressed jsonl file

python3 hydrate.py -streaming

for hydrating the streaming collection or

python3 hydrate.py -account

for hydrating the account collection

Hydrating using Tweepy:

import tweepy
auth = tweepy.AppAuthHandler(consumer_key, consumer_secret)
api = tweepy.API(auth, retry_count=5, retry_delay=2, wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
api.statuses_lookup(list_of_ids) #consider the limitations in tweepy documentation

Data Usage Agreement / How to Cite

By using this dataset, you agree to remain in compliance with Twitter's Terms of Service, and cite the following manuscript: Muric G, Wu Y, Ferrara E. COVID-19 Vaccine Hesitancy on Social Media: Building a Public Twitter Data Set of Antivaccine Content, Vaccine Misinformation, and Conspiracies. JMIR Public Heal Surveill 2021;7(11)E30642

@article{Muric2021,
author = {Muric, Goran and Wu, Yusong and Ferrara, Emilio},
doi = {10.2196/30642},
eprint = {2105.05134},
issn = {2369-2960},
journal = {JMIR Public Health Surveill 2021;7(11):e30642},
keywords = {COVID-19,COVID-19 vaccines,SARS-CoV-2,Twitter,conspiracy,dataset,hesitancy,misinformation,network analysis,public health,social media,trust,utilization,vaccine,vaccine hesitancy},
month = {nov},
number = {11},
pages = {e30642},
publisher = {JMIR Public Health and Surveillance},
title = {{COVID-19 Vaccine Hesitancy on Social Media: Building a Public Twitter Data Set of Antivaccine Content, Vaccine Misinformation, and Conspiracies}},
url = {https://publichealth.jmir.org/2021/11/e30642},
volume = {7},
year = {2021}
}

avax-tweets-dataset's People

Contributors

epicfaace avatar gmuric avatar lishichengyan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

avax-tweets-dataset's Issues

More users in account collection than listed?

I rehydrated about 5 million tweets from the account collection using Twarc, and am a little confused about the number of unique accounts I am getting. After running
users <- unique(as.data.frame(tweets$user_screen_name)) in R I get 759,548 unique accounts, which is obviously way more than the ~70k you list in the summary statistics for the dataset, and I only hydrated a fraction of the tweet ids. I checked for parsing errors and everything seems ok on that front - any ideas what may be going on?

Missing Tweets in the streaming collection

Hi, I used Twitter Academic Research resources and Twarc to rehydrate streaming collection tweets. But only managed to get 1.2M tweets instead of 1.8M. There were no tweets for some tweet ids in this list. Any reasons?

Hydrate.py missing?

Hi! Thanks for this dataset. I'm eager to use it for a research project. Did you maybe forget to push the hydrate.py? Thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.