Coder Social home page Coder Social logo

dmpe / sawi-hiwi-2017 Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 540 KB

FAU.de course S(A)WI - My files for creating a twitter dataset for students to work with

Home Page: http://www.wi2.fau.de/teaching/master/master-courses/sawi/

Jupyter Notebook 45.03% Python 7.14% R 0.33% HTML 47.50%
python twitter debates elections r tweets presidential-election fau university teaching

sawi-hiwi-2017's Introduction

2016 USA Presidential debate - Preparition of a sample of tweets

Goal:

Being a student research assistant for SAWI course during autumn 2017/2018 at FAU, the task was to prepare 2016 USA Presidential debate dataset that could have been passed to the students, for their analysis.

Data Source in question: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2FPDI7IN

Specifically, the first presidential debate in the USA that was held in 2016. https://dataverse.harvard.edu/file.xhtml?persistentId=doi:10.7910/DVN/PDI7IN/AGYMSC&version=3.0. See its Readme file as well.

Note for myself

See that MEGA folder for the whole twitter dataset (~15 GB; Log file ~ 200 MB). You have to ask me for the key to see files!

Tasks:

The final idea was to have 5 groups and provide each of them slightly different dataset according to when people have tweeted about it, i.e. wheather it was before, during or after the debate.

In summary, the objective would be to gather 5 samples, each of 5000 tweets:

  • 1 sample before the debate has began

  • 3 samples during the debate

  • 1 sample after the debate

http://rpubs.com/F789GH/USAPresidentialTweets shows some statistics for the final CSV samples. See RPubs folder for more.

1. Step - Get data

Download tweets because that .txt file will just contain the Tweet IDs. Not the whole content of the tweet itself.

We have used TWARC from https://github.com/DocNow/twarc.

1.1 Configure

First, get Twitter DEV API Keys from https://developer.twitter.com/en/apply-for-access: Then, place them into:

twarc configure

After that,

twarc hydrate first-debate.txt > all_first_tweets.jsonl

It takes hours...

1.5 Step - Split data into manageble junks

Why? Well, because the original 13.5 GB jsonl file will be hard to read in any programm. So dont try R + limited RAM! You could use for that https://stedolan.github.io/jq/.

Alternatively, you could split txt file into multiple files and only then apply previous twarc command.

Something like

split -b 1M -d  first-debate.txt file 

2. Step - Convert to CSV

Why? Because jsonl files are hard to work with.

Execute for each file, depending on your PC and twarc itself:

python 2jsonl.py xaa.jsonl -o xaa.csv

OR

python3 2jsonl.py xaa.jsonl -o xaa.csv

You can also try:

python 2csv_original.py xae -o xae.csv

CSV delimiter will be ";"

Overall, this will create very large CSV files at around 250 MB. And we still need samples of those.

Again, alternatively, you can take one large JSONL file and convert it to one large CSV file which you can later split as well.

3. Step

Analyse Data in order to understand time when tweets have been published ;)

E.g.

head -n 5 xae.csv

The outcome:

xaa -> before debate: from 12 PM EST till 18:30 EST

xab -> before debate: from 18:30 EST till 21:00 EST 

(first debate was from 21:00 till 22:35)

xac -> during debate: from 20:47 EST till 22:40 EST

xad -> after  debate: from 22:40 EST till 01:20 AM EST

xae -> after  debate: from 01:20 AM EST till 06:20 AM EST

xaf -> after  debate: from 06:20 AM EST till 09:40 AM EST

xag ->  .... (rest)

4. Step - Split large CSVs into smaller samples

Use R script process_data.R to apply proper formatting.

You can also go faster (but not more reliable), where in case of xa{a,b}.csv 250MB files, you could execute via bash:

shuf -n 2500 xaa.csv > xaa_sample_2500.csv
shuf -n 2500 xab.csv > xab_sample_2500.csv

Having those, only then you would use process_data.R which contains something along the lines:

mt <- fread("xaa.csv") # or xaa_sample_2500.csv directly - depending on previous steps
mt <- mt[sample(.N, 2500)]
fwrite(mt, "xaa_sample_2500.csv", sep = ";")

5. Step - In any case, you need to combine files from previous R scipt

On Windows, type commands like:

type xad_sample_1500.csv xae_sample_2500.csv xag_sample_1000.csv> after_sample_5000.csv

type xaa_sample_2500.csv xab_sample_2500.csv > before_random_sample_5000.csv

type after_sample_1000.csv xae_sample_1000.csv xaf_sample_2500.csv xag_sample_2500.csv > after_random_sample_5000.csv
xaf_sample_1500.csv 

# combine 2500*2 tweets from the time before the debate into 5000 pieces
type before_sample_2500_a.csv before_sample_2500_b.csv > before_sample_5000.csv

Some research before

Download tweets manually, looking for tweets that include specific #hastags. Then store them in the database from which you can then query them. See that folder and Python Notebooks.

sawi-hiwi-2017's People

Contributors

dmpe avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.