Coder Social home page Coder Social logo

trec_bs0x's Introduction

TREC 2011 Twitter Collection Downloader

  • Downloads approximately 16 million tweets which are gathered by TREC in 2011.
  • Enables download from multiple machine and save your time (in a single machine, it may last +30 days!)
  • Retry failed downloads
  • Write downloads into log file
  • Combines all failed/succeed download statistics
  • Uncompress all zips into one folder

For more information about collection https://github.com/lintool/twitter-tools/wiki/Tweets2011-Collection and TREC http://trec.nist.gov/data/tweets/

Prerequests

  • Python 3.5.*

  • Download https://github.com/lintool/twitter-tools to your local. This is main tool to fetch the tweets. Installing steps of the tool is as decribed in its page:

    • The main Maven artifact for the TREC Microblog API is twitter-tools-core. The latest releases of Maven artifacts are available at Maven Central.

    • You can clone the repo with the following command:

      $ git clone git://github.com/lintool/twitter-tools.git
      
    • Once you've cloned the repository, change directory into twitter-tools-core and build the package with Maven:

      $ cd twitter-tools-core
      $ mvn clean package appassembler:assemble
      
    • For more information, see the project wiki.

  • Than clone this repo with the following command into the twitter-tools-core directory that you have cd into in previous step:

    https://github.com/cmlonder/trec-collection-downloader.git
    
  • Install required libraries by running this command:

    pip3.5 install -r requirements.txt
    

Or if your default python version is 3.5 than you can run this command with using directly pip command instead of pip3.5

pip install -r requirements.txt
  • If you haven't pip installed (which is must be installed default when you install Python 3.5) than follow those steps:

    • To install pip, securely download get-pip.py

    • Then run the following:

      python3.5 get-pip.py
      
    • or if your default python version is 3.5 you can run this command with using directly python instead of python3.5

      python get-pip.py
      
    • For the rest of the tutorial, it is assumed that your default python version is 3.5 and python command will be used in examples.

  • Get your URL, username and password via http://trec.nist.gov/data/tweets/ .Pay attention to this section in the link:

Email the signed agreement, as a PDF file, to Angela Ellis [email protected]. In the body of your email, Be clear that you are requesting the Tweets2011 dataset Include your name, your email address, and the name of your organization. We will respond to your request with a URL, a username, and a password with which you can download the tweet lists. Please allow seven business days for a respons

Folder Structure

Here is the folder structure which this script will create.

tarFiles: Compressed files which is firstly downloaded from the site you have received from the TREC authorities.

openedTarFiles: Uncompressed files which are structured like DAY/DAY-NUMBER.dat. There are 17 days and 100 dat files in each day

jsonFiles: Compressed JSON files that fetched by using dat files. There are approximately 10.000 tweets in each file

repairJsonFiles: JSON files which has the information that failed fetches. Those will be downloaded automatically by the script into jsonFiles folder with the extension .repair.json.gz

logFiles: Log files for each json file.

Usage

  1. Update url, user and password with your own ones
url = 'YOUR_URL_HERE'
user, password = 'YOUR_USER_NAME_HERE', 'YOUR_PASSWORD_HERE'
  1. Run the script for default usage. This will fetch tweets(10.000 per .dat file) from each .dat file (100 per folder) from each folder (17 folder). This is not good scenario cause it will cost you +30 more days, but if you have acceptable time and connection go with this one.
python trec.py
  1. Download the corpus with reverse order.
python trec.py --reverse
  1. Download the corpus from folder number 5 to 10
python trec.py 5 10
  1. Download the corpus from folder number 0 to 5 with reverse order
python trec.py --inverse 0 5
  1. Download the corpus from 0 (default beginning number if only 1 number is given) to 5
python trec.py 5

Example Scenario

You have 4 machine to download the corpus. Or you can start 4 different command line in a single machine if you have good bandwith and processor. It will be faster than single connection. So run the script in each machine with given ranges and than merge your jsonFiles folders. Use any number of machine like this scenario to download corpus with minimum amount of time.

python trec.py 0 3
python trec.py 4 7
python trec.py 8 11
python trec.py 12 17

trec_bs0x's People

Contributors

trellixvulnteam avatar vivekreddyd avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.