Coder Social home page Coder Social logo

whsheng / tweetscraper Goto Github PK

View Code? Open in Web Editor NEW

This project forked from jonbakerfish/tweetscraper

0.0 1.0 0.0 60 KB

TweetScraper is a simple crawler/spider for Twitter Search without using API

License: GNU General Public License v2.0

Python 93.42% Shell 6.58%

tweetscraper's Introduction

Introduction

TweetScraper can get tweets from Twitter Search. It is built on Scrapy without using Twitter's APIs. The crawled data is not as clean as the one obtained by the APIs, but the benefits are you can get rid of the API's rate limits and restrictions. Ideally, you can get all the data from Twitter Search.

WARNING: please be polite and follow the crawler's politeness policy.

Installation

  1. Install conda, you can get it from miniconda. The tested python version is 3.7.

  2. Install selenium python bindings: https://selenium-python.readthedocs.io/installation.html. (Note: the KeyError: 'driver' is caused by wrong setup)

  3. For ubuntu or debian user, run:

    $ bash install.sh
    $ conda activate tweetscraper
    $ scrapy list
    $ #If the output is 'TweetScraper', then you are ready to go.
    

    the install.sh will create a new environment tweetscraper and install all the dependencies (e.g., firefox-geckodriver and firefox),

Usage

  1. Change the USER_AGENT in TweetScraper/settings.py to identify who you are

     USER_AGENT = 'your website/e-mail'
    
  2. In the root folder of this project, run command like:

     scrapy crawl TweetScraper -a query="foo,#bar"
    

    where query is a list of keywords seperated by comma and quoted by ". The query can be any thing (keyword, hashtag, etc.) you want to search in Twitter Search. TweetScraper will crawl the search results of the query and save the tweet content and user information.

  3. The tweets will be saved to disk in ./Data/tweet/ in default settings and ./Data/user/ is for user data. The file format is JSON. Change the SAVE_TWEET_PATH and SAVE_USER_PATH in TweetScraper/settings.py if you want another location.

Acknowledgement

Keeping the crawler up to date requires continuous efforts, please support our work via opencollective.com/tweetscraper.

License

TweetScraper is released under the GNU GENERAL PUBLIC LICENSE, Version 2

tweetscraper's People

Contributors

jonbakerfish avatar jeromecc avatar hamzamogni avatar bdjohnson529 avatar eilgnaw avatar michaelachmann avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.