Coder Social home page Coder Social logo

beardeddonut / ir-twitter-crawler Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 234 KB

A Twitter application which crawls twitter to fetch and extract restaurants from tweets made about restaurants and rate each restaurant according to the tweet

License: Apache License 2.0

Java 100.00%
information-retrieval twitter-bot crwaler restaurant-reviews twitter4j nlp

ir-twitter-crawler's Introduction

IR-Twitter-Crawler

This application was implemented as the project of the Information Retrieval course so it wont receive regular updates and it is as it is. :D

Project Description:

This is a crawler project which crawls the twitter and searches for the restaurant's among tweets. It also rates each restaurant in sense of Stars using NLP analysis.

Technologies Used:

  • Java 1.8
  • Lucene v6.6.1, for indexing
  • Twitter4j 4, for fetching tweets and querying twitter
  • Stanford NLP 3.9, for Sentiment Analysis, POS Tagging, Named Entity analysis
  • Maven, for package management

How to Run:

  • For indexing run the Indexer Class, NOTE: please make sure there are tweets in the 'tweets' folder

  • For analyzing and fetching the tweets run the App Class, NOTE: make sure to put proper credentials in ProjectConstants Class.

  • For more configuration please check the ProjectConstant Class.

  • NOTE: I know it is not appropriate to store constants and configuration settings in a class but due to lack of time ... I did!

Problems I Encounter in this Project:

  • Most of the tweets we fetched were not related to a specific restaurant.
  • I couldn't find any solution to extract menu items from tweets.
  • My proposed heuristic to identify restaurant's name from tweets might result in a good Precision but it lacks a proper Recall factor.
  • I should have run the process of text processing on multiple threaded to enhance the performance but due to lack of time I simply couldn't.

System Architecture:

system-Diagram

Stage Number 1:

As mentioned in this stage system fetch related tweets since 2017-01-01 based on the keywords which are set in ProjectConstant class such as restuarant from some specific location which again are set in ProjectConstant class such as chicago.

After that system passes the tweets for further analysis and indexing, besides writing each tweet's text on disk to save them. it also saves the tweets in tweets directory.

Stage Number 2:

At Stage #2 system uses Stanford NLP library and run some text analysis, such as Named Entity Recognition, Part of Speech Tagging, and Sentiment Analysis.

System uses a heuristic approach to extract the name of restaurant from tweets. System check's if a Token is Noun (using POS) and LOCATION(using NER) then it is probably a restaurant(since all tweets are about restaurant cause it queried for restaurant related tweets).

After finding restaurants system analyzes the text of each tweet to determine the ratings of each restaurant using sentiment analysis.

Finally at this stage system stores the results in a text file named RestaurantsList.txt in finalOutputList directory.

Stage Number 3:

At this level system index each tweet's text, and also some other information such as Created Date. after that system saves the index files in indexes folder. so that it would be easy to search for restaurant's if it was needed.

System uses EnglishAnalyzer for indexing which handles Stop Word Removal, Lemmatization, and Stemming.

To search the created indexes use QueryParser and IndexSearcher class from the Lucene packages.

NOTE:

I excluded some cities from Cities list so that the program would terminate much quicker... uncomment the FIXME section for doing complete analysis.

I also excluded some fetched tweets from the final project package to reduce the size of the project and the final zip file.

Sample Result:

Tweets (Stage #1 output):

chicago|Mon Jul 09 22:20:35 IRDT 2018
@Parker Molloy
Okay, which of my musician friends wants to write the Trump administration version of uncomfortable restaurant "Happy Birthday"? https://t.co/20go0LpteB

Found Restaurants List (Stage #2 output):

    {
    	"name":  "Vero International Cuisine",
    	"city":  "Racine",
    	"rating": "**",
    	"tweet-id": "119"
    }

Detailed Description of Each Used Technologies:

  • Lucene: Apache Lucene is an open-source high performance search engine library written in Java and it is distributed by Apache Foundation, used for full-text search and indexing.

  • Stanford NLP: 'Stanford NLP' API is an open-source library developed by the Stanford NLP Group, it provides a wide set of Natural Language Processing tools and it is written in Java. Some of the analysis it can perform are: Named Entity Recognition, Part of Speech Tagging, Sentiment Analysis, Summariaztion and etc. it is available for 6 different languages such as English, Chineese, French and etc.

  • Twitter4j: Twitter4j is an open-source unofficial Java library for Twitter API, which makes it quite easy to integrate with Twitter applications.

  • Maven: Apache Maven is a dependency management and build automation tool for Java projects.

Author :

Navid Alipour - Simple Twitter Restaurant Crawler - Navid Alipour

Thanks...

ir-twitter-crawler's People

Contributors

beardeddonut avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.