Coder Social home page Coder Social logo

melvynator / elk_twitter Goto Github PK

View Code? Open in Web Editor NEW
58.0 7.0 26.0 6.75 MB

This is a data pipeline for Twitter (ETL) using the elastic stack Elasticsearch, Logstash and Kibana (version 6.1)

License: Apache License 2.0

elk elk-stack elasticsearch kibana logstash twitter-api twitter data-visualization natural-language-processing data-collection machine-learning

elk_twitter's Introduction

Out of the box Twitter pipeline using the Elastic stack (ELK)

Contributing

This repository is fully free and fully open source. The license is Apache 2.0, meaning you are pretty much free to use it however you want in whatever way.

All contributions are welcome: ideas, pull requests, issues, documentation improvement, complaints.

Summary

Introduction

This repository aims to provide a fully working "out-of-the-box" data pipeline for doing Machine learning on Twitter data using the ELK (Elasticsearch, Logstash, and Kibana) stack.

If you are not familiar with Logstash you may want to follow this tutorial first.

After having installed ELK you should be able in 5 minutes to visualize dashboard like the following:

The offered pipeline can be modelized by the following flow chart:

alt text

Here are some slides that present the logstash part of the pipeline: https://www.slideshare.net/hypto/machine-learning-in-a-twitter-etl-using-elk .

Let's have a look to the different part that are covered by this pipeline:

Concerning the Logstash part


Input

The input used is Twitter, you can use it to track users or keywords or tweets in a specific location.

Filter

A lot of filters are applied and they are in charge of the following tasks:

  • Remove depreciated field
  • Divide the tweet in two or three events (users and tweet)
  • Flatten the JSON
  • Remove the fields not used

Output

Two output are defined:

  • Elasticsearch: To allow a better search of your data
  • MongoDB: To store your data

Concerning the Elasticsearch part


Mapping

A mapping is provided and offers the following:

  • A parent/child relationship between the tweet author and their tweets
  • On text fields (Tweet content, User description, User location):
    • 3 Analyzers
    • Storing of the term vectors (For the 3 analyzers)
    • Storing of the token numbers (For the 3 analyzers)
  • One geofield to locate the provenance of the tweet (if available)
  • Many "keyword", "integer" field to all allow data filtering

The 3 analyzers are:

  1. Standard
  2. English
  3. A custom analyzer that keeps emoticons and punctuations, which is useful for sentimental and emotion analysis

The mapping is not dynamic, Twitter having a lot of fields that are not (or poorly) documented, it avoid data polution and keep only the wanted data.

Concerning the Kibana part


On Kibana side the repository offer:

  • A dashboard for general data visualization
  • A dashboard for comparison between a positive and negative tweet
  • Different kind of visualizations

Machine learning


Logstash make it simple to integrate machine learning model directly into your pipeline using the rest filter. A small "API" has been created to give you an idea about how you can use the rest filter in order to "label" your tweet on the fly before indexation. You can find this toy API here:

https://github.com/melvynator/toy_sentiment_API

The model is a dummy model but you can easily introduce your own complex model on the form of such API.

Requirements

For the pipeline to work, you need a Twitter developer account, which you can obtain here: https://dev.twitter.com/resources/signup

Linux users

This guide assumes that you have already installed Elasticsearch, Logstash and Kibana. All three need to be installed properly in order to use this pipeline.

Once having installed ELK, here are some instructions to configure Elasticsearch to start automatically when the system boots up.

  sudo /bin/systemctl daemon-reload
  sudo /bin/systemctl enable elasticsearch.service

Elasticsearch can be started and stopped as follows:

  sudo systemctl start elasticsearch.service
  sudo systemctl stop elasticsearch.service

(Note that the same steps can be used for Kibana and Logstash)

Mac users

brew install elasticsearch
brew install logstash
brew install kibana

Getting started

Clone the repository:

git clone https://github.com/melvynator/ELK_twitter.git

Setting up Elasticsearch


Make sure that you don't have an index twitter already present.

Setting up your Machine Learning API


⚠️ If you don't have the need to make any API call you can skip this part ⚠️

⚠️ If you have your own API you can skip this part ⚠️

Download the toy API:

git clone https://github.com/melvynator/toy_sentiment_API

Go into the main repository and create a virtual environement:

cd toy_sentiment_API
virtualenv -p python3 venv
source venv/bin/activate

Then install Flask and Scikit-Learn (For the machine learning)

pip install -r requirements.txt

Then you can launch your local server:

python sentiment_server.py

Setting up Logstash


To start configuring your logstash you have to open the configuration file:

ELK_twitter/src/twitter-pipeline/config/twitter-pipeline.conf

Replace the <YOUR-KEY> by your corresponding twitter key:

  consumer_key => "<YOUR-KEY>"
  consumer_secret => "<YOUR-KEY>"
  oauth_token => "<YOUR-KEY>"
  oauth_token_secret => "<YOUR-KEY>"

Now go into twitter-pipeline:

cd ../src/twitter-pipeline

Make sure that Elasticsearch is started and run on the port 9200.

In addition, you also have to manually install the following plugins for Logstash:

⚠️ If you don't have the need to make any API call you don't have to install the REST Plugin ⚠️

⚠️ If you don't want to use mongoDB you don't have to install the MongoDB Plugin ⚠️

  1. MongoDB for Logstash (Allow you to store your data into mongoDB) sudo /usr/share/logstash/bin/logstash-plugin install logstash-output-mongodb
  2. REST for Logstash (Allow you to make API call) sudo /usr/share/logstash/bin/logstash-plugin install logstash-filter-rest

⚠️ By default, the pipeline is only configured to output to Elasticsearch, but if you have MongoDB installed, then you can uncomment the mongo output in the config file: ELK_twitter/src/twitter-pipeline/config/twitter-pipeline.conf

⚠️ By default, the pipeline is not configured to make API call, if you have an API you can uncomment the rest filter in the config file: ELK_twitter/src/twitter-pipeline/config/twitter-pipeline.conf

Don't forget to specify your own endpoint and data.

Then, you can run the pipeline using:

sudo /usr/share/logstash/bin/logstash -f config/twitter-pipeline.conf

Or define logstash in your SYSTEM_PATH and run the following:

logstash -f config/twitter-pipeline.conf

You should see some logs that end up with:

Successfully started Logstash sentiment_service endpoint {:port=>9600}

Setting up Kibana


Now go to Kibana: http://localhost:5601/

Management => Index Patterns => Create Index Pattern

Into the text box Index name or pattern type: twitter

Into the drop down box Time Filter field name choose: inserted_in_es_at

Click on create

Now go to:

Management => Saved Objects => import

And select the file in:

ELK_twitter/src/twitter-pipeline/kibana-visualization/kibana_charts.json

You can now go to Dashboard

This gif summarize the different step if you are lost.

alt text

Ressources

Thanks to stackoverflow community and Elastic community for the answer provided.

https://www.elastic.co/guide/en/logstash/current/introduction.html https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html

elk_twitter's People

Contributors

melvynator avatar omarsar avatar yenhao avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

elk_twitter's Issues

No results found in KIbana

Hi melvynator

Thanks for a awsome project and for sharing !

I am getting No results found in KIbana, tried the mapping you suggested in #9

I tried on both ELK stack 6.1.4 and the latest 6.5.4

On 6.5.4 I can see the document count increasing also on timelion I dont see any data
looks like kibana cant see the data in elasticsearch

Please help

Thanks
Jaco

Toy_sentiment_API

Hi, I have installed the toy api sentiment because I would like to use it with another app I have for twitter but I don´t know how it works. I run it and appear this:

imagen

I got into the localhost, and it didn't works.

Thanks

Error in dashboard "Could not locate"

Morning,

First of all thank you for this wonderful tutorial for those of us who are starting in this world, is fantastic, I followed your tutorial step by step without getting any error, but after importing the data and go to the dashboard I skip a large number of errors, like the following:

Could not locate that visualization (id: da589000-8f8c-11e7-abbf-fd2b008dbf85)
Could not locate that search (id: 254f2590-8fc6-11e7-abbf-fd2b008dbf85)
Could not locate that visualization (id: Retweet-number)

I can't visualize any information.

Thank you very much for your help, sorry for the inconvenience.

Add time based indices

The goal of this repo is to allow the collect of tweet over time.

But let's say that you are collecting tweets for 4 months non stop, you may end up with a shard that is quite big. To avoid this problem in elasticsearch, a good practice is to create new indices over time, for example, one indie per day/week/month/year.

The pros of doing this are:

  • Search can be made on multiple indices
  • You can use the time to filter out some indices
  • You can delete/archive/compress old indices if your app is real time for example.

Mapping doesn't respond to new versions of kibana?

I am trying to set up the whole project from your source. Everything seems to work fine until the elasticsearch part where after the indexing there are no results in kibana.
So do you believe it has to do with the new versions of kibana and elasticsearch or anything else?

Logstash-Problema

Logstash manual exec:

[ERROR] 2022-04-22 14:40:19.149 [Ruby-0-Thread-10: :1] elasticsearch - Failed to install template {:message=>"Malformed escape pair at index 25: /_index_template/twitter-%{+YYYY.MM.dd}", :exception=>Java::JavaNet::URISyntaxException, :backtrace=>["java.net.URI$Parser.fail(java/net/URI.java:2913)

Import template through api:

{"error":{"root_cause":[{"type":"parse_exception","reason":"unknown key [template] in the template "}],"type":"parse_exception","reason":"unknown key [template] in the template "},"status":400}root@ubuntu:/home/ubuntu#

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.