Coder Social home page Coder Social logo

chinese-tweet-analysis-with-topic-model's Introduction

Chinese-tweet-analysis-with-topic-model

An experiment of applying topic model to a tweet archive, downloaded from the Twitter server (Settings -> Account -> Request Twitter Archive). The code is originally adapted from Tan He, where he demostrated how to analyze and visualize Chinese text using topic model. For more information about topic model, refer to links below (or google "Topic Models". I'm pretty sure there are plenty of awesome texts out there!):

A topic model for movie reviews

Topic Model - Wikipedia

Topic Modeling: A Basic Introduction

What tools/packages do I need?

  • Tools:
    • Python 2.7.9
    • R 3.2.2
    • Notepad++ v 6.7.8 (or above)
  • Packages:
    • in Python: Pandas, Numpy
    • in R: JiebaR, LDA, LDAvis, servr

How to use the files here?

The repository contains three files: LDATweets.R, stopword.txt, and tweetExtract.py.

  • LDATweets.R: the core file that runs the LDA model.
  • stopword.txt (optional): a plain text file containing main English and Chinese stopwords. Currently in utf-8 format; downloaded from ibook360. Please, substitute this with your favorite stopword list if you'd like!
  • tweetExtract.py: preprocessing tweets before running the topic model.

Thus, it is recommended to run the files in the following order:

  1. Request your Twitter Archive from twitter.com.
  2. Unzip your Twitter Archive to the local disk. You should find tweets.csv under the root.
  3. Ideally, your tweets should be mostly Chinese. If it's in English, refer to A topic model for movie reviews as shown above.
  4. Make sure that tweets.csv, LDATweets.R, tweetExtract.py, and (optional) stopword.txt are all in your working directory.
  5. Run tweetExtract.py.
  6. Run LDATweets.R. It may take a while for R to execute the script, depending on the size of your tweet archive.
  7. Upload the whole vis folder to a server, or open the index.html in Firefox.
  8. Do the following steps ONLY IF Chinese characters are not correctly displaying in your browser:
  • Open the vis folder, found under the root of your working directory.
  • Create an empty file with whatever name you like, but the extension must be .json.
  • Edit that file in Notepad++. Click on "Encoding" -> "Encode in UTF-8".
  • Save the file.
  • Open lda.json. Copy and paste everything in the .json file above.
  • Save the .json file again.
  • Remove the lda.json file and renamed the previous .json file as lda.json.
  1. Enjoy!

Questions?

Submit a pull request here!! I will reply ASAP.

I shall update the readme once I found out more information about Topic Model.

2018/03/13 update: will try to convert the code to Python and rewrite the code base. Stay tuned! :)

chinese-tweet-analysis-with-topic-model's People

Contributors

mekomlusa avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.