Coder Social home page Coder Social logo

phanduc / text-mining-training-course Goto Github PK

View Code? Open in Web Editor NEW

This project forked from minhpqn/text-mining-training-course

0.0 2.0 0.0 71.78 MB

Short Text Mining Training Course at FPT

HTML 69.64% Python 5.26% DIGITAL Command Language 14.32% Julia 10.78%

text-mining-training-course's Introduction

Short Text Mining Training Course at FPT

Learning outcome

  • Understand and practice NLP techniques for English.
  • Be able to apply Web technologies to build a website.
  • Learn and apply data visualization techniques to report mining results.
  • Can apply security techniques to simulate Web attacks and defense.

Project

Task description

In the project, we will try to detect technology research trends by mining a large collection of articles related to some research topics. We can choose some topics of following topics:

  • Natural Language Processing
  • Big Data
  • Internet of Things
  • Machine Learning

Motivations

Large amount of articles, tweets on mass media, journals, conferences, and social media about hot technologies such as Big data, Machine Learning, Internet of Things, NLP become available. Information on such medias often reflect trends in some fields. However, for human being, it is difficult to read all that data and get some insights about the hot trends in a field. So, it is necessary to develop automatic methods for doing that task.

Project objectives

  • Automatic mine a large collection of articles related to NLP, Big Data, Internet of Things, Machine Learning to get insights about hot trends, keywords in the fields.
  • Generate, visualize a summary or report about hot trends, keywords related above fields.

Collecting data

We would like to collect paper articles, news articles related to two topics:

  • Big data
  • Internet of things (IoT)

Neccessary tools to do course project

Schedule

(In English)

  • Day 1:
    • Introduce the training course
    • Describe the project's requirements, task list
    • Assign/Discuss tasks for internship students
  • Day 2:
    • Using topic modeling for mining topics/trends in raw text corpora
    • Recommended readings for topic modeling & applications for mining topics
    • Topic modeling tools
  • Day 3:
    • Collecting data for the project (tech news articles about big data and internet of things)
    • Using Scrapy (Python package) to crawl data on the internet
    • Introduce some data resources for crawling raw data
    • How to processing crawled text data
  • Day 4:
    • Using nltk for processing crawled text data
    • Transform text data to the data in the format of LDA tools
  • Day 5:
    • Run LDA to train topic models
    • Observe outputs
  • Day 6:
    • Data visualization & Make a simple website
  • Day 7:
    • Review the product
  • Day 8: Project demo & defense

(In Vietnamese)

  • Buổi 1: Giới thiệu về khóa học và sản phẩm được yêu cầu làm, các công việc cần chuẩn bị, thảo luận phân công nhiệm vụ của sinh viên
  • Buổi 2: Giới thiệu các ý tưởng cơ bản của text mining để giải quyết vấn đề, chỉ ra các nguồn tài liệu để tự học và tìm hiểu các khái niệm liên quan
  • Buổi 3: Hướng dẫn cách thu thập / crawl text (thời gian thực), các tổ chức dữ liệu text
  • Buổi 4: Hướng dẫn công cụ xử lý các text đã crawl
  • Buổi 5: Hướng dẫn các trích rút ra thông tin cần thiết
  • Buổi 6: Xem xét và review cách trình bày thông tin ở dạng biểu đồ
  • Buổi 7: Xem xét và review sản phẩm đã gần hoàn thiện
  • Buổi 8: Demo sản phẩm, bảo vệ sản phẩm

References

  • David Hall, Daniel Jurafsky, and Christopher D. Manning. 2008. Studying the history of ideas using topic models. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP '08). Association for Computational Linguistics, Stroudsburg, PA, USA, 363-371.
  • Anton Barua, Stephen W. Thomas, and Ahmed E. Hassan. 2014. What are developers talking about? An analysis of topics and trends in Stack Overflow. Empirical Softw. Engg. 19, 3 (June 2014), 619-654. DOI: http://dx.doi.org/10.1007/s10664-012-9231-y.
  • Bird, Steven et al. “The ACL Anthology Reference Corpus: A Reference Dataset for Bibliographic Research in Computational Linguistics.” LREC (2008).
  • Priva, Uriel Cohen and Joseph L Austerweil. “Analyzing the history of Cognition using Topic Models..” Cognition 135 (2015): 4-9.
  • Gollapalli, Sujatha Das and Xiaoli Li. “EMNLP versus ACL: Analyzing NLP research over time.” EMNLP (2015).
  • Grant, Christan Earl et al. “A Topic-Based Search, Visualization, and Exploration System.” FLAIRS Conference (2015).
  • Paul, Michael J. and Roxana Girju. “Topic Modeling of Research Fields: An Interdisciplinary Perspective.” RANLP (2009).
  • Radev, Dragomir R. et al. “The ACL anthology network corpus.” Language Resources and Evaluation 47 (2013): 919-944.
  • Yang, Tze-I. et al. “Topic Modeling on Historical Newspapers.” LaTeCH@ACL (2011).
  • Building an automatic keyphrase extraction system using NLTK in Python
  • Text Processing: Keyword-extraction

Topic modeling reading list

Web Crawling

Python environments

LDA on Quora:

Word Cloud Tools

text-mining-training-course's People

Contributors

minhpqn avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.