____ ____ ____ ____ _________ ____ ____ ____ ____ ____ ____
||t |||e |||x |||t ||| |||m |||i |||n |||i |||n |||g ||
||__|||__|||__|||__|||_______|||__|||__|||__|||__|||__|||__||
|/__\|/__\|/__\|/__\|/_______\|/__\|/__\|/__\|/__\|/__\|/__\|
A curated list of resources for learning about natural language processing, text mining, text analytics, and unstructured data.
- Books
- Blogs
- Blog articles, Papers, Case Studies
- Online Courses
- APIs and Libraries
- Products
- Online Demos and Tools
- Datasets
- Misc
- Meta
- Other Curated Lists
- Natural Language Processing with Python
- Natural Language Processing with PyTorch
- Python Natural Language Processing
- Mastering Natural Language Processing with Python
- Natural Language Processing: Python and NLTK
- Applied Text Analysis with Python: Enabling Language-Aware Data Products with Machine Learning
- Deep Learning with Text
- Applied Natural Language Processing With Python 2018.
- Taming Text
- Speech and Language Processing
- Foundations of Statistical Natural Language Processing
- Language Processing with Perl and Prolog: Theories, Implementation, and Application (Cognitive Technologies)
- An introduction for information retrieval
- Handbook of Natural Language Processing
- Practical Text Mining and Statistical Analysis for Non-structured Text Data Applications
- Fundamentals of Predictive Text Mining
- Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More
- Neural Network Methods for Natural Language Processing
- Text Mining: A Guidebook for the Social Sciences
- Practical Text Analytics: Interpreting Text and Unstructured Data for Business Intelligence
- Neural Network Methods in Natural Language Processing
- Machine Learning for Text (2018)
- Probably Approximately a Scientific Blog
- Sebastian Ruder
- NLP-progress
- natural language processing blog
- Text Mining, Analytics & More
- FriendlyData blog
- Salmon Run
- Lekta Blog
- NLP News
- WEF Live Campaign - Twitter fed Global News Topics & Sentiment Tracker - Live Jan 2019
- Modern Deep Learning Techniques Applied to Natural Language Processing
- The Definitive Guide to Natural Language Processing
- From Natural Language to Calendar Entries, with Clojure. March 2015. NLP, Clojure
- Ask HN: How Can I Get into NLP (Natural Language Processing)?
- Ask HN: What are the best tools for analyzing large bodies of text?
- Quora: How do I learn Natural Language Processing?
- Quora Topic: Natural Language Processing
- The Definitive Guide to Natural Language Processing October 2015.
- Futures of text Feb 2015.
- R or Python on Text Mining Aug 2015.
- Where to start in Text Mining Aug 2012.
- Text Mining in R and Python: 8 Tips To Get Started. Oct 2016
- An introduction to text analysis with Python, Part 1 April 2012.
- Mining Twitter Data with Python (Part 1: Collecting Data)
- a gentle introduction to historical data analysis
- Why Text Mining May Be The Next Big Thing. March 2012.
- SAS CEO offers analytics over BI, reveals use cases for text analytics June 2011.
- Value and benefits of text mining. Sep 2015.
- Text Mining South Park Feb 2016
- Natural Language Processing: An Introduction
- Natural Language Processing Tutorial. June 2013.
- Natural Language Processing blog.
- An Introduction to Text Mining using Twitter Streaming API and Python
- GitHub repo with code: https://github.com/adilmoujahid/Twitter_Analytics
- How To Get Into Natural Language Processing'
- Betty: a friendly English-like interface for your command line.
- Creating machine learning models to analyze startup news - Part1. Part 2. Part 3.
- A Tidytext Analysis of the Weinstein Effect Dec 2013.
- Comparison of the Most Useful Text Processing APIs
- 100 Must-Read NLP Papers
- Venturebeat Blogpost - Gender biases in datasets - Based on UCLA research paper "Learning Gender Neutral Word Embeddings" Aug 2018 .
- https://blog.scrapinghub.com/2016/01/19/scrapy-tips-from-the-pros-part-1/
- Extract text from any document; no muss, no fuss.. July 2014.
- Using Scrapy to Build your Own Dataset Sep 2017.
- Taming Text with the SVD. SAS. Jan 2004.
- Automatic Sarcasm Detection: A Survey ACM Computer Surveys, Sep 2017.
- Naive Bayes and Text Classification. Oct 2014
- Bag of Tricks for Efficient Text Classification
- Text Classifier Algorithms in Machine Learning July 2017
- Classifying Documents in the Reuters-21578 R8 Dataset. August 2016.
- Tidy Text Mining Beer Reviews Jan 2018.
- Implementing a CNN for Text Classification in TensorFlow
- Using fastText and Comet.ml to classify relationships in Knowledge Graphs
- Multi-Class Text Classification with Scikit-Learn
- Machine Learning with Text in scikit-learn (PyCon 2016)
- How to solve 90% of NLP problems: a step-by-step guide
- Entity Extraction and Network Analysis. Python,
StanfordCoreNLP
- NLP Techniques for Extracting Information
- Text Clustering: Get quick insights from Unstructured Data. July 2017.
- Document Clustering. MSc Thesis.
- Document Clustering: A Detailed Review. Shah and Mahajan. IJAIS 2012.
- Document Clsutering with Python A GitHub repository that clusters IMDB movie descriptions. Based on this original tutorial, whose GitHub repo is here.
- Text mining and sentiment analysis on video game user reviews using SAS® Enterprise Miner
- Who wrote the anti-Trump New York Times op-ed? Using tidytext to find document similarity
- Topic models: Past, present, and future
- Word vectors using LSA, Part - 2
- Probabilistic Topic Models
- LEGO color themes as topic models Sep 2017.
- How our startup switched from Unsupervised LDA to Semi-Supervised GuidedLDA
- Topic Modeling with LSA, PLSA, LDA & lda2Vec Aug 2018.
- text2vec's Description of Topic Models
- Topic Modelling Portal
- Applications of Topic Models 2017.
- MACS 30500: Text analysis: topic modeling
- Sentiment analysis on Trump's tweets using Python. Sep 2017.
- Donald Trump vs Hillary Clinton: sentiment analysis on Twitter mentions
- Does sentiment analysis work? A tidy analysis of Yelp reviews
- CACM: Techniques and Applications for Sentiment Analysis
- Twitter mood predicts the stock market
- A nonlinear impact: evidences of causal effects of social media on market prices
- Stock Sentiment Data: Measuring the Mood of the Market
- Stock Prediction Using Twitter Sentiment Analysis. Stanford course project report.
- Forbes: How Quant Traders Use Sentiment To Get An Edge On The Market
- News Sentiment Analysis Using R to Predict Stock Market Trends. SMU lecture.
- On the Predictability of Stock Market Behavior using StockTwits Sentiment and Posting Volume
- Sentdex: Quantifying the Qualitative
- Leveraging international market sentiment for trading strategies
- From tweets to polls: Linking text sentiment to public opinion time series
- Lexicon-Based Methods for Sentiment Analysis
- On the negativity of negation
- Blog Post: That Sentimental Feeling
- Trump2Cash: A stock trading bot powered by Trump tweets
- Unsupervised Sentiment Neuron. April 2017.
- Current State of Text Sentiment Analysis from Opinion to Emotion Mining Feb 2017
- Does sentiment analysis work? A tidy analysis of Yelp reviews
- Lost at Sea: How Social Media is Helping Cruise Lines Attract Millennials
- Harry Plotter: Celebrating the 20 year anniversary with tidytext and the tidyverse in R August 2015.
- Data Science 101: Sentiment Analysis in R Tutorial. October 2017.
- Cannes Lions 2017: Hungerithm, Mars Chocolate Australia (Clemenger BBDO, Melbourne). Snickers price goes down as anger goes up.
- A survey on sentiment analysis challenges. April 2016.
- Challenges in Sentiment Analysis. 2015.
- Sentiment Analysis Tools Overview, Part 1. Positive and Negative Words Databases. July 2017
- Sentiment analysis: 10 applications and 4 services
- Emotion and Sentiment Analysis: A Practitioner’s Guide to NLP
- Sentiment analysis: 10 applications and 4 services. June 2018.
- TWITTER SENTIMENT ANALYSIS USING COMBINED LSTM-CNN MODELS
- Breakthrough Research Papers and Models for Sentiment Analysis Article contains different types of model for Sentiment Analysis.
- Blog Post: Found in translation: More accurate, fluent sentences in Google Translate Nov 2016
- NYTimes: The Great A.I. Awakening Dec 2016
- Machine Learning Translation and the Google Translate Algorithm
- Neural Machine Translation (seq2seq) Tutorial
- Paper Dissected: “Attention is All You Need” Explained Explanation of an important paper that first introduced 'Attention mechanism' in 2017.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding A new language representation model published in 2018.
- Meet Lucy: Creating a Chatbot Prototype
- Microsoft Bot Framework. A YouTube video describing the product.
- Training Millions of Personalized Dialogue Agents
- Ultimate Guide to Leveraging NLP & Machine Learning for your Chatbot. 2016.
- A Survey on Dialogue Systems: Recent Advances and New Frontiers Jan 2018.
- agrep method in R. Approximate String Matching (Fuzzy Matching)
- fuzzywuzzy package in R. Example usage.
- Fuzzy String Matching – a survival skill to tackle unstructured information
- The RecordLinkage Package: Detecting Errors in Data
- R package fastLink: Fast Probabilistic Record Linkage
- Fuzzy merge in R
- Learning Text Similarity with Siamese Recurrent Networks
- Dedupe: A python library for accurate and scaleable fuzzy matching, record deduplication and entity-resolution
- An Intuitive Understanding of Word Embeddings: From Count Vectors to Word2Vec
- An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation 2016. From IBM.
- Document Embedding with Paragraph Vectors 2015. From Google.
- GloVe Word Embeddings Demo 2017. From fasti.
- Text Classification With Word2Vec 2016.
- Document Embedding 2017
- From Word Embeddings To Document Distances 2015.
- Word Embeddings, Bias in ML, Why You Don't Like Math, & Why AI Needs You 2017. Rachel Thomas (fastai)
- Word Vectors in Natural Language Processing: Global Vectors (GloVe). Aug 2018.
- Doc2Vec Tutorial on the Lee Dataset
- Word Embeddings in Python with Spacy and Gensim
- The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning). Dec 2018.
- Deep Contextualized Word Represenations. ElMo. PyTorch implmentation. TF Implementation
- Universal Language Model Fine-tuning for Text Classification.
- Supervised Learning of Universal Sentence Representations from Natural Language Inference Data.
- Learned in Translation: Contextualized Word Vectors. CoVe.
- Distributed Representations of Sentences and Documents. Paragraph vectors. See doc2vec tutorial at gensim
- sense2vec. Word sense disambiguation.
- Skip Thought Vectors. Word representation method.
- Sequence to Sequence Learning with Neural Networks
- The Amazing Power of Word Vectors. 2016.
- Contextual String Embeddings for Sequence Labeling. 2018.
- Understanding Convolutional Neural Networks for NLP
- Keras LSTM tutorial – How to easily build a powerful deep learning language model
- Deep Learning for Natural Language Processing: Tutorials with Jupyter Notebooks
- A Survey of the Usages of Deep Learning in Natural Language Processing
- Udemy: Deep Learning and NLP A-Z™: How to create a ChatBot
- Udemy: Natural Language Processing with Deep Learning in Python
- Stanford CS 224N / Ling 284
- Deep Learning for NLP. DeepMind and University of Oxford Department of Computer Science.
- CMU CS 11-747: Neural Network for NLP
- YSDA NLP course. Yandex School of data analysis.
- Stanford course on NLP: Dan Jurafsky and Chris Manning
- Stanford Deep Learning NLP Course
- CMU Language and Statistics II: (More) Empirical Methods in Natural Language Processing
- UT CS 388: Natural Language Processing
- Coursera: Applied Text Mining in Python
- Big Data University: Text Analytics – Getting Results with SystemT
- Big Data University: Advanced Text Analytics – Getting Results with SystemT
- Big Data University: Text mining in action: Analyzing Twitter data for Democratic General Elections (BETA Version)
- Columbia: COMS W4705: Natural Language Processing
- Columbia: COMS E6998: Machine Learning for Natural Language Processing (Spring 2012)
- Machine Translation: Spring 2016
- DataCamp: Natural Language Processing Fundamentals in Python
- DataCamp: Sentiment Analysis in R: The Tidy Way
- DataCamp: Text Mining: Bag of Words
- DataCamp: Building Chatbots in Python
- Coursera: Introduction to Natural Language Processing
- Coursera: Nartual Language Processing
- Commonlounge: Learn Natural Language Processing: From Beginner to Expert
- Udacity: Natural Language Processing Nanodegree
- Udemy: NLP - Natural Language Processing with Python
- Udemy: Deep Learning: Advanced NLP and RNNs
- Udemy: Natural Language Processing and Text Mining Without Coding
- R packages
- tm: Text Mining.
- lsa: Latent Semantic Analysis.
- lda: Collapsed Gibbs Sampling Methods for Topic Models.
- textir: Inverse Regression for Text Analysis.
- corpora: Statistics and data sets for corpus frequency data.
- tau: Text Analysis Utilities.
- tidytext: Text mining using dplyr, ggplot2, and other tidy tools.
- Sentiment140: Sentiment text analysis
- sentimentr: Lexicon-based sentiment analysis.
- cleanNLP: ML-based sentiment analysis.
- RSentiment: Lexicon-based sentiment analysis. Contains support for negation detection and sarcasm.
- text2vec: Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities.
- fastTextR: Interface to the fastText library.
- LDAvis: Interactive visualization of topic models.
- keras: Interface to Keras, a high-level neural networks 'API'. (RStudio Blog: TensorFlow for R)
- retweet: Client for accessing Twitter’s REST and stream APIs. (21 Recipes for Mining Twitter Data with rtweet)
- topicmodels: Interface to the C code for Latent Dirichlet Allocation (LDA).
- textmineR: Aid for text mining in R, with a syntax that should be familiar to experienced R users.
- wordVectors: Creating and exploring word2vec and other word embedding models.
- gtrendsR: Interface for retrieving and displaying the information returned online by Google Trends.
- textstem: Tools that stem and lemmatize text.
- Python modules
- NLTK: Natural Language Toolkit.
- scikit-learn: Machine Learning in Python
- spaCy: Industrial-Strength Natural Language Processing in Python.
- textblob: Simplified Text processing.
- Gensim: Topic Modeling for humans.
- textmining: Python Text Mining utilities.
- Scrapy: Open source and collaborative framework for extracting the data you need from websites.
- lda2vec: Tools for interpreting natural language.
- PyText A deep-learning based NLP modeling framework built on PyTorch.
- sent2vec: General purpose unsupervised sentence representations.
- flair: A very simple framework for state-of-the-art Natural Language Processing (NLP)
- word_forms: Accurately generate all possible forms of an English word e.g "election" --> "elect", "electoral", "electorate" etc.
- AllenNLP: Open-source NLP research library, built on PyTorch.
- BigARTM: Fast topic modeling platform.
- Scattertext: Beautiful visualizations of how language differs among document types.
- embeddings: Pretrained word embeddings in Python.
- fastText: Library for efficient learning of word representations and sentence classification.
- Google Seq2Seq: A general-purpose encoder-decoder framework for Tensorflow that can be used for Machine Translation, Text Summarization, Conversational Modeling, Image Captioning, and more.
- polyglot: A natural language pipeline that supports multilingual applications.
- Apache Tika: a content analysis tookilt.
- Stanford CoreNLP: a suite of core NLP tools
- Also checkout http://corenlp.run for a hosted version of the CoreNLP server.
- Stanford Parser
- Stanford POS Tagger
- Stanford Named Entity Recognizer
- Stanford Classifier
- Stanford OpenIE
- Stanford Topic Modeling Toolbox
- MALLET: MAchine Learning for LanguagE Toolkit
- Github: https://github.com/mimno/Mallet
- Apache OpenNLP: Machine learning based toolkit for text NLP.
- Streamcrab: Real-Time, Twitter sentiment analyzer engine http:/www.streamcrab.com
- TextRazor API: Extract Meaning from your Text.
- fastText. Library for fast text representation and classification. Facebook.
- Comparison of Top 6 Python NLP Libraries.
- Systran - Enterprise Translation Products
- SAS Text Miner (Part of SAS Enterprise Miner)
- SAS Sentiment Analysis
- STATISTICA
- KNIME
- RapidMiner
- Gate
- IBM Watson
- Crimson Hexagon
- Stocktwits: Tap into the Pulse of Markets
- Meltwater
- CrowdFlower: AI for your business.
- Lexalytics Sematria: API and Excel plugin.
- Rosette Text Analytics: AI for Human Language
- Google's Natural Language API: Derive insights from unstructured text using Google machine learning
- Alchemy API
- Monkey Learn
- LightTag Annotation Tool. Hosted annotation tool for teams.
- Anafora: Free and open source web-based raw text annotation tool
- brat: Rapid annotation tool.
- Amazon Lex: A service for building conversational interfaces into any application using voice and text.
- Apache PDFBox
- Tabula: A tool for liberating data tables locked inside PDF files.
- PDFLayoutTextStripper: Converts a pdf file into a text file while keeping the layout of the original pdf.
- pdftabextract: A set of tools for extracting tables from PDF files helping to do data mining on (OCR-processed) scanned documents.
- SO: How to extract text from a PDF?
- Tools for Extracting Data and Text from PDFs - A Review
- How I used NLP (Spacy) to screen Data Science Resumes
- MIT OpenNPT for neural machine translation and neural sequence modeling
- Stemming & Lemmatization with Python NLTK
- Stanford Parser
- Stanford CoreNLP
- word2vec demo
- Another word2vec demo
- UCI's Text Datasets
- data.world's Text Datasets
- Awesome Public Datasets' Natural Languge
- Insight Resources Datasets
- Bing Sentiment Analysis
- Consumer Complaint Database. From the Consumer Financial Protection Bureau.
- Sentiment Labelled Sentences Data Set . Contains sentences labelled as "positive" or "negative", from imdb.com, amazon.com, and yelp.com.
- Amazon product data
- Data is Plural
- FiveThirtyEight's datasets
- r/datasets
- Awesome public datasets
- R's
datasets
package - 200,000 Russian Troll Tweets
- Wikipedia: List of datasets for ML research
- Google Dataset Search
- Kaggle: UMICH SI650 - Sentiment Classification
- Lee's Similarity Data Sets
- Corpus of Presidential Speeches (CoPS) and a Clinton/Trump Corpus
- 15 Best Chatbot Datasets for Machine Learning
- A Survey of Available Corpora for Building Data-Driven Dialogue Systems
- nlp-datasets
- Hate-speech-and-offensive-language
- First Quora Dataset Release: Question Pairs
- The Best 25 Datasets for Natural Language Processing
- AskReddit: People with a mother tongue that isn't English, what are the most annoying things about the English language when you are trying to learn it?
- Funny Video: Emotional Spell Check
- How to win Kaggle competition based on NLP task, if you are not an NLP expert
- awesome-nlp
- Deep Learning for NLP resources
- Speech and Natural Language Processing
- Opinion Mining, Sentiment Analysis, and Opinion Spam Detection
- awesome-machine-learning
- Sentiment140
- Awesome Deep Learning for Natural Language Processing (NLP)
Contributions are more than welcome! Please read the contribution guidelines first.
To the extent possible under law, @stepthom has waived all copyright and related or neighboring rights to this work.