Coder Social home page Coder Social logo

mltext's Introduction

Machine Learning with Text in Python

This repo contains the solutions for the assessments proposed in the following course -

Course Schedule


Before the Course

  • Make sure that scikit-learn, pandas, and matplotlib (and their dependencies) are installed on your system. The easiest way to accomplish this is by downloading the Anaconda distribution of Python. Both Python 2 and 3 are welcome.
  • If you are not familiar with Git and GitHub, watch my quick introduction to Git and GitHub (8 minutes). Note that the repository shown in the video is from a previous iteration of the course, and the GitHub interface has also changed slightly.
    • For a longer introduction to Git and GitHub, watch my 11-video series (36 minutes).
  • If you are not familiar with the Jupyter notebook, watch my introductory video (8 minute segment). Note that the Jupyter notebook was previously called the "IPython notebook", and the interface has also changed slightly. (Here is the notebook shown in the video.)
  • If you are not yet comfortable with scikit-learn, review the notebooks and/or videos from my scikit-learn video series, focusing specifically on the following topics:
  • If you are not yet comfortable with pandas, review the notebook and/or videos from my pandas video series. Alternatively, review another one of my recommended pandas resources.

Week 1: Working with Text Data in scikit-learn

Topics covered:

  • Model building in scikit-learn (refresher)
  • Representing text as numerical data
  • Reading the SMS data
  • Vectorizing the SMS data
  • Building a Naive Bayes model
  • Comparing Naive Bayes with logistic regression
  • Calculating the "spamminess" of each token
  • Creating a DataFrame from individual text files

Week 2: Basic Natural Language Processing (NLP)

Topics covered:

  • What is NLP?
  • Reading in the Yelp reviews corpus
  • Tokenizing the text
  • Comparing the accuracy of different approaches
  • Removing frequent terms (stop words)
  • Removing infrequent terms
  • Handling Unicode errors

Week 3: Intermediate NLP and Basic Regular Expressions

Topics covered:

  • Intermediate NLP:
    • Reading in the Yelp reviews corpus
    • Term Frequency-Inverse Document Frequency (TF-IDF)
    • Using TF-IDF to summarize a Yelp review
    • Sentiment analysis using TextBlob
  • Basic Regular Expressions:
    • Why learn regular expressions?
    • Rules for searching
    • Metacharacters
    • Quantifiers
    • Using regular expressions in Python
    • Match groups
    • Character classes
    • Finding multiple matches

Week 4: Intermediate Regular Expressions

Topics covered:

  • Week 3 homework review
  • Greedy or lazy quantifiers
  • Alternatives
  • Substitution
  • Anchors
  • Option flags
  • Assorted functionality

Week 5: Working a Text-Based Data Science Problem

Topics covered:

  • Reading in and exploring the data
  • Feature engineering
  • Model evaluation using train_test_split and cross_val_score
  • Making predictions for new data
  • Searching for optimal tuning parameters using GridSearchCV
  • Extracting features from text using CountVectorizer
  • Chaining steps into a Pipeline

Week 6: Advanced Machine Learning Techniques

Topics covered:

  • Reading in the Kaggle data and adding features
  • Using a Pipeline for proper cross-validation
  • Combining GridSearchCV with Pipeline
  • Efficiently searching for tuning parameters using RandomizedSearchCV
  • Adding features to a document-term matrix (using SciPy)
  • Adding features to a document-term matrix (using FeatureUnion)
  • Ensembling models
  • Locating groups of similar cuisines
  • Model stacking

mltext's People

Contributors

jnavarro86 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.