Coder Social home page Coder Social logo

nlp-practicum2021's Introduction

Machine Learning Guild - NLP Practicum

Textbook:

Setup and Installation

  • You'll be able to follow along without installing Python on your computer by viewing the Jupyter notebooks in the repo. You should also be able to run all code except for lesson one using the Google Colab buttons.
  • The repository has been built for Windows computers. Some packages may be incompatible with Macs.
  • Environment setup instructions: https://www.youtube.com/watch?v=sUUWLBmj7Xc&feature=youtu.be
  • For lesson one, you will want to use environment_nlp-course-small.yml for a fast installation.
  • Config.ini: You will need to change two lines in the config.ini file under [USER]. Change the text following "USERNAME:" to your username and "RAW_DATA:" to the file path to the raw data folder within the repository.

LESSONS

0. Configuration (Pre-work)

  • Topics: course overview, git bash, python config.ini files, conda virtual environments
  • Technology: git bash, configparser, conda
  • Homework: use the command line to search data among 1000's of server configuration files

1. Text Extraction

  • Topics: Extract text from docx, pdf, and image files
  • Technology: docx, PyPDF2, pdfminer.six, subprocess, pytesseract
  • Homework: structure the annual reports into sections
  • Supplementary Material: watch lesson_databases videos

2. Text Preprocessing

  • Topics: POS tagging, dependency parsing, rule-based matching, phrase dectection
  • Technology: SpaCy, gensim
  • Prework: Read section 2.1-2.4 SLP and/or 2.1-2.5 SLP videos , section 8.1-8.3 SLP, and chapter 5 Collocations
  • Supplementary Material: watch lesson_automation videos

3. Text Vectorization (count-based methods)

  • Topics: vector space model, TFIDF, BM25, Co-occurance matrix
  • Technology: scikit-learn
  • Prework: Read section 6.1-6.6 SLP
  • Supplementary Material: watch lesson_object_oriented_python

4. Dimensionality Reduction

  • Topics: PCA, latent semantic indexing (LSI), latent dirichlet allocation(LDA), topic coherence metrics
  • Technology: scikit-learn, gensim
  • Prework: Read TamingTextwiththeSVD

5. Word Embeddings

6. Deep Learning for NLP 1

7. Deep Learning for NLP 2

8. Text Similarity

  • Topics: cosine similarity, distance metrics, l1 and l2 norm, recommendation engines
  • Technology: scikit-learn, SpaCy, gensim
  • Prework: Read section 2.5 SLP and/or 2.1-2.5 SLP videos

SUPPLEMENTARY MATERIAL

Automation

  • Topics: automate the process to collect data from https://www.annualreports.com
  • Technology: requests, Jupyter Notebooks, BeautifulSoup, Scrapy
  • Homework: automate the process to identify and download company 10-K annual reports

Databases

  • Topics: use sqlalchemy to create and populate a database, locally and on AWS
  • Technology: sqlalchemy, sqllite, AWS RDS (MySQL)
  • Homework: create and populate a database with sqlalchemy

Object Oriented Python

  • Topics: reconstruct scikit-learn's CountVectorizer codebase
  • Technology: scikit-learn, object oriented Python

nlp-practicum2021's People

Contributors

anhvinhdoanvo avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.