practical-nlp / practical-nlp-code Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 577.0 89.55 MB

Official Repository for Code associated with 'Practical Natural Language Processing' book by O'Reilly Media

Home Page: http://www.practicalnlp.ai/

License: MIT License

Jupyter Notebook 99.70% Python 0.30%

natural-language-processing natural-language-understanding oreilly-books

practical-nlp-code's Introduction

Practical Natural Language Processing

A Comprehensive Guide to Building Real-World NLP Systems

Sowmya Vajjala, Bodhisattwa P. Majumder, Anuj Gupta, Harshit Surana

Endorsed by: Zachary Lipton, Sebastian Ruder, Marc Najork, Monojit Choudhury, Vinayak Hegde, Mengting Wan, Siddharth Sharma, & Ed Harris Foreword by: Julian McAuley

Homepage: www.practicalnlp.ai Published by O'Reilly Media, 2020

Book Structure

Please note that the code repository is still under development and review.

All the notebooks will be crystalized in the coming months. 

The notebooks have been tested on an ubuntu machine running python 3.6. 

Currently, we are using TF1.x. We will migrate to TF2.x in the coming months.

✨ A newer version compatible with Ubuntu 23 is being added in pnlp-refactor-ubuntu23.

We thank everyone, especially the educators & universities, for their feedback in pointing out the issues & improving the accessibility of the notebooks.

🚩 Details of the repository roadmap could be found here

Open the repository in Google Colab:

Open the repository in Jupyter nbviewer:

Chapterwise folders:

Contributors to Codebase:

Found a Bug?

If the bug is in book text, please enter a errata here: https://www.oreilly.com/catalog/errata.csp?isbn=0636920262329

If the bug is in the codebase: Great! Please read this how can you help us improve the codebase

practical-nlp-code's People

Contributors

Stargazers

Watchers

Forkers

suranah mindis estkae rishithbhowmick iicemann stanimman angelaaaateng peterqtr11 sylvainyu samyak2 theainerd ankit013 eugene-renoir foeinlove ujs seanxl allensmile jingmouren varunp2k jkapila dcronkite duybluemind1988 kunitaka88 javierabosch2 ncdingari ramasravani zh3ngxu vaishali-vcs advaitha ngcao gerardinho10 vbukkala nguyendo24 bnriiitb dancy0824 kkphd asuskf py-ranoid fenrice profitalo humepac andrewolal ramnathv kalok87 bugrabalkac nickmao1994 saurabh7pull phillette qixuanhou mpankaj dcastillost sjs-hitk jiportilla leekltw cw18-coder tony-mtz devanshu125 iotnation vishwa360 angrydata shiv-dhawan jay-lim beoy morganwang010 naveedafzal nddung105 radovankavicky gapdata subhasisj 123nishant markonewo ankitw497 idowuilekura sanjayprakash rajaramkuberan darejohnson quanfang neelbhandari6 hriday039 nikhilskashyap nishantjh broilo gaoleia gfvvz ravigv rajgupt duyvuleo bunny3363 makise05 tchigher evi1angel matrix21 wilcoln rapetibhargav shivanidhiddy thelc127 imanxosse phoitack anakromeiro bentosilva

practical-nlp-code's Issues

Fix notebook dying due to inactivity on Google Colab (ch 4, nb 7)

The Notebook dies due to inactivity on Google Colab, when running cells which takes 30 minutes or more. Suspected reasons may be due to unstable internet connection of user or an issue from Google Colab.

Screenshot (Progress stops and the notebook needs to be rerun)

[BUG] Mistakes in Ch4/01_OnePipeline_ManyClassifiers.ipynb

In: Ch4/01_OnePipeline_ManyClassifiers.ipynb

1/ In cell 14, related to SVM: there is a copy/paste error in the comment, referring to Logistic regression, when the code in that cell is only regarding SVM.

2/ The AUC metric calculated here for SVM calculates something, but it is the AUC for the prior cell, for logistic regression. As seen by the duplicated values.

Stanford Core NLP on Google Colab

This template is ONLY to be used for reporting bugs in the code.

ISSUE: Unable to connect to StanfordCoreNLPServer on Google Colab

Location

https://github.com/practical-nlp/practical-nlp/blob/master/Ch9/01_Aspect_Based_Sentiment_analysis.ipynb

Current Behavior

unable to connect to StanfordCoreNLPServer

Expected Behavior

connected to StanfordCoreNLPServer

Possible Solution

can use Stanza, but give slightly different result (sentiment analysis on words, instead of sentence)

Steps to Reproduce

download and install java:
import os
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
download stanford-corenlp:
!wget "https://nlp.stanford.edu/software/stanford-corenlp-latest.zip"
!unzip "stanford-corenlp-latest.zip"
change directory:
cd stanford-corenlp-4.2.0/
connect to server:
!java -mx5g -cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 10000
if using port 9000 (even though nothing else is using this port):
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - Server default properties:
(Note: unspecified annotator properties are English defaults)
inputFormat = text
outputFormat = json
prettyPrint = false
[main] INFO CoreNLP - Threads: 2
[main] INFO CoreNLP - Starting server...
[main] WARN CoreNLP - java.net.BindException: Address already in use
java.base/sun.nio.ch.Net.bind0(Native Method)
java.base/sun.nio.ch.Net.bind(Net.java:455)
java.base/sun.nio.ch.Net.bind(Net.java:447)
java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
jdk.httpserver/sun.net.httpserver.ServerImpl.(ServerImpl.java:101)
jdk.httpserver/sun.net.httpserver.HttpServerImpl.(HttpServerImpl.java:50)
jdk.httpserver/sun.net.httpserver.DefaultHttpServerProvider.createHttpServer(DefaultHttpServerProvider.java:35)
jdk.httpserver/com.sun.net.httpserver.HttpServer.create(HttpServer.java:137)
edu.stanford.nlp.pipeline.StanfordCoreNLPServer.run(StanfordCoreNLPServer.java:1527)
edu.stanford.nlp.pipeline.StanfordCoreNLPServer.launchServer(StanfordCoreNLPServer.java:1624)
edu.stanford.nlp.pipeline.StanfordCoreNLPServer.main(StanfordCoreNLPServer.java:1631)
[Thread-0] INFO CoreNLP - CoreNLP Server is shutting down.

if using port 9001 (freeze at this point):
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - Server default properties:
(Note: unspecified annotator properties are English defaults)
inputFormat = text
outputFormat = json
prettyPrint = false
[main] INFO CoreNLP - Threads: 2
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0.0.0.0:9001

Context (Environment)

running the notebook on colab (on Windows 10)
python 3.6.9
pycorenlp-0.3.0

Possible Implementation

this is an option: https://colab.research.google.com/github/stanfordnlp/stanza/blob/master/demo/Stanza_CoreNLP_Interface.ipynb

Extracting Question and Answers from StackOverflow - Ch2

Notebook link - https://github.com/practical-nlp/practical-nlp/blob/master/Ch2/01_WebScraping_using_BeautifulSoup.ipynb

The questiontext and answertext will return None as there is no such class as "post-text". I have tried to correct the code and here is my answer to the issue -

Jupyter Notebook kernel died

Hi,

I was wondering if anyone knows why the Jupyter Notebook kernel kept dying when I run the below code in Ubuntu:

    # model_type: word2vec, glove or fasttext
    aug = naw.WordEmbsAug(
        model_type='word2vec', model_path='GoogleNews-vectors-negative300.bin',
        action="insert")

I was wondering if it's because the GoogleNews-vectors-negative300.bin is too large that it runs out of memory?

Thank you.

[BUG]

This template is ONLY to be used for reporting bugs in the code.

ISSUE: Unable to connect to StanfordCoreNLPServer on Google Colab

Location

https://github.com/practical-nlp/practical-nlp/blob/master/Ch9/01_Aspect_Based_Sentiment_analysis.ipynb

Current Behavior

unable to connect to StanfordCoreNLPServer

Expected Behavior

connected to StanfordCoreNLPServer

Possible Solution

can use Stanza, but give slightly different result (sentiment analysis on words, instead of sentence)

Steps to Reproduce

download and install java:
import os
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
download stanford-corenlp:
!wget "https://nlp.stanford.edu/software/stanford-corenlp-latest.zip"
!unzip "stanford-corenlp-latest.zip"
change directory:
cd stanford-corenlp-4.2.0/
connect to server:
!java -mx5g -cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 10000
if using port 9000 (even though nothing else is using this port):
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - Server default properties:
(Note: unspecified annotator properties are English defaults)
inputFormat = text
outputFormat = json
prettyPrint = false
[main] INFO CoreNLP - Threads: 2
[main] INFO CoreNLP - Starting server...
[main] WARN CoreNLP - java.net.BindException: Address already in use
java.base/sun.nio.ch.Net.bind0(Native Method)
java.base/sun.nio.ch.Net.bind(Net.java:455)
java.base/sun.nio.ch.Net.bind(Net.java:447)
java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
jdk.httpserver/sun.net.httpserver.ServerImpl.(ServerImpl.java:101)
jdk.httpserver/sun.net.httpserver.HttpServerImpl.(HttpServerImpl.java:50)
jdk.httpserver/sun.net.httpserver.DefaultHttpServerProvider.createHttpServer(DefaultHttpServerProvider.java:35)
jdk.httpserver/com.sun.net.httpserver.HttpServer.create(HttpServer.java:137)
edu.stanford.nlp.pipeline.StanfordCoreNLPServer.run(StanfordCoreNLPServer.java:1527)
edu.stanford.nlp.pipeline.StanfordCoreNLPServer.launchServer(StanfordCoreNLPServer.java:1624)
edu.stanford.nlp.pipeline.StanfordCoreNLPServer.main(StanfordCoreNLPServer.java:1631)
[Thread-0] INFO CoreNLP - CoreNLP Server is shutting down.

Context (Environment)

running the notebook on colab (on Windows 10)
python 3.6.9
pycorenlp-0.3.0
Please provide any more details you feel necessary

Possible Implementation

this is an option: https://colab.research.google.com/github/stanfordnlp/stanza/blob/master/demo/Stanza_CoreNLP_Interface.ipynb

Fix Dependency Issues and Update code with latest versions of python libraries for Notebooks of Chapter 5

I'm working on updating the code for Chapter 5 to ensure compatibility with newer versions of Python and updated libraries. This issue will track the progress and changes made for this specific notebook.

Notebooks:

Tasks:

Update Python version from 3.6 to 3.9
Update Ubuntu version to 22.04
Update all Python libraries to their latest versions
Test the notebook to ensure it runs without errors

Context:

Current Python Version: 3.6
Current Ubuntu Version: 16.04
Current Library Versions:

Proposed Changes:

Update notebooks to run with Ubuntu 22.04
Update notebooks to run with Python 3.10
Libraries from each notebooks will be updated to latest version
Markdown Text will be improved

Related Issues:

Will list them here

Let's work together to update this notebook effectively and ensure it aligns with the latest Python and library versions.

Improve explanation across all notebooks

[BUG]

This template is ONLY to be used for reporting bugs in the code.

The flat_classification_report function is not working (in train_seq function), there is a strange problem with it

Location

https://github.com/practical-nlp/practical-nlp-code/blob/master/Ch5/02_NERTraining.ipynb

Current Behavior

Training a Sequence classification model with CRF
0.9369945384072719

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

[<ipython-input-21-5212b1f7685b>](https://localhost:8080/#) in <module>()
     15 
     16 if __name__=="__main__":
---> 17     main()

3 frames

[/usr/local/lib/python3.7/dist-packages/sklearn_crfsuite/metrics.py](https://localhost:8080/#) in flat_classification_report(y_true, y_pred, labels, **kwargs)
     66     """
     67     from sklearn import metrics
---> 68     return metrics.classification_report(y_true, y_pred, labels, **kwargs)
     69 
     70 

TypeError: classification_report() takes 2 positional arguments but 3 positional arguments (and 1 keyword-only argument) were given

Expected Behavior

It should have worked like on your notebook

Possible Solution

I think it's been ignored or its structure changed

🚩 Roadmap & Milestones for 2020

🚩 Overview

Based on the feedback received from the readers, we have decided to focus on a few things to this repo as accessible and useful as possible. This issue contains a high level for the updates planned in 2020. It will continue being updated based on reader feedback.

🛣️ Roadmap

We think the following features, enhancements & fixes are most relevant for 2020:

🖥️ Platform Accessibility

Easy data download for Windows
Python & dependencies issues resolved for Windows
Full support for running on Collab

📚 Code & Pedagogical Improvements

RASA chatbot notebook
Input sample & stats demonstrated for all notebooks
Output sample demonstrated for all notebooks
Output visualization demonstrated where possible
Color figures are needed for print buyers - they will made available for all.

🚘 Milestones

The roadmap is broken into the following milestones.

🚧 Version 1.5

All platform accessibility-related issues to be resolved
The first draft of pedagogical enhancements

🏁 Version 1.9

Input & output to be demonstrated well for all notebooks
Other user requests handled

08_Duckling.ipynb - TypeError

Hi, I've been trying to run the below code on Jupyter Notebook and Colab, but both threw the same error:

TypeError Traceback (most recent call last)

in ()
----> 1 pprint(d.parse_time(u"Let's meet at 11:45am"))
2 #pprint(d.parse_time(u'You owe me twenty bucks, please call me today'))

4 frames

/usr/local/lib/python3.6/dist-packages/jpype/_jstring.py in getitem(self, i)
46
47 def getitem(self, i):
---> 48 if i < 0:
49 i += len(self)
50 if i < 0:

TypeError: '<' not supported between instances of 'slice' and 'int'

I tried duckling 1.8.0, and 1.7.1, without success. I was wondering if anyone could point me in the right direction? Thank you.

Regards,
Sony

Handle remaining issues in the pnlp-refactor-ubuntu23

We need to tackle a range of issues for the version upgrade refactor. This is a list of issues that needed to be addressed in the pnlp-refactor-ubuntu23:

Refactor the repo for compatibility with latest ubuntu & ML libraries

In response to feedback from students and instructors actively using this repo in formal coursework, we recognize the need to address compatibility issues with the latest OS and libraries. We're initiating a refactor to ensure seamless use in classrooms & by self-learners. This is part of our continuous effort to keep the codebase relevant and functional.

Until the new branch is stable, all changes related to this refactor will be available in the branch pnlp-refactor-ubuntu23.

[BUG] Wikimedia link not working

This template is ONLY to be used for reporting bugs in the code.

Location

Chapter 3 Notebook 6

Current Behavior

Link not working
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream14.xml-p6197595p7697594.bz2

--2021-03-31 18:04:43--  https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream14.xml-p6197595p7697594.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.7, 2620:0:861:1:208:80:154:7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.7|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-03-31 18:04:43 ERROR 404: Not Found.

Expected Behavior

Possible Solution

Steps to Reproduce

1.Run the notebook

Context (Environment)

google colab

Possible Implementation

Dat not available for Doc2Vec Chapter -4

Hello Folks,

I was trying to replicate the code in google collab for Chapter 4. While running the 02_Doc2Vec_Example.ipynb, I was not able to get the data from kaggle as the data was removed. Then I reffered to previous issue and found data was in Data folder.

While using that view of data I found the data is dfferent as indicated in the notebook.

Can you please indicate which data shoul we use, as the accuracy form the data is pretty low as showcased in the notebook.

Following are the ouptut:

#Load the dataset and explore.
filepath = "data/train_data.csv"
df = pd.read_csv(filepath)
print(df.shape)
df.head()

sentiment	content
empty	@tiffanylue i know i was listenin to bad habi...
sadness	Layin n bed with a headache ughhhh...waitin o...
sadness	Funeral ceremony...gloomy friday...
enthusiasm	wants to hang out with friends SOON!
neutral	@dannycastillo We want to trade with someone w...

df['sentiment'].value_counts()

category	value
worry	7433
neutral	6340
sadness	4828
happiness	2986
love	2068
surprise	1613
hate	1187
fun	1088
relief	1021
empty	659
enthusiasm	522
boredom	157
anger	98
Name: sentiment, dtype: int64

#Let us take the top 3 categories and leave out the rest.
shortlist = ['neutral', "happiness", "worry"]
df_subset = df[df['sentiment'].isin(shortlist)]
df_subset.shape

(16759, 2)

preds = myclass.predict(test_vectors)
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(test_cats, preds))

              precision    recall  f1-score   support

   happiness       0.34      0.54      0.42       713
     neutral       0.48      0.56      0.52      1595
       worry       0.62      0.40      0.48      1882

    accuracy                           0.48      4190
   macro avg       0.48      0.50      0.47      4190
weighted avg       0.52      0.48      0.49      4190

Issue with code sample in book from chapter 3 "PRE-TRAINED WORD EMBEDDINGS"

Hi - apologies if this is the wrong place to report this, but I have been reading the online version of this book, and when I try to run the following code sample from chapter 3 with the path to the model updated:

from gensim.models import Word2Vec, KeyedVectors
pretrainedpath = "NLPBookTut/GoogleNews-vectors-negative300.bin"
w2v_model = KeyedVectors.load_word2vec_format(pretrainedpath, binary=True)
print('done loading Word2Vec')
print(len(w2v_model.vocab)) #Number of words in the vocabulary.
print(w2v_model.most_similar['beautiful'])
W2v_model['beautiful']

It fails with the following:

$ python3 word2vec.py                                                                                                                                                        
done loading Word2Vec
Traceback (most recent call last):
  File "word2vec.py", line 5, in <module>
    print(len(w2v_model.vocab)) #Number of words in the vocabulary.
  File "/home/sits/.local/lib/python3.8/site-packages/gensim/models/keyedvectors.py", line 645, in vocab
    raise AttributeError(
AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

I can see the code for Ch3 has been changed to take this into account, eg, removing the len() call and using code like:

print(w2v_model.most_similar('beautiful'))

Can the online book be updated with the correct code?

Chapter 3 TF - IDF

I was going through your book to learn NLP. A great effort in writing such an amazing book.

I found that the example for TF-IDF is a bit unclear in the way TF scores are calculated for the toy corpus.

Ex: dog = 1/3 and bites = 1/6, I am unable to understand how this ratios are being achieved. Please include that in the example description if possible.

Address Autosklearn Library issue. Installing newer version of Autosklearn give error (Ch11, nb1)

I am trying to install newer version of autosklearn. During the installation of autosklearn dependencies it gives an error.

The issue has been reported on the original repository few times recently
#1688
#1670
#1666

Screenshot of the error

Files are missing from the chapters

For Ch3. Notebook 10.

It uses a model called "File-Path".

Nowhere is the model File Path given in the Repo.

Fix and Update Ch4/08_LimeDemo.ipynb to support latest version of Ubuntu and Python

This issue is created for Ch4/08_LimeDemo.ipynb Notebook.

Fix and Update Ch4/09_Lime_RNN.ipynb to support latest version of Ubuntu and Python

This issue is created for Ch4/09_Lime_RNN.ipynb Notebook.

[BUG] No module named 'Data' in Chapter 6 Chatbots notebooks

This template is ONLY to be used for reporting bugs in the code.

Address DeepExplainer issue. Shap interface internals seem to have changed (ch 4, nb 10)

macOS 'Buggy Accelerate Backend when using numpy 1.19'

Error Message:

Polyfit sanity test emitted a warning, most likely due to using a buggy Accelerate backend. If you compiled yourself, see site.cfg.example for information. Otherwise report this to the vendor that provided NumPy.

As per the linked issue in numpy, I wanted to raise and document the solution here (or link it to a TROUBLESHOOTING.md) for others trying to use the source code for this book.

numpy/numpy#15947

Current Behavior

git clone https://github.com/practical-nlp/practical-nlp
cd practical-nlp
python3 -m venv .venv
. .venv/bin/activate
python3 -m pip install --upgrade pip setuptools wheel

# This step is now problematic on macOS
python3 -m pip install -r requirements.txt

This will spit errors in scikit-image and gensim about buggy Accelerate backend in numpy 1.19.

Possible Solution

There are two proposed solutions:

Fallback to numpy==1.18.0
Install openblas

Combining both:

brew install openblas
OPENBLAS="$(brew --prefix openblas)" python3 -m pip install numpy==1.18.0

Updating the pinned version in requirements.txt to numpy==1.18.0 also allows this option:

brew install openblas
OPENBLAS="$(brew --prefix openblas)" python3 -m pip install -r requirements.txt

Context

Why is this a problem?

BLAS stands for Basic Linear Algebra Subprograms. So all the core maths operations in numpy which actually make it fast, can lean on the platform specific implementations of BLAS.

Compiled programs supporting every combination of CPU architecture and OS platform and OS version is a disturbingly difficult task.

So numpy has allowed the OS to provide their own backend for BLAS to delegate this responsibility. What makes this harder is the addition of GPU accelerated versions of BLAS blows out these combinations which make it unwieldy for numpy to support on their own.

So on macOS though, the Accelerate Framework is meant to provide native BLAS support.

numpy being good citizens for the scientific community occasionally need to flag certain versions of these frameworks as bad builds as they produce incorrect results.

That is why we need to swap it out for an alternative in this situation. It isn't the end of the world, just a sharp edge to be aware of in this space.

Conclusion

Hopefully future versions of numpy and Accelarate play nice again and this is not an issue. For now this error will show up because during install, a calculation which is known to trigger the fault in these backends, will raise this error.

terminal command lines not working on Windows 10 Anaconda

Location

https://github.com/practical-nlp/practical-nlp/blob/master/Ch3/05_Pre_Trained_Word_Embeddings.ipynb
https://github.com/practical-nlp/practical-nlp/blob/master/Ch4/01_OnePipeline_ManyClassifiers.ipynb

Current Behavior

'wget' is not recognized as an internal or external command, operable program or batch file.
'apt-get' is not recognized as an internal or external command, operable program or batch file.

Expected Behavior

Possible Solution

Steps to Reproduce

!pip install wget
run the code, then the error.
!pip install apt-get, then "ERROR: Could not find a version that satisfies the requirement apt-get (from versions: none)
ERROR: No matching distribution found for apt-get"
rune the code, then "'apt-get' is not recognized as an internal or external command, operable program or batch file."
More detailed explanation in the original message below.

Context (Environment)

running on Jupyter Notebook
Python 3.8.3
Windows 10 64, Conda, both base and virtual env (python=3.6)
Clean installed Windows and Anaconda.

Possible Implementation

Found the template and added the above. The below is the original message in writing.
Fyi, I am not a coder nor data scientist so the error could be due to my lack of knowledge.

Hello,

I have been enjoying this book since the moment I bought this last week and trying to walk through the commands on Jupyter.

I am on Windows 10 64-bit, not Mac and having trouble executing lines after !. They do not work on my desktop (e.g., Ch. 3-5, wget, Ch. 4-1 apt-get). I first googled and saw ! does not work on Windows but others say bash commands are now okay. So, I installed wget-3.2 (!pip install wget) on Jupyter but still unable to to execute it. I get the following message: "'wget' is not recognized as an internal or external command, operable program or batch file." I even installed a separate program wget on Windows because others on the internet say it has to be a path variable.
Also, I could not even install apt-get (using !pip install apt-get) with the following message "ERROR: Could not find a version that satisfies the requirement apt-get (from versions: none) ERROR: No matching distribution found for apt-get".

Could you help me with these and possibly other terminal commands in the following chapters? Given the growing popularity of NLP and the excellent quality of this book, I am sure more people on Windows 10 will walk through the commands here and might face the same issues.

P.S. I also think the TF-IDF table in Ch. 3 is hard to understand based on the definition you provided.

Output examples for Ch 4 (Text Classification) notebooks

Address Fast AI response issue. Fast AI return type seems to have changed (ch 4, nb 10)

Chapter 2: Snorkel

Need to add a notebook demonstrating Snorkel.

Zero-touch collab support for all chapters

We need to make sure all notebooks run in Collab without any extra configuration, testing, or data mismatch.

The calculation of TF-IDF scores in Chap 3 is unclear

Hi, I came across this book recently and really love it. However, I found that the calculation of the TF-IDF scores in Chapter 3 is kind of confusing.

More specifically, in Table 3-2, each word has a TF score. But since we have 4 sentences (treated as 4 docs in the example), I think we should have one TF score for each word in each sentence?

Accordingly, the vector representation for D1 should be:[1/3IDF(dog), 1/3IDF(bites), 1/3*IDF(man), 0, 0, 0] since each word in D1 only appear once and the TF(w, D1) should all be equal to 1/3.

Color figures & outline

Print readers of the book have found some images to be not legible in black & white. As a result, we decided to update the repo with color figures & outline for all chapters. This should aid our readers.

Address LexNLP Library issue. Installing newer version of LexNLP give error (Ch10, nb 2)

I am trying to install newer version of LexNLP. During installation of dependencies of LexNLP, it gives an error

ERROR: Failed building wheel for scikit-learn
and
ERROR: Could not build wheels for scikit-learn, which is required to install pyproject.toml-based projects

The issue has been reported earlier this year on the original repository here.

Screenshot of the error

Update RASA Notebook

Sentiment and Emotion in Text Dataset

I don't see this dataset included in the repository and it looks like Figure Eight was absorbed by Appen.

https://www.figure-eight.com/data-for-everyone/ redirects to https://appen.com/resources/datasets/

I found the dataset on Kaggle using Google's dataset search engine. Might be worthwhile to include the dataset in the repo or update the links in the notebook.

https://www.kaggle.com/manuelbenedicto/figure-eight-labelled-textual-dataset

Fix Dependency Issues and Update code with latest versions of python libraries for Notebooks of Chapter 4

I'm working on updating the code for Chapter 4 to ensure compatibility with newer versions of Python and updated libraries. This issue will track the progress and changes made for this specific notebook.

Notebooks:

Tasks:

Update Python version from 3.6 to 3.9
Update Ubuntu version to 22.04
Update all Python libraries to their latest versions
Test the notebook to ensure it runs without errors

Context:

Current Python Version: 3.6
Current Ubuntu Version: 16.04
Current Library Versions:

Proposed Changes:

Update notebooks to run with Ubuntu 22.04
Update notebooks to run with Python 3.10
Libraries from each notebooks will be updated to latest version
Markdown Text will be improved

Related Issues:

Will list them here

Let's work together to update this notebook effectively and ensure it aligns with the latest Python and library versions.

Fix data download issues for Windows

Data download for training & test datasets are currently geared towards UNIX operating systems. We need to bring that up to par for Windows as well. Need to ensure this is done and for each chapter notebooks

[BUG] for Ch3/10_Visualizing_Embeddings_using_Tensorboard.ipynb, there is no example file.

For Ch3/10_Visualizing_Embeddings_using_Tensorboard.ipyn, there is no example file for cell 4. I used the available models on here and they are too big. It creates the error "Cannot create a tensor proto whose content is larger than 2GB." in cell 9.

Ch7 Book Summariy Dataset Missing

Thank you for your reply Harshit. Good to hear that the book has been adopted at several colleges.
One way to get issues around is WSL. I was able to avoid most of the issues by installing WSL on my Windows and Anaconda on WSL. Your scripts work way better on Linux thanks to the similarity between Mac Terminal and Linux Bash.
Still, there are quite a few issues like dependency. Many modules are imported without being first installed. It would be no problem on your Mac, but first-time nlp reader will not be able to run the scripts. Don't know exactly why but it is harder to use terminal command in Jupyter on Windows (even with admin status), so I had to get out of Jupyter and installed modules on Terminal and got back. Also, there are issues of 'pip install' and 'anaconda install'. I suggest you clean install Anaconda or clean Python and run the codes in Jupyter notebook on Windows,

Another issue I found was about topic modeling in Ch. 7. The data used for the demonstration is located in the local path: data_path = "/home/etherealenvy/Downloads/booksummaries/booksummaries.txt". There is no way for others to follow the notebook.

Originally posted by @knslee07 in #13 (comment)

[BUG] spelling_en.txt link not working

This template is ONLY to be used for reporting bugs in the code.

Location

chapter 2 notebook 5

Current Behavior

(https://github.com/makcedward/nlpaug/blob/master/model/spelling_en.txt) link is not working

Resolving github.com (github.com)... 52.69.186.44
Connecting to github.com (github.com)|52.69.186.44|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-03-23 16:21:46 ERROR 404: Not Found.

Expected Behavior

Possible Solution

Steps to Reproduce

run the notebook

Context (Environment)

using colab

Possible Implementation

[BUG] Ch4/01_OnePipeline_ManyClassifiers.ipynb missing sklearn import

Had to pip install sklearn.
Then edit the import of stop_words. Change to ****stop_words
sklearn.feature_extraction import ****stop_words
Google took me here > https://stackoverflow.com/questions/68620436/cannot-import-name-stop-words-from-sklearn-feature-extraction
Then add the _ later in the notebook. Section 2: Test Pre-processing
stopwords = **_**stop_words.ENGLISH_STOP_WORDS

Chapter 3 one-hot encoding

I found that a small typo in one-hot representation.

For the toy corpus D4 representation : Given in text book :D4 is represented as [ [ 0 0 1 0 0] [0 0 0 0 1 0] [0 0 0 0 0 1]].

one-hot representation for first-word "man" is only of length 5, one zero is missing in the one-hot vector.

In CBOW method for learning Embeddings:

one of the sentences in the text is: Given a sentence of, say, m words, it assigns a probability Pr(w1, w2, ….., wn) to the whole sentence.

Shouldn't be the probability of words be form w1,w2,.......,wm instead of wn

The dimensions used for CBOW in explanation are (V,d), whereas in CBOW figure(3.9) the dimensions used are (V,N)

practical-nlp/Ch4/04_FastText_Example.ipynb Link Formatting

In the first cell, the link to the dbpedia has a # in front of the link making it not open when clicked.

original:
[here](#https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz)

corrected:
[here](https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz)

[BUG] for Ch4/08_LimeDemo.ipynb

Problem

in Part 2: Using Lime to interpret predictions,

mystring = list(X_test)[221]
print(c.predict_proba([mystring])

there is an extra list and it gives this error:

AttributeError: 'list' object has no attribute 'replace'

and this error refers to this line of the code in the clean function:
doc = doc.replace("</br>", " ")

Solution:

this problem can easily be solved by removing the list part and changing the code to this:

mystring = X_test[221] print(c.predict_proba([mystring])

practical-nlp/Ch4/03_Word2Vec_Example.ipynb

In the above Jupiter notebook, in function:

# Creating a feature vector by averaging all embeddings for all sentences
def embedding_feats(list_of_lists):
    DIMENSION = 300
    zero_vector = np.zeros(DIMENSION)
    feats = []
    for tokens in list_of_lists:
        feat_for_this =  np.zeros(DIMENSION)
        count_for_this = 0
        for token in tokens:
            if token in w2v_model:
                **feat_for_this += w2v_model[token]
                count_for_this +=1
        feats.append(feat_for_this)**        
    return feats

The feature vectors are not averaged, directly sum of the word embeddings is appended as the feature, whereas the function description indicates that the vectors will be averaged.

Shap Demo not working on tensorflow >= 2.0.0

ISSUE: TypeError: 'NoneType' object cannot be interpreted as an integer, when running shap on LSTM model

Location

https://github.com/practical-nlp/practical-nlp/blob/master/Ch4/10_ShapDemo.ipynb

Current Behavior

running the code below:
shap_values = explainer.shap_values(x_val[:5])

gives you the below error:
TypeError Traceback (most recent call last)

in ()
1 # explain the first 10 predictions
2 # explaining each prediction requires 2 * background dataset size runs
----> 3 shap_values = explainer.shap_values(x_val[:5])
4
5 import numpy as np

4 frames

/usr/local/lib/python3.6/dist-packages/shap/explainers/deep/deep_tf.py in anon()
352 shape = list(self.model_inputs[i].shape)
353 shape[0] = -1
--> 354 data = X[i].reshape(shape)
355 v = tf.constant(data, dtype=self.model_inputs[i].dtype)
356 inputs.append(v)

TypeError: 'NoneType' object cannot be interpreted as an integer

Expected Behavior

according to your page, we should get the below output:
Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json
1646592/1641221 [==============================] - 0s 0us/step

Possible Solution

it's likely due to shap not working properly on tensorflow 2.0.0 above
but, if we use tensorflow 1.x, then we get message that keras does not work with tf 1.x

Steps to Reproduce

Steps are the same as the provided notebook, but added the below to install tf 2.3.0
%tensorflow_version 2.x
!pip uninstall -y tensorflow
!pip install tensorflow==2.3.0

Context (Environment)

running the notebook on colab (on Windows 10)
python 3.6.9
shap 0.35.0 (also tried the latest version)
tf 2.3.0 (also tried 2.4.0 and 1.14)

Possible Implementation

no working solutions identified yet

NOTE:

also in the below code, batch_size and max_features were defined, but not used (though using them doesn't make any difference)

batch_size = 64
max_features = vocab_size + 1

#Training an LSTM with embedding on the fly

print("Defining and training an LSTM model, training embedding layer on the fly")
#modified from:
rnnmodel = Sequential()
rnnmodel.add(Embedding(MAX_NUM_WORDS, 128))
rnnmodel.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel.add(Dense(2, activation='sigmoid'))
rnnmodel.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
print('Training the RNN')
rnnmodel.fit(x_train, y_train,
batch_size=32,
epochs=2,
validation_data=(x_val, y_val))
score, acc = rnnmodel.evaluate(test_data, test_labels,
batch_size=32)
print('Test accuracy with RNN:', acc)

Update for current versions of spaCy, gensim, etc.?

Great book! I read it cover to cover AND tried to run nearly all of the code, which I almost never do.

Could the code in this repo for spaCy and gensim be updated to the current versions? As one example, Ch5/01_KPE.ipynb does not run with the current version of spaCy. I am just learning but I assume changes might include, for Ch5/01_KPE.ipynb,

Book version:

!pip install textacy==0.9.1
!pip install spacy==2.2.4

import spacy
import textacy.ke
from textacy import *

print(f'Using textacy {textacy.__version__} and spaCy {spacy.__version__}')

# Worked with 2.2.4:
textacy.ke.textrank(doc, topn=10)

# Worked with 2.2.4:
print("Textrank output: ", [kps for kps, weights in textacy.ke.textrank(doc, normalize="lemma", topn=5)])

What appears to run okay as of this writing, December 2021, using spaCy 3.2.0...

# Lines of Ch5/01_KPE.ipynb revised for spaCy 3.2.0:
!pip install textacy==0.11.0   # or 0.12.0 but I haven't tried that
!pip install spacy==3.2.0

import spacy
import textacy
from textacy import extract
from textacy.extract import keyterms as kt

print(f'Using textacy {textacy.__version__}')
print(f'Using spaCy {spacy.__version__}')

# Works with 3.2.0:
import spacy
import textacy
from textacy import extract
from textacy.extract import keyterms as kt

print(f'Using textacy {textacy.__version__} and spaCy {spacy.__version__}')

# Works with 3.2.0:
kt.textrank(doc, normalize="lemma", topn=10) # I'm not sure the role of normalize

# Works with 3.2.0:
print("Textrank output: ", [kps for kps, weights in extract.keyterms.textrank(doc, normalize="lemma", topn=5)])

Would be great if someone smarter than me could update the book's spaCy- and gensim-related code to run current versions for 2022...

Fix notebook dying due to inactivity on Google Colab (ch 4, nb 6)

04_FastText_Example.ipynb - code outdated

Hi, supervised is no longer supported in fasttext. I used the recommended train_supervised, but was unable to get the 97% precision/recall like you said in the book. below is my code:

Using fastText for feature extraction and training

#from fasttext import supervised
from fasttext import train_supervised
"""fastText expects and training file (csv), a model name as input arguments.
label_prefix refers to the prefix before label string in the dataset.
default is label. In our dataset, it is class.
There are several other parameters which can be seen in:
https://pypi.org/project/fasttext/
"""
#%time model = supervised(train_file, 'temp', label_prefix="class")
%time model = train_supervised(train_file, label_prefix="class", lr=1.0, epoch=25, wordNgrams=2, bucket=200000, dim=50)
results = model.test(test_file)
#print(results.nexamples, results.precision, results.recall)
def print_results(N, p, r):
print("N\t" + str(N))
print("P@{}\t{:.3f}".format(1, p))
print("R@{}\t{:.3f}".format(1, r))

print_results(*results)

below is the result:
CPU times: user 6min 15s, sys: 2.64 s, total: 6min 18s
Wall time: 6min 18s
N 70000
P@1 0.481
R@1 0.481

any comments would be appreciated. thanks

[BUG] Chapter 2 Notebook 6 - 'utils' module not found

Location

https://github.com/practical-nlp/practical-nlp/blob/master/Ch2/06_Snorkel.ipynb

Current Behavior

utils module is not found

Possible Solution

cd into the cloned repository before importing the modules from that repository.

Steps to Reproduce

Simply run the notebook from the start.

Context (Environment)

Using python 3.6.8 on windows with the requirements install from requirements.txt using pip venv.

Question on a new version of the book

Hi,

Thanks for making available the companion code for the book on Practical Natural Language Processing.

Since you are planning to move the code from tf 1 to tf 2, are you also planning to print a new version of the companion book as well?

Many thanks,

Ivan