Coder Social home page Coder Social logo

practical-nlp / practical-nlp-code Goto Github PK

View Code? Open in Web Editor NEW
1.2K 1.2K 577.0 89.55 MB

Official Repository for Code associated with 'Practical Natural Language Processing' book by O'Reilly Media

Home Page: http://www.practicalnlp.ai/

License: MIT License

Jupyter Notebook 99.70% Python 0.30%
natural-language-processing natural-language-understanding oreilly-books

practical-nlp-code's Introduction

Practical Natural Language Processing

A Comprehensive Guide to Building Real-World NLP Systems

Sowmya Vajjala, Bodhisattwa P. Majumder, Anuj Gupta, Harshit Surana

Endorsed by: Zachary Lipton, Sebastian Ruder, Marc Najork, Monojit Choudhury, Vinayak Hegde, Mengting Wan, Siddharth Sharma, & Ed Harris Foreword by: Julian McAuley

Homepage: www.practicalnlp.ai Published by O'Reilly Media, 2020


Book Structure

figure

Please note that the code repository is still under development and review.

All the notebooks will be crystalized in the coming months. 

The notebooks have been tested on an ubuntu machine running python 3.6. 

Currently, we are using TF1.x. We will migrate to TF2.x in the coming months.  

โœจ A newer version compatible with Ubuntu 23 is being added in pnlp-refactor-ubuntu23.

We thank everyone, especially the educators & universities, for their feedback in pointing out the issues & improving the accessibility of the notebooks.


๐Ÿšฉ Details of the repository roadmap could be found here


Open the repository in Google Colab: Open In Colab

Open the repository in Jupyter nbviewer: Open in nbviewer

Chapterwise folders:

Contributors to Codebase:

Found a Bug?

If the bug is in book text, please enter a errata here: https://www.oreilly.com/catalog/errata.csp?isbn=0636920262329

If the bug is in the codebase: Great! Please read this how can you help us improve the codebase

practical-nlp-code's People

Contributors

anujgupta82 avatar dependabot[bot] avatar devanshu125 avatar faisito avatar jatinpapreja avatar kartikay-bagla avatar kumar-apurva avatar majumderb avatar mbodhisattwa avatar nishkalavallabhi avatar samyak2 avatar shawn-jung avatar sony-au avatar sukeeratsg avatar suranah avatar varunp2k avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

practical-nlp-code's Issues

Fix notebook dying due to inactivity on Google Colab (ch 4, nb 7)

The Notebook dies due to inactivity on Google Colab, when running cells which takes 30 minutes or more. Suspected reasons may be due to unstable internet connection of user or an issue from Google Colab.

Screenshot (Progress stops and the notebook needs to be rerun)
Screenshot 2023-09-25 at 11 07 56 AM

[BUG] Mistakes in Ch4/01_OnePipeline_ManyClassifiers.ipynb

In: Ch4/01_OnePipeline_ManyClassifiers.ipynb

1/ In cell 14, related to SVM: there is a copy/paste error in the comment, referring to Logistic regression, when the code in that cell is only regarding SVM.

2/ The AUC metric calculated here for SVM calculates something, but it is the AUC for the prior cell, for logistic regression. As seen by the duplicated values.

Stanford Core NLP on Google Colab

This template is ONLY to be used for reporting bugs in the code.

ISSUE: Unable to connect to StanfordCoreNLPServer on Google Colab

Location

https://github.com/practical-nlp/practical-nlp/blob/master/Ch9/01_Aspect_Based_Sentiment_analysis.ipynb

Current Behavior

unable to connect to StanfordCoreNLPServer

Expected Behavior

connected to StanfordCoreNLPServer

Possible Solution

can use Stanza, but give slightly different result (sentiment analysis on words, instead of sentence)

Steps to Reproduce

  1. download and install java:
    import os
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null
    os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
  2. download stanford-corenlp:
    !wget "https://nlp.stanford.edu/software/stanford-corenlp-latest.zip"
    !unzip "stanford-corenlp-latest.zip"
  3. change directory:
    cd stanford-corenlp-4.2.0/
  4. connect to server:
    !java -mx5g -cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 10000
    if using port 9000 (even though nothing else is using this port):
    [main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
    [main] INFO CoreNLP - Server default properties:
    (Note: unspecified annotator properties are English defaults)
    inputFormat = text
    outputFormat = json
    prettyPrint = false
    [main] INFO CoreNLP - Threads: 2
    [main] INFO CoreNLP - Starting server...
    [main] WARN CoreNLP - java.net.BindException: Address already in use
    java.base/sun.nio.ch.Net.bind0(Native Method)
    java.base/sun.nio.ch.Net.bind(Net.java:455)
    java.base/sun.nio.ch.Net.bind(Net.java:447)
    java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
    java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
    jdk.httpserver/sun.net.httpserver.ServerImpl.(ServerImpl.java:101)
    jdk.httpserver/sun.net.httpserver.HttpServerImpl.(HttpServerImpl.java:50)
    jdk.httpserver/sun.net.httpserver.DefaultHttpServerProvider.createHttpServer(DefaultHttpServerProvider.java:35)
    jdk.httpserver/com.sun.net.httpserver.HttpServer.create(HttpServer.java:137)
    edu.stanford.nlp.pipeline.StanfordCoreNLPServer.run(StanfordCoreNLPServer.java:1527)
    edu.stanford.nlp.pipeline.StanfordCoreNLPServer.launchServer(StanfordCoreNLPServer.java:1624)
    edu.stanford.nlp.pipeline.StanfordCoreNLPServer.main(StanfordCoreNLPServer.java:1631)
    [Thread-0] INFO CoreNLP - CoreNLP Server is shutting down.

if using port 9001 (freeze at this point):
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - Server default properties:
(Note: unspecified annotator properties are English defaults)
inputFormat = text
outputFormat = json
prettyPrint = false
[main] INFO CoreNLP - Threads: 2
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0.0.0.0:9001

Context (Environment)

running the notebook on colab (on Windows 10)
python 3.6.9
pycorenlp-0.3.0

Possible Implementation

this is an option: https://colab.research.google.com/github/stanfordnlp/stanza/blob/master/demo/Stanza_CoreNLP_Interface.ipynb

Jupyter Notebook kernel died

Hi,

I was wondering if anyone knows why the Jupyter Notebook kernel kept dying when I run the below code in Ubuntu:

    # model_type: word2vec, glove or fasttext
    aug = naw.WordEmbsAug(
        model_type='word2vec', model_path='GoogleNews-vectors-negative300.bin',
        action="insert")

I was wondering if it's because the GoogleNews-vectors-negative300.bin is too large that it runs out of memory?

Thank you.

[BUG]

This template is ONLY to be used for reporting bugs in the code.

ISSUE: Unable to connect to StanfordCoreNLPServer on Google Colab

Location

https://github.com/practical-nlp/practical-nlp/blob/master/Ch9/01_Aspect_Based_Sentiment_analysis.ipynb

Current Behavior

unable to connect to StanfordCoreNLPServer

Expected Behavior

connected to StanfordCoreNLPServer

Possible Solution

can use Stanza, but give slightly different result (sentiment analysis on words, instead of sentence)

Steps to Reproduce

  1. download and install java:
    import os
    !apt-get install openjdk-8-jdk-headless -qq > /dev/null
    os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
  2. download stanford-corenlp:
    !wget "https://nlp.stanford.edu/software/stanford-corenlp-latest.zip"
    !unzip "stanford-corenlp-latest.zip"
  3. change directory:
    cd stanford-corenlp-4.2.0/
  4. connect to server:
    !java -mx5g -cp "./*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -port 9000 -timeout 10000
    if using port 9000 (even though nothing else is using this port):
    [main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
    [main] INFO CoreNLP - Server default properties:
    (Note: unspecified annotator properties are English defaults)
    inputFormat = text
    outputFormat = json
    prettyPrint = false
    [main] INFO CoreNLP - Threads: 2
    [main] INFO CoreNLP - Starting server...
    [main] WARN CoreNLP - java.net.BindException: Address already in use
    java.base/sun.nio.ch.Net.bind0(Native Method)
    java.base/sun.nio.ch.Net.bind(Net.java:455)
    java.base/sun.nio.ch.Net.bind(Net.java:447)
    java.base/sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:227)
    java.base/sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:80)
    jdk.httpserver/sun.net.httpserver.ServerImpl.(ServerImpl.java:101)
    jdk.httpserver/sun.net.httpserver.HttpServerImpl.(HttpServerImpl.java:50)
    jdk.httpserver/sun.net.httpserver.DefaultHttpServerProvider.createHttpServer(DefaultHttpServerProvider.java:35)
    jdk.httpserver/com.sun.net.httpserver.HttpServer.create(HttpServer.java:137)
    edu.stanford.nlp.pipeline.StanfordCoreNLPServer.run(StanfordCoreNLPServer.java:1527)
    edu.stanford.nlp.pipeline.StanfordCoreNLPServer.launchServer(StanfordCoreNLPServer.java:1624)
    edu.stanford.nlp.pipeline.StanfordCoreNLPServer.main(StanfordCoreNLPServer.java:1631)
    [Thread-0] INFO CoreNLP - CoreNLP Server is shutting down.

if using port 9001 (freeze at this point):
[main] INFO CoreNLP - --- StanfordCoreNLPServer#main() called ---
[main] INFO CoreNLP - Server default properties:
(Note: unspecified annotator properties are English defaults)
inputFormat = text
outputFormat = json
prettyPrint = false
[main] INFO CoreNLP - Threads: 2
[main] INFO CoreNLP - Starting server...
[main] INFO CoreNLP - StanfordCoreNLPServer listening at /0.0.0.0:9001

Context (Environment)

running the notebook on colab (on Windows 10)
python 3.6.9
pycorenlp-0.3.0
Please provide any more details you feel necessary

Possible Implementation

this is an option: https://colab.research.google.com/github/stanfordnlp/stanza/blob/master/demo/Stanza_CoreNLP_Interface.ipynb

Fix Dependency Issues and Update code with latest versions of python libraries for Notebooks of Chapter 5

I'm working on updating the code for Chapter 5 to ensure compatibility with newer versions of Python and updated libraries. This issue will track the progress and changes made for this specific notebook.

Notebooks:

  • 01_KPE.ipynb
  • 02_NERTraining.ipynb
  • 03_NERIssues.ipynb
  • 04_NER_using_spaCy - CoNLL.ipynb
  • 05_BERT_CONLL_NER.ipynb
  • 06_EntityLinking-AzureTextAnalytics.ipynb
  • 07_REWatson.ipynb
  • 08_Duckling.ipynb

Tasks:

  • Update Python version from 3.6 to 3.9
  • Update Ubuntu version to 22.04
  • Update all Python libraries to their latest versions
  • Test the notebook to ensure it runs without errors

Context:

  • Current Python Version: 3.6
  • Current Ubuntu Version: 16.04
  • Current Library Versions:

Proposed Changes:

  • Update notebooks to run with Ubuntu 22.04
  • Update notebooks to run with Python 3.10
  • Libraries from each notebooks will be updated to latest version
  • Markdown Text will be improved

Related Issues:

  • Will list them here

Let's work together to update this notebook effectively and ensure it aligns with the latest Python and library versions.

[BUG]

This template is ONLY to be used for reporting bugs in the code.

The flat_classification_report function is not working (in train_seq function), there is a strange problem with it

Location

https://github.com/practical-nlp/practical-nlp-code/blob/master/Ch5/02_NERTraining.ipynb

Current Behavior

Training a Sequence classification model with CRF
0.9369945384072719

---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

[<ipython-input-21-5212b1f7685b>](https://localhost:8080/#) in <module>()
     15 
     16 if __name__=="__main__":
---> 17     main()

3 frames

[/usr/local/lib/python3.7/dist-packages/sklearn_crfsuite/metrics.py](https://localhost:8080/#) in flat_classification_report(y_true, y_pred, labels, **kwargs)
     66     """
     67     from sklearn import metrics
---> 68     return metrics.classification_report(y_true, y_pred, labels, **kwargs)
     69 
     70 

TypeError: classification_report() takes 2 positional arguments but 3 positional arguments (and 1 keyword-only argument) were given

Expected Behavior

It should have worked like on your notebook

Possible Solution

I think it's been ignored or its structure changed

๐Ÿšฉ Roadmap & Milestones for 2020

๐Ÿšฉ Overview

Based on the feedback received from the readers, we have decided to focus on a few things to this repo as accessible and useful as possible. This issue contains a high level for the updates planned in 2020. It will continue being updated based on reader feedback.

๐Ÿ›ฃ๏ธ Roadmap

We think the following features, enhancements & fixes are most relevant for 2020:

๐Ÿ–ฅ๏ธ Platform Accessibility

  • Easy data download for Windows
  • Python & dependencies issues resolved for Windows
  • Full support for running on Collab

๐Ÿ“š Code & Pedagogical Improvements

  • RASA chatbot notebook
  • Input sample & stats demonstrated for all notebooks
  • Output sample demonstrated for all notebooks
  • Output visualization demonstrated where possible
  • Color figures are needed for print buyers - they will made available for all.

๐Ÿš˜ Milestones

The roadmap is broken into the following milestones.

๐Ÿšง Version 1.5

  • All platform accessibility-related issues to be resolved
  • The first draft of pedagogical enhancements

๐Ÿ Version 1.9

  • Input & output to be demonstrated well for all notebooks
  • Other user requests handled

08_Duckling.ipynb - TypeError

Hi, I've been trying to run the below code on Jupyter Notebook and Colab, but both threw the same error:


TypeError Traceback (most recent call last)

in ()
----> 1 pprint(d.parse_time(u"Let's meet at 11:45am"))
2 #pprint(d.parse_time(u'You owe me twenty bucks, please call me today'))

4 frames

/usr/local/lib/python3.6/dist-packages/jpype/_jstring.py in getitem(self, i)
46
47 def getitem(self, i):
---> 48 if i < 0:
49 i += len(self)
50 if i < 0:

TypeError: '<' not supported between instances of 'slice' and 'int'

I tried duckling 1.8.0, and 1.7.1, without success. I was wondering if anyone could point me in the right direction? Thank you.

Regards,
Sony

Refactor the repo for compatibility with latest ubuntu & ML libraries

In response to feedback from students and instructors actively using this repo in formal coursework, we recognize the need to address compatibility issues with the latest OS and libraries. We're initiating a refactor to ensure seamless use in classrooms & by self-learners. This is part of our continuous effort to keep the codebase relevant and functional.

Until the new branch is stable, all changes related to this refactor will be available in the branch pnlp-refactor-ubuntu23.

[BUG] Wikimedia link not working

This template is ONLY to be used for reporting bugs in the code.

Location

Chapter 3 Notebook 6

Current Behavior

Link not working
https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream14.xml-p6197595p7697594.bz2

--2021-03-31 18:04:43--  https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles-multistream14.xml-p6197595p7697594.bz2
Resolving dumps.wikimedia.org (dumps.wikimedia.org)... 208.80.154.7, 2620:0:861:1:208:80:154:7
Connecting to dumps.wikimedia.org (dumps.wikimedia.org)|208.80.154.7|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-03-31 18:04:43 ERROR 404: Not Found.

Expected Behavior

Possible Solution

Steps to Reproduce

1.Run the notebook

Context (Environment)

google colab

Possible Implementation

Dat not available for Doc2Vec Chapter -4

Hello Folks,

I was trying to replicate the code in google collab for Chapter 4. While running the 02_Doc2Vec_Example.ipynb, I was not able to get the data from kaggle as the data was removed. Then I reffered to previous issue and found data was in Data folder.

While using that view of data I found the data is dfferent as indicated in the notebook.

Can you please indicate which data shoul we use, as the accuracy form the data is pretty low as showcased in the notebook.

Following are the ouptut:


#Load the dataset and explore.
filepath = "data/train_data.csv"
df = pd.read_csv(filepath)
print(df.shape)
df.head()
sentiment content
empty @tiffanylue i know i was listenin to bad habi...
sadness Layin n bed with a headache ughhhh...waitin o...
sadness Funeral ceremony...gloomy friday...
enthusiasm wants to hang out with friends SOON!
neutral @dannycastillo We want to trade with someone w...
df['sentiment'].value_counts()
category value
worry 7433
neutral 6340
sadness 4828
happiness 2986
love 2068
surprise 1613
hate 1187
fun 1088
relief 1021
empty 659
enthusiasm 522
boredom 157
anger 98
Name: sentiment, dtype: int64
#Let us take the top 3 categories and leave out the rest.
shortlist = ['neutral', "happiness", "worry"]
df_subset = df[df['sentiment'].isin(shortlist)]
df_subset.shape

(16759, 2)

preds = myclass.predict(test_vectors)
from sklearn.metrics import classification_report, confusion_matrix
print(classification_report(test_cats, preds))
              precision    recall  f1-score   support

   happiness       0.34      0.54      0.42       713
     neutral       0.48      0.56      0.52      1595
       worry       0.62      0.40      0.48      1882

    accuracy                           0.48      4190
   macro avg       0.48      0.50      0.47      4190
weighted avg       0.52      0.48      0.49      4190

Issue with code sample in book from chapter 3 "PRE-TRAINED WORD EMBEDDINGS"

Hi - apologies if this is the wrong place to report this, but I have been reading the online version of this book, and when I try to run the following code sample from chapter 3 with the path to the model updated:

from gensim.models import Word2Vec, KeyedVectors
pretrainedpath = "NLPBookTut/GoogleNews-vectors-negative300.bin"
w2v_model = KeyedVectors.load_word2vec_format(pretrainedpath, binary=True)
print('done loading Word2Vec')
print(len(w2v_model.vocab)) #Number of words in the vocabulary.
print(w2v_model.most_similar['beautiful'])
W2v_model['beautiful']

It fails with the following:

$ python3 word2vec.py                                                                                                                                                        
done loading Word2Vec
Traceback (most recent call last):
  File "word2vec.py", line 5, in <module>
    print(len(w2v_model.vocab)) #Number of words in the vocabulary.
  File "/home/sits/.local/lib/python3.8/site-packages/gensim/models/keyedvectors.py", line 645, in vocab
    raise AttributeError(
AttributeError: The vocab attribute was removed from KeyedVector in Gensim 4.0.0.
Use KeyedVector's .key_to_index dict, .index_to_key list, and methods .get_vecattr(key, attr) and .set_vecattr(key, attr, new_val) instead.
See https://github.com/RaRe-Technologies/gensim/wiki/Migrating-from-Gensim-3.x-to-4

I can see the code for Ch3 has been changed to take this into account, eg, removing the len() call and using code like:

print(w2v_model.most_similar('beautiful'))

Can the online book be updated with the correct code?

Chapter 3 TF - IDF

I was going through your book to learn NLP. A great effort in writing such an amazing book.

I found that the example for TF-IDF is a bit unclear in the way TF scores are calculated for the toy corpus.

Ex: dog = 1/3 and bites = 1/6, I am unable to understand how this ratios are being achieved. Please include that in the example description if possible.

macOS 'Buggy Accelerate Backend when using numpy 1.19'

Error Message:

Polyfit sanity test emitted a warning, most likely due to using a buggy Accelerate backend. If you compiled yourself, see site.cfg.example for information. Otherwise report this to the vendor that provided NumPy.

As per the linked issue in numpy, I wanted to raise and document the solution here (or link it to a TROUBLESHOOTING.md) for others trying to use the source code for this book.

numpy/numpy#15947

Current Behavior

git clone https://github.com/practical-nlp/practical-nlp
cd practical-nlp
python3 -m venv .venv
. .venv/bin/activate
python3 -m pip install --upgrade pip setuptools wheel

# This step is now problematic on macOS
python3 -m pip install -r requirements.txt

This will spit errors in scikit-image and gensim about buggy Accelerate backend in numpy 1.19.

Possible Solution

There are two proposed solutions:

  1. Fallback to numpy==1.18.0
  2. Install openblas

Combining both:

brew install openblas
OPENBLAS="$(brew --prefix openblas)" python3 -m pip install numpy==1.18.0

Updating the pinned version in requirements.txt to numpy==1.18.0 also allows this option:

brew install openblas
OPENBLAS="$(brew --prefix openblas)" python3 -m pip install -r requirements.txt

Context

Why is this a problem?

BLAS stands for Basic Linear Algebra Subprograms. So all the core maths operations in numpy which actually make it fast, can lean on the platform specific implementations of BLAS.

Compiled programs supporting every combination of CPU architecture and OS platform and OS version is a disturbingly difficult task.

So numpy has allowed the OS to provide their own backend for BLAS to delegate this responsibility. What makes this harder is the addition of GPU accelerated versions of BLAS blows out these combinations which make it unwieldy for numpy to support on their own.

So on macOS though, the Accelerate Framework is meant to provide native BLAS support.

numpy being good citizens for the scientific community occasionally need to flag certain versions of these frameworks as bad builds as they produce incorrect results.

That is why we need to swap it out for an alternative in this situation. It isn't the end of the world, just a sharp edge to be aware of in this space.

Conclusion

Hopefully future versions of numpy and Accelarate play nice again and this is not an issue. For now this error will show up because during install, a calculation which is known to trigger the fault in these backends, will raise this error.

terminal command lines not working on Windows 10 Anaconda

Location

https://github.com/practical-nlp/practical-nlp/blob/master/Ch3/05_Pre_Trained_Word_Embeddings.ipynb
https://github.com/practical-nlp/practical-nlp/blob/master/Ch4/01_OnePipeline_ManyClassifiers.ipynb

Current Behavior

'wget' is not recognized as an internal or external command, operable program or batch file.
'apt-get' is not recognized as an internal or external command, operable program or batch file.

Expected Behavior

Possible Solution

Steps to Reproduce

  1. !pip install wget
  2. run the code, then the error.
  3. !pip install apt-get, then "ERROR: Could not find a version that satisfies the requirement apt-get (from versions: none)
    ERROR: No matching distribution found for apt-get"
  4. rune the code, then "'apt-get' is not recognized as an internal or external command, operable program or batch file."
  5. More detailed explanation in the original message below.

Context (Environment)

running on Jupyter Notebook
Python 3.8.3
Windows 10 64, Conda, both base and virtual env (python=3.6)
Clean installed Windows and Anaconda.

Possible Implementation

Found the template and added the above. The below is the original message in writing.
Fyi, I am not a coder nor data scientist so the error could be due to my lack of knowledge.

Hello,

I have been enjoying this book since the moment I bought this last week and trying to walk through the commands on Jupyter.

I am on Windows 10 64-bit, not Mac and having trouble executing lines after !. They do not work on my desktop (e.g., Ch. 3-5, wget, Ch. 4-1 apt-get). I first googled and saw ! does not work on Windows but others say bash commands are now okay. So, I installed wget-3.2 (!pip install wget) on Jupyter but still unable to to execute it. I get the following message: "'wget' is not recognized as an internal or external command, operable program or batch file." I even installed a separate program wget on Windows because others on the internet say it has to be a path variable.
Also, I could not even install apt-get (using !pip install apt-get) with the following message "ERROR: Could not find a version that satisfies the requirement apt-get (from versions: none) ERROR: No matching distribution found for apt-get".

Could you help me with these and possibly other terminal commands in the following chapters? Given the growing popularity of NLP and the excellent quality of this book, I am sure more people on Windows 10 will walk through the commands here and might face the same issues.

P.S. I also think the TF-IDF table in Ch. 3 is hard to understand based on the definition you provided.

The calculation of TF-IDF scores in Chap 3 is unclear

Hi, I came across this book recently and really love it. However, I found that the calculation of the TF-IDF scores in Chapter 3 is kind of confusing.

More specifically, in Table 3-2, each word has a TF score. But since we have 4 sentences (treated as 4 docs in the example), I think we should have one TF score for each word in each sentence?

Accordingly, the vector representation for D1 should be:[1/3IDF(dog), 1/3IDF(bites), 1/3*IDF(man), 0, 0, 0] since each word in D1 only appear once and the TF(w, D1) should all be equal to 1/3.

Color figures & outline

Print readers of the book have found some images to be not legible in black & white. As a result, we decided to update the repo with color figures & outline for all chapters. This should aid our readers.

Address LexNLP Library issue. Installing newer version of LexNLP give error (Ch10, nb 2)

I am trying to install newer version of LexNLP. During installation of dependencies of LexNLP, it gives an error

ERROR: Failed building wheel for scikit-learn
and
ERROR: Could not build wheels for scikit-learn, which is required to install pyproject.toml-based projects

The issue has been reported earlier this year on the original repository here.

Screenshot of the error
Screenshot 2023-09-25 at 11 23 45 AM

Fix Dependency Issues and Update code with latest versions of python libraries for Notebooks of Chapter 4

I'm working on updating the code for Chapter 4 to ensure compatibility with newer versions of Python and updated libraries. This issue will track the progress and changes made for this specific notebook.

Notebooks:

  • 01_OnePipeline_ManyClassifiers.ipynb
  • 02_Doc2Vec_Example.ipynb
  • 03_Word2Vec_Example.ipynb
  • 04_FastText_Example.ipynb
  • 05_DeepNN_Example.ipynb
  • 06_BERT_IMDB_Sentiment_Classification.ipynb
  • 07_BERT_Sentiment_Classification_IMDB_ktrain.ipynb
  • #146
  • #147
  • 10_ShapDemo.ipynb
  • 11_SpamClassification.ipynb

Tasks:

  • Update Python version from 3.6 to 3.9
  • Update Ubuntu version to 22.04
  • Update all Python libraries to their latest versions
  • Test the notebook to ensure it runs without errors

Context:

  • Current Python Version: 3.6
  • Current Ubuntu Version: 16.04
  • Current Library Versions:

Proposed Changes:

  • Update notebooks to run with Ubuntu 22.04
  • Update notebooks to run with Python 3.10
  • Libraries from each notebooks will be updated to latest version
  • Markdown Text will be improved

Related Issues:

  • Will list them here

Let's work together to update this notebook effectively and ensure it aligns with the latest Python and library versions.

Fix data download issues for Windows

Data download for training & test datasets are currently geared towards UNIX operating systems. We need to bring that up to par for Windows as well. Need to ensure this is done and for each chapter notebooks

  • Ch 2
  • Ch 3
  • Ch 4
  • Ch 5
  • Ch 6
  • Ch 7
  • Ch 8
  • Ch 9
  • Ch 10
  • Ch 11

Ch7 Book Summariy Dataset Missing

Thank you for your reply Harshit. Good to hear that the book has been adopted at several colleges.
One way to get issues around is WSL. I was able to avoid most of the issues by installing WSL on my Windows and Anaconda on WSL. Your scripts work way better on Linux thanks to the similarity between Mac Terminal and Linux Bash.
Still, there are quite a few issues like dependency. Many modules are imported without being first installed. It would be no problem on your Mac, but first-time nlp reader will not be able to run the scripts. Don't know exactly why but it is harder to use terminal command in Jupyter on Windows (even with admin status), so I had to get out of Jupyter and installed modules on Terminal and got back. Also, there are issues of 'pip install' and 'anaconda install'. I suggest you clean install Anaconda or clean Python and run the codes in Jupyter notebook on Windows,

Another issue I found was about topic modeling in Ch. 7. The data used for the demonstration is located in the local path: data_path = "/home/etherealenvy/Downloads/booksummaries/booksummaries.txt". There is no way for others to follow the notebook.

Originally posted by @knslee07 in #13 (comment)

[BUG] spelling_en.txt link not working

This template is ONLY to be used for reporting bugs in the code.

Location

chapter 2 notebook 5

Current Behavior

(https://github.com/makcedward/nlpaug/blob/master/model/spelling_en.txt) link is not working

Resolving github.com (github.com)... 52.69.186.44
Connecting to github.com (github.com)|52.69.186.44|:443... connected.
HTTP request sent, awaiting response... 404 Not Found
2021-03-23 16:21:46 ERROR 404: Not Found.

Expected Behavior

Possible Solution

Steps to Reproduce

  1. run the notebook

Context (Environment)

using colab

Possible Implementation

Chapter 3 one-hot encoding

  1. I found that a small typo in one-hot representation.

For the toy corpus D4 representation : Given in text book :D4 is represented as [ [ 0 0 1 0 0] [0 0 0 0 1 0] [0 0 0 0 0 1]].

one-hot representation for first-word "man" is only of length 5, one zero is missing in the one-hot vector.

  1. In CBOW method for learning Embeddings:

one of the sentences in the text is: Given a sentence of, say, m words, it assigns a probability Pr(w1, w2, โ€ฆ.., wn) to the whole sentence.

Shouldn't be the probability of words be form w1,w2,.......,wm instead of wn

  1. The dimensions used for CBOW in explanation are (V,d), whereas in CBOW figure(3.9) the dimensions used are (V,N)

practical-nlp/Ch4/04_FastText_Example.ipynb Link Formatting

In the first cell, the link to the dbpedia has a # in front of the link making it not open when clicked.

original:
[here](#https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz)

corrected:
[here](https://github.com/le-scientifique/torchDatasets/raw/master/dbpedia_csv.tar.gz)

[BUG] for Ch4/08_LimeDemo.ipynb

Problem

in Part 2: Using Lime to interpret predictions,

mystring = list(X_test)[221]
print(c.predict_proba([mystring])

there is an extra list and it gives this error:

AttributeError: 'list' object has no attribute 'replace'

and this error refers to this line of the code in the clean function:
doc = doc.replace("</br>", " ")

Solution:

this problem can easily be solved by removing the list part and changing the code to this:

mystring = X_test[221] print(c.predict_proba([mystring])

practical-nlp/Ch4/03_Word2Vec_Example.ipynb

In the above Jupiter notebook, in function:

# Creating a feature vector by averaging all embeddings for all sentences
def embedding_feats(list_of_lists):
    DIMENSION = 300
    zero_vector = np.zeros(DIMENSION)
    feats = []
    for tokens in list_of_lists:
        feat_for_this =  np.zeros(DIMENSION)
        count_for_this = 0
        for token in tokens:
            if token in w2v_model:
                **feat_for_this += w2v_model[token]
                count_for_this +=1
        feats.append(feat_for_this)**        
    return feats

The feature vectors are not averaged, directly sum of the word embeddings is appended as the feature, whereas the function description indicates that the vectors will be averaged.

Shap Demo not working on tensorflow >= 2.0.0

ISSUE: TypeError: 'NoneType' object cannot be interpreted as an integer, when running shap on LSTM model

Location

https://github.com/practical-nlp/practical-nlp/blob/master/Ch4/10_ShapDemo.ipynb

Current Behavior

running the code below:
shap_values = explainer.shap_values(x_val[:5])

gives you the below error:
TypeError Traceback (most recent call last)

in ()
1 # explain the first 10 predictions
2 # explaining each prediction requires 2 * background dataset size runs
----> 3 shap_values = explainer.shap_values(x_val[:5])
4
5 import numpy as np

4 frames

/usr/local/lib/python3.6/dist-packages/shap/explainers/deep/deep_tf.py in anon()
352 shape = list(self.model_inputs[i].shape)
353 shape[0] = -1
--> 354 data = X[i].reshape(shape)
355 v = tf.constant(data, dtype=self.model_inputs[i].dtype)
356 inputs.append(v)

TypeError: 'NoneType' object cannot be interpreted as an integer

Expected Behavior

according to your page, we should get the below output:
Downloading data from https://s3.amazonaws.com/text-datasets/imdb_word_index.json
1646592/1641221 [==============================] - 0s 0us/step

Possible Solution

it's likely due to shap not working properly on tensorflow 2.0.0 above
but, if we use tensorflow 1.x, then we get message that keras does not work with tf 1.x

Steps to Reproduce

Steps are the same as the provided notebook, but added the below to install tf 2.3.0
%tensorflow_version 2.x
!pip uninstall -y tensorflow
!pip install tensorflow==2.3.0

Context (Environment)

running the notebook on colab (on Windows 10)
python 3.6.9
shap 0.35.0 (also tried the latest version)
tf 2.3.0 (also tried 2.4.0 and 1.14)

Possible Implementation

no working solutions identified yet

NOTE:

also in the below code, batch_size and max_features were defined, but not used (though using them doesn't make any difference)

batch_size = 64
max_features = vocab_size + 1

#Training an LSTM with embedding on the fly

print("Defining and training an LSTM model, training embedding layer on the fly")
#modified from:
rnnmodel = Sequential()
rnnmodel.add(Embedding(MAX_NUM_WORDS, 128))
rnnmodel.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
rnnmodel.add(Dense(2, activation='sigmoid'))
rnnmodel.compile(loss='binary_crossentropy',
optimizer='adam',
metrics=['accuracy'])
print('Training the RNN')
rnnmodel.fit(x_train, y_train,
batch_size=32,
epochs=2,
validation_data=(x_val, y_val))
score, acc = rnnmodel.evaluate(test_data, test_labels,
batch_size=32)
print('Test accuracy with RNN:', acc)

Update for current versions of spaCy, gensim, etc.?

Great book! I read it cover to cover AND tried to run nearly all of the code, which I almost never do.

Could the code in this repo for spaCy and gensim be updated to the current versions? As one example, Ch5/01_KPE.ipynb does not run with the current version of spaCy. I am just learning but I assume changes might include, for Ch5/01_KPE.ipynb,

Book version:

!pip install textacy==0.9.1
!pip install spacy==2.2.4

import spacy
import textacy.ke
from textacy import *

print(f'Using textacy {textacy.__version__} and spaCy {spacy.__version__}')

# Worked with 2.2.4:
textacy.ke.textrank(doc, topn=10)

# Worked with 2.2.4:
print("Textrank output: ", [kps for kps, weights in textacy.ke.textrank(doc, normalize="lemma", topn=5)])

What appears to run okay as of this writing, December 2021, using spaCy 3.2.0...

# Lines of Ch5/01_KPE.ipynb revised for spaCy 3.2.0:
!pip install textacy==0.11.0   # or 0.12.0 but I haven't tried that
!pip install spacy==3.2.0

import spacy
import textacy
from textacy import extract
from textacy.extract import keyterms as kt

print(f'Using textacy {textacy.__version__}')
print(f'Using spaCy {spacy.__version__}')

# Works with 3.2.0:
import spacy
import textacy
from textacy import extract
from textacy.extract import keyterms as kt

print(f'Using textacy {textacy.__version__} and spaCy {spacy.__version__}')

# Works with 3.2.0:
kt.textrank(doc, normalize="lemma", topn=10) # I'm not sure the role of normalize

# Works with 3.2.0:
print("Textrank output: ", [kps for kps, weights in extract.keyterms.textrank(doc, normalize="lemma", topn=5)])

Would be great if someone smarter than me could update the book's spaCy- and gensim-related code to run current versions for 2022...

04_FastText_Example.ipynb - code outdated

Hi, supervised is no longer supported in fasttext. I used the recommended train_supervised, but was unable to get the 97% precision/recall like you said in the book. below is my code:

Using fastText for feature extraction and training

#from fasttext import supervised
from fasttext import train_supervised
"""fastText expects and training file (csv), a model name as input arguments.
label_prefix refers to the prefix before label string in the dataset.
default is label. In our dataset, it is class.
There are several other parameters which can be seen in:
https://pypi.org/project/fasttext/
"""
#%time model = supervised(train_file, 'temp', label_prefix="class")
%time model = train_supervised(train_file, label_prefix="class", lr=1.0, epoch=25, wordNgrams=2, bucket=200000, dim=50)
results = model.test(test_file)
#print(results.nexamples, results.precision, results.recall)
def print_results(N, p, r):
print("N\t" + str(N))
print("P@{}\t{:.3f}".format(1, p))
print("R@{}\t{:.3f}".format(1, r))

print_results(*results)

below is the result:
CPU times: user 6min 15s, sys: 2.64 s, total: 6min 18s
Wall time: 6min 18s
N 70000
P@1 0.481
R@1 0.481

any comments would be appreciated. thanks

Question on a new version of the book

Hi,

Thanks for making available the companion code for the book on Practical Natural Language Processing.

Since you are planning to move the code from tf 1 to tf 2, are you also planning to print a new version of the companion book as well?

Many thanks,

Ivan

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.