hse-aml / natural-language-processing Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 2.0K 268 KB

Resources for "Natural Language Processing" Coursera course.

Home Page: https://www.coursera.org/learn/language-processing

Python 17.91% Jupyter Notebook 81.35% Shell 0.15% Dockerfile 0.59%

natural-language-processing

natural-language-processing's People

Contributors

Stargazers

Watchers

Forkers

nhatnguyen12 mpizosdim nikolay-bushkov aipachakutiqwan ideis eyurtsev yhu314 bawcos yinhang18 vladkanchev jeknov fahadaijaz gkarmakar jackconnor rhmiller47 nunorc gachet kinetikm saikrishnaanudeepj kalicharanv ragabov achukka edvbb dilnozabobokalonova1 salmanmaq lavinathong anvaribs akiska getachew67 pankaj02 ankita1291 birdbird1117 deeptensors girishgupta vaishnav127 marcogomex mxc19912008 tejash-shah justin-graham gopigrip7 hadxu nicholasmoore-dac ashish-gupta03 stefmt2970 reddisanjeevkumar ujjwalranjan mainakchain filored tanjillatina sriram7777 hbunyamin robertus100 mwakaba2 eesql wesen wcollins-ebsco johnashdsouza abhishek11097 deeplearningpk subhashinih g-rutter zcwei askhari139 esppk cyrilou242 paturi1710 timefunny jianxungao akash13singh farnazmotamed zfxu judesafo sergey-chebanov kambehmw shk1993 saikiran321 wangtuobin haojunyu lydia-gu yonoklee jcsharp yasserantonio cyy7645 achmadalam denchik20071992 levibrown causalreinforcer colombia-ai eulertech wangz19 ichaitu4u sirelkhatim alfords menshikh-iv nirupam1sharma shrutimyideasmyblogs zhzou2020 srkancharla abhishekdharu paduvi

natural-language-processing's Issues

Week 1 MultiLabelBinarizer

In Week1 module during Multiclass training, scikit-learn raises this kind of exception:
"Scikit Learn Multilabel Classification: ValueError: You appear to be using a legacy multi-label data representation..."

So, I've found out that we should use MultiLabelBinarizer in order to preprocess labels, done.

But when we need to evaluate "val" dataset on trained classifiers, there is "mlb" variable referenced, which was not instantiated. I assume that it refers to "MultiLabelBinarizer". As you see, there is an inconsistency here, which currently should be manually fixed.

Week 5 Support

Can you please add week 5 support?

Running on Google Colab

You need to sign into Google account first or you won't see the GitHub tab from the README instruction

tf.nn.rnn_cell.BasicLSTMCell is deprecated - Week2

Hello
tf.nn.rnn_cell.BasicLSTMCell is deprecated and tf.nn.rnn_cell.LSTMCell is replaced.

And there is some text about optimization with GPU:

"Note that this cell is not optimized for performance. Please use tf.contrib.cudnn_rnn.CudnnLSTM for better performance on GPU"

Docker access to shared folder

Hi,

Having issues getting the access to work for Docker and would really appreciate some straightforward advice as I have never used Docker before. The Docker installation is the toolbox version on Windows 10 (not pro).

I have reached the point where I have Docker Quickstart Terminal and a Jupyter Notebook session running. Note that the Docker tutorial on Github fails at this point:

David@DESKTOP-TLE6KHC MINGW64 /c/Program Files/Docker Toolbox
$ docker run -it -p 8080:8080 --name coursera-aml-nlp --user root -v /C:/Users/David/natural-language-processing-master
/week3/data:/root/coursera
"docker run" requires at least 1 argument.
See 'docker run --help'.

Usage:  docker run [OPTIONS] IMAGE [COMMAND] [ARG...]

Run a command in a new container

So, as I am using the toolbox I follow the instructions from Dr Shahin Rostami at:

https://shahinrostami.com/posts/tools/docker/docker-toolbox-windows-7-and-shared-volumes/

Which brings me to the point in the instructions:

Sharing Folders with a Docker Container

To create a Docker container from the jupyter/scipy-notebook image, type the following command and wait for it to complete execution: docker run --name="scipy" --user root -v /h/work:/home/jovyan -d -e GRANT_SUDO=yes -p 8888:8888 jupyter/scipy-notebook start-notebook.sh --NotebookApp.token=''

This may take some time, as it will need to download and extract the image. Once it's finished, you should be able to access the Jupyter notebook using 127.0.0.1:8888. I hope this helps you get up and running with Docker Toolbox and shared folders. Of course, the process is typically easier when using the non-legacy Docker solutions.

Which results in the page:

I'm really not sure what the token or password represent here. The access to the folder is all I am after but I do not know what to try next.

Thanks,

David

Error while downloading resources

Following error is happening while I try to download the resources for week1

tf.nn.softmax_cross_entropy_with_logits is deprecated - Week2

tf.nn.softmax_cross_entropy_with_logits is deprecated and tf.nn.softmax_cross_entropy_with_logits_v2 is replaced.

Task 3 (BagOfWords)

I guess in task description it should be 12th row (index 11) instead of 11th row (index 10) for which the number many non-zero elements has to be determined. Not sure if I am understanding something wrong about indexing csr matrices.

Currently my code works when I use
row = X_train_mybag[11].toarray()[0]

https://github.com/hse-aml/natural-language-processing/blob/master/week1/week1-MultilabelClassification.ipynb

main_bot struggles if you have non-ascii characters in your name

If you name contains any funny characters in Telegram, the bot will crash

Ready to talk!
An update received.
Traceback (most recent call last):
  File "main_bot.py", line 111, in <module>
    main()
  File "main_bot.py", line 103, in main
    print("Update content: {}".format(update))
UnicodeEncodeError: 'ascii' codec can't encode character '\xf8' in position 153: ordinal not in range(128)

Although adding some more computational complexity, adding the following function

def cast_to_utf_8(old_dict):
    """
    Encodes the string content of a dict to utf-8

    Parameters
    ----------
    old_dict : dict
        The dict to encode

    Returns
    -------
    new_dict : dict
        The encoded dict
    """

    def walk(node):
        """
        Recursively traverses a node ande encodes all strings to utf-8

        Parameters
        ----------
        node : dict
            The node to traverse

        Returns
        -------
        node : dict
            The node where the strings are encoded to utf-8
        """
        for key, item in node.items():
            if type(item)==dict:
                walk(item)
            elif type(item)==list:
                for i, elem in enumerate(item):
                    if type(elem) == str:
                        node[key][i] = elem.encode('utf-8')
            elif type(item)==str:
                node[key] = item.encode('utf-8')
        return node

    new_dict = walk(old_dict)

    return new_dict

and calling it like this in main()

                    if is_unicode(text):
                        update = cast_to_utf_8(update)
                        print("Update content: {}".format(update))
                        bot.send_message(chat_id, bot.get_answer(update["message"]["text"]))
                    else:
                        bot.send_message(chat_id, "Hmm, you are sending some weird characters to me...")

was a remedy for me

Module common not found

Hi,

module common not found. Tried on colab and on the docker as well.

here is the traceback:

`ImportError Traceback (most recent call last)
in ()
1 import sys
2 sys.path.append("..")
----> 3 from common.download_utils import download_week1_resources
4
5 download_week1_resources()

ImportError: No module named 'common'`

NotImplementedError: Open utils.py and fill with your code. In case of Google Colab, download(https://github.com/hse-aml/natural-language-processing/blob/master/project/utils.py), edit locally and upload using '> arrow on the left edge' -> Files -> UPLOAD

Week 1 incorrect answers in test_my_bag_of_words

In the function test_my_bag_of_words, answers is defined as a list of list while it should be just list

Original:
answers = [[1, 1, 0, 1]]

Should be:
answers = [1, 1, 0, 1]
as it is a return of my_bag_of_words function which takes text as input and returns np array

Running on Google Colab

Hi,

Can we have Colab environment support?

Thanks

problem with dependencies on colab

I am not able to download dependencies using google colab. Please help me solve this issue.

Typo in lemmatization notebook example

https://github.com/hse-aml/natural-language-processing/blob/master/week1/lemmatization_demo.ipynb

If you look at cell 5, the string is text = "operates operative operating", but in cell 6 after you apply the Porter Stemmer, the stemmed string applied on the one from cell 5 is different - in particular, there are no common characters in each word, most likely due to a previous string already cached prior to running the cell: u'feet cat wolv talk'. The same for the lemmatized string in cell 7.

The intent of the code is clear though, but the results should be changed in the future.

Week 3: cannot import name 'logsumexp' from scipy.misc

Running the Week 3 notebook on google colab (after previously encountering #33), I see

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-7-e70e92d32c6e> in <module>()
----> 1 import gensim

3 frames
/usr/local/lib/python3.6/dist-packages/gensim/models/ldamodel.py in <module>()
     49 
     50 # log(sum(exp(x))) that tries to avoid overflow
---> 51 from scipy.misc import logsumexp
     52 
     53 

ImportError: cannot import name 'logsumexp'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Natural LP

Week 1: Invalid Syntax

Hello,
This code returns an "Invalid Syntax" error.

`REPLACE_BY_SPACE_RE = re.compile('[/(){}[]|@,;]')
BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')
STOPWORDS = set(stopwords.words('english'))

def text_prepare(text):
"""
text: a string

    return: modified initial string
"""
text = # lowercase text
text = # replace REPLACE_BY_SPACE_RE symbols by space in text
text = # delete symbols which are in BAD_SYMBOLS_RE from text
text = # delete stopwords from text
return text`

Docker Tutorial Link not working

Kindly take a look to the provided Docker Tutorial link in the Readme, it isn;t working - https://github.com/hse-aml/natural-language-processing/blob/master/(Docker-tutorial.md)

Broken link in `week2-NER` notebook

The following link in the Week 2 assignment notebook appears to be broken:

natural-language-processing/week2/week2-NER.ipynb

Line 356 in c3f2cb5

    
           "First, we need to create [placeholders](https://www.tensorflow.org/versions/master/api_docs/python/tf/placeholder) to specify what data we are going to feed into the network during the execution time.  For this task we will need the following placeholders:\n",

I believe it should point instead to https://www.tensorflow.org/api_docs/python/tf/compat/v1/placeholder

Week 1: Traceback & Name Error

Hello, this piece of code returns aftermentioned errors:

print(test_text_prepare())

`---------------------------------------------------------------------------
NameError Traceback (most recent call last)
in ()
----> 1 print(test_text_prepare())

in test_text_prepare()
5 "free c++ memory vectorint arr"]
6 for ex, ans in zip(examples, answers):
----> 7 if text_prepare(ex) != ans:
8 return "Wrong answer for the case: '%s'" % ex
9 return 'Basic tests are passed.'

NameError: name 'text_prepare' is not defined`

Week1: How to open in Colab?

Sorry, stupid question, but how do I open week 1 in Colab? Usually for my own files, there is always a "open in Colab" button, but there is none for week one task?

Link to `clipping` in `perform_optimization` documentation is broken - Week2

Link to Clippin in perform_optimization function documentation is broken.

Final submission error

I am using google colab for week 3 assignment and at the last when i am finally submitting it, the compiler is not recognizing my E-mail id. Please tell me the solution.
----> 1 STUDENT_EMAIL = [email protected]# EMAIL
2 STUDENT_TOKEN = AT5ZyzLxuQfnEhNg# TOKEN
3 grader.status()

NameError: name 'mandloi19faraday96' is not defined

This is the error i am getting how should i submit my assignment please help.

Tensorflow 2.0 support

Would it be possible to migrate it to Tf 2.0?

Unavailability of Token while executing run_notebook in docker

When I execute run_notebook in terminal it shows the following and I am not able to access the notebook.

Problem opening 'data/text_prepare_tests.tsv' file

Using the docker container environment I am getting a UnicodeDecodeError. More speciffically:

prepared_questions = []
for line in open('data/text_prepare_tests.tsv'):
     line = text_prepare(line.strip())
     prepared_questions.append(line)
text_prepare_results = '\n'.join(prepared_questions)
grader.submit_tag('TextPrepare', text_prepare_results)

Is giving the following error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 79: ordinal not in range(128)

In order to run it I had to change it to:

prepared_questions = []
for line in open('data/text_prepare_tests.tsv', encoding='utf-8'):
     line = text_prepare(line.strip())
     prepared_questions.append(line)
text_prepare_results = '\n'.join(prepared_questions)
grader.submit_tag('TextPrepare', text_prepare_results)

Can also be solved by using pd.read_csv.

Is this error reproducible to anyone else?

ModuleNotFound running notebook in CoLab

When I open one of the notebooks in CoLab, specifically week1-MultilabelClassification.ipynb:
https://colab.research.google.com/github/hse-aml/natural-language-processing/blob/master/week1/week1-MultilabelClassification.ipynb

When I try to run the notebook I get ModuleNotFound error on this line:
from common.download_utils import download_week1_resources

I have never run code from GitHub in CoLab and am not sure if I need to do something so that it can find the common module.

Python version in docker image

I retrieved the docker image like so:
# docker pull akashin/coursera-aml-nlp

#python3 --version
shows that this has python 3.5 installed

Unfortuantely python 3.5 has this bug. Because of this test_my_bag_of_words() fails as the dict order is not maintained. This works correctly in colab as the colab python version is 3.6.

I tried upgrading python 3.5 on the docker image to python 3.7 using this post. The upgrade seems to work. I also upgrade the jupyter notebook. But then the notebook doesn't work properly.

Is it possible to provide a docker image with an upgraded python version?