jamescalam / transformers Goto Github PK

License: MIT License

Jupyter Notebook 100.00%

transformers's Introduction

Learn Transformers

Repo for the Natural Language Processing: NLP With Transformers in Python course, you can get 70% off the price with this link! If the link stops working raise an issue or drop me a message somewhere:

YouTube | Discord

I also have plenty of free material on YouTube 😊

transformers's People

Contributors

Stargazers

Watchers

Forkers

stjordanis zoakes pjamil21 biranchi2018 shaileshj2803 bruckhaimer craig-matadeen nishantsbi eejustin german-garcia abhipise2704 ayoade1st xy2909 jzhu8803 dancinturtle anuragithub ai-freelancing-community alomdaelmasry aruneshtobrakab ilangurudev emrecanaltinsoy devmallyak mkayanda latitudekevinhe yceron nirshad97 khushpatelmd jpecore18 chiragnsolanki richardolson michael-m-judd nlp-kg mlfutbol aissatoupaye andysingal ngo010 mykrass imvenkata vicar codeislife99 bharathkumar-kancharla sima111b boaei navanith-raya dhruva9 dtailabs henokdes nirgonen msneck leninchris theartificialguy izaznov dwhdai statkclee attiliotrovato bkiselgof euhkim bluetint fierval saurabhsuman2406 nipanon chooonth bigdatamatta 2kinto bachriahfatwadhini rabinm charmzshab reneje lggeorgieva priyankasindhwani1012 tejaswi-kt xsakix ravisrirangam sungreong dguo456 ananya-ayasi mike1026915 ebtihalaziz holt-crews swatianand25 rvravi77 syheee jobsyangliu taymasj azeem110201 budhadityasaha julinam josephhany doosanjung plaban1981 statsgary avantumdtr taodevelopment pawan2905 byukan vethssvikas sarvex keyrunsai nitinksingh007 ashbaliga

transformers's Issues

Can I replace the nsp task with sop in the pre-training task of bert

in the notebook -"Training With MLM and NSP",can I replace the nsp task with sop in the pre-training task of bert? If so, how to achieve it, please point out the big brother! Thanks!

can you give a example for Contrastive Learning?

Now that comparative learning is hotly researched and used in NLP, can you do an in-depth analysis on ContrastiveLoss, TripletLoss and Multiple Negatives Ranking Loss, etc., without resorting to existing sentence-transformers?

Environment setup issue.

Local Conda version conda 4.11.0. Use https://github.com/jamescalam/transformers/blob/main/environment.yaml to set up the environment.

Getting the following error :

Collecting package metadata (repodata.json): done

Solving environment: failed

ResolvePackageNotFound:

mkl_fft==1.2.0=py38h45dec08_0
pandocfilters==1.4.3=py38haa95532_1
icc_rt==2019.0.0=h0cc432a_1
ninja==1.10.2=h5362a0b_0
winpty==0.4.3=4
pip==20.2.4=py38haa95532_0
vc==14.1=h0510ff6_4
lz4-c==1.9.2=hf4a77e7_3
pyzmq==19.0.2=py38ha925a31_1
lzo==2.10=he774522_2
scikit-image==0.17.2=py38h1e1f486_0
libpng==1.6.37=h2a8f88b_0
scipy==1.5.2=py38h14eb087_0
pycosat==0.6.3=py38he774522_0
wrapt==1.11.2=py38he774522_0
blosc==1.20.1=h7bd577a_0
sympy==1.6.2=py38haa95532_1
pillow==8.0.1=py38h4fa10fc_0
h5py==2.10.0=py38h5e291fa_0
tornado==6.0.4=py38he774522_1
markupsafe==1.1.1=py38he774522_0
hdf5==1.10.4=h7ebc959_0
setuptools==50.3.1=py38haa95532_1
powershell_shortcut==0.0.1=3
get_terminal_size==1.0.0=h38e98db_0
msgpack-python==1.0.0=py38h74a9793_1
lazy-object-proxy==1.4.3=py38he774522_0
numexpr==2.7.1=py38h25d0782_0
libsodium==1.0.18=h62dcd97_0
pyyaml==5.3.1=py38he774522_1
yaml==0.2.5=he774522_0
python==3.8.5=h5fd99cc_1
curl==7.71.1=h2a8f88b_1
krb5==1.18.2=hc04afaa_0
m2w64-gcc-libs==5.3.0=7
libxslt==1.1.34=he774522_0
numpy-base==1.19.2=py38ha3acd2a_0
rtree==0.9.4=py38h21ff451_1
argon2-cffi==20.1.0=py38he774522_1
mkl==2020.2=256
py-lief==0.10.1=py38ha925a31_0
libxml2==2.9.10=hb89e7f3_3
libiconv==1.15=h1df5818_7
freetype==2.10.4=hd328e21_0
jpeg==9b=hb83a4c4_2
regex==2020.10.15=py38he774522_0
libspatialindex==1.9.3=h33f27b4_0
pkginfo==1.6.1=py38haa95532_0
qt==5.9.7=vc14h73c81de_0
fastcache==1.1.0=py38he774522_0
pyreadline==2.1=py38_1
sip==4.19.13=py38ha925a31_0
mkl-service==2.3.0=py38hb782905_0
bottleneck==1.3.2=py38h2a96729_1
console_shortcut==0.1.1=4
m2w64-gcc-libs-core==5.3.0=7
mkl_random==1.1.1=py38h47e9c7a_0
m2w64-gcc-libgfortran==5.3.0=6
pywin32==227=py38he774522_1
icu==58.2=ha925a31_3
tk==8.6.10=he774522_0
pywin32-ctypes==0.2.0=py38_1000
intel-openmp==2020.2=254
distributed==2.30.1=py38haa95532_0
bcrypt==3.2.0=py38he774522_0
sqlalchemy==1.3.20=py38h2bbff1b_0
bitarray==1.6.1=py38h2bbff1b_0
sqlite==3.33.0=h2a8f88b_0
statsmodels==0.12.0=py38he774522_0
ipython==7.19.0=py38hd4e2768_0
pytables==3.6.1=py38ha5be198_0
libarchive==3.4.2=h5e25573_0
liblief==0.10.1=ha925a31_0
tifffile==2020.10.1=py38h8c2d366_2
scikit-learn==0.23.2=py38h47e9c7a_0
menuinst==1.4.16=py38he774522_1
win_unicode_console==0.5=py38_0
zeromq==4.3.2=ha925a31_3
gevent==20.9.0=py38he774522_0
pywavelets==1.1.1=py38he774522_2
watchdog==0.10.3=py38_0
cytoolz==0.11.0=py38he774522_0
mistune==0.8.4=py38he774522_1000
six==1.15.0=py38haa95532_0
zlib==1.2.11=h62dcd97_4
cudatoolkit==11.1.1=heb2d755_7
lxml==4.6.1=py38h1350720_0
brotlipy==0.7.0=py38he774522_1000
m2w64-gmp==6.1.0=2
libcurl==7.71.1=h2a8f88b_1
numba==0.51.2=py38hf9181ef_1
psutil==5.7.2=py38he774522_0
libtiff==4.1.0=h56a325e_1
kiwisolver==1.3.0=py38hd77b12b_0
astropy==4.0.2=py38he774522_0
pynacl==1.4.0=py38h62dcd97_1
ruamel_yaml==0.15.87=py38he774522_1
pyrsistent==0.17.3=py38he774522_0
win_inet_pton==1.1.0=py38_0
pandoc==2.11=h9490d1a_0
libuv==1.41.0=h8ffe710_0
pyqt==5.9.2=py38ha925a31_4
wincertstore==0.2=py38_0
cryptography==3.1.1=py38h7a1dbc1_0
libssh2==1.9.0=h7a1dbc1_1
zope.interface==5.1.2=py38he774522_0
pyodbc==4.0.30=py38ha925a31_0
vs2015_runtime==14.16.27012=hf0eaf9b_3
numpy==1.19.2=py38hadc3359_0
openssl==1.1.1h=he774522_0
matplotlib-base==3.3.2=py38hba9282a_0
pycurl==7.43.0.6=py38h7a1dbc1_0
zstd==1.4.5=h04227a9_0
llvmlite==0.34.0=py38h1a82afc_4
comtypes==1.1.7=py38_1001
pywinpty==0.5.7=py38_0
xz==5.2.5=h62dcd97_0
msys2-conda-epoch==20160418=1
m2w64-libwinpthread-git==5.0.0.4634.697f757=2
bzip2==1.0.8=he774522_0
cffi==1.14.3=py38h7a1dbc1_0
greenlet==0.4.17=py38he774522_0
ujson==4.0.1=py38ha925a31_0

Did conda update conda and conda update conda -c conda-forge

But the issue persists.

how can i enforce using cuda , please ?

i got killed after sometimes the code is running , i think it is gpu problem, how can i use cuda in the code please or which lines i should change it .

TypeError: forward() got an unexpected keyword argument 'labels'

In NSP training program, when training starts the TypeError occurs in the following line of code,

outputs = model(input_ids, attention_mask=attention_mask, token_type_ids=token_type_ids, labels=labels)

The Error shows as follows:
/...../site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
TypeError: forward() got an unexpected keyword argument 'labels'

I reinstalled torch's latest version, even then I am getting this error. Need help to start training.

Dtype error in model.fit()

I am having an issue when I run through the code provided in project_build_tf_sentiment_model folder. When I try and run this code:

history = model.fit(
   train_ds,
   validation_data=val_ds,
   epochs=2
)

in the 02_build_and_train_lstm_example.ipynb file, I get the following error:

`---------------------------------------------------------------------------
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-7-9a34c6a1ecff> in <module>
      2     train_ds,
      3     validation_data=val_ds,
----> 4     epochs=2
      5 )

~\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\keras\engine\training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
   1181                 _r=1):
   1182               callbacks.on_train_batch_begin(step)
-> 1183               tmp_logs = self.train_function(iterator)
   1184               if data_handler.should_sync:
   1185                 context.async_wait()

~\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\eager\def_function.py in __call__(self, *args, **kwds)
    887 
    888       with OptionalXlaContext(self._jit_compile):
--> 889         result = self._call(*args, **kwds)
    890 
    891       new_tracing_count = self.experimental_get_tracing_count()

~\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\eager\def_function.py in _call(self, *args, **kwds)
    948         # Lifting succeeded, so variables are initialized and we can run the
    949         # stateless function.
--> 950         return self._stateless_fn(*args, **kwds)
    951     else:
    952       _, _, _, filtered_flat_args = \

~\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\eager\function.py in __call__(self, *args, **kwargs)
   3022        filtered_flat_args) = self._maybe_define_function(args, kwargs)
   3023     return graph_function._call_flat(
-> 3024         filtered_flat_args, captured_inputs=graph_function.captured_inputs)  # pylint: disable=protected-access
   3025 
   3026   @property

~\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\eager\function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1959       # No tape is watching; skip to running the function.
   1960       return self._build_call_outputs(self._inference_function.call(
-> 1961           ctx, args, cancellation_manager=cancellation_manager))
   1962     forward_backward = self._select_forward_and_backward_functions(
   1963         args,

~\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\eager\function.py in call(self, ctx, args, cancellation_manager)
    594               inputs=args,
    595               attrs=attrs,
--> 596               ctx=ctx)
    597         else:
    598           outputs = execute.execute_with_cancellation(

~\AppData\Roaming\Python\Python37\site-packages\tensorflow\python\eager\execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     58     ctx.ensure_initialized()
     59     tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
---> 60                                         inputs, attrs, num_outputs)
     61   except core._NotOkStatusException as e:
     62     if name is not None:

InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument:  Data type mismatch at component 0: expected double but got int32.
	 [[node IteratorGetNext (defined at <ipython-input-7-9a34c6a1ecff>:4) ]]
  (1) Invalid argument:  Data type mismatch at component 0: expected double but got int32.
	 [[node IteratorGetNext (defined at <ipython-input-7-9a34c6a1ecff>:4) ]]
	 [[GroupCrossDeviceControlEdges_0/Identity_2/_41]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_26737]

Function call stack:
train_function -> train_function`

I created a new environmeny from the *yml file and requirements.txt files provided and got the same error.

how to train NSP model using different embedding instead bert

[Calculate Similarity] Can we not using sentence-transformers pre-trained?

Hi there,

I so appreciate your nice work, It did give me a lot more understanding about the tokenization of the BERT model.

So in my case of training on the Vietnamese language (with a bunch of new specific vocabularies), can I load the model, not from sentence-transformer pre-trained (here the model I intend to use)

I did give it a shot, below is my pseudo-code

model_name = "vinai/phobert-large"
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
sentences = [
    "tôi thích quán này dữ lắm lắm luôn á trời",
    "trời hôm nay đẹp lắm",
]
#the first sentence means: I extremely love this restaurant, 
#while the second one is the today's sky is so nice
#_______________

# initialize dictionary to store tokenized sentences
tokens = {'input_ids': [], 'attention_mask': []}

for sentence in sentences:
    # encode each sentence and append to dictionary
    new_tokens = tokenizer.encode_plus(clean_doc(sentence), max_length=128,
                                       truncation=True, padding='max_length',
                                       return_tensors='pt')
    tokens['input_ids'].append(new_tokens['input_ids'][0])
    tokens['attention_mask'].append(new_tokens['attention_mask'][0])

# reformat list of tensors into single tensor
tokens['input_ids'] = torch.stack(tokens['input_ids'])
tokens['attention_mask'] = torch.stack(tokens['attention_mask'])
outputs = model(**tokens)

embeddings = outputs.last_hidden_state

attention_mask = tokens['attention_mask']
mask = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()

masked_embeddings = embeddings * mask
summed = torch.sum(masked_embeddings, 1)
summed_mask = torch.clamp(mask.sum(1), min=1e-9)
mean_pooled = summed / summed_mask

from sklearn.metrics.pairwise import cosine_similarity
mean_pooled = mean_pooled.detach().numpy()
cosine_similarity(
    [mean_pooled[0]],
    mean_pooled[1:]
)

Particularly, these two sentences are quite different, but their cosine value is about 91%.

Or there is any approach more suitable in my case.

I do appreciate your time and sharing.

Error on import flair in "Sentiment with Flair"

I am getting the following error:

OSError                                   Traceback (most recent call last)
<ipython-input-1-911909092dfc> in <module>
----> 1 import flair
      2 model = flair.models.TextClassifier.load('en-sentiment')

C:\ProgramData\Anaconda3\envs\ml\lib\site-packages\flair\__init__.py in <module>
      1 import os
----> 2 import torch
      3 from pathlib import Path
      4 from transformers import set_seed as hf_set_seed
      5 

C:\ProgramData\Anaconda3\envs\ml\lib\site-packages\torch\__init__.py in <module>
    126                 err = ctypes.WinError(last_error)
    127                 err.strerror += f' Error loading "{dll}" or one of its dependencies.'
--> 128                 raise err
    129             elif res is not None:
    130                 is_loaded = True

OSError: [WinError 127] The specified procedure could not be found. Error loading "C:\ProgramData\Anaconda3\envs\ml\lib\site-packages\torch\lib\caffe2_detectron_ops.dll" or one of its dependencies.

The file C:\ProgramData\Anaconda3\envs\ml\lib\site-packages\torch\lib\caffe2_detectron_ops.dll is present in that directory.
I have flair 0.8 version.

Elasticsearch With Haystack - failed in connection

Hi , I am a student in your udemy course and trying to run 09_elastic_in_haystack.ipynb code from QA chapter.
I am using JupyterLab which is connected to my GCP VM- Teslas 100 Debian 9 Linux. I did install Debian version of Elasticsearch . the sample code is here and snapshot or error. I appreciate your support on this
Shabnam

!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.1.2-linux-x86_64.tar.gz
!wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-8.1.2-linux-x86_64.tar.gz.sha512
!shasum -a 512 -c elasticsearch-8.1.2-linux-x86_64.tar.gz.sha512 
!tar -xzf elasticsearch-8.1.2-linux-x86_64.tar.gz
!cd elasticsearch-8.1.2/ 


url = """https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json"""
!wget -nc -q {url}
url = """https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json"""
!wget -nc -q {url}
import json

with open('dev-v2.0.json', 'r') as f:
    squad = json.load(f)

!pip install pymilvus
import pymilvus

from haystack.document_store.elasticsearch import ElasticsearchDocumentStore


document_store = ElasticsearchDocumentStore(host='localhost', username='', password='', index='squad_docs')

ConnectionError: Initial connection to Elasticsearch failed. Make sure you run an Elasticsearch instance at [{'host': 'localhost', 'port': 9200}] and that it has finished the initial ramp up (can take > 30s).

Could you share the slides used for the course?

Hi James, could you share the slides used for the Udemy course? Thanks!

Need some ordering in /course folders

This is just a suggestion to add a prefix to the folders in the /course folders so that it is easier to match with the Sections in the Udemy course. Similar to the course, it would be very useful to have Section_{num}_{folder_name}

For instance Section_1_introduction, etc.

Thanks