Coder Social home page Coder Social logo

nlp-uoregon / trankit Goto Github PK

View Code? Open in Web Editor NEW
717.0 21.0 99.0 1.08 MB

Trankit is a Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

License: Apache License 2.0

Python 100.00%
nlp natural-language-processing pytorch language-model xlm-roberta machine-learning deeplearning artificial-intelligence universal-dependencies multilingual adapters sentence-segmentation tokenization part-of-speech-tagging morphological-tagging dependency-parsing lemmatization

trankit's Introduction

Trankit: A Light-Weight Transformer-based Python Toolkit for Multilingual Natural Language Processing

Our technical paper for Trankit won the Outstanding Demo Paper Award at EACL 2021. Please cite the paper if you use Trankit in your research.

@inproceedings{nguyen2021trankit,
      title={Trankit: A Light-Weight Transformer-based Toolkit for Multilingual Natural Language Processing}, 
      author={Nguyen, Minh Van and Lai, Viet Dac and Veyseh, Amir Pouran Ben and Nguyen, Thien Huu},
      booktitle="Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations",
      year={2021}
}

💥 💥 💥 Trankit v1.0.0 is out:

  • 90 new pretrained transformer-based pipelines for 56 languages. The new pipelines are trained with XLM-Roberta large, which further boosts the performance significantly over 90 treebanks of the Universal Dependencies v2.5 corpus. Check out the new performance here. This page shows you how to use the new pipelines.

  • Auto Mode for multilingual pipelines. In the Auto Mode, the language of the input will be automatically detected, enabling the multilingual pipelines to process the input without specifying its language. Check out how to turn on the Auto Mode here. Thank you loretoparisi for your suggestion on this.

  • Command-line interface is now available to use. This helps users who are not familiar with Python programming language use Trankit easily. Check out the tutorials on this page.

Trankit is a light-weight Transformer-based Python Toolkit for multilingual Natural Language Processing (NLP). It provides a trainable pipeline for fundamental NLP tasks over 100 languages, and 90 downloadable pretrained pipelines for 56 languages.

Trankit outperforms the current state-of-the-art multilingual toolkit Stanza (StanfordNLP) in many tasks over 90 Universal Dependencies v2.5 treebanks of 56 different languages while still being efficient in memory usage and speed, making it usable for general users.

In particular, for English, Trankit is significantly better than Stanza on sentence segmentation (+9.36%) and dependency parsing (+5.07% for UAS and +5.81% for LAS). For Arabic, our toolkit substantially improves sentence segmentation performance by 16.36% while Chinese observes 14.50% and 15.00% improvement of UAS and LAS for dependency parsing. Detailed comparison between Trankit, Stanza, and other popular NLP toolkits (i.e., spaCy, UDPipe) in other languages can be found here on our documentation page.

We also created a Demo Website for Trankit, which is hosted at: http://nlp.uoregon.edu/trankit

Installation

Trankit can be easily installed via one of the following methods:

Using pip

pip install trankit

The command would install Trankit and all dependent packages automatically.

From source

git clone https://github.com/nlp-uoregon/trankit.git
cd trankit
pip install -e .

This would first clone our github repo and install Trankit.

Fixing the compatibility issue of Trankit with Transformers

Previous versions of Trankit have encountered the compatibility issue when using recent versions of transformers. To fix this issue, please install the new version of Trankit as follows:

pip install trankit==1.1.0

If you encounter any other problem with the installation, please raise an issue here to let us know. Thanks.

Usage

Trankit can process inputs which are untokenized (raw) or pretokenized strings, at both sentence and document level. Currently, Trankit supports the following tasks:

  • Sentence segmentation.
  • Tokenization.
  • Multi-word token expansion.
  • Part-of-speech tagging.
  • Morphological feature tagging.
  • Dependency parsing.
  • Named entity recognition.

Initialize a pretrained pipeline

The following code shows how to initialize a pretrained pipeline for English; it is instructed to run on GPU, automatically download pretrained models, and store them to the specified cache directory. Trankit will not download pretrained models if they already exist.

from trankit import Pipeline

# initialize a multilingual pipeline
p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

Perform all tasks on the input

After initializing a pretrained pipeline, it can be used to process the input on all tasks as shown below. If the input is a sentence, the tag is_sent must be set to True.

from trankit import Pipeline

p = Pipeline(lang='english', gpu=True, cache_dir='./cache')

######## document-level processing ########
untokenized_doc = '''Hello! This is Trankit.'''
pretokenized_doc = [['Hello', '!'], ['This', 'is', 'Trankit', '.']]

# perform all tasks on the input
processed_doc1 = p(untokenized_doc)
processed_doc2 = p(pretokenized_doc)

######## sentence-level processing ####### 
untokenized_sent = '''This is Trankit.'''
pretokenized_sent = ['This', 'is', 'Trankit', '.']

# perform all tasks on the input
processed_sent1 = p(untokenized_sent, is_sent=True)
processed_sent2 = p(pretokenized_sent, is_sent=True)

Note that, although pretokenized inputs can always be processed, using pretokenized inputs for languages that require multi-word token expansion such as Arabic or French might not be the correct way. Please check out the column Requires MWT expansion? of this table to see if a particular language requires multi-word token expansion or not.
For more detailed examples, please check out our documentation page.

Multilingual usage

Starting from version v1.0.0, Trankit supports a handy Auto Mode in which users do not have to set a particular language active before processing the input. In the Auto Mode, Trankit will automatically detect the language of the input and use the corresponding language-specific models, thus avoiding switching back and forth between languages in a multilingual pipeline.

from trankit import Pipeline

p = Pipeline('auto')

# Tokenizing an English input
en_output = p.tokenize('''I figured I would put it out there anyways.''') 

# POS, Morphological tagging and Dependency parsing a French input
fr_output = p.posdep('''On pourra toujours parler à propos d'Averroès de "décentrement du Sujet".''')

# NER tagging a Vietnamese input
vi_output = p.ner('''Cuộc tiêm thử nghiệm tiến hành tại Học viện Quân y, Hà Nội''')

In this example, the code name 'auto' is used to initialize a multilingual pipeline in the Auto Mode. For more information, please visit this page. Note that, besides the new Auto Mode, the manual mode can still be used as before.

Building a customized pipeline

Training customized pipelines is easy with Trankit via the class TPipeline. Below we show how we can train a token and sentence splitter on customized data.

from trankit import TPipeline

tp = TPipeline(training_config={
    'task': 'tokenize',
    'save_dir': './saved_model',
    'train_txt_fpath': './train.txt',
    'train_conllu_fpath': './train.conllu',
    'dev_txt_fpath': './dev.txt',
    'dev_conllu_fpath': './dev.conllu'
    }
)

trainer.train()

Detailed guidelines for training and loading a customized pipeline can be found here

Sharing your customized pipelines

In case you want to share your customized pipelines with other users. Please create an issue here and provide us the following information:

  • Training data that you used to train your models, e.g., data license, data source, and some data statistics (i.e., sizes of training, development, and test data).
  • Performance of your pipelines on your test data using the official evaluation script.
  • A downloadable link to your trained model files (a Google drive link would be great). After we receive your request, we will check and test your pipelines. Once everything is done, we would make the pipelines accessible by other users via new language codes.

Acknowledgements

This project has been supported by the Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via IARPA Contract No. 2019-19051600006 under the Better Extraction from Text Towards Enhanced Retrieval (BETTER) Program.

We use XLM-Roberta and Adapters as our shared multilingual encoder for different tasks and languages. The AdapterHub is used to implement our plug-and-play mechanism with Adapters. To speed up the development process, the implementations for the MWT expander and the lemmatizer are adapted from Stanza. To implement the language detection module, we leverage the langid library.

trankit's People

Contributors

alxshine avatar b-mcdowell avatar calpt avatar harshcasper avatar laiviet avatar minhhdvn avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

trankit's Issues

API Docs

Great library!

Can you provide the detailed documentation of the API? For example, what are the inputs/fields of the Pipeline class/function and their meanings etc.

Pipeline(lang='english', gpu=True, cache_dir='./cache')

Activation of DeepSource

Hi 👋

One of my Pull Requests around fixing Code Quality Issues with DeepSource was merged here: #9

I'd just like to inform you that the issues fixed here were detected by running DeepSource analysis on the repo. If you like, you can activate analysis for your repository to detect such code quality issues/bug risks on the fly for every change made. You can also use the Autofix feature to fix them with one click.

The .deepsource.toml file you merged will only take effect if you activate analysis for this repo.

Here's what you can do if you wish to activate DeepSource to continuously analyze your repository:

  • Sign up on DeepSource and activate analysis for this repository.
  • Create .deepsource.toml configuration which you can use to configure your analysis settings (My PR already added that, but feel free to edit it anytime).
  • Track/Check analysis here.

If you have any doubts or questions, you can check out the docs, or feel free to reach out :)

Output probabilities

Hi - thanks for releasing this great toolkit! Is there any way to get probabilities outputted next to component predictions? It would be really nice if we could get, say, POS tag probabilities next to each prediction, etc.

Hardware recommendations

Hello. :)
I'm runnning multiple versions of trankit in docker containers (each container is assigned a rtx3090 or a rtx2080Ti), with 3 python instances/workers per GPU/container.

I'm seeing performance throughput drop off beyond about 3 gpus on a dual 2697 v3 machine (dual 16 core processors, single thread passmark about 2000, multi 20k per cpu), and for a single gpu, performance is about 15% lower than on a 5950x machine (16 cores, single thread passmark about 3500).

I'm still doing some tests, but, seems like trankit likes fast cpu cores (seems like 4-5 per gpu) to run well?

Format for training custom NER classifiers

First of all, thanks for opensourcing trankit -- it looks very interesting!

I would be interested in training a custom NER model as described in the docs. Could you please comment a bit on what sort of a format should the .bio files be stored in?

Thanks!

cc @minhvannguyen

Single shared configuration for all Pipeline instances

Hi,
What I did:

  • create two Pipeline instances for processing two languages (file global_config_demo.py, attached as text file because of github limitations global_config_demo.py.txt)
  • try to split sentence with one instance

What I get:
Error with the language from another instance (file console.log)

What I expected to get:
Split sentence without error.

Root of the problem
All instances share the same configuration so the only usage scenario is single threaded processing requests one by one:
https://github.com/nlp-uoregon/trankit/blob/master/trankit/pipeline.py#L166

How to fix
do not mix global and instance options and handle them in code separately

Working solution
Make an isolated copy of config for each Pipeline instance (global_config_demo_fixed.py.txt)

Is there any reason for global singleton Pipeline instance configuration?
Thank you.

Training a model for pre-tagged data

Hi and thanks for making this toolkit available!

I wanted to ask whether it's possible to train a model to predict morphology (FEATS) and/or lemmatization based on pre-tagged text (for example POS tags already in a CoNLL-U input file)?

Prevent from splitting on hyphen when doing tokenization for POS?

Dear community,

Is it possible to prevent Trankit to split words for POS-tagging on hyphens?
For example, it splits "out-of-print materials" to "out", "-", "of", "-", "print", "materials", and then does POS on each item separately. Sometimes a word on the whole could have one POS, but if Trankit splits the words, all of them get their own POS tag. Next I need to find a way to combine all of them back and choose only one POS. It makes the whole code quite cumbersome.

Is it possible to just prevent Trankit from doing that?

Issue with lemmatization of indonesian

Input:

`
p = Pipeline('indonesian', embedding='xlm-roberta-large')

print(p('Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.'))
`

Output:

Loading pretrained XLM-Roberta, this may take a while...
Loading tokenizer for indonesian
Loading tagger for indonesian
Loading lemmatizer for indonesian

Active language: indonesian

{'text': 'Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.', 'sentences': [{'id': 1, 'text': 'Ia menjadi Gubernur Bali menggantikan Anak Agung Bagus Sutedja.', 'tokens': [{'id': 1, 'text': 'Ia', 'upos': 'PRON', 'xpos': 'PS3', 'feats': 'Number=Sing|Person=3|PronType=Prs', 'head': 2, 'deprel': 'nsubj', 'dspan': (0, 2), 'span': (0, 2), 'lemma': 'ia'}, {'id': 2, 'text': 'menjadi', 'upos': 'VERB', 'xpos': 'VSA', 'feats': 'Number=Sing|Voice=Act', 'head': 0, 'deprel': 'root', 'dspan': (3, 10), 'span': (3, 10), 'lemma': 'menjadi'}, {'id': 3, 'text': 'Gubernur', 'upos': 'PROPN', 'xpos': 'NSD', 'feats': 'Number=Sing', 'head': 2, 'deprel': 'obj', 'dspan': (11, 19), 'span': (11, 19), 'lemma': 'gubernur'}, {'id': 4, 'text': 'Bali', 'upos': 'PROPN', 'xpos': 'NSD', 'feats': 'Number=Sing', 'head': 3, 'deprel': 'flat', 'dspan': (20, 24), 'span': (20, 24), 'lemma': 'bali'}, {'id': 5, 'text': 'menggantikan', 'upos': 'VERB', 'xpos': 'VSA', 'feats': 'Number=Sing|Voice=Act', 'head': 2, 'deprel': 'xcomp', 'dspan': (25, 37), 'span': (25, 37), 'lemma': 'mengantikan'}, {'id': 6, 'text': 'Anak', 'upos': 'PROPN', 'xpos': 'NSD', 'feats': 'Number=Sing', 'head': 5, 'deprel': 'obj', 'dspan': (38, 42), 'span': (38, 42), 'lemma': 'anak'}, {'id': 7, 'text': 'Agung', 'upos': 'PROPN', 'xpos': 'ASP', 'feats': 'Degree=Pos|Number=Sing', 'head': 6, 'deprel': 'flat', 'dspan': (43, 48), 'span': (43, 48), 'lemma': 'agung'}, {'id': 8, 'text': 'Bagus', 'upos': 'PROPN', 'xpos': 'ASP', 'feats': 'Degree=Pos|Number=Sing', 'head': 7, 'deprel': 'flat', 'dspan': (49, 54), 'span': (49, 54), 'lemma': 'bagus'}, {'id': 9, 'text': 'Sutedja', 'upos': 'PROPN', 'xpos': 'X--', 'head': 8, 'deprel': 'flat', 'dspan': (55, 62), 'span': (55, 62), 'lemma': 'sutedja'}, {'id': 10, 'text': '.', 'upos': 'PUNCT', 'xpos': 'Z--', 'head': 2, 'deprel': 'punct', 'dspan': (62, 63), 'span': (62, 63), 'lemma': '.'}], 'dspan': (0, 63)}], 'lang': 'indonesian'}

Expected output:

Words like 'mengantikan' to be reduced to their lemma.

For example, here is the lemmatization from aksara:
1 Ia ia PRON _ Number=Sing|Person=3|PronType=Prs _ _ _ _
2 menjadi jadi VERB _ Voice=Act _ _ _ _
3 Gubernur Gubernur PROPN _ _ _ _ _ _
4 Bali Bali PROPN _ _ _ _ _ _
5 menggantikan ganti VERB _ Voice=Act _ _ _ _
6 Anak Anak PROPN _ _ _ _ _ _
7 Agung Agung PROPN _ _ _ _ _ _
8 Bagus Bagus PROPN _ _ _ _ _ _
9 Sutedja Sutedja PROPN _ _ _ _ _ SpaceAfter=No
10 . . PUNCT _ _ _ _ _ _

Note how menjadi becomes jadi and menggantikan becomes ganti.

I think this issue is related to the UD GSD data.

I have seen a similar issue in stanza: stanfordnlp/stanza#1003

Thanks for releasing this great tool.

(Ubuntu, python3.9, trankit 1.1.0)

KeyError: 'lemma'

Following the code from https://trankit.readthedocs.io/en/latest/training.html#training-a-lemmatizer i get a KeyError: 'lemma':

Setting up training config...
Initialized lemmatizer trainer
Training dictionary-based lemmatizer

---------------------------------------------------------------------------

KeyError                                  Traceback (most recent call last)

[<ipython-input-9-a90867cc5ef3>](https://localhost:8080/#) in <module>()
     11 
     12 # start training
---> 13 trainer.train()

3 frames

[/content/trankit/trankit/tpipeline.py](https://localhost:8080/#) in train(self)
    680             self._train_posdep()
    681         elif self._task == 'lemmatize':
--> 682             self._train_lemma()
    683         elif self._task == 'ner':
    684             self._train_ner()

[/content/trankit/trankit/tpipeline.py](https://localhost:8080/#) in _train_lemma(self)
    581 
    582     def _train_lemma(self):
--> 583         self._lemma_model.train()
    584 
    585     def _train_ner(self):

[/content/trankit/trankit/models/lemma_model.py](https://localhost:8080/#) in train(self)
    379             self.config.logger.info("Training dictionary-based lemmatizer")
    380             self.trainer.train_dict(
--> 381                 [[token[TEXT], token[UPOS], token[LEMMA]] for sentence in self.train_batch.doc for token in sentence if
    382                  not (
    383                          type(token[ID]) == tuple and len(token[ID]) == 2)])

[/content/trankit/trankit/models/lemma_model.py](https://localhost:8080/#) in <listcomp>(.0)
    381                 [[token[TEXT], token[UPOS], token[LEMMA]] for sentence in self.train_batch.doc for token in sentence if
    382                  not (
--> 383                          type(token[ID]) == tuple and len(token[ID]) == 2)])
    384             dev_preds = self.trainer.predict_dict(
    385                 [[token[TEXT], token[UPOS]] for sentence in self.dev_batch.doc for token in sentence if

KeyError: 'lemma'

The recent version from https://github.com/UniversalDependencies/UD_Thai-PUD is used as trainings and development data.

Performance issue on cpu only machines

Hi all,
I'm trying to run Trankit on a server which doesn't have gpu available, so it seems slower than, for example, Stanford tools for NER and tokenization by x10 factor (0.500ms vs 5sec on NER english task, corpus is 2000 tokens circa).

I was wondering if

  1. loading a different model other than Roberta-base could help? such as distilRoberta-base

  2. running Trankit in manual mode could help too?

Do you have any other suggestion on how to enhance performance on cpu only servers? Thanks!

Also, at the start of model loading, I got this message line appears twice:
Loading pretrained XLM-Roberta, this may take a while...
Loading pretrained XLM-Roberta, this may take a while...
something wrong with my implementation or does it work as intended? thanks

Installation failing for Python 3.9 on Windows

I have trankit running on an old laptop with Anaconda (required me to downgrade hdf5, but it works), but I cannot get it working on my new laptop for the life of me. It's a brand-new system, so there are no old versions of anything interfering. I have tried both pip and installing from source, and neither works.

System Info:
Windows 10
Python 3.9.1
VisualStudio Build Tools C++ 2019

What I did:

  • attempted to install with pip, ran into dependency error with rust.
  • did some brief reading, installed visual studio build tools for C ++ so that I could install rust. Also installed rust.
  • am now running into the error below:
...
Building wheels for collected packages: sentencepiece, tokenizers
  Building wheel for sentencepiece (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: 'c:\users\tbehr\appdata\local\programs\python\python39\python.exe' -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\tbehr\\AppData\\Local\\Temp\\pip-install-rxqny9hb\\sentencepiece_327f06df01b54e21a430f9036c18b342\\setup.py'"'"'; __file__='"'"'C:\\Users\\tbehr\\AppData\\Local\\Temp\\pip-install-rxqny9hb\\sentencepiece_327f06df01b54e21a430f9036c18b342\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d 'C:\Users\tbehr\AppData\Local\Temp\pip-wheel-k4b2fb39'
       cwd: C:\Users\tbehr\AppData\Local\Temp\pip-install-rxqny9hb\sentencepiece_327f06df01b54e21a430f9036c18b342\
  Complete output (15 lines):
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build\lib.win-amd64-3.9
  copying sentencepiece.py -> build\lib.win-amd64-3.9
  running build_ext
  building '_sentencepiece' extension
  creating build\temp.win-amd64-3.9
  creating build\temp.win-amd64-3.9\Release
  C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.28.29333\bin\HostX86\x64\cl.exe /c /nologo /Ox /W3 /GL /DNDEBUG /MD -Ic:\users\tbehr\appdata\local\programs\python\python39\include -Ic:\users\tbehr\appdata\local\programs\python\python39\include -IC:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Tools\MSVC\14.28.29333\include -IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\ucrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\shared -IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\um -IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\winrt -IC:\Program Files (x86)\Windows Kits\10\include\10.0.18362.0\cppwinrt /EHsc /Tpsentencepiece_wrap.cxx /Fobuild\temp.win-amd64-3.9\Release\sentencepiece_wrap.obj /MT /I..\build\root\include
  cl : Command line warning D9025 : overriding '/MD' with '/MT'
  sentencepiece_wrap.cxx
  sentencepiece_wrap.cxx(2777): fatal error C1083: Cannot open include file: 'sentencepiece_processor.h': No such file or directory
  error: command 'C:\\Program Files (x86)\\Microsoft Visual Studio\\2019\\BuildTools\\VC\\Tools\\MSVC\\14.28.29333\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2
  ----------------------------------------
  ERROR: Failed building wheel for sentencepiece 

Tried again in a new, empty virtualenv and got a slightly different error, also when building sentencepiece. (Also tried cloning the repo and installing from source...got an encoding error, but I suspect that's a Microsoft/different issue, so I'm not going to worry about that right now.)

Based on this issue, I think this is because trankit forces installation of a lower version of sentencepiece (my old laptop has v0.1.91), which doesn't work for Python 3.9. v0.1.95 of sentencepiece does, and I can install that separately, but when pip-installing trankit, it uninstalls v0.1.95 and tries to install the lower version that doesn't work.

Is there a workaround for this other than using a lower version of Python? (I guess that's not a major crisis, but it's easier not to have to switch back and forth between Python 3.7 and 3.9.) Or is there a way to update the setup files to reflect the newer version of sentencepiece?

ETA: I have this working successfully in a venv with Python 3.7, so this does appear to be an issue with 3.9 only.

Can we load model with transformer library?

Hi @minhvannguyen ,
Can I load the model using this coding style?
`import torch
from transformers import AutoModel, AutoTokenizer

model = AutoModel.from_pretrained("trankit")
tokenizer = AutoTokenizer.from_pretrained("trankit", use_fast=False)`

Thank you.

switch base model from xlm-roberta-base

Hi, I'm working with historical swedish text, and want to train a sentence-segmenter for that (historical swedish text is very inconsistent with punctuation, capital letters etc). I've finetuned a swedish BERT-model on a historical text corpora, and now I want to use this model instead of xlm-roberta-base, when I train the sentence segmenter, I've tried changing the base model, so that it loads the BERT model instead, but I get a mismatch in parameters. Can you give me some tips on what I have to do to change the base_model from xlm-roberta-base to my finetuned historical BERT-model.

Best regards, and thanks for a great repo!

Difficulties in reproducing the GermEval14 NER model

Hi again @minhvannguyen,

I am sorry to bother you once again but I was wondering whether you could provide a bit more information on how might one reproduce the trankit results on GermEval14, which are presented in the trankit paper.

Baed on your suggestion in #6 I tried to train a trankit-based NER model on the GermEval14 data by directly passing it to trankit.TPipeline. You can find the (very simple) code that sets up the environment, prepares the data and trains the model in the following Colab.

In the paper, Table 3 reports the test F1 score on this dataset at 86.9 but even after running over 80 training epochs, the best dev F1 score I managed to receive was on 1.74 and it does not seem like the evaluation on the test set would produce vastly different results.

Hence, my preliminary confusion is that I must be doing something wrong. One of the first suspects would be random seeds but those seems to be fixed as we can see in the snippet below:

os.environ['PYTHONHASHSEED'] = str(1234)
random.seed(1234)
np.random.seed(1234)
torch.manual_seed(1234)
torch.cuda.manual_seed(1234)
torch.cuda.manual_seed_all(1234)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

I was therefore wondering whether you could immediately see what I am doing wrong here, or generally provide some pointers that could be helpful in reproducing the results listed in the paper.

Thanks!

Parse error of Italian

I used Italian model for predicting the dependency tree and obtained following result:

1	Il	il	DET	RD	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	2	det	_
2	termine	termine	NOUN	S	Gender=Masc|Number=Sing	8	nsubj:pass	_	_
3	"	"	PUNCT	FB	_	4	punct	_	_
4	Tathāgata	Tathāgata	PROPN	SP	_	2	nmod	_	_
5	"	"	PUNCT	FB	_	4	punct	_	_
6	può	potere	AUX	VM	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	8	aux	_
7	essere	essere	AUX	VA	VerbForm=Inf	8	aux:pass	_	_
8	letto	leggere	VERB	V	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	0	root	_
9	come	come	ADP	E	_	11	case	_	_
10	"	"	PUNCT	FB	_	11	punct	_	_
11	tathā-gata	tathā-gata	NOUN	S	Gender=Fem|Number=Sing	8	obl	_	_
12	"	"	PUNCT	FB	_	11	punct	_	_
13	o	o	CCONJ	CC	_	16	cc	_	_
14	come	come	ADP	E	_	16	case	_	_
15	"	"	PUNCT	FB	_	16	punct	_	_
16	Tathā-āgata	Tathā-āgata	PROPN	SP	_	11	conj	_	_
17	"	"	PUNCT	FB	_	16	punct	_	_
18	,	,	PUNCT	FF	_	16	punct	_	_
19	dove	dove	ADV	B	_	22	advmod	_	_
20	il	il	DET	RD	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	21	det	_
21	primo	primo	ADJ	NO	Gender=Masc|Number=Sing|NumType=Ord	22	nsubj	_	_
22	significa	significare	VERB	V	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	16	acl:relcl	_	_
23	"	"	PUNCT	FB	_	25	punct	_	_
24	così	così	ADV	B	_	25	advmod	_	_
25	andato	andare	VERB	V	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	22	xcomp	_
26	"	"	PUNCT	FB	_	25	punct	_	_
27	mentre	mentre	CCONJ	CC	_	30	cc	_	_
28	il	il	DET	RD	Definite=Def|Gender=Masc|Number=Sing|PronType=Art	29	det	_
29	secondo	secondo	ADJ	NO	Gender=Masc|Number=Sing|NumType=Ord	30	nsubj	_	_
30	significa	significare	VERB	V	Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin	22	conj	_	_
31	"	"	PUNCT	FB	_	32	punct	_	_
32	così venuto	così venuto	ADV	B	_	30	advmod	_	_
33	"	"	PUNCT	FB	_	32	punct	_	_
34	.	.	PUNCT	FS	_	8	punct	_	_

I think line 32 is invalid because it contains space within one token.

What is curious is in another sentence containing 'così venuto', these two words are regarded as separated tokens:

1	Così	così	ADV	B	_	2	advmod	_	_
2	venuto	venire	VERB	V	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	0	root	_	_
3	/	/	PUNCT	FF	_	2	punct	_	_
4	Così	così	ADV	B	_	5	advmod	_	_
5	andato	andare	VERB	V	Gender=Masc|Number=Sing|Tense=Past|VerbForm=Part	2	conj	_	_
6	.	.	PUNCT	FS	_	2	punct	_	_

Is this a bug? I'd appreciate it if you could investigate this issue.

CoNLLU output

Hi,

Very useful tool.
Is there a way in which we can get the output in CoNLLU format? or do you have a script to convert from JSON to CoNLLU?

Sarves

em dash character crashes French pipeline

I tested trankit with the base and large models using the French pipeline and the em dash (character unicode 8212) causes the model to crash. The online demo seems to have the same problem. A quick replace on the input string to change to an hyphen avoid this issue. I did not test the three other types of dashes, nor with other languages.

Understanding the token fields

The documentation mentions some information about each of the token fields:

{
  'text': 'Hello! This is Trankit.',  # input string
  'sentences': [ # list of sentences
    {
      'id': 2,  # sentence index
      'text': 'This is Trankit.',  'dspan': (7, 23), # sentence span
      'tokens': [ # list of tokens
        {
          'id': 1, # token index
          'text': 'This',  # text form of the token
          'upos': 'PRON',  # UPOS tag of the token
          'xpos': 'DT',    # XPOS tag of the token
          'feats': 'Number=Sing|PronType=Dem', # morphological feature of the token
          'head': 3,  # index of the head token
          'deprel': 'nsubj', # dependency relation for the token
          'dspan': (7, 11), # document-level span of the token
          'span': (0, 4) # sentence-level span of the token
        },
       ...

But finding information about what the xpos and deprel fields mean is difficult. upos is a bit easier to find online. What are valid values? Is there documentation that could be linked to describe them?

Thanks!

Edit: just want to add that I found the UPOS info here: https://universaldependencies.org/u/pos/index.html, but it specifies that XPOS is specific to each language. How should I go about finding more on that?

Torch version issue

I was having an issue using trankit with my GPU since I had an incompatible version of pytorch (1.9.0+cu111).

trankit currently requires torch<1.8.0,>=1.6.0.

Is there a reason for this dependency lock or could it be expanded to include torch==1.9.0? I've built from source with 1.9.0 and everything seems to be working. I'd be happy to make a PR with the version bump.

with version 1.1.0 I get ModuleNotFoundError: No module named 'transformers'

Hi,
I just have a fresh installation (version 1.1.0) and when running it (I'm testing the sentence segmentation) I get the following error:

[...]
    from trankit import Pipeline
  File "/home/Luca/.conda/envs/ss/lib/python3.7/site-packages/trankit/__init__.py", line 1, in <module>
    from .pipeline import Pipeline
  File "/home/Luca/.conda/envs/ss/lib/python3.7/site-packages/trankit/pipeline.py", line 2, in <module>
    from .models.base_models import Multilingual_Embedding
  File "/home/Luca/.conda/envs/ss/lib/python3.7/site-packages/trankit/models/__init__.py", line 1, in <module>
    from .classifiers import *
  File "/home/Luca/.conda/envs/ss/lib/python3.7/site-packages/trankit/models/classifiers.py", line 2, in <module>
    from .base_models import *
  File "/home/Luca/.conda/envs/ss/lib/python3.7/site-packages/trankit/models/base_models.py", line 1, in <module>
    from ..adapter_transformers import AdapterType, XLMRobertaModel
  File "/home/Luca/.conda/envs/ss/lib/python3.7/site-packages/trankit/adapter_transformers/__init__.py", line 562, in <module>
    from .modeling_tf_electra import (
  File "/home/Luca/.conda/envs/ss/lib/python3.7/site-packages/trankit/adapter_transformers/modeling_tf_electra.py", line 5, in <module>
    from transformers import ElectraConfig
ModuleNotFoundError: No module named 'transformers'

these are the libraries that I have installed:

Package                Version
---------------------- -------------------
absl-py                0.13.0
astunparse             1.6.3
blingfire              0.1.3
blis                   0.7.4
cachetools             4.2.2
catalogue              1.0.0
certifi                2021.5.30
chardet                4.0.0
click                  8.0.1
cymem                  2.0.5
de-core-news-lg        2.3.0
de-core-news-sm        2.3.0
deepsegment            2.3.1
en-core-sci-lg         0.3.0
en-core-sci-sm         0.3.0
en-core-web-lg         2.3.1
en-core-web-sm         2.3.1
filelock               3.3.1
fr-core-news-lg        2.3.0
fr-core-news-sm        2.3.0
gast                   0.3.3
gensim                 3.8.3
google-auth            1.32.0
google-auth-oauthlib   0.4.4
google-pasta           0.2.0
grpcio                 1.38.1
h5py                   2.10.0
idna                   2.10
importlib-metadata     4.5.0
joblib                 1.0.1
JPype1                 1.3.0
Keras                  2.3.1
Keras-Applications     1.0.8
Keras-Preprocessing    1.1.2
langid                 1.1.6
Markdown               3.3.4
murmurhash             1.0.5
nltk                   3.5
nnsplit                0.5.7.post0
numpy                  1.21.0
oauthlib               3.1.1
onnxruntime            1.7.0
opt-einsum             3.3.0
packaging              21.0
pip                    21.1.2
plac                   1.1.3
preshed                3.0.5
progressbar2           3.53.1
protobuf               3.17.3
pyasn1                 0.4.8
pyasn1-modules         0.2.8
pydload                1.0.9
pyparsing              3.0.3
pysbd                  0.3.3
python-utils           2.5.6
PyYAML                 5.4.1
regex                  2021.8.28
requests               2.25.1
requests-oauthlib      1.3.0
rsa                    4.7.2
sacremoses             0.0.46
scikit-learn           0.24.2
scipy                  1.4.1
sentence-splitter      1.4
sentencepiece          0.1.96
seqeval                0.0.3
seqtag-keras           1.0.6
setuptools             52.0.0.post20210125
six                    1.16.0
smart-open             5.1.0
spacy                  2.3.4
srsly                  1.0.5
stanza                 1.1.1
syntok                 1.3.1
tensorboard            2.2.2
tensorboard-plugin-wit 1.8.0
tensorflow             2.2.0
tensorflow-estimator   2.2.0
termcolor              1.1.0
thinc                  7.4.5
threadpoolctl          2.1.0
tokenizers             0.10.3
torch                  1.7.1
tqdm                   4.61.1
trankit                1.1.0
typing-extensions      3.10.0.0
urllib3                1.26.5
wasabi                 0.8.2
Werkzeug               2.0.1
wheel                  0.36.2
wrapt                  1.12.1
zipp                   3.4.1

I installed transformers manually (#32) but it didn't seems to help:

Traceback (most recent call last):
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/transformers/file_utils.py", line 2150, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/transformers/modeling_tf_utils.py", line 30, in <module>
    from tensorflow.python.keras.engine.keras_tensor import KerasTensor
ModuleNotFoundError: No module named 'tensorflow.python.keras.engine.keras_tensor'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/transformers/file_utils.py", line 2150, in _get_module
    return importlib.import_module("." + module_name, self.__name__)
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 953, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "<frozen importlib._bootstrap>", line 1006, in _gcd_import
  File "<frozen importlib._bootstrap>", line 983, in _find_and_load
  File "<frozen importlib._bootstrap>", line 967, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 677, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 728, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/transformers/models/__init__.py", line 19, in <module>
    from . import (
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/transformers/models/layoutlm/__init__.py", line 22, in <module>
    from .configuration_layoutlm import LAYOUTLM_PRETRAINED_CONFIG_ARCHIVE_MAP, LayoutLMConfig
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/transformers/models/layoutlm/configuration_layoutlm.py", line 22, in <module>
    from ...onnx import OnnxConfig, PatchingSpec
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/transformers/onnx/__init__.py", line 17, in <module>
    from .convert import export, validate_model_outputs
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/transformers/onnx/convert.py", line 23, in <module>
    from .. import PreTrainedModel, PreTrainedTokenizer, TensorType, TFPreTrainedModel, is_torch_available
  File "<frozen importlib._bootstrap>", line 1032, in _handle_fromlist
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/transformers/file_utils.py", line 2140, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/transformers/file_utils.py", line 2154, in _get_module
    ) from e
RuntimeError: Failed to import transformers.modeling_tf_utils because of the following error (look up to see its traceback):
No module named 'tensorflow.python.keras.engine.keras_tensor'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "segmentation_study/metrics.py", line 6, in <module>
    from segmenters import *
  File "/Users/lfoppiano/development/projects/sentence-segmentation/sentence-segmentation/segmentation_study/segmenters.py", line 10, in <module>
    from trankit import Pipeline
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/trankit/__init__.py", line 1, in <module>
    from .pipeline import Pipeline
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/trankit/pipeline.py", line 2, in <module>
    from .models.base_models import Multilingual_Embedding
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/trankit/models/__init__.py", line 1, in <module>
    from .classifiers import *
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/trankit/models/classifiers.py", line 2, in <module>
    from .base_models import *
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/trankit/models/base_models.py", line 1, in <module>
    from ..adapter_transformers import AdapterType, XLMRobertaModel
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/trankit/adapter_transformers/__init__.py", line 562, in <module>
    from .modeling_tf_electra import (
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/trankit/adapter_transformers/modeling_tf_electra.py", line 5, in <module>
    from transformers import ElectraConfig
  File "<frozen importlib._bootstrap>", line 1032, in _handle_fromlist
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/transformers/file_utils.py", line 2140, in __getattr__
    module = self._get_module(self._class_to_module[name])
  File "/Users/lfoppiano/opt/anaconda3/envs/ss/lib/python3.7/site-packages/transformers/file_utils.py", line 2154, in _get_module
    ) from e
RuntimeError: Failed to import transformers.models.electra because of the following error (look up to see its traceback):
Failed to import transformers.modeling_tf_utils because of the following error (look up to see its traceback):
No module named 'tensorflow.python.keras.engine.keras_tensor'

Update:

After I uninstall transformers and installed adapter-transfomers it seems to work fine...

Feature request: langID in multilingual pipelines

Thanks for this framework!
It could be worth to add a language identification task to avoid using p.set_active( lang ). For langId a very robust, fast and tiny one could be FastText or a BERT model (better integration, but computational intensive) So the multilingual pipeline would become:

from trankit import Pipeline
p = Pipeline(lang='auto', gpu=True, cache_dir='./cache') # auto means language identification active
p.tokenize('Rich was here before the scheduled time.')
p.ner('وكان كنعان قبل ذلك رئيس جهاز الامن والاستطلاع للقوات السورية العاملة في لبنان.')

Finetuning Trankit for POS on a new language

Hi,
I've looked through the documentation on building a customized pipeline + looked through the example for NER for German. However, I do not understand, whether it is finetuning, that is, starting from an already pretrained Trankit model + Adding a new language, or is it training from scratch?

In particular, I am interested in adding a new language, say, Thai for Part-of-Speech tagging. My plan is to download the th_pud-ud-test.conllu from https://github.com/UniversalDependencies/UD_Thai-PUD, and finetune the existing model to accept this language as well.

1 - I am not sure where to start from
2 - whether I will keep the quality of POS for other languages
3 - how to make this retrained/finetuned model with Thai language be available in "Auto Mode for multilingual pipelines"?

Thank you!

Memory leak in Pipeline() on a CPU

Hello,

I've initiated the model like so:
nlp = Pipeline('english', gpu=False, cache_dir='./cache')
Than call it by using:
with torch.no_grad(): for idx in range(10000): nlp.lemmatize('Hello World', is_sent=True).
When running the code, the RAM memory slowly fills.

I attached a graph of the memory filling up.
image

I'm using python3.7, trankit=1.1.0, torch=1.7.1.

Thank you!

Running Evals on Sentence Segmentation

Very neat project! I am trying to understand the quality of trankit's sentence segmentation. I see the evals here: https://trankit.readthedocs.io/en/latest/performance.html#universal-dependencies-v2-5, but the results aren't very clear to me. Each column is simply a percentage, are these accuracy scores? F1-scores?

Additionally I'd like to run an eval against nltk's Punkt sentence segmentation model to see which I should use. Is the code that generate your evals public?

Thanks.

Problem in long sentences?

Hi,

we occasionally have a problem with long sentences.

Traceback (most recent call last):
  File "test_trankit.py", line 25, in <module>
    parsed = p(parse_me)
  File "/home/jesse/.local/lib/python3.7/site-packages/trankit/pipeline.py", line 916, in __call__
    out = self._ner_doc(out)
  File "/home/jesse/.local/lib/python3.7/site-packages/trankit/pipeline.py", line 873, in _ner_doc
    word_reprs, cls_reprs = self._embedding_layers.get_tagger_inputs(batch)
  File "/home/jesse/.local/lib/python3.7/site-packages/trankit/models/base_models.py", line 68, in get_tagger_inputs
    word_lens=batch.word_lens
  File "/home/jesse/.local/lib/python3.7/site-packages/trankit/models/base_models.py", line 43, in encode_words
    idxs) * masks  # this might cause non-deterministic results during training, consider using `compute_word_reps_avg` in that case

Example code:

from trankit import Pipeline
import json


p = Pipeline(lang='french', gpu=False, cache_dir='./cache')

######## document-level processing ########

sentences = [
['Bacquelaine', 'Daniel', ',', 'Battheu', 'Sabien', ',', 'Bauchau', 'Marie', ',', 'Becq', 'Sonja', ',', 'Ben', 'Hamou', 'Nawal', ',', 'Blanchart', 'Philippe', ',', 'Bogaert', 'Hendrik', ',', 'Bonte', 'Hans', ',', 'Brotcorne', 'Christian', ',', 'Burton', 'Emmanuel', ',', 'Caprasse', 'Véronique', ',', 'Ceysens', 'Patricia', ',', 'Clarinval', 'David', ',', 'Daerden', 'Frédéric', ',', 'De', 'Block', 'Maggie', ',', 'De', 'Coninck', 'Monica', ',', 'De', 'Crem', 'Pieter', ',', 'De', 'Croo', 'Alexander', ',', 'Delannois', 'Paul-Olivier', ',', 'Delizée', 'Jean-Marc', ',', 'Delpérée', 'Francis', ',', 'Demeyer', 'Willy', ',', 'Demon', 'Franky', ',', 'Deseyn', 'Roel', ',', 'Detiège', 'Maya', ',', 'Dewael', 'Patrick', ',', 'Dierick', 'Leen', ',', 'Di', 'Rupo', 'Elio', ',', 'Dispa', 'Benoît', ',', 'Ducarme', 'Denis', ',', 'Fernandez', 'Fernandez', 'Julia', ',', 'Flahaut', 'André', ',', 'Flahaux', 'Jean-Jacques', ',', 'Fonck', 'Catherine', ',', 'Foret', 'Gilles', ',', 'Frédéric', 'André', ',', 'Fremault', 'Céline', ',', 'Friart', 'Benoît', ',', 'Geens', 'Koenraad', ',', 'Geerts', 'David', ',', 'Goffin', 'Philippe', ',', 'Grovonius', 'Gwenaelle', ',', 'Heeren', 'Veerle', ',', 'Jadin', 'Kattrin', ',', 'Jiroflée', 'Karin', ',', 'Kir', 'Emir', ',', 'Kitir', 'Meryame', ',', 'Laaouej', 'Ahmed', ',', 'Lachaert', 'Egbert', ',', 'Lalieux', 'Karine', ',', 'Lanjri', 'Nahima', ',', 'Lijnen', 'Nele', ',', 'Lutgen', 'Benoît', ',', 'Mailleux', 'Caroline', ',', 'Maingain', 'Olivier', ',', 'Marghem', 'Marie-Christine', ',', 'Massin', 'Eric', ',', 'Mathot', 'Alain', ',', 'Matz', 'Vanessa', ',', 'Michel', 'Charles', ',', 'Muylle', 'Nathalie', ',', 'Onkelinx', 'Laurette', ',', 'Özen', 'Özlem', ',', 'Pehlivan', 'Fatma', ',', 'Piedboeuf', 'Benoit', ',', 'Pirlot', 'Sébastian', ',', 'Pivin', 'Philippe', ',', 'Poncelet', 'Isabelle', ',', 'Reynders', 'Didier', ',', 'Schepmans', 'Françoise', ',', 'Senesael', 'Daniel', ',', 'Smaers', 'Griet', ',', 'Somers', 'Ine', ',', 'Temmerman', 'Karin', ',', 'Terwingen', 'Raf', ',', 'Thiébaut', 'Eric', ',', 'Thiéry', 'Damien', ',', 'Thoron', 'Stéphanie', ',', 'Top', 'Alain', ',', 'Turtelboom', 'Annemie', ',', 'Van', 'Biesen', 'Luk', ',', 'Van', 'Cauter', 'Carina', ',', 'Vande', 'Lanotte', 'Johan', ',', 'Van', 'den', 'Bergh', 'Jef', ',', 'Vandenput', 'Tim', ',', 'Van', 'der', 'Maelen', 'Dirk', ',', 'Vanheste', 'Ann', ',', 'Van', 'Mechelen', 'Dirk', ',', 'Van', 'Quickenborne', 'Vincent', ',', 'Van', 'Rompuy', 'Eric', ',', 'Vanvelthoven', 'Peter', ',', 'Vercamer', 'Stefaan', ',', 'Verherstraeten', 'Servais', ',', 'Wathelet', 'Melchior', ',', 'Winckel', 'Fabienne', ',', 'Yüksel', 'Veli'],

['HR', 'Rail', 'organise', 'des', 'actions', 'pour', 'attirer', 'un', 'maximum', 'de', 'candidats', 'vers', 'le', 'métier', 'du', 'rail.', 'À', 'ce', 'titre', ',', 'elle', 'organise', 'des', 'dizaines', 'de', 'job', 'days', ',', 'participe', 'à', 'plusieurs', 'dizaines', 'de', 'salons', 'de', "l'", 'emploi', ',', 'organise', 'énormément', 'de', 'visites', "d'", 'écoles', 'et', 'amène', 'un', 'grand', 'nombre', "d'", 'étudiants', 'à', 'visiter', 'les', 'ateliers', 'de', 'la', 'SNCB', 'et', "d'", 'Infrabel', ',', 'met', 'sur', 'pied', 'des', 'concours', ',', 'est', 'présente', 'dans', 'les', 'médias', 'sociaux', '(', 'LinkedIn', ',', 'Facebook', ',', 'etc', '.)', 'ainsi', 'que', 'dans', 'les', 'médias', 'classiques', '(', 'à', 'la', 'télévision', 'et', 'dans', 'les', 'cinémas', 'en', 'Flandre', '),', 'lance', 'des', 'actions', 'telles', 'que', 'Refer', 'a', 'friend', ',', 'a', 'lancé', 'début', '2016', ',', 'en', 'collaboration', 'avec', 'les', 'services', 'Communication', 'de', 'la', 'SNCB', 'et', "d'", 'Infrabel', ',', 'une', 'toute', 'nouvelle', 'campagne', "d'", 'image', '"', 'Hier', 'ton', 'rêve', ',', "aujourd'", 'hui', 'ton', 'job', '",', 'réactualise', 'son', 'site', 'internet', 'dédié', 'au', 'recrutement', '(', 'www.lescheminsdeferengagent.be', '),', 'a', 'développé', 'un', 'simulateur', 'de', 'train', 'et', 'de', 'train', 'technique', 'utilisé', 'lors', 'des', 'job', 'events', 'et', 'disponible', 'sur', 'le', 'site', 'internet', 'en', 'tant', "qu'", 'application', 'Android', 'et', 'IOS', ',', 'participe', 'à', 'différents', 'projets', 'de', 'formation', 'avec', 'le', 'VDAB', 'et', 'le', 'FOREM', ',', 'a', 'organisé', 'différentes', 'actions', "d'", 'été', 'dans', 'les', 'gares', 'pour', 'sensibiliser', 'le', 'public', 'aux', 'métiers', 'ferroviaires', ',', 'développe', 'des', 'actions', 'en', 'faveur', 'de', 'la', 'diversité', ',', 'a', 'lancé', 'le', 'pelliculage', 'de', 'certains', 'trains', 'en', 'faveur', 'de', 'son', 'site', 'internet', 'et', 'de', 'son', 'recrutement', ',', 'organisera', 'début', '2017', 'le', 'train', 'de', "l'", 'emploi', '.'],

['Les', 'données', 'de', 'la', 'banque', 'transmises', 'aux', 'équipes', 'de', 'recherche', 'sont', 'le', 'numéro', 'du', 'dossier', ',', 'la', 'langue', ',', "l'", 'âge', 'du', 'patient', ',', 'le', 'sexe', 'du', 'patient', ',', 'le', 'lieu', 'du', 'décès', '(', 'à', 'domicile', ',', 'à', "l'", 'hôpital', ',', 'dans', 'une', 'maison', 'de', 'repos', 'et', 'de', 'soins', 'ou', 'autre', '),', 'la', 'base', 'de', "l'", 'euthanasie', '(', 'demande', 'actuelle', 'ou', 'déclaration', 'anticipée', '),', 'la', 'catégorie', "d'", 'affection', 'selon', 'la', 'classification', 'de', "l'", 'OMS', ',', 'le', 'code', 'ICD-10', '(', 'par', 'exemple', ',', 'tumeur', '),', 'la', 'sous-catégorie', "d'", 'affection', 'à', 'la', 'base', 'de', 'la', 'demande', "d'", 'euthanasie', ',', 'selon', 'la', 'classification', 'de', "l'", 'OMS', '(', 'par', 'exemple', ',', 'tumeur', 'maligne', 'du', 'sein', '),', "l'", 'information', 'complémentaire', '(', 'présence', 'de', 'métastases', ',', 'de', 'dépression', ',', 'de', 'cancer', '),', "l'", 'échéance', 'de', 'décès', '(', 'bref', 'ou', 'non', 'bref', '),', 'la', 'qualification', 'du', 'premier', 'médecin', 'consulté', 'dans', 'tous', 'les', 'cas', '(', 'un', 'généraliste', ',', 'un', 'spécialiste', ',', 'un', 'médecin', 'palliatif', '),', 'la', 'qualification', 'du', 'second', 'médecin', 'consulté', 'en', 'cas', 'de', 'décès', ',', 'non', 'prévu', 'à', 'brève', 'échéance', '(', 'psychiatre', 'ou', 'spécialiste', '),', "l'", 'autre', 'personne', 'ou', "l'", 'instance', 'consultée', '(', 'médecin', 'ou', 'psychologue', ',', "l'", 'équipe', 'palliative', 'ou', 'autre', '),', 'le', 'type', 'de', 'souffrance', '(', 'psychique', 'ou', 'physique', '),', 'la', 'méthode', 'et', 'les', 'produits', 'utilisés', '(', 'le', 'thiopental', 'seul', ',', 'le', 'thiopental', 'avec', 'le', 'curare', ',', 'des', 'barbituriques', 'ou', 'autres', 'médicaments', '),', 'la', 'décision', 'de', 'la', 'Commission', '(', 'ouverture', 'pour', 'remarques', ',', 'ou', 'pour', 'plus', "d'", 'informations', 'sur', 'les', 'conditions', 'ou', 'la', 'procédure', 'suivie', '),', 'la', 'transmission', 'ou', 'non', 'à', 'la', 'justice', '.'],

['Monsieur', 'le', 'ministre', ',', 'l’', 'article', '207', ',', 'alinéa', '7', 'du', 'Code', 'des', 'impôts', 'sur', 'les', 'revenus', '(', 'CIR', ')', 'mentionne', 'qu’', 'aucune', 'de', 'ces', 'déductions', 'ou', 'compensations', 'avec', 'la', 'perte', 'de', 'la', 'période', 'imposable', 'ne', 'peut', 'être', 'opérée', 'sur', 'la', 'partie', 'du', 'résultat', 'qui', 'provient', "d'", 'avantages', 'anormaux', 'ou', 'bénévoles', 'visés', 'à', "l'", 'article', '79', ',', 'ni', 'sur', 'les', 'avantages', 'financiers', 'ou', 'de', 'toute', 'nature', 'reçus', 'visés', 'à', "l'", 'article', '53', ',', '24°', ',', 'ni', 'sur', "l'", 'assiette', 'de', 'la', 'cotisation', 'distincte', 'spéciale', 'établie', 'sur', 'les', 'dépenses', 'ou', 'les', 'avantages', 'de', 'toute', 'nature', 'non', 'justifiés', ',', 'conformément', 'à', "l'", 'article', '219', ',', 'ni', 'sur', 'la', 'partie', 'des', 'bénéfices', 'qui', 'sont', 'affectés', 'aux', 'dépenses', 'visées', 'à', "l'", 'article', '198', ',', '§', '1er', ',', '9°', ',', '9°', 'bis', 'et', '12°', ',', 'ni', 'sur', 'la', 'partie', 'des', 'bénéfices', 'provenant', 'du', 'non-respect', 'de', "l'", 'article', '194quater', ',', '§', '2', ',', 'alinéa', '4', 'et', 'de', "l'", 'application', 'de', "l'", 'article', '194quater', ',', '§', '4', ',', 'ni', 'sur', 'les', 'dividendes', 'visés', 'à', "l'", 'article', '219ter', ',', 'ni', 'sur', 'la', 'partie', 'du', 'résultat', 'qui', 'fait', "l'", 'objet', "d'", 'une', 'rectification', 'de', 'la', 'déclaration', 'visée', 'à', "l'", 'article', '346', 'ou', "d'", 'une', 'imposition', "d'", 'office', 'visée', 'à', "l'", 'article', '351', 'pour', 'laquelle', 'des', 'accroissements', "d'", 'un', 'pourcentage', 'égal', 'ou', 'supérieur', 'à', '10', '%', 'visés', 'à', "l'", 'article', '444', 'sont', 'effectivement', 'appliqués', ',', 'à', "l'", 'exception', 'dans', 'ce', 'dernier', 'cas', 'des', 'revenus', 'déductibles', 'conformément', 'à', "l'", 'article', '205', ',', '§', '2', '.'],
]

for s in sentences:
  print(" ".join(s))
  parse_me = [s]
  parsed = p(parse_me)

ImportError: cannot import name 'Pipeline' from partially initialized module 'trankit'

Hi. After I install trankit using pip, run this simple code to test trankit:

from trankit import Pipeline

p = Pipeline('auto')
output = p.tokenize('xxxxx')

And I see this error:
Traceback (most recent call last):
File "/Users/jiluyang/Desktop/trankit.py", line 1, in
from trankit import Pipeline
File "/Users/jiluyang/Desktop/trankit.py", line 1, in
from trankit import Pipeline
ImportError: cannot import name 'Pipeline' from partially initialized module 'trankit' (most likely due to a circular import) (/Users/jiluyang/Desktop/trankit.py)
So why does it occur and how can I fix it? I have tried trankit 1.1.0 and 1.1.1, this error occurs in both versions. Thx !

Vietnamese word tokenization seems broken at certain situations

This is not too common of a case but it sometimes happens.

Original sentence:
"nhắn vào số của ngõ vạn phúc là gửi lịch sử thanh toán không"
Tokenized sentence (for convenience, phonetics assigned as one word are linked with underscore(s)):
"nhắn vào số của ngo ~_vạn_ph ú c là gửi lịc h_sử_thanh_t oán_kh ông"

As you can see, the phonetics themselves were oddly splitted out, even the tone got splitted also.

Also, when I try to recreate this on the demo, it just popped up "Connection error!" (as in screenshot), but the default example still works, denoting the demo system still functions properly on normal cases but this.

Please look into this problem. Thank you.

image

Can't load customized pipeline

Hi! I trained a customized pipeline for customized-mwt using my data, and I get this message during verifying it:

Training done
Customized pipeline is ready to use!
It can be initialized as follows:
-----------------------------------
from trankit import Pipeline
p = Pipeline(lang='customized-mwt', cache_dir='./save_dir')

However, when I try to use it, I receive next error:

  File "aug_trankit.py", line 70, in <module>
    p = Pipeline(lang='customized-mwt', cache_dir='./save_dir')
  File "/home/uaparsers/trankit/trankit/pipeline.py", line 78, in __init__
    self._load_vocabs()
  File "/home/uaparsers/trankit/trankit/pipeline.py", line 276, in _load_vocabs
    '{}/{}.vocabs.json'.format(lang, lang))) as f:
FileNotFoundError: [Errno 2] No such file or directory: './save_dir/xlm-roberta-base/customized-mwt/customized-mwt.vocabs.json'

Why do I get this error if pipeline is verified?
this is my code:https://pastebin.com/2BRmrHYL

File is not a zip file Error when loading the pretrained model

Hi I was trying the customized ner tutorial notebook

When I ran code

trankit.verify_customized_pipeline(`
    category='customized-ner', # pipeline category
    save_dir='./save_dir_filtered' # directory used for saving models in previous steps
)

It printed "Customized pipeline is ready to use". However when I loaded the pipeline as the instruction, it kept reporting error message:
/usr/local/lib/python3.7/dist-packages/trankit/utils/base_utils.py in download(cache_dir, language, saved_model_version, embedding_name)
BadZipFile: File is not a zip file.

Can you help me to figure out what did I miss, and how to fix this?

RuntimeError: CUDA error: no kernel image is available for execution on the device

when I run

p=Pipeline('auto')

>>> from trankit import Pipeline
2022-05-31 18:01:41.938559: I tensorflow/core/util/util.cc:169] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
>>> from trankit import Pipeline
>>> p = Pipeline('auto')
Loading pretrained XLM-Roberta, this may take a while...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.8/dist-packages/trankit/pipeline.py", line 85, in __init__
    self._embedding_layers.half()
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 765, in half
    return self._apply(lambda t: t.half() if t.is_floating_point() else t)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 578, in _apply
    module._apply(fn)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 601, in _apply
    param_applied = fn(param)
  File "/usr/local/lib/python3.8/dist-packages/torch/nn/modules/module.py", line 765, in <lambda>
    return self._apply(lambda t: t.half() if t.is_floating_point() else t)
RuntimeError: CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

docker image is nvidia/cuda:11.4.2-cudnn8-runtime-ubuntu18.04

$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

$ nvidia-smi
NVIDIA-SMI 470.103.01 Driver Version: 470.103.01 CUDA Version: 11.4

$ pip list | grep torch

torch 1.11.0

Hebrew link is broken

Hi, thanks for the great resource!

Link to Hebrew model seems to be broken

BadZipFile                                Traceback (most recent call last)
Input In [59], in <cell line: 3>()
      1 from trankit import Pipeline
----> 3 p = Pipeline(lang='hebrew', gpu=False, cache_dir='./cache')

File ~/opt/anaconda3/envs/ayalon/lib/python3.9/site-packages/trankit/pipeline.py:70, in Pipeline.__init__(self, lang, cache_dir, gpu, embedding)
     66     assert lang in lang2treebank, '{} has not been supported. Currently supported languages: {}'.format(lang,
     67                                                                                                         list(
     68                                                                                                             lang2treebank.keys()))
     69 # download saved model for initial language
---> 70 download(
     71     cache_dir=self._config._cache_dir,
     72     language=lang,
     73     saved_model_version=saved_model_version,  # manually set this to avoid duplicated storage
     74     embedding_name=master_config.embedding_name
     75 )
     77 # load ALL vocabs
     78 self._load_vocabs()

File ~/opt/anaconda3/envs/ayalon/lib/python3.9/site-packages/trankit/utils/base_utils.py:114, in download(cache_dir, language, saved_model_version, embedding_name)
    112         file.write(data)
    113 progress_bar.close()
--> 114 unzip(lang_dir, '{}.zip'.format(language))
    115 if total_size_in_bytes != 0 and progress_bar.n != total_size_in_bytes:
    116     print("Failed to download saved models for {}!".format(language))

File ~/opt/anaconda3/envs/ayalon/lib/python3.9/site-packages/trankit/utils/base_utils.py:89, in unzip(dir, filename)
     88 def unzip(dir, filename):
---> 89     with zipfile.ZipFile(os.path.join(dir, filename)) as f:
     90         f.extractall(dir)
     91     os.remove(os.path.join(dir, filename))

File ~/opt/anaconda3/envs/ayalon/lib/python3.9/zipfile.py:1266, in ZipFile.__init__(self, file, mode, compression, allowZip64, compresslevel, strict_timestamps)
   1264 try:
   1265     if mode == 'r':
-> 1266         self._RealGetContents()
   1267     elif mode in ('w', 'x'):
   1268         # set the modified flag so central directory gets written
   1269         # even if no files are added to the archive
   1270         self._didModify = True

File ~/opt/anaconda3/envs/ayalon/lib/python3.9/zipfile.py:1333, in ZipFile._RealGetContents(self)
   1331     raise BadZipFile("File is not a zip file")
   1332 if not endrec:
-> 1333     raise BadZipFile("File is not a zip file")
   1334 if self.debug > 1:
   1335     print(endrec)

BadZipFile: File is not a zip file```

Import error after fresh install

I'm having some trouble installing trankit.

Created a new venv on python 3.8

$ pip install trankit
$ python
Python 3.8.6 (default, Oct 21 2020, 08:28:24)
[Clang 11.0.0 (clang-1100.0.33.12)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from trankit import Pipeline
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "vtrankit/lib/python3.8/site-packages/trankit/__init__.py", line 1, in <module>
    from .pipeline import Pipeline
  File "vtrankit/lib/python3.8/site-packages/trankit/pipeline.py", line 2, in <module>
    from .models.base_models import Multilingual_Embedding
  File "vtrankit/lib/python3.8/site-packages/trankit/models/__init__.py", line 1, in <module>
    from .classifiers import *
  File "vtrankit/lib/python3.8/site-packages/trankit/models/classifiers.py", line 2, in <module>
    from .base_models import *
  File "vtrankit/lib/python3.8/site-packages/trankit/models/base_models.py", line 1, in <module>
    from transformers import AdapterType, XLMRobertaModel
  File "vtrankit/lib/python3.8/site-packages/transformers/__init__.py", line 672, in <module>
    from .trainer import Trainer
  File "vtrankit/lib/python3.8/site-packages/transformers/trainer.py", line 69, in <module>
    from .trainer_pt_utils import (
  File "vtrankit/lib/python3.8/site-packages/transformers/trainer_pt_utils.py", line 40, in <module>
    from torch.optim.lr_scheduler import SAVE_STATE_WARNING
ImportError: cannot import name 'SAVE_STATE_WARNING' from 'torch.optim.lr_scheduler' (vtrankit/lib/python3.8/site-packages/torch/optim/lr_scheduler.py)

$ pip freeze
adapter-transformers==1.1.1
certifi==2020.12.5
chardet==4.0.0
click==7.1.2
filelock==3.0.12
idna==2.10
joblib==1.0.1
numpy==1.20.1
packaging==20.9
protobuf==3.15.5
pyparsing==2.4.7
regex==2020.11.13
requests==2.25.1
sacremoses==0.0.43
sentencepiece==0.1.91
six==1.15.0
tokenizers==0.9.3
torch==1.8.0
tqdm==4.58.0
trankit==0.3.5
typing-extensions==3.7.4.3
urllib3==1.26.3

I've been looking around, the same error happened here. Not sure what is happening, but seems like my pytorch version is too new? The setup.py for trankit specifies torch>=1.6.1.

how can i

When I use trankit to calculate the dependency parsing of chemical hazard sentences. There are many proper nouns here, but trankit cannot identify them correctly sometimes.
eg:
image
image
Actually,the word ‘工扶' does not exist in Chinese dictionary.
So, is there a way to solve this problem? Thanks very much!

How to download resources manually

Due to the network is poor, when I run the code

from trankit import Pipeline
p = Pipeline('english')

It's very slow and often break off. 41.6M/1.12G [07:26<3:17:34, 90.6kB/s]
So, I wander if there are methods that I can manually download the resources.

CUDA error: device-side assert triggered

Hi!
When I try to parse with

tagged_sent = p.posdep(data, is_sent=True)

I got such an error. My input is a list of strings. Environment is Google Colab.

RuntimeError                              Traceback (most recent call last)
[<ipython-input-17-776a5ac68296>](https://localhost:8080/#) in <module>()
----> 1 tagged_sent = p.posdep(data,is_sent=True)
2 frames
[/usr/local/lib/python3.7/dist-packages/trankit/models/classifiers.py](https://localhost:8080/#) in predict(self, batch, word_reprs, cls_reprs)
    170         unlabeled_scores = self.unlabeled(dep_reprs, dep_reprs).squeeze(3)
    171 
--> 172         diag = torch.eye(batch.head_idxs.size(-1) + 1, dtype=torch.bool).unsqueeze(0).to(self.config.device)
    173         unlabeled_scores.masked_fill_(diag, -float('inf'))
    174 
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

Question on pre-tokenized input

In my case, I need to use bert to tokenize sentences and use trankit with the tokenized sentence to calculate the dependency relations. I want to know whether trankit will have performance loss with the pre-tokenized sentence?

Length of input

Hello
Could u please tell about the length of input? the maximum size?

Cannot train a new NER model on cpu-only machine

Hello

I'm trying to train a TPipeline initialized with a training_config with training_config['gpu'] set to False on a cpu machine (no NVIDIA driver)
But the call to the train method fails with the following stack:

  File "/opt/conda/lib/python3.6/site-packages/trankit/tpipeline.py", line 684, in train
    self._train_ner()
  File "/opt/conda/lib/python3.6/site-packages/trankit/tpipeline.py", line 599, in _train_ner
    shuffle=True, collate_fn=self.train_set.collate_fn)):
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 475, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/_utils/fetch.py", line 47, in fetch
    return self.collate_fn(data)
  File "/opt/conda/lib/python3.6/site-packages/trankit/iterators/ner_iterators.py", line 259, in collate_fn
    batch_piece_idxs = torch.cuda.LongTensor(batch_piece_idxs)
  File "/opt/conda/lib/python3.6/site-packages/torch/cuda/__init__.py", line 172, in _lazy_init
    torch._C._cuda_init()
RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
ERROR:root:20210907-080829-978232 model training failed 

When I open the ner_iterators.py file in the implementation of the collate_fn method it looks like tensors are forced to their torch.cuda version

        batch_piece_idxs = torch.cuda.LongTensor(batch_piece_idxs)
        batch_attention_masks = torch.cuda.FloatTensor(
            batch_attention_masks)
        batch_entity_label_idxs = torch.cuda.LongTensor(batch_entity_label_idxs)
        batch_word_num = torch.cuda.LongTensor(batch_word_num)
        batch_word_mask = torch.cuda.LongTensor(batch_word_mask).eq(0)

Does it mean that it is not possible to train a new NER model on a non-GPU machine

Best regards

Olivier Terrier

Question: running in parallel

Hey guys,

Starting to use your library, which is pretty cool! Thanks a lot !
However, I'm trying to process a lot of document ~400k and as you can guess it will take quite some time 😅 .
I'm working with pandas dataframe and I tried to use pandarallel to try running things in parallel but I didn't manage to have it to work. Seemed like it was stuck forever..

Do you have any if there's a way I could leverage parallelisation (or anything else other than GPU) to reduce computation time?

Thanks in advance!

error on "from trankit import Pipeline "

Thanks for providing this great toolkit. But, I cannot import Pipeline and get the following error:
ImportError: cannot import name '_BaseLazyModule' from 'transformers.file_utils'

It could be because of the conflict in versions. When I did "pip install trankit", I got this error at the end:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
transformers 4.2.1 requires tokenizers==0.9.4, but you have tokenizers 0.9.3 which is incompatible.
Successfully installed tokenizers-0.9.3

I really appreciate your help on this.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.