binbin83 / nlp_pipeline Goto Github PK

View Code? Open in Web Editor NEW

this repo is a wrapper of different NLP libraries into a uniform pipeline object. That will make easy to use them all in one project or to compare them to each other.

License: GNU General Public License v3.0

Python 71.09% Jupyter Notebook 28.91%

nlp_pipeline's Introduction

nlp_pipeline

This repo is a wrapper of different NLP libraries into a uniform pipeline object. That will make easy to use them all in one project or to compare them to each other.

You can use this pipeline to do the following:

Tokenization
Stemming
Lemmatization
POS tagging
morphological analysis
build embeddins on corpora

Now, the library that are supported are:

Installation

Clone the repo

git clone  https://github.com/binbin83/nlp_pipeline.git

Create a virtual environment

python3 -m venv path/to/venv/nlp_pipeline

Install the requirements (if you wan to use gpu, install requirements_gpu.txt)

pip install -r requirements.txt

Read the Notebooks of examples in the folder notebooks
Update the config file with your own paths and parameters
Run the pipeline

python3 main_nlp.py

python3 main_embeddings.py

NLP data

The pipelines:

StanzaNlpPipeline
SpacyNlpPipeline
HuggingfaceNlpPipeline
StanzaCoreNlpPipeline

Have nearly the same structure and the same methods. The results they return are the same. A dictionaries with the following keys:

'tokens': list of tokens
'lemmas': list of lemmas
'pos': list of pos tags
'morph': list of morphological analysis
'doc': the original doc object of the library

Speed

With used RTX A4000 GPU 8Go, apply the nlp pipeline on a 10 millions words corpus took:

~70 minutes for Stanza (GPU)
~20 minutes for Spacy trf (GPU)
~14 minutes for Spacy lg (11th Gen Intel® Core™ i7-11850H @ 2.50GHz × 16)

Embeddings

The embeddings can be buil with the following models: Word2vec, Fastext, Doc2vec, LDA, LSA, ELDA, and HDP

Todo

[ ] Add hugging models available to the embeddings pipeline. ie make possible to finetune CAMEMBERT embeddings on the data

[ ] Add hops parser to the options: https://github.com/hopsparser/hopsparser

[ ] Add unitests

Recommend Projects

binbin83 / nlp_pipeline Goto Github PK

nlp_pipeline's Introduction

nlp_pipeline

Installation

NLP data

Speed

Embeddings

Todo

nlp_pipeline's People

Contributors

Stargazers

Watchers

Forkers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent