coderpat / muda Goto Github PK

View Code? Open in Web Editor NEW

8.0 8.0 6.0 224 KB

Python 99.80% JavaScript 0.20%

muda's People

Contributors

Stargazers

Watchers

Forkers

techthiyanes mjpost codeaudit sweta20 hfxunlp sumirehonda2

muda's Issues

Coref error

I am having some problems when running the script. I created my environment using the muda_env.yml file. When I test it on a small test document, I run into a "coref error". It seems to me that it is related to multi-token tokenisation (for example the Spanish word "al" is tokenized as "a" and "el"). See below for an example.

I'd be grateful for your thoughts on this!

This is the command I used: PYTHONPATH=/home/getalp/nakhlem/MuDA python muda/main.py --src my_data/text.en --tgt my_data/text.es --docids my_data/text.docids --dump-tags my_data/test_enes_muda-env-yaml.tags --tgt-lang "es"

And this is the full message:

2024-01-10 16:53:32 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 22.0MB/s]                                                 
2024-01-10 16:53:33 INFO: Loading these models for language: en (English):
=================================
| Processor | Package           |
---------------------------------
| tokenize  | combined          |
| pos       | combined_charlm   |
| lemma     | combined_nocharlm |
| depparse  | combined_charlm   |
=================================

2024-01-10 16:53:33 INFO: Using device: cuda
2024-01-10 16:53:33 INFO: Loading: tokenize
2024-01-10 16:53:35 INFO: Loading: pos
2024-01-10 16:53:36 INFO: Loading: lemma
2024-01-10 16:53:36 INFO: Loading: depparse
2024-01-10 16:53:36 INFO: Done loading processors!
2024-01-10 16:53:36 INFO: Checking for updates to resources.json in case models have been updated.  Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 21.0MB/s]                                                 
2024-01-10 16:53:37 WARNING: Language es package default expects mwt, which has been added
2024-01-10 16:53:38 INFO: Loading these models for language: es (Spanish):
===============================
| Processor | Package         |
-------------------------------
| tokenize  | ancora          |
| mwt       | ancora          |
| pos       | ancora_charlm   |
| lemma     | ancora_nocharlm |
| depparse  | ancora_charlm   |
===============================

2024-01-10 16:53:38 INFO: Using device: cuda
2024-01-10 16:53:38 INFO: Loading: tokenize
2024-01-10 16:53:38 INFO: Loading: mwt
2024-01-10 16:53:38 INFO: Loading: pos
2024-01-10 16:53:38 INFO: Loading: lemma
2024-01-10 16:53:38 INFO: Loading: depparse
2024-01-10 16:53:38 INFO: Done loading processors!
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Due to multiword token expansion or an alignment issue, the original text has been replaced by space-separated expanded tokens.
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Varios', 'enmascarados', 'irrumpieron', 'en', 'el', 'estudio', 'de', 'el', 'canal', 'público', 'TC', 'durante', 'una', 'emisión', ',', 'obligando', 'a', 'el', 'personal', 'a', 'tirar', 'se', 'a', 'el', 'suelo', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['A', 'el', 'menos', '10', 'personas', 'han', 'muerto', 'desde', 'que', 'el', 'lunes', 'se', 'declarara', 'el', 'estado', 'de', 'excepción', 'en', 'Ecuador', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Este', 'se', 'declaró', 'después', 'de', 'que', 'un', 'famoso', 'gánster', 'desapareciera', 'de', 'su', 'celda', 'en', 'prisión', '.', 'No', 'está', 'claro', 'si', 'el', 'incidente', 'en', 'el', 'estudio', 'de', 'televisión', 'de', 'Guayaquil', 'está', 'relacionado', 'con', 'la', 'desaparición', 'de', 'una', 'prisión', 'de', 'la', 'misma', 'ciudad', 'de', 'el', 'jefe', 'de', 'la', 'banda', 'de', 'los', 'Choneros', ',', 'Adolfo', 'Macías', 'Villamar', ',', 'o', 'Fito', ',', 'como', 'es', 'más', 'conocido', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['En', 'el', 'vecino', 'Perú', ',', 'el', 'gobierno', 'ordenó', 'el', 'despliegue', 'inmediato', 'de', 'una', 'fuerza', 'policial', 'en', 'la', 'frontera', 'para', 'evitar', 'que', 'la', 'inestabilidad', 'se', 'extienda', 'a', 'el', 'país', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Ecuador', 'es', 'uno', 'de', 'los', 'principales', 'exportadores', 'de', 'plátano', 'de', 'el', 'mundo', ',', 'pero', 'también', 'exporta', 'petróleo', ',', 'café', ',', 'cacao', ',', 'camarones', 'y', 'productos', 'pesqueros', '.', 'El', 'aumento', 'de', 'la', 'violencia', 'en', 'el', 'país', 'andino', ',', 'dentro', 'y', 'fuera', 'de', 'sus', 'prisiones', ',', 'se', 'ha', 'vinculado', 'a', 'los', 'enfrentamientos', 'entre', 'cárteles', 'de', 'la', 'droga', ',', 'tanto', 'extranjeros', 'como', 'locales', ',', 'por', 'el', 'control', 'de', 'las', 'rutas', 'de', 'la', 'cocaína', 'hacia', 'Estados', 'Unidos', 'y', 'Europa', '.']
Entities: []
  docs = (self._ensure_doc(text) for text in texts)
Loading the dataset...
Extracting: 9it [00:00, 23.15it/s]
Some weights of BertModel were not initialized from the model checkpoint at SpanBERT/spanbert-large-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
coref error
coref error

Originally posted by @MariamNakhle in #16 (comment)

Create a virtual environment or requirements file for MuDA

We should create this so that people don't have dependency issues and can easily use the library.

Separate specific language tagger in different files

Currently the different taggers for the language supported live all in the same file muda/tagger.py. However, code for each language's tagger should live in a separate files since this adds encapsulation and makes it easier to add taggers for new languages.

I suggest them adding them in a new folder. e.g muda/langs/pt_tagger.py

Fix _verb_formality tagging

Currently due to the refactor, verb formalities for languages that support it are disabled.
We should revisit this functions and make them compatible

Keep track of reference chain in MuDA

Currently, we track whether a sentence has any links to the previous sentence with booleans. If we could keep track of these references explicitly, we could automatically change them to create contrastive datasets for certain phenomena to measure a model's context sensitivity. This could be difficult to do, but would remove the dependence on non-contextual baselines.

Add basic tests unit tests for each phenomena

Currently, the tagger works but can be a bit flaky, and might fail for weird edge cases.

One way to ensure that both the code isn't broken and also produces expected tags, we could have some unit tests to assert the output of tagger for specific input sentences we design.

These tests would have to be language specific almost certainly, which could be a pain in the ass, so I think this is only mid-priority

I started by creating a python environment: python3 -m venv muda; . muda/bin/activate
There are two requirements files, in the root dir and under muda/. I installed both, starting with the one under muda.
I had to manually install overrides==3.1.0 to fix one set of errors
I had to manually downgrade pydantic==1.7.4 to fix another set of errors

After that, I was able to get the program to run.

Encapsulate all of MuDA's operations in a single command.

Currently to extract tags from MuDA, one has to run a series of commands (format, align, tag).
Ideally all these operation should be encapsulated in a single command.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.

coderpat / muda Goto Github PK

muda's People

Contributors

Stargazers

Watchers

Forkers

muda's Issues

Coref error

Create a virtual environment or requirements file for MuDA

Separate specific language tagger in different files

Fix _verb_formality tagging

Keep track of reference chain in MuDA

Add basic tests unit tests for each phenomena

Get rid of subprocess call in MuDA

Add new UX to intepret MuDA results

Allow passing pre-existing alignments

Installation issues

Encapsulate all of MuDA's operations in a single command.

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent