muda's People
muda's Issues
Coref error
I am having some problems when running the script. I created my environment using the muda_env.yml file. When I test it on a small test document, I run into a "coref error". It seems to me that it is related to multi-token tokenisation (for example the Spanish word "al" is tokenized as "a" and "el"). See below for an example.
I'd be grateful for your thoughts on this!
This is the command I used: PYTHONPATH=/home/getalp/nakhlem/MuDA python muda/main.py --src my_data/text.en --tgt my_data/text.es --docids my_data/text.docids --dump-tags my_data/test_enes_muda-env-yaml.tags --tgt-lang "es"
And this is the full message:
2024-01-10 16:53:32 INFO: Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 22.0MB/s]
2024-01-10 16:53:33 INFO: Loading these models for language: en (English):
=================================
| Processor | Package |
---------------------------------
| tokenize | combined |
| pos | combined_charlm |
| lemma | combined_nocharlm |
| depparse | combined_charlm |
=================================
2024-01-10 16:53:33 INFO: Using device: cuda
2024-01-10 16:53:33 INFO: Loading: tokenize
2024-01-10 16:53:35 INFO: Loading: pos
2024-01-10 16:53:36 INFO: Loading: lemma
2024-01-10 16:53:36 INFO: Loading: depparse
2024-01-10 16:53:36 INFO: Done loading processors!
2024-01-10 16:53:36 INFO: Checking for updates to resources.json in case models have been updated. Note: this behavior can be turned off with download_method=None or download_method=DownloadMethod.REUSE_RESOURCES
Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.6.0.json: 367kB [00:00, 21.0MB/s]
2024-01-10 16:53:37 WARNING: Language es package default expects mwt, which has been added
2024-01-10 16:53:38 INFO: Loading these models for language: es (Spanish):
===============================
| Processor | Package |
-------------------------------
| tokenize | ancora |
| mwt | ancora |
| pos | ancora_charlm |
| lemma | ancora_nocharlm |
| depparse | ancora_charlm |
===============================
2024-01-10 16:53:38 INFO: Using device: cuda
2024-01-10 16:53:38 INFO: Loading: tokenize
2024-01-10 16:53:38 INFO: Loading: mwt
2024-01-10 16:53:38 INFO: Loading: pos
2024-01-10 16:53:38 INFO: Loading: lemma
2024-01-10 16:53:38 INFO: Loading: depparse
2024-01-10 16:53:38 INFO: Done loading processors!
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Due to multiword token expansion or an alignment issue, the original text has been replaced by space-separated expanded tokens.
docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Varios', 'enmascarados', 'irrumpieron', 'en', 'el', 'estudio', 'de', 'el', 'canal', 'público', 'TC', 'durante', 'una', 'emisión', ',', 'obligando', 'a', 'el', 'personal', 'a', 'tirar', 'se', 'a', 'el', 'suelo', '.']
Entities: []
docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['A', 'el', 'menos', '10', 'personas', 'han', 'muerto', 'desde', 'que', 'el', 'lunes', 'se', 'declarara', 'el', 'estado', 'de', 'excepción', 'en', 'Ecuador', '.']
Entities: []
docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Este', 'se', 'declaró', 'después', 'de', 'que', 'un', 'famoso', 'gánster', 'desapareciera', 'de', 'su', 'celda', 'en', 'prisión', '.', 'No', 'está', 'claro', 'si', 'el', 'incidente', 'en', 'el', 'estudio', 'de', 'televisión', 'de', 'Guayaquil', 'está', 'relacionado', 'con', 'la', 'desaparición', 'de', 'una', 'prisión', 'de', 'la', 'misma', 'ciudad', 'de', 'el', 'jefe', 'de', 'la', 'banda', 'de', 'los', 'Choneros', ',', 'Adolfo', 'Macías', 'Villamar', ',', 'o', 'Fito', ',', 'como', 'es', 'más', 'conocido', '.']
Entities: []
docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['En', 'el', 'vecino', 'Perú', ',', 'el', 'gobierno', 'ordenó', 'el', 'despliegue', 'inmediato', 'de', 'una', 'fuerza', 'policial', 'en', 'la', 'frontera', 'para', 'evitar', 'que', 'la', 'inestabilidad', 'se', 'extienda', 'a', 'el', 'país', '.']
Entities: []
docs = (self._ensure_doc(text) for text in texts)
/home/getalp/nakhlem/miniconda3/envs/muda_yml/lib/python3.9/site-packages/spacy/language.py:1580: UserWarning: Can't set named entities because of multi-word token expansion or because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Ecuador', 'es', 'uno', 'de', 'los', 'principales', 'exportadores', 'de', 'plátano', 'de', 'el', 'mundo', ',', 'pero', 'también', 'exporta', 'petróleo', ',', 'café', ',', 'cacao', ',', 'camarones', 'y', 'productos', 'pesqueros', '.', 'El', 'aumento', 'de', 'la', 'violencia', 'en', 'el', 'país', 'andino', ',', 'dentro', 'y', 'fuera', 'de', 'sus', 'prisiones', ',', 'se', 'ha', 'vinculado', 'a', 'los', 'enfrentamientos', 'entre', 'cárteles', 'de', 'la', 'droga', ',', 'tanto', 'extranjeros', 'como', 'locales', ',', 'por', 'el', 'control', 'de', 'las', 'rutas', 'de', 'la', 'cocaína', 'hacia', 'Estados', 'Unidos', 'y', 'Europa', '.']
Entities: []
docs = (self._ensure_doc(text) for text in texts)
Loading the dataset...
Extracting: 9it [00:00, 23.15it/s]
Some weights of BertModel were not initialized from the model checkpoint at SpanBERT/spanbert-large-cased and are newly initialized: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
coref error
coref error
Originally posted by @MariamNakhle in #16 (comment)
Create a virtual environment or requirements file for MuDA
We should create this so that people don't have dependency issues and can easily use the library.
Separate specific language tagger in different files
Currently the different taggers for the language supported live all in the same file muda/tagger.py
. However, code for each language's tagger should live in a separate files since this adds encapsulation and makes it easier to add taggers for new languages.
I suggest them adding them in a new folder. e.g muda/langs/pt_tagger.py
Fix _verb_formality tagging
Currently due to the refactor, verb formalities for languages that support it are disabled.
We should revisit this functions and make them compatible
Keep track of reference chain in MuDA
Currently, we track whether a sentence has any links to the previous sentence with booleans. If we could keep track of these references explicitly, we could automatically change them to create contrastive datasets for certain phenomena to measure a model's context sensitivity. This could be difficult to do, but would remove the dependence on non-contextual baselines.
Add basic tests unit tests for each phenomena
Currently, the tagger works but can be a bit flaky, and might fail for weird edge cases.
One way to ensure that both the code isn't broken and also produces expected tags, we could have some unit tests to assert the output of tagger for specific input sentences we design.
These tests would have to be language specific almost certainly, which could be a pain in the ass, so I think this is only mid-priority
Get rid of subprocess call in MuDA
Add new UX to intepret MuDA results
Currently MuDA relies on compare-mt to analyse the performance of contextual models. However this delegation both makes MuDA harder to use AND brings alot of extra information to users who just want MuDA scores.
Ideally MuDA should have an interface that takes the tags (for both reference and the mt system) and outputs a score.
This is connected to #2
Allow passing pre-existing alignments
Installation issues
I had a lot of issues with installation, noting them here in case you want to fix them, or others have similar ones.
- I started by creating a python environment:
python3 -m venv muda; . muda/bin/activate
- There are two requirements files, in the root dir and under
muda/
. I installed both, starting with the one undermuda
. - I had to manually install
overrides==3.1.0
to fix one set of errors - I had to manually downgrade
pydantic==1.7.4
to fix another set of errors
After that, I was able to get the program to run.
Encapsulate all of MuDA's operations in a single command.
Currently to extract tags from MuDA, one has to run a series of commands (format, align, tag).
Ideally all these operation should be encapsulated in a single command.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.