luismond / tm2tb Goto Github PK
View Code? Open in Web Editor NEWBilingual term extractor
Home Page: https://www.tm2tb.com
License: GNU General Public License v3.0
Bilingual term extractor
Home Page: https://www.tm2tb.com
License: GNU General Public License v3.0
Hi! In the Google Colab notebook, running the biterm extractor on a file above a certain size fails as follows:
The test_bitext_en_es.tmx test file you supplied works fine. If I truncate my test TMX to less than about 300 lines, it also works fine.
I also tested this on Windows with Python 3.10, 3.11, and 3.12 and the result is the same -- less than 300ish lines works, above 300ish lines fails.
Thanks and love your work!
For this who cannot send data to a 3rd party API for data confidentiality reasons, it would be cool to use local dictionaries, like Hunspell (which is complete enough to power OS/browser spellchecking).
Hey :) love this project but I'm not really skilled with Python hence this might sound like a beginner question.
If I set print(biterms[:200])
to a high number like 200, it seems to truncate the export and skipping a bunch of lines
src_term src_tags src_rank trg_term trg_tags trg_rank similarity frequency biterm_rank
0 enemy Clan [NOUN, PROPN] 0.6490 clan ennemi [NOUN, NOUN] 0.5855 0.9662 16 0.6448
1 Archers [PROPN] 0.4446 archers [NOUN] 0.4693 0.9096 28 0.6024
2 hunt [NOUN] 0.4017 chasse [NOUN] 0.4042 0.9168 6 0.5911
3 Warriors [NOUN] 0.2542 guerriers [NOUN] 0.5313 0.9256 32 0.5899
4 attacks [NOUN] 0.2252 attaques [NOUN] 0.5065 0.9658 6 0.5872
.. ... ... ... ... ... ... ... ... ...
195 bonus rewards [NOUN, NOUN] 0.1408 récompenses bonus [NOUN, NOUN] 0.3058 0.9771 1 0.5398
196 player [NOUN] 0.1561 joueur [NOUN] 0.1726 0.9684 17 0.5397
197 level [PROPN] 0.1620 niveau [NOUN] 0.1600 0.9875 21 0.5397
[200 rows x 9 columns]
also, any way to save all lines to a file or even better directly to a CSV? (at the moment I'm just saving the print output to a txt file via
from contextlib import redirect_stdout
with open('out.txt', 'w') as f:
with redirect_stdout(f):
print(biterms[:30])
Will there be a PyPI package so that other applications can integrate it easily with pip install tm2tb
?
LaBSE is good, but also very big (almost 2 GB).
Add a model selection feature to use tm2tb also with other smaller language models.
-- .xlsx: use office acount to create it
-- .mqxliff: reinstall memoQ, import data, export .mqxliff, .tmx
-- .mxliff: use account, load data, export
Hi thank you for this cool tool! We are trying to extend it with more Spacy languages. Adding the new language models was not an issue. However, when running the extractor, there are no stop_vectors found. Any hint where we can download some or create our own?
Thank you!
Traceback (most recent call last):
File "/Users/devi/PycharmProjects/tm2tb/de_it_test.py", line 42, in <module>
terms = extractor.extract_terms(span_range=(1, 3), incl_pos=['ADJ', 'NOUN', 'PROPN', 'ADP']) # Extract terms
File "/Users/devi/PycharmProjects/tm2tb/tm2tb/term_extractor.py", line 114, in extract_terms
stops_embeddings_avg = np.load(os.path.join(file_dir, '..', 'stops_vectors', str(self.emb_dims), f'{self.lang}.npy'))
File "/Users/devi/PycharmProjects/tm2tb/venv/lib/python3.9/site-packages/numpy/lib/npyio.py", line 390, in load
fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/Users/devi/PycharmProjects/tm2tb/tm2tb/../stops_vectors/768/it.npy'
Process finished with exit code 1
Write detailed instructions on how to get an Azure Cognitive Services key
Hi there, I was trying to use your tool in Jupyter Lab and got this error, despite using a virtual environment and downloading all the requirements. Any ideas about what's happening here?
What is the basic approach used?
(Would be ideal to describe it in the README.)
Problem:
Due to segmentation issues or configuration, many bilingual documents have long paragraphs in each segment. It would be nice if we could split the paragraphs and align the sentences within them. And simple rules like splitting on new line, or using regex wouldn't work. It would be necessary to align them using similarity. Could tm2tb do this?
When the max_stopword_similarity value passed to extract_terms method is too low, e. g. .10, no terms might be found at all. This results in the following error being raised in term_extractor.py line 124.
raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Suggestion:
Check if top_spans actually contains any term candidate by wrapping lines 124-132 in an if condition:
**if len(top_spans) > 0:**
if collapse_similarity is True:
top_spans = self._collapse_similarity(top_spans)
for i, span in enumerate(top_spans):
span._.span_id = i
top_spans = sorted(top_spans, key=lambda span: span._.span_id)
if return_as_table is True:
top_spans = self._return_as_table(top_spans)
return top_spans
Does this make sense to you?
When loading the SpaCy models, all models are loaded even if they are not used. See file spacy_models.py line 19 and following.
When adding more languages or using lg models, this might become a bottleneck and slow down the extraction process significantly.
Suggestion:
Check which language is requested and only load the required model, e.g. by changing from line 58 to (removing spacy_model = spacy_models[lang]
):
if lang == 'de':
spacy_model = de_core_news_md.load()
elif lang == 'en':
spacy_model = en_core_web_md.load()
...
Then you would also be able to remove line 19-26.
What do you think?
Hi @luismond ,
Thank you so much for making this available.
I tried to run the app, and it was complaining about missing 'uploads' directory. I created it, but it throws an error.
Any idea how to fix this?
By the way, I was testing it using the de-en csv file you provided.
This is what the browser shows up:
Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.
Here is the console output:
127.0.0.1 - - [25/Mar/2021 05:41:39] "GET /favicon.ico HTTP/1.1" 500 -
tm len: 100
fn to df 0.052005767822265625
preproc 0.15700340270996094
detect 2.0500078201293945
detected src lang: en
detected trg lang: de
get tokens 2.051008701324463
grams 2.4290056228637695
remove stops 2.686998128890991
fn to iterzip 2.804997444152832
prepare 1116 tb cands 2.975860595703125
[2021-03-25 05:51:26,829] ERROR in app: Exception on / [POST]
Traceback (most recent call last):
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\_compat.py", line 39, in reraise
raise value
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "myproject.py", line 47, in post_file
return prev(filename)
File "myproject.py", line 59, in prev
result_html = Markup(tm2tb_main(filename))
File "C:\Users\admin\Downloads\Programs\tm2tb_flask_terminology_extraction-main\tm2tb-main\tm2tb.py", line 145, in tm2tb_main
tb = tb_to_azure(tb, srcdet, trgdet)
File "C:\Users\admin\Downloads\Programs\tm2tb_flask_terminology_extraction-main\tm2tb-main\lib\tb_to_azure.py", line 33, in tb_to_azure
sst_batches_lu = [get_azure_dict_lookup(src_det, trgdet, l) for l in sst_batches]
File "C:\Users\admin\Downloads\Programs\tm2tb_flask_terminology_extraction-main\tm2tb-main\lib\tb_to_azure.py", line 33, in <listcomp>
sst_batches_lu = [get_azure_dict_lookup(src_det, trgdet, l) for l in sst_batches]
File "C:\Users\admin\Downloads\Programs\tm2tb_flask_terminology_extraction-main\tm2tb-main\lib\get_azure_dict_lookup.py", line 36, in get_azure_dict_lookup
targets = [get_normalizedTargets(d) for d in response]
File "C:\Users\admin\Downloads\Programs\tm2tb_flask_terminology_extraction-main\tm2tb-main\lib\get_azure_dict_lookup.py", line 36, in <listcomp>
targets = [get_normalizedTargets(d) for d in response]
File "C:\Users\admin\Downloads\Programs\tm2tb_flask_terminology_extraction-main\tm2tb-main\lib\get_azure_dict_lookup.py", line 12, in get_normalizedTargets
targets = [t['normalizedTarget'] for t in d['translations']]
TypeError: string indices must be integers
127.0.0.1 - - [25/Mar/2021 05:51:26] "POST / HTTP/1.1" 500 -
[2021-03-25 05:51:27,556] ERROR in app: Exception on /favicon.ico [GET]
Traceback (most recent call last):
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\_compat.py", line 39, in reraise
raise value
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1936, in dispatch_request
return self.view_functions[rule.endpoint](**req.view_args)
File "myproject.py", line 69, in get_file
redirect(url_for('uploaded_file', filename=filename))
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\helpers.py", line 370, in url_for
return appctx.app.handle_url_build_error(error, endpoint, values)
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 2216, in handle_url_build_error
reraise(exc_type, exc_value, tb)
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\_compat.py", line 39, in reraise
raise value
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\helpers.py", line 358, in url_for
endpoint, values, method=method, force_external=external
File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\werkzeug\routing.py", line 2179, in build
raise BuildError(endpoint, values, method, self)
werkzeug.routing.BuildError: Could not build url for endpoint 'uploaded_file' with values ['filename']. Did you forget to specify values ['filename_tb']?
Problem:
So far, the module can extract terms from parallel, aligned data.
The fact is that there is a lot of multilingual data out there that is not aligned or in a translation file format.
For example, a Wikipedia page about panda bears in English and the same page in Spanish.
The content is similar, but the sentences of both pages are not aligned.
Solution:
Make it work for non-aligned data. Extract all source n-grams, and all target n-grams, compare them. Write special filtering and ranking functions if necessary.
I want to add Chinese language support. It is not difficult to add Spacy models of the Chinese Language.
But the problem is: I can not find the stops_embedings_avg needed.
How to create stops_embeddings data?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.