artitw / text2text Goto Github PK
View Code? Open in Web Editor NEWText2Text: Crosslingual NLP/G toolkit
Home Page: https://discord.gg/eHaaUuWpTc
License: Other
Text2Text: Crosslingual NLP/G toolkit
Home Page: https://discord.gg/eHaaUuWpTc
License: Other
Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics. Moreover, multilingual support can be incorporated inherently as part of the subword tokenization model training. XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English and over 80% for 10 other languages without any heuristics-based preprocessing.
from text2text.text_generator import TextGenerator
ModuleNotFoundError: No module named 'text2text.text_generator'
Kindlyt help me with this error
Hi!
Great work, it's working very well for me. I have one question:
Is it possible, for question generation, to output the confidence of the generated question?
If so, how would it be done?
Thanks in advance!
Perform a similar study to https://arxiv.org/pdf/1907.04307.pdf
but expanding to support 100 languages using the embeddings from the translator.
Possibly start with the paper's code sample.
I have a save copy of t2t but now it doesn't work anymore. Please help?
Hello! Thanks for the development of this text generation framework. Awesome work! I am trying to query a string from a SQL database into the model as a string variable However, when I do, there is a type error in the tokenization file of the pytorch bert model within the clean text function. The model is unable to take one char at a time even though there is a for loop. Have you come across this and how have you dealt with it?
Error: /pytorch_pretrained_bert/tokenization.py", line 283, in _clean_text
cp = ord(char)
TypeError: ord() expected a character, but string of length 3619 found
Training and inferencing performance could be better. Need to update and test https://github.com/artitw/apex
Hey..
This project is exactly what I was searching for. I have tried many projects on github for question generation but found none satisfactory. This has helped me a lot. I am trying to understand the logic behind question generation from text. I have understood that you are using bert-model for generation. Can you please help me in understanding this better?
Can anyone let me know how to serialize <text2text.text_generator.TextGenerator object at 0x00000165BFF72940>?
I am loading below when flask loads.
from text2text.text_generator import TextGenerator
qg = TextGenerator(output_type="question")
Now I need to pass qg object to another server, so need to serialize it.
Any relevant pointers would be highly appreciated.
Code
text2text.Handler(["Marco ate a red apple at the restaurant yesterday with his friends."], src_lang="en").translate(tgt_lang='ru')
Warning:
/usr/local/lib/python3.7/dist-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
return torch.floor_divide(self, other)
Running on google collab.
Hi,
I am getting RuntimeError: expected scalar type Half but found Float
while using output_type="question"
. Any leads on how to fix this?
Thanks
Fine-tune cross-lingual translator for text2text generation tasks, e.g. question generation, question answering, summarization, etc. to demonstrate cross-lingual alignment, zero-shot generation, etc.
For example, can we demonstrate question generation or question answering using the existing API? If not, what needs to get fixed?
I successfully installed Apex with CUDA 10.2 onto my device. When I run without Apex, it runs well. However, when I generate questions with Apex, I keep on getting this error. I was wondering if you knew how to go about fixing this.
AttributeError: 'FusedLayerNorm' object has no attribute 'normalized_shape'
example paragraph:
Hitler was born in [Q/TF/B/MCQ4] Austria—then part of Austria-Hungary—and was raised near Linz. He moved to Germany in 1913 and was decorated during his service in the German Army in World War I. In 1919, he joined the German Workers' Party (DAP), the precursor of the NSDAP, and was appointed leader of the NSDAP in 1921.
here : Q = What/when..... questions , TF = true false questions, B = blanks, MCQ = autogenerate 4 options of simillar type (in this case countries), combination is possible, generated questions will be grouped according to type
Having some issues with the CUDA memory allocation. Is there a way around this or maybe I can just use CPU to train instead? What do I need to comment out?
RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 3.95 GiB total capacity; 2.56 GiB already allocated; 10.88 MiB free; 2.57 GiB reserved in total by PyTorch)
I'm using a GeForce GTX 1050 Mobile card so I understand that it's not exactly built for high end processing
Could u please tell the maximum size of inputs for tokenization?
Hello
I hope you are doing fine , firstly i thank you for your contributions on question generation , and i have a question if i may ask .
Im trying to build a question generation system for a non-English language i was planing to use UniLm ( miniLm multilingual version ) because bert is not really built for text generation since you have experience on that what how do you suggest to do that and am i following the good path .
Thank you in advance for your appreciated help !
Currently, the documentation consists of the README, which very brief. There is much more functionality in the text2text API that is not described. Such functionality can be better documented for the benefit of all users.
Hi, I last used T2T 5 May 2021, today I want to produce more work and it wasn't working. I then downloaded the latest but it too isn't working . Please help. Thanks in advance for quick and positive response
Hi, I tried to generate question for Arabic and Urdu language and it seems small model cannot fit into memory to generate question.
It runs for a long time and then runtime crashes most of the time but few time worked.
import text2text as t2t
t2t.Transformer.PRETRAINED_TRANSLATOR = "facebook/m2m100_418M" #Remove this line for the larger model
h = t2t.Handler(["حکومت اور کالعدم تحریک طالبان پاکستان کی جانب سے مذاکرات میں کسی بھی پیش رفت کے بارے میں آگاہ نہیں کیا جا رہا اور استفسار کے باوجود متعلقہ وزرا خاموشی اختیار کیے ہوئے ہیں۔"], src_lang="ur")
h.tokenize()
h.question()
Here is the log of crash
Dec 4, 2021, 1:19:02 PM | WARNING | WARNING:root:kernel b089d5ac-c179-45fc-aae7-a1cd3fc13344 restarted
-- | -- | --
Dec 4, 2021, 1:19:02 PM | INFO | KernelRestarter: restarting kernel (1/5), keep random ports
Dec 4, 2021, 1:08:56 PM | WARNING | tcmalloc: large alloc 1242218496 bytes == 0x556f760fc000 @ 0x7f07e19221e7 0x556f0fd30f98 0x556f0fcfbe27 0x556f0fcfde20 0x556f0fcff2ed 0x556f0fdf0e1d 0x556f0fd72e99 0x556f0fc3fd14 0x556f0fdf0f31 0x556f0fe1e849 0x556f0fd6ea7d 0x556f0fd6d9ee 0x556f0fd0148c 0x556f0fd01698 0x556f0fd6ffe4 0x556f0fdf1c66 0x556f0fd6edaf 0x556f0fd6d9ee 0x556f0fd00bda 0x556f0fd6e915 0x556f0fd6d9ee 0x556f0fd0148c 0x556f0fd01698 0x556f0fd6ffe4 0x556f0fd6d9ee 0x556f0fd0148c 0x556f0fd01698 0x556f0fd6ffe4 0x556f0fd6d9ee 0x556f0fd00bda 0x556f0fd6ec0d
Dec 4, 2021, 1:08:01 PM | INFO | Adapting to protocol v5.1 for kernel b089d5ac-c179-45fc-aae7-a1cd3fc13344
Dec 4, 2021, 1:07:59 PM | INFO | Kernel started: b089d5ac-c179-45fc-aae7-a1cd3fc13344
Dec 4, 2021, 1:03:31 PM | INFO | Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Dec 4, 2021, 1:03:31 PM | INFO | http://172.28.0.12:9000/
Dec 4, 2021, 1:03:31 PM | INFO | The Jupyter Notebook is running at:
Dec 4, 2021, 1:03:31 PM | INFO | 0 active kernels
Dec 4, 2021, 1:03:31 PM | INFO | Serving notebooks from local directory: /
QG almost take 1 minute to generate question if not crashed.
Does the model generates questions only of "What" type.
every time we run a model with input sentence one question is generated. Next time for the same input different question is generated. Is there any way to generate all possible questions for a given sentence at one go.
I get this error when using Handler()
:
json.decoder.JSONDecodeError: Unterminated string starting at: line 98979 column 3 (char 2833058)
Here is a screenshot of the error:
Here's the test code that generated the error:
I'm implementing directly using the text2text module, I did not pip install. Could this be a reason for the error?
Hi!
I would like to know the process of fine-tuning UniLM with inverted SQUAD (hardware, training time, number of steps, parameters, etc.)
Would that be possible?
Thanks in advance!
Hi there, Is it possible to expand many similar meaning questions based on the given question?
There is currently no type checking, so we can follow practices from https://docs.python.org/3/library/typing.html
Follow guidelines from official Python documentation for unit testing: https://docs.python.org/3/library/unittest.html
Hi, lovely repo! Can I ask where did the pretrained models come from? ie were they from microsoft and whose google drive is the code downloading the model from? Appreciate it. Thanks!
Two approaches to try:
not sure what that means?
This looks like it's doing abstractive summarization, but occasionally it can do pure extractive summarization.
Can you confirm - and explain the exact methodology used in the summarization module?
Turn colab demo notebook into integration tests:
https://colab.research.google.com/drive/1LE_ifTpOGO5QJCKNQYtZe6c_tjbwnulR
https://github.com/artitw/text2text/blob/master/text2text_demo.ipynb
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.