artitw / text2text Goto Github PK

Text2Text: Crosslingual NLP/G toolkit

Home Page: https://discord.gg/eHaaUuWpTc

License: Other

Python 100.00%

nlp question-generation natural-language-processing natural-language-generation data-augmentation translator cross-lingual multi-lingual question-answering transformers

text2text's Issues

Multilingual Search with Subword TF-IDF

Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics. Moreover, multilingual support can be incorporated inherently as part of the subword tokenization model training. XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English and over 80% for 10 other languages without any heuristics-based preprocessing.

No module named 'text2text.text_generator'

from text2text.text_generator import TextGenerator

ModuleNotFoundError: No module named 'text2text.text_generator'

Kindlyt help me with this error

Output confidence

Hi!

Great work, it's working very well for me. I have one question:
Is it possible, for question generation, to output the confidence of the generated question?
If so, how would it be done?

Thanks in advance!

Cross-lingual semantic retrieval

Perform a similar study to https://arxiv.org/pdf/1907.04307.pdf
but expanding to support 100 languages using the embeddings from the translator.

Possibly start with the paper's code sample.

Collab demo not working

txt2text not working

I have a save copy of t2t but now it doesn't work anymore. Please help?

TypeError: ord(char) with string input

Hello! Thanks for the development of this text generation framework. Awesome work! I am trying to query a string from a SQL database into the model as a string variable However, when I do, there is a type error in the tokenization file of the pytorch bert model within the clean text function. The model is unable to take one char at a time even though there is a for loop. Have you come across this and how have you dealt with it?

Error: /pytorch_pretrained_bert/tokenization.py", line 283, in _clean_text
cp = ord(char)
TypeError: ord() expected a character, but string of length 3619 found

Fix pytorch-extension and re-integrate into text2text for improved performance

Training and inferencing performance could be better. Need to update and test https://github.com/artitw/apex

Understanding logic

Hey..
This project is exactly what I was searching for. I have tried many projects on github for question generation but found none satisfactory. This has helped me a lot. I am trying to understand the logic behind question generation from text. I have understood that you are using bert-model for generation. Can you please help me in understanding this better?

is having a CUDA enabled GPU required for installation ?

Serializing <text2text.text_generator.TextGenerator object at 0x00000165BFF72940>

Can anyone let me know how to serialize <text2text.text_generator.TextGenerator object at 0x00000165BFF72940>?

I am loading below when flask loads.

from text2text.text_generator import TextGenerator
qg = TextGenerator(output_type="question")

Now I need to pass qg object to another server, so need to serialize it.
Any relevant pointers would be highly appreciated.

pytorch warning using translate method

Code

text2text.Handler(["Marco ate a red apple at the restaurant yesterday with his friends."], src_lang="en").translate(tgt_lang='ru')

Warning:

/usr/local/lib/python3.7/dist-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)

Running on google collab.

expected scalar type Half but found Float error in generating question

Hi,
I am getting RuntimeError: expected scalar type Half but found Float while using output_type="question". Any leads on how to fix this?

Thanks

Fine-tune cross-lingual translator for text2text generation

Fine-tune cross-lingual translator for text2text generation tasks, e.g. question generation, question answering, summarization, etc. to demonstrate cross-lingual alignment, zero-shot generation, etc.

For example, can we demonstrate question generation or question answering using the existing API? If not, what needs to get fixed?

https://github.com/artitw/text2text#training--finetuning

Apex Attribute Error

I successfully installed Apex with CUDA 10.2 onto my device. When I run without Apex, it runs well. However, when I generate questions with Apex, I keep on getting this error. I was wondering if you knew how to go about fixing this.

AttributeError: 'FusedLayerNorm' object has no attribute 'normalized_shape'

generating different types of question depending on token combination

example paragraph:
Hitler was born in [Q/TF/B/MCQ4] Austria—then part of Austria-Hungary—and was raised near Linz. He moved to Germany in 1913 and was decorated during his service in the German Army in World War I. In 1919, he joined the German Workers' Party (DAP), the precursor of the NSDAP, and was appointed leader of the NSDAP in 1921.

here : Q = What/when..... questions , TF = true false questions, B = blanks, MCQ = autogenerate 4 options of simillar type (in this case countries), combination is possible, generated questions will be grouped according to type

RuntimeError: CUDA out of memory

Having some issues with the CUDA memory allocation. Is there a way around this or maybe I can just use CPU to train instead? What do I need to comment out?

RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 3.95 GiB total capacity; 2.56 GiB already allocated; 10.88 MiB free; 2.57 GiB reserved in total by PyTorch)

I'm using a GeForce GTX 1050 Mobile card so I understand that it's not exactly built for high end processing

length of input for tokenization

Could u please tell the maximum size of inputs for tokenization?

Question generation for non English languages

Hello
I hope you are doing fine , firstly i thank you for your contributions on question generation , and i have a question if i may ask .
Im trying to build a question generation system for a non-English language i was planing to use UniLm ( miniLm multilingual version ) because bert is not really built for text generation since you have experience on that what how do you suggest to do that and am i following the good path .

Thank you in advance for your appreciated help !

qg not working

Hi, I last used T2T 5 May 2021, today I want to produce more work and it wasn't working. I then downloaded the latest but it too isn't working . Please help. Thanks in advance for quick and positive response

QG not working properly for other langauges.

Hi, I tried to generate question for Arabic and Urdu language and it seems small model cannot fit into memory to generate question.
It runs for a long time and then runtime crashes most of the time but few time worked.

import text2text as t2t
t2t.Transformer.PRETRAINED_TRANSLATOR = "facebook/m2m100_418M" #Remove this line for the larger model
h = t2t.Handler(["حکومت اور کالعدم تحریک طالبان پاکستان کی جانب سے مذاکرات میں کسی بھی پیش رفت کے بارے میں آگاہ نہیں کیا جا رہا اور استفسار کے باوجود متعلقہ وزرا خاموشی اختیار کیے ہوئے ہیں۔"], src_lang="ur")
h.tokenize()
h.question()

Here is the log of crash

Dec 4, 2021, 1:19:02 PM | WARNING | WARNING:root:kernel b089d5ac-c179-45fc-aae7-a1cd3fc13344 restarted
-- | -- | --
Dec 4, 2021, 1:19:02 PM | INFO | KernelRestarter: restarting kernel (1/5), keep random ports
Dec 4, 2021, 1:08:56 PM | WARNING | tcmalloc: large alloc 1242218496 bytes == 0x556f760fc000 @ 0x7f07e19221e7 0x556f0fd30f98 0x556f0fcfbe27 0x556f0fcfde20 0x556f0fcff2ed 0x556f0fdf0e1d 0x556f0fd72e99 0x556f0fc3fd14 0x556f0fdf0f31 0x556f0fe1e849 0x556f0fd6ea7d 0x556f0fd6d9ee 0x556f0fd0148c 0x556f0fd01698 0x556f0fd6ffe4 0x556f0fdf1c66 0x556f0fd6edaf 0x556f0fd6d9ee 0x556f0fd00bda 0x556f0fd6e915 0x556f0fd6d9ee 0x556f0fd0148c 0x556f0fd01698 0x556f0fd6ffe4 0x556f0fd6d9ee 0x556f0fd0148c 0x556f0fd01698 0x556f0fd6ffe4 0x556f0fd6d9ee 0x556f0fd00bda 0x556f0fd6ec0d
Dec 4, 2021, 1:08:01 PM | INFO | Adapting to protocol v5.1 for kernel b089d5ac-c179-45fc-aae7-a1cd3fc13344
Dec 4, 2021, 1:07:59 PM | INFO | Kernel started: b089d5ac-c179-45fc-aae7-a1cd3fc13344
Dec 4, 2021, 1:03:31 PM | INFO | Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Dec 4, 2021, 1:03:31 PM | INFO | http://172.28.0.12:9000/
Dec 4, 2021, 1:03:31 PM | INFO | The Jupyter Notebook is running at:
Dec 4, 2021, 1:03:31 PM | INFO | 0 active kernels
Dec 4, 2021, 1:03:31 PM | INFO | Serving notebooks from local directory: /

QG almost take 1 minute to generate question if not crashed.

Question generation

Does the model generates questions only of "What" type.
every time we run a model with input sentence one question is generated. Next time for the same input different question is generated. Is there any way to generate all possible questions for a given sentence at one go.

JSONDecodeError when using the Handler Object

I get this error when using Handler():
json.decoder.JSONDecodeError: Unterminated string starting at: line 98979 column 3 (char 2833058)

Here is a screenshot of the error:

Here's the test code that generated the error:

I'm implementing directly using the text2text module, I did not pip install. Could this be a reason for the error?

Use crosslingual embeddings as input to MLP or tree-based model in transfer learning fashion
Fine-tune crosslingual translator with softmax output

https://github.com/artitw/text2text/blob/master/text2text_demo.ipynb

artitw / text2text Goto Github PK

text2text's Issues

Recommend Projects

Recommend Topics

Recommend Org