Coder Social home page Coder Social logo

artitw / text2text Goto Github PK

View Code? Open in Web Editor NEW
278.0 10.0 33.0 702 KB

Text2Text: Crosslingual NLP/G toolkit

Home Page: https://discord.gg/eHaaUuWpTc

License: Other

Python 100.00%
nlp question-generation natural-language-processing natural-language-generation data-augmentation translator cross-lingual multi-lingual question-answering transformers

text2text's Issues

Multilingual Search with Subword TF-IDF

Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics. Moreover, multilingual support can be incorporated inherently as part of the subword tokenization model training. XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English and over 80% for 10 other languages without any heuristics-based preprocessing.

Output confidence

Hi!

Great work, it's working very well for me. I have one question:
Is it possible, for question generation, to output the confidence of the generated question?
If so, how would it be done?

Thanks in advance!

TypeError: ord(char) with string input

Hello! Thanks for the development of this text generation framework. Awesome work! I am trying to query a string from a SQL database into the model as a string variable However, when I do, there is a type error in the tokenization file of the pytorch bert model within the clean text function. The model is unable to take one char at a time even though there is a for loop. Have you come across this and how have you dealt with it?

Error: /pytorch_pretrained_bert/tokenization.py", line 283, in _clean_text
cp = ord(char)
TypeError: ord() expected a character, but string of length 3619 found

Understanding logic

Hey..
This project is exactly what I was searching for. I have tried many projects on github for question generation but found none satisfactory. This has helped me a lot. I am trying to understand the logic behind question generation from text. I have understood that you are using bert-model for generation. Can you please help me in understanding this better?

Serializing <text2text.text_generator.TextGenerator object at 0x00000165BFF72940>

Can anyone let me know how to serialize <text2text.text_generator.TextGenerator object at 0x00000165BFF72940>?

I am loading below when flask loads.

from text2text.text_generator import TextGenerator
qg = TextGenerator(output_type="question")

Now I need to pass qg object to another server, so need to serialize it.
Any relevant pointers would be highly appreciated.

pytorch warning using translate method

Code

text2text.Handler(["Marco ate a red apple at the restaurant yesterday with his friends."], src_lang="en").translate(tgt_lang='ru')

Warning:

/usr/local/lib/python3.7/dist-packages/torch/_tensor.py:575: UserWarning: floor_divide is deprecated, and will be removed in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values.
To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor'). (Triggered internally at  /pytorch/aten/src/ATen/native/BinaryOps.cpp:467.)
  return torch.floor_divide(self, other)

Running on google collab.

Apex Attribute Error

I successfully installed Apex with CUDA 10.2 onto my device. When I run without Apex, it runs well. However, when I generate questions with Apex, I keep on getting this error. I was wondering if you knew how to go about fixing this.

AttributeError: 'FusedLayerNorm' object has no attribute 'normalized_shape'

generating different types of question depending on token combination

example paragraph:
Hitler was born in [Q/TF/B/MCQ4] Austria—then part of Austria-Hungary—and was raised near Linz. He moved to Germany in 1913 and was decorated during his service in the German Army in World War I. In 1919, he joined the German Workers' Party (DAP), the precursor of the NSDAP, and was appointed leader of the NSDAP in 1921.

here : Q = What/when..... questions , TF = true false questions, B = blanks, MCQ = autogenerate 4 options of simillar type (in this case countries), combination is possible, generated questions will be grouped according to type

RuntimeError: CUDA out of memory

Having some issues with the CUDA memory allocation. Is there a way around this or maybe I can just use CPU to train instead? What do I need to comment out?

RuntimeError: CUDA out of memory. Tried to allocate 14.00 MiB (GPU 0; 3.95 GiB total capacity; 2.56 GiB already allocated; 10.88 MiB free; 2.57 GiB reserved in total by PyTorch)

I'm using a GeForce GTX 1050 Mobile card so I understand that it's not exactly built for high end processing

Question generation for non English languages

Hello
I hope you are doing fine , firstly i thank you for your contributions on question generation , and i have a question if i may ask .
Im trying to build a question generation system for a non-English language i was planing to use UniLm ( miniLm multilingual version ) because bert is not really built for text generation since you have experience on that what how do you suggest to do that and am i following the good path .

Thank you in advance for your appreciated help !

qg not working

Hi, I last used T2T 5 May 2021, today I want to produce more work and it wasn't working. I then downloaded the latest but it too isn't working . Please help. Thanks in advance for quick and positive response

QG not working properly for other langauges.

Hi, I tried to generate question for Arabic and Urdu language and it seems small model cannot fit into memory to generate question.
It runs for a long time and then runtime crashes most of the time but few time worked.

import text2text as t2t
t2t.Transformer.PRETRAINED_TRANSLATOR = "facebook/m2m100_418M" #Remove this line for the larger model
h = t2t.Handler(["حکومت اور کالعدم تحریک طالبان پاکستان کی جانب سے مذاکرات میں کسی بھی پیش رفت کے بارے میں آگاہ نہیں کیا جا رہا اور استفسار کے باوجود متعلقہ وزرا خاموشی اختیار کیے ہوئے ہیں۔"], src_lang="ur")
h.tokenize()
h.question()

Here is the log of crash

Dec 4, 2021, 1:19:02 PM | WARNING | WARNING:root:kernel b089d5ac-c179-45fc-aae7-a1cd3fc13344 restarted
-- | -- | --
Dec 4, 2021, 1:19:02 PM | INFO | KernelRestarter: restarting kernel (1/5), keep random ports
Dec 4, 2021, 1:08:56 PM | WARNING | tcmalloc: large alloc 1242218496 bytes == 0x556f760fc000 @ 0x7f07e19221e7 0x556f0fd30f98 0x556f0fcfbe27 0x556f0fcfde20 0x556f0fcff2ed 0x556f0fdf0e1d 0x556f0fd72e99 0x556f0fc3fd14 0x556f0fdf0f31 0x556f0fe1e849 0x556f0fd6ea7d 0x556f0fd6d9ee 0x556f0fd0148c 0x556f0fd01698 0x556f0fd6ffe4 0x556f0fdf1c66 0x556f0fd6edaf 0x556f0fd6d9ee 0x556f0fd00bda 0x556f0fd6e915 0x556f0fd6d9ee 0x556f0fd0148c 0x556f0fd01698 0x556f0fd6ffe4 0x556f0fd6d9ee 0x556f0fd0148c 0x556f0fd01698 0x556f0fd6ffe4 0x556f0fd6d9ee 0x556f0fd00bda 0x556f0fd6ec0d
Dec 4, 2021, 1:08:01 PM | INFO | Adapting to protocol v5.1 for kernel b089d5ac-c179-45fc-aae7-a1cd3fc13344
Dec 4, 2021, 1:07:59 PM | INFO | Kernel started: b089d5ac-c179-45fc-aae7-a1cd3fc13344
Dec 4, 2021, 1:03:31 PM | INFO | Use Control-C to stop this server and shut down all kernels (twice to skip confirmation).
Dec 4, 2021, 1:03:31 PM | INFO | http://172.28.0.12:9000/
Dec 4, 2021, 1:03:31 PM | INFO | The Jupyter Notebook is running at:
Dec 4, 2021, 1:03:31 PM | INFO | 0 active kernels
Dec 4, 2021, 1:03:31 PM | INFO | Serving notebooks from local directory: /

QG almost take 1 minute to generate question if not crashed.

Question generation

Does the model generates questions only of "What" type.
every time we run a model with input sentence one question is generated. Next time for the same input different question is generated. Is there any way to generate all possible questions for a given sentence at one go.

JSONDecodeError when using the Handler Object

I get this error when using Handler():
json.decoder.JSONDecodeError: Unterminated string starting at: line 98979 column 3 (char 2833058)

Here is a screenshot of the error:

Screenshot 2022-05-01 at 21 26 38

Here's the test code that generated the error:

Screenshot 2022-05-01 at 21 30 34

I'm implementing directly using the text2text module, I did not pip install. Could this be a reason for the error?

Fine-Tuning process

Hi!
I would like to know the process of fine-tuning UniLM with inverted SQUAD (hardware, training time, number of steps, parameters, etc.)
Would that be possible?
Thanks in advance!

Question expansion

Hi there, Is it possible to expand many similar meaning questions based on the given question?

Enquiry: Source of pretrained model

Hi, lovely repo! Can I ask where did the pretrained models come from? ie were they from microsoft and whose google drive is the code downloading the model from? Appreciate it. Thanks!

How does the summarization work?

This looks like it's doing abstractive summarization, but occasionally it can do pure extractive summarization.

Can you confirm - and explain the exact methodology used in the summarization module?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.