paulbricman / autocards Goto Github PK

View Code? Open in Web Editor NEW

This project forked from patil-suraj/question_generation

132.0 10.0 19.0 1.03 MB

Accelerating learning through machine-generated flashcards.

Home Page: https://paulbricman.com/thoughtware/autocards

License: MIT License

Python 100.00%

anki flashcards spaced-repetition tools-for-thought cognitive-enhancement

autocards's Introduction

Autocards

Automatically create questions and answers from various input formats (PDF files, webpages, wikipedia, epub files, etc) for your favorite flashcards software (like anki, SuperMemo or others).
Can handle virtually any language thanks to inbuilt translation (but usually at the cost of lower quality cards).
To see a real world example, the complete output of this article can be found in this folder. It's the direct output with no post processing whatsoever.
Code is PEP compliant and with docstrings. Contributions and PR are extremely appreciated.
Learn more by reading the official write-up.

How to:

This has been tested on python 3.9 but will probably work on earlier versions as well.
git clone https://github.com/paulbricman/autocards
cd autocards
pip install -r ./requirements.txt
install punkt by running python -m nltk.downloader punkt
open a python console: ipython3
read the usage guide below

Usage:

All arguments are mentioned with their default value, you don't have to supply them every time.

initialization:
- from autocards import Autocards
- a = Autocards(in_lang="en", out_lang="en")
  
  translation modules sometimes need to be downloaded and can be rather large
consuming input text is done using one of the following ways:
- a.consume_var(my_text, per_paragraph=True)
- a.consume_user_input(title="")
- a.consume_textfile(path_to_file, per_paragraph=True)
- a.consume_pdf(path_to_file, per_paragraph=True)
- a.consume_web(link_or_path, mode="url", element="p")
  
  mode can be "url" or "local"
  
  element is the html element, like p for paragraph
different ways to get the results back:
- out = a.string_output(prefix='', jeopardy=False)
  
  prefix is a text that will be appended before the question & answer
  
  jeopardy is for swapping question and answer
- a.print(prefix='', jeopardy=False)
- a.pprint(prefix='', jeopardy=False)
  
  pprint stands for pretty printing
- a.to_anki(deckname="autocards_export", tags=["some_tag"])
- df = a.pandas_df(prefix='')
- a.to_csv("output.csv", prefix="")
- a.to_json("output.json", prefix="")
Also note that a user provided his own scripts that you can get inspiration from, they are a bit outdated but can be found in the folder examples_script

autocards's People

Contributors

Stargazers

Watchers

Forkers

suyoghc ncoop57 bjsi snerks joaoeira thiswillbeyourgithub 0rangeo andrewsanchez deklanw jmone01 wchoujaa henry-pulver alecmmunoz dnrtrdata huggyturd danbroooks jjazar homizoka kj3moraes

autocards's Issues

Extend functionality with cloze-style flashcard generation

It would also be amazing if it could generate cloze deletions

Tentative answers can directly get translated into cloze-style flashcards.

Update Repo

The URL in the repo no longer works: https://paulbricman.com/docs/tools/autocards/

Incomplete requirements

Had to pip install both sentencepiece and protobuf to get example working

Limited number of flashcards generated despite large input

Ya, I did the 'open demo' and i put in a good couple of paragraphs of content. It never generated more than3 questions. 😬

It might be useful to feed individual paragraphs in the thing. This way, the question answering check has a better chance of succeeding, because it won't find answers in other places and invalidate the flashcard candidate.

Add the option to consume text paragraph after paragraph

When a large body of text is consumed, the program sometimes doesn't know where to search the answer of a question, because it may search for the answer at a different location in the text than where it really is.
If the text is consumed one paragraph at a time, this may solve this issue.
i.e. with a for loop, consuming one paragraph at a time, then moving on to the next.

Note that this problem doesn't happen all the time, and therefore this may break things. Hence why it would be optional.
This is more of an open-ended idea than an issue.

My solution to "list index out of range" in the file tokenization utils fast

Hi,

I had this exception happenned to me, apparently this happens when a sentence is tokenized to an empty list (ie not useful words are inside the sentence).

Updating transformers to 4.6.1 did not solve the issue so I decided to read the text line by line and add a try/except and it worked very well.

I copied my script here : #6

I figured I might as well help people who have the same issue.

edit : added in last revision

ERROR - You need to have sentencepiece installed to convert a slow tokenizer to a fast one

Facing issue after executing this "a = Autocards(in_lang="en", out_lang="en")"

ValueError: Couldn't instantiate the backend tokenizer from one of:

(1) a tokenizers library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

I tried this: https://stackoverflow.com/questions/65431837/transformers-v4-x-convert-slow-tokenizer-to-fast-tokenizer this solution but it still did not work.

According to Transformers v4.0.0 release, sentencepiece was removed as a required dependency. This means that "The tokenizers that depend on the SentencePiece library will not be available with a standard transformers installation" including the XLMRobertaTokenizer. However, sentencepiece can be installed as an extra dependency.

Ignore references while consuming test

I've been trying out random wiki pages... Having references [1] etc seems to mess it up.. as well as the line break etc. I've been putting them in word, Replacing "^ p " (paragraph) to ". " As well, for quotes you can search "[ ^ # ]" which ^ # is any number...

Sounds like a nice possible option, ignore_references as an option of the consume functions or something.

currently uses only 4 cores instead of 8

I noticed while creating cards from a large documents that only 4 out of 8 of my cores seem to be used. Maybe investigating this could increase the speed of autocards a lot.

(And yes, I don't have a GPU).

Add PDF handling

Add a way for the users to feed a pdf, allows them to select which pages they want to program to create questions on, and automatically extract, consume, and create questions on these pages.
Useful for textbooks and papers.

see #8 for the same idea, but with web-scraping.

Add web-scraping capabilities

Add a way for the users to give a URL, to get the given URL scraped, and consume, + create questions of, the content.

useful for articles, online textbooks, online non-pdf papers.
see #7 for the same idea but with PDF handling.

Update the README with instructions for running it, overview of options

Pypi Package

Hi @paulbricman, love the tool and all the interesting applications it can be used for. I was wondering, do you have any plan to make it into a pypi package for people to easily install? If so, I'd be happy to open a PR that setups a minimal one and help you to get it into pypi (it is extremely easy).

Support client-side rendered content

Many sites aren't rendered server-side and so are unusable with consume_web, for example all the articles on KhanAcademy https://www.khanacademy.org/humanities/world-history/medieval-times/cross-cultural-diffusion-of-knowledge/a/the-golden-age-of-islam

Integration with Selenium, splash, etc would be one way to fix this

Prettify JSON example

It would be a lot easier to get an overview of the example output if the code was prettified.

https://github.com/paulbricman/autocards/blob/master/output_example/Philip_II.json

Support other Language?

Example russian language?
Thanks

do a demo on a wikipedia outline

Hi,

just had an interesting idea that I thought was worth sharing.

Wikipedia has these very cool pages called "outline" for example here's the one for genetics.

They allow to quickly find lots of pages on a given subject.

I think it should be rather easy to do a python script using BeautifulSoup to gather the first paragraph of every page listed in a given outline. Then feed it to Autocards.

The resulting anki deck would be a very interesting demo of a "in the field" use of Autocards.

Actually it would be a very quick and low cost way of proving the value of Autocards. And the decks might be useful in themselves!

You could then share them on ankiweb and this should be a nice publicity!

(sort of related to #8 )

Thoughts ?

edit : a script showing how to do this can be found in the latest revision

Not working

>>> a = Autocards()
Loading backend, this can take some time...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\USER\autocards\autocards.py", line 79, in __init__
    self.qg = qg_pipeline('question-generation',
  File "C:\Users\USER\autocards\pipelines.py", line 360, in qg_pipeline
    tokenizer = AutoTokenizer.from_pretrained(tokenizer)
  File "C:\Users\USER\AppData\Roaming\Python\Python39\site-packages\transformers\models\auto\tokenization_auto.py", line 435, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "C:\Users\USER\AppData\Roaming\Python\Python39\site-packages\transformers\tokenization_utils_base.py", line 1719, in from_pretrained
    return cls._from_pretrained(
  File "C:\Users\USER\AppData\Roaming\Python\Python39\site-packages\transformers\tokenization_utils_base.py", line 1791, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "C:\Users\USER\AppData\Roaming\Python\Python39\site-packages\transformers\models\t5\tokenization_t5_fast.py", line 128, in __init__
    super().__init__(
  File "C:\Users\USER\AppData\Roaming\Python\Python39\site-packages\transformers\tokenization_utils_fast.py", line 105, in __init__
    raise ValueError(
ValueError: Couldn't instantiate the backend tokenizer from one of:
(1) a `tokenizers` library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.
>>> a.consume_var("a")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'a' is not defined

Quotes cause trouble in the demo

A good addition to this would be a box at the top to parse text for characters that cause errors (like quotes) and replace bullets and line-breaks with periods, etc. That approach has mostly worked.

Those characters should somehow be escaped.

Card reformulation / simplification.

It would be cool if you could take an existing card and reformulate or simplify it.