Coder Social home page Coder Social logo

paulbricman / autocards Goto Github PK

View Code? Open in Web Editor NEW

This project forked from patil-suraj/question_generation

132.0 10.0 19.0 1.03 MB

Accelerating learning through machine-generated flashcards.

Home Page: https://paulbricman.com/thoughtware/autocards

License: MIT License

Python 100.00%
anki flashcards spaced-repetition tools-for-thought cognitive-enhancement

autocards's Introduction

Autocards

  • Automatically create questions and answers from various input formats (PDF files, webpages, wikipedia, epub files, etc) for your favorite flashcards software (like anki, SuperMemo or others).
  • Can handle virtually any language thanks to inbuilt translation (but usually at the cost of lower quality cards).
  • To see a real world example, the complete output of this article can be found in this folder. It's the direct output with no post processing whatsoever.
  • Code is PEP compliant and with docstrings. Contributions and PR are extremely appreciated.
  • Learn more by reading the official write-up.

How to:

  • This has been tested on python 3.9 but will probably work on earlier versions as well.
  • git clone https://github.com/paulbricman/autocards
  • cd autocards
  • pip install -r ./requirements.txt
  • install punkt by running python -m nltk.downloader punkt
  • open a python console: ipython3
  • read the usage guide below

Usage:

All arguments are mentioned with their default value, you don't have to supply them every time.

  • initialization:

    • from autocards import Autocards

    • a = Autocards(in_lang="en", out_lang="en")

      translation modules sometimes need to be downloaded and can be rather large

  • consuming input text is done using one of the following ways:

    • a.consume_var(my_text, per_paragraph=True)

    • a.consume_user_input(title="")

    • a.consume_textfile(path_to_file, per_paragraph=True)

    • a.consume_pdf(path_to_file, per_paragraph=True)

    • a.consume_web(link_or_path, mode="url", element="p")

      mode can be "url" or "local"

      element is the html element, like p for paragraph

  • different ways to get the results back:

    • out = a.string_output(prefix='', jeopardy=False)

      prefix is a text that will be appended before the question & answer

      jeopardy is for swapping question and answer

    • a.print(prefix='', jeopardy=False)

    • a.pprint(prefix='', jeopardy=False)

      pprint stands for pretty printing

    • a.to_anki(deckname="autocards_export", tags=["some_tag"])

    • df = a.pandas_df(prefix='')

    • a.to_csv("output.csv", prefix="")

    • a.to_json("output.json", prefix="")

    Also note that a user provided his own scripts that you can get inspiration from, they are a bit outdated but can be found in the folder examples_script

autocards's People

Contributors

deklanw avatar henry-pulver avatar mrm8488 avatar nautman avatar nrjvarshney avatar patil-suraj avatar paulbricman avatar priyanksonis avatar thiswillbeyourgithub avatar tomwilde avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

autocards's Issues

Limited number of flashcards generated despite large input

Ya, I did the 'open demo' and i put in a good couple of paragraphs of content. It never generated more than3 questions. ๐Ÿ˜ฌ

It might be useful to feed individual paragraphs in the thing. This way, the question answering check has a better chance of succeeding, because it won't find answers in other places and invalidate the flashcard candidate.

Add the option to consume text paragraph after paragraph

When a large body of text is consumed, the program sometimes doesn't know where to search the answer of a question, because it may search for the answer at a different location in the text than where it really is.
If the text is consumed one paragraph at a time, this may solve this issue.
i.e. with a for loop, consuming one paragraph at a time, then moving on to the next.

Note that this problem doesn't happen all the time, and therefore this may break things. Hence why it would be optional.
This is more of an open-ended idea than an issue.

My solution to "list index out of range" in the file tokenization utils fast

Hi,

I had this exception happenned to me, apparently this happens when a sentence is tokenized to an empty list (ie not useful words are inside the sentence).

Updating transformers to 4.6.1 did not solve the issue so I decided to read the text line by line and add a try/except and it worked very well.

I copied my script here : #6

I figured I might as well help people who have the same issue.

edit : added in last revision

ERROR - You need to have sentencepiece installed to convert a slow tokenizer to a fast one

Facing issue after executing this "a = Autocards(in_lang="en", out_lang="en")"

ValueError: Couldn't instantiate the backend tokenizer from one of:

(1) a tokenizers library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.

I tried this: https://stackoverflow.com/questions/65431837/transformers-v4-x-convert-slow-tokenizer-to-fast-tokenizer this solution but it still did not work.

According to Transformers v4.0.0 release, sentencepiece was removed as a required dependency. This means that "The tokenizers that depend on the SentencePiece library will not be available with a standard transformers installation" including the XLMRobertaTokenizer. However, sentencepiece can be installed as an extra dependency.

Ignore references while consuming test

I've been trying out random wiki pages... Having references [1] etc seems to mess it up.. as well as the line break etc. I've been putting them in word, Replacing "^ p " (paragraph) to ". " As well, for quotes you can search "[ ^ # ]" which ^ # is any number...

Sounds like a nice possible option, ignore_references as an option of the consume functions or something.

currently uses only 4 cores instead of 8

I noticed while creating cards from a large documents that only 4 out of 8 of my cores seem to be used. Maybe investigating this could increase the speed of autocards a lot.

(And yes, I don't have a GPU).

Add PDF handling

Add a way for the users to feed a pdf, allows them to select which pages they want to program to create questions on, and automatically extract, consume, and create questions on these pages.
Useful for textbooks and papers.

see #8 for the same idea, but with web-scraping.

Add web-scraping capabilities

Add a way for the users to give a URL, to get the given URL scraped, and consume, + create questions of, the content.

useful for articles, online textbooks, online non-pdf papers.
see #7 for the same idea but with PDF handling.

Pypi Package

Hi @paulbricman, love the tool and all the interesting applications it can be used for. I was wondering, do you have any plan to make it into a pypi package for people to easily install? If so, I'd be happy to open a PR that setups a minimal one and help you to get it into pypi (it is extremely easy).

do a demo on a wikipedia outline

Hi,

just had an interesting idea that I thought was worth sharing.

Wikipedia has these very cool pages called "outline" for example here's the one for genetics.

They allow to quickly find lots of pages on a given subject.

I think it should be rather easy to do a python script using BeautifulSoup to gather the first paragraph of every page listed in a given outline. Then feed it to Autocards.

The resulting anki deck would be a very interesting demo of a "in the field" use of Autocards.

Actually it would be a very quick and low cost way of proving the value of Autocards. And the decks might be useful in themselves!

You could then share them on ankiweb and this should be a nice publicity!

(sort of related to #8 )

Thoughts ?

edit : a script showing how to do this can be found in the latest revision

Not working

>>> a = Autocards()
Loading backend, this can take some time...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\USER\autocards\autocards.py", line 79, in __init__
    self.qg = qg_pipeline('question-generation',
  File "C:\Users\USER\autocards\pipelines.py", line 360, in qg_pipeline
    tokenizer = AutoTokenizer.from_pretrained(tokenizer)
  File "C:\Users\USER\AppData\Roaming\Python\Python39\site-packages\transformers\models\auto\tokenization_auto.py", line 435, in from_pretrained
    return tokenizer_class_fast.from_pretrained(pretrained_model_name_or_path, *inputs, **kwargs)
  File "C:\Users\USER\AppData\Roaming\Python\Python39\site-packages\transformers\tokenization_utils_base.py", line 1719, in from_pretrained
    return cls._from_pretrained(
  File "C:\Users\USER\AppData\Roaming\Python\Python39\site-packages\transformers\tokenization_utils_base.py", line 1791, in _from_pretrained
    tokenizer = cls(*init_inputs, **init_kwargs)
  File "C:\Users\USER\AppData\Roaming\Python\Python39\site-packages\transformers\models\t5\tokenization_t5_fast.py", line 128, in __init__
    super().__init__(
  File "C:\Users\USER\AppData\Roaming\Python\Python39\site-packages\transformers\tokenization_utils_fast.py", line 105, in __init__
    raise ValueError(
ValueError: Couldn't instantiate the backend tokenizer from one of:
(1) a `tokenizers` library serialization file,
(2) a slow tokenizer instance to convert or
(3) an equivalent slow tokenizer class to instantiate and convert.
You need to have sentencepiece installed to convert a slow tokenizer to a fast one.
>>> a.consume_var("a")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
NameError: name 'a' is not defined

Quotes cause trouble in the demo

A good addition to this would be a box at the top to parse text for characters that cause errors (like quotes) and replace bullets and line-breaks with periods, etc. That approach has mostly worked.

Those characters should somehow be escaped.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.