Coder Social home page Coder Social logo

vuizur / add-stress-to-epub Goto Github PK

View Code? Open in Web Editor NEW
20.0 4.0 2.0 580.72 MB

A program that sets the stress and the letter ё of Russian text and ebooks using Wiktionary data and grammar analysis.

License: GNU Affero General Public License v3.0

Python 100.00%
epub python dictionary russian russian-stress language-learning russian-accent

add-stress-to-epub's Introduction

Russian word stresser for ebooks

Короткие инструкции на русском

(Screenshots of the GUI and a created ebook with stress marks)

(If you want to know how to use this with a dictionary for reading ebooks, check out my tutorial!)

This program stresses entire Russian ebooks and adds the dots over the ё. It not only is the most sophisticated open source stress detection tool (that I am aware of), it also allows you to convert entire ebooks!

To reach the best results, it analyzes the case and part of speech of every word in order to find the correct stress. So it will stress the word "слова" differently in these two sentences: В стро́гом смы́сле сло́ва? vs. Твои́ слова́ ничего́ не зна́чат.

Also check out my dictionaries that are compatible with stressed text: Russian-English and Russian-Russian (both are in Stardict format and work well with programs such as KOReader).

Installation

For windows, you simply need to download the executable release, unpack the .zip file into a folder and start the program called #Stress marker.exe. This opens up a GUI and you can simply select the ebook you want to convert. Click on the "Start stress marking" button to start it all. Txt files are also supported.

(If your ebook is not in the epub or txt format, you need to install Calibre. If you have it installed, the script will automatically convert the book (for example from FB2) to epub)

I did not create executables for Linux or Mac, for these systems you should refer to the steps under "Command line use"

Command line use

You can install the package by running pip install git+https://github.com/Vuizur/add-stress-to-epub and then use the library programmatically.

Other option: download the Github repository (click the Code -> Download ZIP button) and put the dictionary zip file into this folder.

Afterwards should install Python 3 (and check the installer option to add it to PATH). Afterwards install the required libraries by executing following command in the command line (which can be opened in Windows Explorer through the "File" button at the top left and then selecting "Open Windows Powershell"): Then install poetry and run:

poetry install

Then you can simply execute the GUI by calling:

poetry shell
pip install PyQt6
poetry run python russian_text_stresser/gui.py

If you want to use the command line utility, you should put you ebook in this folder and start the program with following command (and change input.epub to the file name of the ebook you want to convert):

poetry run python edit_epub.py -input "input.epub" -output "output.epub"

That's it!

You can also convert entire folders filled with epub files:

poetry run python edit_epub.py -input_folder "to-convert" -output_folder "was-converted"

Programmatic usage

If you want to stress text in python, you can simply write:

from russian_text_stresser.text_stresser import RussianTextStresser

ts = RussianTextStresser()
print(ts.stress_text("Твои слова ничего не значат."))

This will print Твои́ слова́ ничего́ не зна́чат.

Feedback

If you have feedback or suggestions, please tell me. I have only tested it for some ebooks, so there could be bugs left. If you find a word that is being stressed incorrectly or if a word is on (English) Wiktionary but still not being stressed, then open an issue. I will maybe maintain a list of words that confuse the algorithm (pretty rare but happens), so that there is rather no stress than a wrong one.

If you are interested in modifying the program: The database used in this project has been created using my other project here: https://github.com/Vuizur/ebook_dictionary_creator

Limitations

In some cases the stress is omitted because there are multiple options the word could be stressed depending on the context (in the case of замок or все vs всё) or because they don't appear in my current data source, which can be the case for very rare words. Or the grammatical analysis delivered wrong results, which can also happen in rare cases.

FAQ

Q: The accents don't get correctly displayed in KOReader, the accented letters are way too small or the accent mark is slightly misplaced. What can I do?

A: It's best to disable "Embedded Fonts" in this case. Open the menu by clicking somewhere at the top, then in the menu at the bottom of the page select the settings wheel symbol on the rightmost side. Set here "Embedded Fonts" to "off".

Q: Can I use this with WordDumb so that I can see short definitions directly in the ebook?

A: Yes! You only need to disable the setting "Use POS". With it enabled it currently does not work, unfortunately. Also make sure to add stress marks first, and then use the Worddumb plugin, not the other way around.

Benchmark results

Setting stress:

System % Correct % Unstressed % Incorrect Correct / incorrect
Reynolds 90.14 8.82 1.04 86.94
Russtress 83.84 11.71 4.45 18.83
Russtress-f 93.39 1.13 5.48 17.03
Random 58.66 0.00 41.34 1.42
RussianGram 94.59 3.71 1.71 55.40
Russ 85.52 0.00 14.48 5.90
Our system 93.11 5.85 1.04 89.80
Our system + Russtress-f 95.91 2.53 1.57 61.23
Our system + WSD 94.62 4.09 1.29 73.19
Our system + WSD + Russtress-f 95.83 2.51 1.65 57.92

Yofication + setting stress:

System % Correct % Unstressed % Incorrect Correct / incorrect
Our system 92.64 5.85 1.52 61.03
Reynolds 89.67 8.83 1.50 59.60
RussianGram 94.36 3.62 2.02 46.80

Thesis

You can read my thesis about Russian text stressing here

Cite as:

@mastersthesis{krumbiegel2023,
  author = {Krumbiegel, Hannes},
  title = {Automated detection of word stress in Russian texts},
  school = {TU Bergakademie Freiberg},
  year = {2023},
  month = {September},
  type = {Master's thesis}
}

Acknowledgements

The data is sourced from the English Wiktionary, the SQLite database containing it has been constructed on the base of Tatu Ylonen's parsed Wiktionary that can be found kaikki.org. An additional data source is the OpenRussian project, the Russian Wiktionary and Wikipedia.

Similar projects

add-stress-to-epub's People

Contributors

vuizur avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

add-stress-to-epub's Issues

Offical™ embedded font

Hello,

I'm very impressed by the project and found it a few minutes ago.

I've noticed that the display of accented characters has a lot of variety depending on whatever fonts are embedded in the book or whatever. It would be nice for users to be able to add a font known to be good with calibre (maybe a wiki page)?

(edit: looks like this is a bug in KOReader, where it calls up a "fallback font" for the accented characters)

Finish benchmarks

  • Put everything together and get the basic results
  • Perform detailed benchmark
  • Benchmark of old versions

Publishing a PyPI module

Hello!
I am the maintainer for VocabSieve. It would be great if you you can publish this to PyPI for programmatic use. All the existing ones (like russtress) does not consider context and make mistakes quite often.

On another note, is it really necessary to have spacy for this? It is a rather large dependency and in my experience works somewhat slowly. Have you tried pymorphy2? It seems to be able to tag words too.

Error when importing

Hi there!
Cool concept and exactly what I've been looking for to add stress marks to sentences from the SMARTool database to make Anki decks.

I installed on my Linux Mint 20 machine using
pip3 install git+https://github.com/Vuizur/add-stress-to-epub

When I run your example in my Python script, I get the following error at the import line:

File "/home/user_name/.local/lib/python3.8/site-packages/russian_text_stresser/russian_dictionary.py", line 45, in <module>
    class RussianDictionary:
  File "/home/user_name/.local/lib/python3.8/site-packages/russian_text_stresser/russian_dictionary.py", line 46, in RussianDictionary
    def __init__(self, db_file: str, simple_cases_file: str | None) -> None:
TypeError: unsupported operand type(s) for |: 'type' and 'NoneType'

KeyError: ''AUX"

image

I'm running your tool on Windows 11. I get a bunch of "Apparently wrong POS detected" messages, and then the tool fails with KeyError: 'AUX', all within about 30 seconds.

Here is the file on which I'm trying to add stress marks. Any feedback is appreciated!

Use a CSS hack to add accents to preserve selected words

Hello
This software seems to physically add the accents on the words themselves. This requires that the user has special dictionary programs or files which can handle the words in their accented forms. I propose a possible way to add accents without needing special dictionaries.
We use a CSS ::before property to provide the accented character in a zero width inline-block span. The text generated by CSS is not selectable, at least in all the browsers I tested. It also works in Calibre and Foliate, partially in KOReader (it does not select as a whole word, but you can just drag a bit). It might work with all reader software, but it still seems useful.
Example html:

<html>
  <head>
    <style>
    [data-content]::before {
      content: attr(data-content); 
    }
    </style>
  </head>
  <body>
    нали<span data-content='&#x301;'></span>чный
  </body>
</html>

This should display нали́чный, but if you try to select it, it will select наличный (without the accent)
It would be great if this is an option for this tool!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.