Coder Social home page Coder Social logo

kindlewick's Introduction

Kindlewick

Screenshot

This is a Go program to fetch Wiktionary page content from their API, (optionally) intersect it with a frequency wordlist (as the database is probably too big otherwise), and then produce an HTML file that, together with an .opf file, can be converted to mobi and used on your Kindle for in-book lookups.

Note: My target lang is Finnish, so that’s what I wrote this program in mind with. Hopefully it’ll work out of the box for your TL too, but there’s always the possibility that it does something wonky in the inflection table. Fear not though, goquery is easy to work with!

Instructions

  1. Download the necessities

    1. Install Go. Probably 1.12.

    2. Download a frequency wordlist for your language from here if possible.

      Otherwise the file might be too big for kindlegen to handle, as it’s a 32-bit program. Finnish, with its 98184 lemmata, proved too big to process without a freq list, but your obscure language might be fine.
      If you can’t find one, just omit the -freqlist flag below.

    3. Download kindlegen for your platform You’ll use this to convert the .opf + .html files into .mobi.

  2. Edit the metadata in dict.opf.

    1. Don’t forget to modify <DictionaryInLanguage>!
      Set it to the ISO 639-1 code from here.

    2. You can replace cover.png too, but it matter much as the dictionary won’t show up as a book by default.

  3. To generate dict.html, which dict.opf references, run this, with the name of the frequency list you downloaded instead of fi.txt:

    go run kindlewick.go -freqlist fi.txt

    If it’s still too big, you can just take the first 50k lines or whatever from the file (in bash/zsh/etc) like so:

    go run kindlewick.go -freqlist <(head -n 50000 fi.txt)
  4. Finally, generate the .mobi file and put it on your Kindle!

    kindlegen dict.opf -verbose -c2 -o my_dict.mobi

Q&A

How are inflections acquired?

Basically it just takes every span inside a table cell, and if it consists of multiple words, takes the last one (olen odottanutodottanut, since you can only look up a single word at a time on Kindle), and filters out duplicates.

Why not consult the frequency list before downloading every single page?

Because frequency lists usually have inflected forms of words, and if you only see olen in the list you won’t know you have to download the lemma form olla. Ergo, download everything and keep entries where at least some form of the word shows up in the frequency list.

kindlewick's People

Contributors

efskap avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

kindlewick's Issues

Error decoding https://en.wiktionary.org/w/api...

Hello Dima,

I keep getting a reiterating error when I run this code:

go run kindlewick.go -freqlist fi.txt

The error is:

2022/06/01 13:10:19 Getting https://en.wiktionary.org/w/api.php?action=query&cmlimit=500&format=json&list=categorymembers&formatversion=2&cmtitle=Category:Finnish_lemmas&cmprop=title&cmcontinue=
2022/06/01 13:10:19 Error decoding https://en.wiktionary.org/w/api.php?action=query&cmlimit=500&format=json&list=categorymembers&formatversion=2&cmtitle=Category:Finnish_lemmas&cmprop=title&cmcontinue=
2022/06/01 13:10:19 invalid character '<' looking for beginning of value
2022/06/01 13:10:19 invalid character '<' looking for beginning of value
1 Getting https://en.wiktionary.org/w/api.php?action=parse&formatversion=2&format=json&page=mutta&prop=text
2 Getting https://en.wiktionary.org/w/api.php?action=parse&formatversion=2&format=json&page=olla&prop=text
2022/06/01 13:10:21 Trying again.
...

What do you think the problem might be ?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.