Comments (8)
To make a dictionary, for tesseract, is still too complicated
tessercat should be arranged in the way, that it's doing that automatically, on the base of a file containing the words, of the language, optionally followed by the meaning in one or more other languages , or by example frases (see below)
Please fix / program that in the tesseract program, so that it includes a dictionary , from an input file of words, or parts of words, like the following. In a future version of tesseract, one should include in the ocr that one can make a --dictionary option so that, ocr-ing a dictionary, the program itself re-aranges that into such a word-list / dictionary fiole which can be input for a better traineddata file (inclusive dictionary) for tessercat itself
wulundu Katze/deu
sondu Vogel/deu
rawaandu Hund/deu "Mi gattino rawaandu ndu" "Ich habe den Hund gebissen"/deu
galle haus/deu house/eng
"Mi danyaani galle" "Ich habe kein Haus"/deu "Eu nao rtenho casa"/por
jullere Stuhl/deu
kuriire Küche/deu
laawol Weg/deu
lukujaderre Eidechse/deu lizard/en
danki Bett/deu /bed/en
kogol Garten/deu
julirde Moschee/deu mesquita/por
yahde gehen/deu
from langdata.
Is this the same as Pulaar language?
Or is it the Fula language? There seem to exist different scripts for that (based on Latin, Arabic or other scripts).
from langdata.
The data which you provided is not sufficient for the current Tesseract, but made for the old Tesseract 3 recognizer.
There is also no training text. The included word lists are empty.
from langdata.
ful.traineddata.zip
ful.daten.zip
I trained that now with tessercat 5 . the file ful.traineddata.zip is not zipeed, one can remote the .zip behind . The files used to provide that, are in the ful.daten.zip file. I didnt make yet a dictionary.
That language is extended over a big area and has many names, such as tukulor , peulle, pulaar, fula, fulfulde, bolle fulbe , ...
from langdata.
I want to know, when I producing new box files of aonther text, for training tesseract, if after the first run with any data the relevant informations from these data are already "embedded" into the traineddata file, so that for further training i don't need to use these data again (but only new data) , or if I have to let and use all data accumulated in the folder and add the next data.
And, in the first case, if later one want to "remove" sama training data which one added before, if this is possible.
With thess 5, I have the problem that after accumulating plenty data / box files of new texts, and run everything togehter, then during trainig the program crashs with any matrix error. If I add only new data (nut including the previously included data again) then the problem dont occure.
from langdata.
Enclosed an updated trained data file
from langdata.
ful.frequent_words_list.zip
ful.words_list.zip
ful.traineddata.zip
Hier weitere box und jpg files , Wortlisten, und ein neues ful.traineddata für die Sprache Fula / Tukölör / Pulaar Außer einem haben die Dateien kein zip Format, sie wurden nur umbenannt als .zip weil sonst das Hochladen nicht geht (d.h. Umbenennen ohne .zip)
from langdata.
Ich tue jetzt alle weiteren Dateien und Verbesserungen von ful.traineddata nach :
https://github.com/tukulor/ful.traineddata
from langdata.
Related Issues (20)
- this is not an issue, i just need some guaidline for urdu dataset, any expert please?
- Missing many special characters in desired_characters file (Swedish)
- what is the use of Traintext ? Shouldnt it be images instead? HOT 1
- [tha] Please add support for Thai Character "Phinthu"
- Romanian Cyrillic HOT 4
- Update description for repo - Suggested Text:
- Can't encode transcription HOT 3
- About Uyghur Language recognition
- Balinese Script OCR HOT 26
- Santali Language (Ol Chiki script) OCR
- Cannot show Persian numbers
- I'm ssory
- Failed to initialise tesseract engine: .net 6.0 [Tesseract 4.1.1 + Tesseract.Data.English 4.0.0] HOT 2
- Language Request: Kurdish Sorani (Central Kurdish) HOT 1
- install language
- Add Wynn, Eth, and Ash to Middle English script so it can also be used for Old English (Latin) HOT 1
- Language pack request: Accented Belarusian HOT 2
- Trouble with "separator lines" made of **** or ----- or ======= HOT 1
- special characters missing from `nor` and `dan` `desired_characters`
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from langdata.