shreeshrii / tess5train-fonts Goto Github PK
View Code? Open in Web Editor NEWFiles and Scripts to run Tesseract 5 LSTM Training using fonts
License: Apache License 2.0
Files and Scripts to run Tesseract 5 LSTM Training using fonts
License: Apache License 2.0
Tesseract Version: v4.0.0.20181030
Platform: Ubuntu16
There are some articles, which contain both English alphabets and Greek alphabets, need to be OCRed. And I turned to Tesseract.
After installing Tesseract successfully, I opened a terminal and ran the command tesseract detector_sample_1.png result -l eng+grc, and get result.txt as a result.
The original image named "detector_sample_1.png" is shown as bellow.
And the result.txt is shown as bellow too.
I found that Tesseract works quite well, if disregarded the content in red block(s).
Actually, Greek alphabets do not appear too frequently in these articles. So I came up with the idea that I should retrain/finetune the existing eng.traineddata.
Therefore, I resorted to your code.
After reading your README.md, I think I should firstly run 8-makedata_layernew.sh and 9-layernew.sh later. (Should do some modification certainly!)
In that I need to finetune the eng.traineddata with Greek alphabets, I prepared a training_text
eng.anhao.training_text.txt. (I need to change the extension to .txt in that I can not upload the file with extension .training_text.) And I only cat ../langdata/eng/eng.training_text ../langdata/eng/eng.anhao.training_text >../langdata/eng/eng.layer.training_text (in 8-makedata_layernew.sh). What is more, I prepared a new test file
eng.layertest.training_text.txt.
Then I ran ./8-makedata_layernew.sh and 9-layernew.sh. Afterwards, I get the eng_layer.traineddata.
It is disappointed that the performance degraded, although the eng_layer.traineddata can recognize some Greek alphabets.
I tried to extend the existing model "eng.traineddata" with Greek alphabets, and I tried your code. But the result is disappointing. So I hope you could help me.
Dear @Shreeshrii,
this is the first time I am trying to train tesseract with bangle and Arabic fonts, I've checked the past documentations of tesseract fine-tuning on new fonts but none of them is working after tesseract 5 release. During working with this repository I've faced few issues.I am not understanding where to start, the README doc of this repo is not helpful for newcomers like me and specially someone who is not very keen in Linux commands. Pythonic way of fine-tuning tesseract would be great, but even this repository is good enough for me to understand i believe,all i need is a step by step procedure(a detailed documentation).
i am trying to train tesseract with bangla fonts,if you kindly provide a step by step procedure for bangla font training then it will be highly appreciated. thank you in advance
Dear Shree,
I executed 1-makedata.sh
The final result seems ok, except for an error was thrown: Failed to read data from: ../langdata/eng/eng.config
There is not eng.config. It also is not in the instruction.
Is this error is a small thing and we can ignore?
Thank you!
../langdata/eng/eng.wordlist --numbers ../langdata/eng/eng.numbers --puncs ../langdata/eng/eng.punc --output_dir ../tesstutorial/engeval --lang eng
Loaded unicharset of size 111 from file /tmp/eng-2019-04-03.nuH/eng.unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Warning: properties incomplete for index 25 = ~
Config file is optional, continuing...
Failed to read data from: ../langdata/eng/eng.config
Hi shree, I'm getting this error when I run 6-plusminus.sh (after 5-makedata_plusminus.sh)
My plusminus training is on adding arabic-indic characters.
I think it's not training at all because all my screen is full of this error
Line cannot be recognized!!
Image not trainable
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Image too small to scale!! (1x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Image too small to scale!! (1x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Image too small to scale!! (1x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Image too small to scale!! (1x48 vs min width of 3)
Line cannot be recognized!!
But the weird thing is that it correctly starts to recognize some new characters
Truth:ةيمسر ةيلود ةارابم لوأ يف ٢٤٥٦± ادنلتكسا دض ٧٨± بعلت ارتلجنإ
OCR :ةيمسر ةيلود ةارابم لوأ يف ،(٥٦± ادنلتكسا دض ،٩± بعلت ارتلجنإ
Truth:ىلع نماثلا نرقلا ١٤± ىلإ ارتلجنإ يف مدقلا ةرك خيرات
OCR :ىلع نماثلا نرقلا ١٤± ىلإ ارتلجنإ يف مدقلا ةرك خيرات
A sample of my training plus minus text is
الان هناك زيادة قدرها فى المائة واشار وزير التجارة البحرينى الى
ن النُّمو يعبر، عن الزيادة ±٤٨ الحاصلة في الإنتاج، فإنه يأخذ بعين
إحصائية مقاطع و المزيد نشيط عماء، هذا نغمات 7 ومن التسجيل:
في » في حتى إرسال البيانات؟ = , معلومات اسم برامج أحمد
"النمو في الدخل"، لأن توزيع الدخل إذا كان حاداً(حتى بوجود النمو)
أن لها أسماء أخرى غير عربية عند شعوب مسلمة أخرى
وذو القعدة وذو الحجة ومحرم. ولأن الله نعتها بالدين القيم، فقد
ونبه الى ان هدف الحكومة ±٣٣ البحرينية فى المرحلة الحالية هو
ويمثل فى المائة من الاقتصاد العالمي واكد الوزير البحرينى ان فتح
تماعات السابقة، الأولى هي حمل الكرة باليد والجري بها
تشهد نسبة عالية من الجريمة ±٢٢ لكن هذه السماء، تبقى ماء، هامة
والبضائع التى تأتى للبحرين يمكن اعادة تصديرها لدول المنطقة
Dear Shree,
I execute below command and got error.
I installed tesseract, and also compile from https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation.
which should be removed? or what is best way?
Thank you!
./1-makedata.sh
ERROR: shared library version mismatch (was 4.1.0-rc1-223-g3e71, expected 4.1.0-rc1-184-g497d
Did you use a wrong shared tesseract library?
$ tesseract --version
tesseract 4.1.0-rc1-223-g3e71
leptonica-1.78.0
libpng 1.6.34 : zlib 1.2.11
Found SSE
Hey @Shreeshrii, I'm using your Makefile in a docker container to train tesseract 5 of an English font, just to see if my setup works.
I've been encountering this issue for a while now:
Loaded file data/eng/eng.lstm, unpacking...
Failed to continue from: data/eng/eng.lstm
I have tried to use traineddata from tessdata_best
and tessdata
, same exact error!!
this is the output of combine_tessdata -e data/eng.traineddata data/eng/eng.lstm
with tessdata_best
Extracting tessdata components from data/eng.traineddata
Wrote data/eng/eng.lstm
Version:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517
The command that fails is the following:
lstmtraining \
--continue_from data/eng/eng.lstm --old_traineddata data//eng.traineddata \
--traineddata data/engDejavu/engDejavu-proto.traineddata \
--train_listfile data/engDejavu/list.train \
--eval_listfile data/engDejavu/list.eval \
--max_iterations 100 \
--debug_interval -1 \
--learning_rate 0.0001 \
--target_error_rate 0.01 \
--model_output data/engDejavu/checkpoints/engDejavu
Dockerfile:
# Set docker image
FROM ubuntu:18.04
# Skip the configuration part
ENV DEBIAN_FRONTEND noninteractive
# Update and install depedencies
RUN apt-get update && \
apt-get install -y wget unzip bc vim python3-pip libleptonica-dev git htop
# Packages to complie Tesseract
RUN apt-get install -y --reinstall make && \
apt-get install -y g++ autoconf automake libtool pkg-config libpng-dev libjpeg8-dev libtiff5-dev libicu-dev \
libpango1.0-dev libcairo2-dev autoconf-archive rename ttf-mscorefonts-installer && fc-cache -f
# Set working directory
WORKDIR /app
RUN mkdir /app/src && cd /app/src
# # Set the locale
RUN apt-get install -y locales && locale-gen en_GB.UTF-8
ENV LC_ALL=en_GB.UTF-8
ENV LANG=en_GB.UTF-8
ENV LANGUAGE=en_GB.UTF-8
# # Copy requirements into the container at /app
COPY requirements.txt ./
RUN pip3 install -r requirements.txt
# # Complie Tesseract with training options (also feel free to update Tesseract versions and such!)
RUN mkdir src && cd /app/src && \
git clone https://github.com/tesseract-ocr/tesseract.git && \
cd /app/src/tesseract && \
./autogen.sh && ./configure --disable-graphics && make && make ins all && ldconfig && \
make training && make training-install
Any help or guidance is appreciated! thanks
Hi,
I want to read handwritten digits from an image. There is MNIST handwritten digits dataset but I do not understand how to train it in Tesseract. Do you have any idea?
Thanks for advance
hi.
i want to train 10 new font to fas lang.
i downloaded and create tiff and box with this data:
https://github.com/tesseract-ocr/langdata_lstm/tree/master/fas
now what should i do for training? What is best command line for lstmtraining to get best output accuracy?
im very confused in lstmtraining.
Hi Shree,
Forgive me, as im new to LSTM and Tesseract.
I ran tesseract API on windows, It works best with default tesseract_best/eng.traineddata (sometime it perform better on your trained weights(/tessdata_shreetest/engrestrict_best.traineddata)
However, the result yield is not very satisfactory. (about 55% decoded correctly) (image failed due to low contrast and unknown reasons)
Therefore, I seeks to fine-tune it with some training. Then i realizes the intended training datasets aren't using images, rather it generates image from text.(not very sure as the documentation are extremely confusing).
Q1: Can i train it with my images(contrasts varries), in hope it can perform better on my datasets?
Example:
aaa_1_2_0
Default weights: ASA
Ground Truth: A8A
I am getting an invalid syntax error when running the training:
mv -v data/ground-truth/engImpact-eval/eng.training_files.txt data/engImpact/list.eval
renamed 'data/ground-truth/engImpact-eval/eng.training_files.txt' -> 'data/engImpact/list.eval'
sed -i -e '$a\' data/engImpact/list.eval
mv -v data/ground-truth/engImpact-eval/eng/eng.* data/engImpact/
renamed 'data/ground-truth/engImpact-eval/eng/eng.charset_size=103.txt' -> 'data/engImpact/eng.charset_size=103.txt'
renamed 'data/ground-truth/engImpact-eval/eng/eng.traineddata' -> 'data/engImpact/eng.traineddata'
renamed 'data/ground-truth/engImpact-eval/eng/eng.unicharset' -> 'data/engImpact/eng.unicharset'
rename "s/eng\./engImpact-eval\./g" data/engImpact/*.*
removed directory 'data/ground-truth/engImpact-eval/eng'
bash box2gt.sh data/ground-truth/engImpact-eval
creating gt from box files for data/ground-truth/engImpact-eval
File "generate_gt_from_box.py", line 30
print(''.join(line.replace(" ", "\u001f ").split(' ', 1)[0] for line in boxfile if line), file=gtstring)
SyntaxError: invalid syntax
Python version
root@80c3756f06b2:~/tess5train-fonts# python --version
Python 2.7.18
Is there a way to extract the neural network parameters from the traineddata?
I would like to recreate a tensorflow version of the model with the trained weights.
Kind Regards.
Hi!
I ran into an issue while using tess5train on Arch Linux ... The rename
command as used in the Makefile (example below) is, at least on Arch Linux, from the util-linux
package - which is the same on Debian/Ubuntu distributions. The default rename command does not support PCRE / Regular Expressions ... This causes your Makefile to fail... Not sure about Debian-based distros, but on Arch, the package is perl-rename
and the associated command is perl-rename
...
Line 456 in 3d98a34
Hi,
which data file is best for Arabic, version 4?
many thanks
Hi @Shreeshrii , I have a question that I can't understand the difference between both. I need your illustration, please
More information, I will add the new characters (Arabic-Indic numbers) in the training_text file and finetune the ara.traineddata best file
language-specific.sh show fonts kinds is different from the result of the command that text2image --fonts_dir /usr/share/fonts --list_available_fonts. why?
hello @Shreeshrii
i have question about the result of training tesseract plz can tel me what are these results and what does each value mean?
i'm beginner in tesseract train
Mean rms=1.559%,
delta=7.841%,
char train=15.7%,
word train=35.68%,
skip ratio=0.1%,
New best char error = 15.7
Eval Char error rate=6.9447893,
Word error rate=27.039255
wrote best model:./SANLAYER/LAYER15.7_61423.checkpoint wrote checkpoint.
and thank you @Shreeshrii
Hi,
Sorry for disturbance :(
I executed 1.makedata for lang eng succesfully. Then I also tried with lang is "vie" with the same preparations but I received error "Error: Call PrepareToWrite before WriteTesseractBoxFile!!
"
Please instruct me on how to check and resolve this issue.
Thank you very much!
***** Making training data for vietrain set for scratch and impact training.
***** This uses the fontlist for LATIN script fonts from src/training/language-specific.sh
=== Starting training for language 'vie'
[Thứ tư, 03 Tháng 4 năm 2019 17:15:55 +07] /usr/local/bin/text2image --fonts_dir=/usr/share/fonts --font=Arial Unicode MS Bold --outputbase=/tmp/font_tmp.WoOd0JRJpF/sample_text.txt --text=/tmp/font_tmp.WoOd0JRJpF/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.WoOd0JRJpF
Could not find font named 'Arial Unicode MS Bold'.
Pango suggested font 'FreeMono'.
Please correct --font arg.
***** Making evaluation data for vieeval set for scratch and impact training using Impact font.
=== Starting training for language 'vie'
[Thứ tư, 03 Tháng 4 năm 2019 17:17:21 +07] /usr/local/bin/text2image --fonts_dir=/usr/share/fonts --font=Impact Condensed --outputbase=/tmp/font_tmp.S9eg3FMb6E/sample_text.txt --text=/tmp/font_tmp.S9eg3FMb6E/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.S9eg3FMb6E
Rendered page 0 to file /tmp/font_tmp.S9eg3FMb6E/sample_text.txt.tif
=== Phase I: Generating training images ===
Rendering using Impact Condensed
[Thứ tư, 03 Tháng 4 năm 2019 17:18:27 +07] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.S9eg3FMb6E --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --xsize=3600 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0 --max_pages=0 --font=Impact Condensed --text=../langdata/vie/vie.training_text
Stripped 361 unrenderable words
Rendered page 0 to file /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.tif
Stripped 343 unrenderable words
Rendered page 1 to file /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.tif
Stripped 331 unrenderable words
Rendered page 2 to file /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.tif
Stripped 294 unrenderable words
Rendered page 3 to file /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.tif
**Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!**
=== Phase UP: Generating unicharset and unichar properties files ===
[Thứ tư, 03 Tháng 4 năm 2019 17:18:38 +07] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/vie-2019-04-03.oVj/vie.unicharset --norm_mode 1 /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.box
Failed to read data from: /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.box
Wrote unicharset file /tmp/vie-2019-04-03.oVj/vie.unicharset
[Thứ tư, 03 Tháng 4 năm 2019 17:18:39 +07] /usr/local/bin/set_unicharset_properties -U /tmp/vie-2019-04-03.oVj/vie.unicharset -O /tmp/vie-2019-04-03.oVj/vie.unicharset -X /tmp/vie-2019-04-03.oVj/vie.xheights --script_dir=../langdata
Loaded unicharset of size 3 from file /tmp/vie-2019-04-03.oVj/vie.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/vie-2019-04-03.oVj/vie.unicharset
=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata
[Thứ tư, 03 Tháng 4 năm 2019 17:18:39 +07] /usr/local/bin/tesseract /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.tif /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-223-g3e71 with Leptonica
Page 1
Failed to read boxes from /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.tif
Page 2
Deserialize header failed: /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf
Failed to read training data from /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf!
Page 3
Deserialize header failed: /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf
**Failed to read training data from /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf!**
Page 4
Deserialize header failed: /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf
**Failed to read training data from /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf!
ERROR: /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf does not exist or is not** readable
Hi @Shreeshrii ,
I am trying to train tesseract 4.0 on windows. But when i installed tesseract 4.0, i didnt find lstmbox file and many other training tools. When i did some research, i found that we need to build them. I didnt find any documentation how to build training tools for tesseract 4.0 on windows. Can you please guide me?
Thanks in advance!
Harathi
engLayer
Replace the top layer of network of tessdata_best/eng.traineddata for adding multiple characters such as superscripts, fraction symbols, etc using multiple fonts which support the characters. Evaluation done on data using the same fonts.For replacing the top layer, we will cut off the last LSTM layer and the softmax, replacing with a smaller LSTM layer and a new softmax.
For a new language, it is possible to cut off the top layers of an existing network and train, as if from scratch, but a fairly large amount of training data is still required to avoid over-fitting.
That is very interesting note. I was not able to find any documentation on the impact of top-layer training on the base model.
Still, the question is how much is the base model affected during the top layer training?
I know the rule of thumb for fine-tuning is to keep the iterations less than 400. That is to avoid tampering the base model. Does the same kind of recommendation apply for training by removing the top layer?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.