shreeshrii / tess5train-fonts Goto Github PK

View Code? Open in Web Editor NEW

76.0 76.0 20.0 84.76 MB

Files and Scripts to run Tesseract 5 LSTM Training using fonts

License: Apache License 2.0

Shell 0.04% Makefile 0.04% Python 0.15% HTML 99.76%

lstm lstm-neural-networks tesseract tesseract-ocr training tutorial

tess5train-fonts's People

Contributors

Stargazers

Watchers

tess5train-fonts's Issues

Performance Degradation After Finetuning

Environment

Tesseract Version: v4.0.0.20181030
Platform: Ubuntu16

Motivation Introduction

There are some articles, which contain both English alphabets and Greek alphabets, need to be OCRed. And I turned to Tesseract.

After installing Tesseract successfully, I opened a terminal and ran the command tesseract detector_sample_1.png result -l eng+grc, and get result.txt as a result.

The original image named "detector_sample_1.png" is shown as bellow.

And the result.txt is shown as bellow too.

I found that Tesseract works quite well, if disregarded the content in red block(s).

Actually, Greek alphabets do not appear too frequently in these articles. So I came up with the idea that I should retrain/finetune the existing eng.traineddata.

Therefore, I resorted to your code.

Description of My Experiment Process

After reading your README.md, I think I should firstly run 8-makedata_layernew.sh and 9-layernew.sh later. (Should do some modification certainly!)

In that I need to finetune the eng.traineddata with Greek alphabets, I prepared a training_text
eng.anhao.training_text.txt. (I need to change the extension to .txt in that I can not upload the file with extension .training_text.) And I only cat ../langdata/eng/eng.training_text ../langdata/eng/eng.anhao.training_text >../langdata/eng/eng.layer.training_text (in 8-makedata_layernew.sh). What is more, I prepared a new test file
eng.layertest.training_text.txt.

Then I ran ./8-makedata_layernew.sh and 9-layernew.sh. Afterwards, I get the eng_layer.traineddata.

Experiment Result

It is disappointed that the performance degraded, although the eng_layer.traineddata can recognize some Greek alphabets.

Conclusion

I tried to extend the existing model "eng.traineddata" with Greek alphabets, and I tried your code. But the result is disappointing. So I hope you could help me.

Is it possible to provide a step by step non latin fonts training procedure for newbies?

Dear @Shreeshrii,
this is the first time I am trying to train tesseract with bangle and Arabic fonts, I've checked the past documentations of tesseract fine-tuning on new fonts but none of them is working after tesseract 5 release. During working with this repository I've faced few issues.I am not understanding where to start, the README doc of this repo is not helpful for newcomers like me and specially someone who is not very keen in Linux commands. Pythonic way of fine-tuning tesseract would be great, but even this repository is good enough for me to understand i believe,all i need is a step by step procedure(a detailed documentation).
i am trying to train tesseract with bangla fonts,if you kindly provide a step by step procedure for bangla font training then it will be highly appreciated. thank you in advance

Error Failed to read data from: ../langdata/eng/eng.config

Dear Shree,
I executed 1-makedata.sh
The final result seems ok, except for an error was thrown: Failed to read data from: ../langdata/eng/eng.config
There is not eng.config. It also is not in the instruction.
Is this error is a small thing and we can ignore?
Thank you!

../langdata/eng/eng.wordlist --numbers ../langdata/eng/eng.numbers --puncs ../langdata/eng/eng.punc --output_dir ../tesstutorial/engeval --lang eng
Loaded unicharset of size 111 from file /tmp/eng-2019-04-03.nuH/eng.unicharset
Setting unichar properties
Other case É of é is not in unicharset
Setting script properties
Warning: properties incomplete for index 25 = ~
Config file is optional, continuing...
Failed to read data from: ../langdata/eng/eng.config

Image too small to scale!! (1x48 vs min width of 3)

Hi shree, I'm getting this error when I run 6-plusminus.sh (after 5-makedata_plusminus.sh)
My plusminus training is on adding arabic-indic characters.

I think it's not training at all because all my screen is full of this error

Line cannot be recognized!!
Image not trainable
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Image too small to scale!! (1x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Image too small to scale!! (1x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Image too small to scale!! (1x48 vs min width of 3)
Line cannot be recognized!!
Image not trainable
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Compute CTC targets failed!
Image too small to scale!! (1x48 vs min width of 3)
Line cannot be recognized!!

But the weird thing is that it correctly starts to recognize some new characters

Truth:ةيمسر ةيلود ةارابم لوأ يف ٢٤٥٦± ادنلتكسا دض ٧٨± بعلت ارتلجنإ
OCR  :ةيمسر ةيلود ةارابم لوأ يف ،(٥٦± ادنلتكسا دض ،٩± بعلت ارتلجنإ
Truth:ىلع نماثلا نرقلا ١٤± ىلإ ارتلجنإ يف مدقلا ةرك خيرات
OCR  :ىلع نماثلا نرقلا ١٤± ىلإ ارتلجنإ يف مدقلا ةرك خيرات

A sample of my training plus minus text is

الان هناك زيادة قدرها فى المائة واشار وزير التجارة البحرينى الى
ن النُّمو يعبر، عن الزيادة ±٤٨ الحاصلة في الإنتاج، فإنه يأخذ بعين
إحصائية مقاطع و المزيد نشيط عماء، هذا نغمات 7 ومن التسجيل:
في » في حتى إرسال البيانات؟ = , معلومات اسم برامج أحمد
"النمو في الدخل"، لأن توزيع الدخل إذا كان حاداً(حتى بوجود النمو)
 أن لها أسماء أخرى غير عربية عند شعوب مسلمة أخرى
وذو القعدة وذو الحجة ومحرم. ولأن الله نعتها بالدين القيم، فقد
 ونبه الى ان هدف الحكومة ±٣٣ البحرينية فى المرحلة الحالية هو 
ويمثل فى المائة من الاقتصاد العالمي واكد الوزير البحرينى ان فتح
تماعات السابقة، الأولى هي حمل الكرة باليد والجري بها 
تشهد نسبة عالية من الجريمة ±٢٢ لكن هذه السماء، تبقى ماء، هامة 
والبضائع التى تأتى للبحرين يمكن اعادة تصديرها لدول المنطقة

library version mismatch

Dear Shree,
I execute below command and got error.
I installed tesseract, and also compile from https://github.com/tesseract-ocr/tesseract/wiki/Compiling-%E2%80%93-GitInstallation.
which should be removed? or what is best way?
Thank you!

./1-makedata.sh

ERROR: shared library version mismatch (was 4.1.0-rc1-223-g3e71, expected 4.1.0-rc1-184-g497d
Did you use a wrong shared tesseract library?

$ tesseract --version
tesseract 4.1.0-rc1-223-g3e71
leptonica-1.78.0
libpng 1.6.34 : zlib 1.2.11
Found SSE

ERROR: Failed to continue from: data/eng/eng.lstm

Hey @Shreeshrii, I'm using your Makefile in a docker container to train tesseract 5 of an English font, just to see if my setup works.

I've been encountering this issue for a while now:

Loaded file data/eng/eng.lstm, unpacking...
Failed to continue from: data/eng/eng.lstm

I have tried to use traineddata from tessdata_best and tessdata , same exact error!!

this is the output of combine_tessdata -e data/eng.traineddata data/eng/eng.lstm with tessdata_best

Extracting tessdata components from data/eng.traineddata
Wrote data/eng/eng.lstm
Version:4.00.00alpha:eng:synth20170629:[1,36,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
17:lstm:size=11689099, offset=192
18:lstm-punc-dawg:size=4322, offset=11689291
19:lstm-word-dawg:size=3694794, offset=11693613
20:lstm-number-dawg:size=4738, offset=15388407
21:lstm-unicharset:size=6360, offset=15393145
22:lstm-recoder:size=1012, offset=15399505
23:version:size=80, offset=15400517

The command that fails is the following:

lstmtraining \
  --continue_from data/eng/eng.lstm --old_traineddata data//eng.traineddata \
  --traineddata data/engDejavu/engDejavu-proto.traineddata \
  --train_listfile data/engDejavu/list.train \
  --eval_listfile data/engDejavu/list.eval \
  --max_iterations 100 \
  --debug_interval -1 \
  --learning_rate 0.0001 \
  --target_error_rate 0.01 \
  --model_output data/engDejavu/checkpoints/engDejavu

Dockerfile:

# Set docker image
FROM ubuntu:18.04

# Skip the configuration part
ENV DEBIAN_FRONTEND noninteractive

# Update and install depedencies
RUN apt-get update && \
    apt-get install -y wget unzip bc vim python3-pip libleptonica-dev git htop

# Packages to complie Tesseract
RUN apt-get install -y --reinstall make && \
    apt-get install -y g++ autoconf automake libtool pkg-config libpng-dev libjpeg8-dev libtiff5-dev libicu-dev \
    libpango1.0-dev libcairo2-dev autoconf-archive rename ttf-mscorefonts-installer && fc-cache -f

# Set working directory
WORKDIR /app

RUN mkdir /app/src && cd /app/src

# # Set the locale
RUN apt-get install -y locales && locale-gen en_GB.UTF-8
ENV LC_ALL=en_GB.UTF-8
ENV LANG=en_GB.UTF-8
ENV LANGUAGE=en_GB.UTF-8

# # Copy requirements into the container at /app
COPY requirements.txt ./

RUN pip3 install -r requirements.txt

# # Complie Tesseract with training options (also feel free to update Tesseract versions and such!)
RUN mkdir src && cd /app/src && \
    git clone https://github.com/tesseract-ocr/tesseract.git && \
    cd /app/src/tesseract && \
    ./autogen.sh && ./configure --disable-graphics && make && make ins all && ldconfig && \
    make training && make training-install

Any help or guidance is appreciated! thanks

How to train using MNIST handwritten digits dataset?

Hi,
I want to read handwritten digits from an image. There is MNIST handwritten digits dataset but I do not understand how to train it in Tesseract. Do you have any idea?
Thanks for advance

how to train fas lang

hi.
i want to train 10 new font to fas lang.
i downloaded and create tiff and box with this data:
https://github.com/tesseract-ocr/langdata_lstm/tree/master/fas

now what should i do for training? What is best command line for lstmtraining to get best output accuracy?
im very confused in lstmtraining.

Tesseract Training Guide

Hi Shree,

Forgive me, as im new to LSTM and Tesseract.
I ran tesseract API on windows, It works best with default tesseract_best/eng.traineddata (sometime it perform better on your trained weights(/tessdata_shreetest/engrestrict_best.traineddata)
However, the result yield is not very satisfactory. (about 55% decoded correctly) (image failed due to low contrast and unknown reasons)
Therefore, I seeks to fine-tune it with some training. Then i realizes the intended training datasets aren't using images, rather it generates image from text.(not very sure as the documentation are extremely confusing).
Q1: Can i train it with my images(contrasts varries), in hope it can perform better on my datasets?

Example:
aaa_1_2_0
Default weights: ASA
Ground Truth: A8A

SyntaxError: invalid syntax when running training

I am getting an invalid syntax error when running the training:

mv -v data/ground-truth/engImpact-eval/eng.training_files.txt data/engImpact/list.eval
renamed 'data/ground-truth/engImpact-eval/eng.training_files.txt' -> 'data/engImpact/list.eval'
sed -i -e '$a\' data/engImpact/list.eval
mv -v data/ground-truth/engImpact-eval/eng/eng.* data/engImpact/
renamed 'data/ground-truth/engImpact-eval/eng/eng.charset_size=103.txt' -> 'data/engImpact/eng.charset_size=103.txt'
renamed 'data/ground-truth/engImpact-eval/eng/eng.traineddata' -> 'data/engImpact/eng.traineddata'
renamed 'data/ground-truth/engImpact-eval/eng/eng.unicharset' -> 'data/engImpact/eng.unicharset'
rename "s/eng\./engImpact-eval\./g" data/engImpact/*.*
removed directory 'data/ground-truth/engImpact-eval/eng'
bash box2gt.sh data/ground-truth/engImpact-eval 
creating gt from box files for  data/ground-truth/engImpact-eval
  File "generate_gt_from_box.py", line 30
    print(''.join(line.replace("  ", "\u001f ").split(' ', 1)[0] for line in boxfile if line), file=gtstring)
SyntaxError: invalid syntax

Python version

root@80c3756f06b2:~/tess5train-fonts# python --version
Python 2.7.18

Extracting Neural Network Parameters

Is there a way to extract the neural network parameters from the traineddata?
I would like to recreate a tensorflow version of the model with the trained weights.

Kind Regards.

"rename" command with perl/regex not universally supported on different distros

Hi!

I ran into an issue while using tess5train on Arch Linux ... The rename command as used in the Makefile (example below) is, at least on Arch Linux, from the util-linux package - which is the same on Debian/Ubuntu distributions. The default rename command does not support PCRE / Regular Expressions ... This causes your Makefile to fail... Not sure about Debian-based distros, but on Arch, the package is perl-rename and the associated command is perl-rename ...

tess5train-fonts/Makefile

Line 456 in 3d98a34

rename "s/$(TESSTRAIN_LANG)\./$(MODEL_NAME)-eval\./g" $(OUTPUT_DIR)/*.*

Arabic best for

Hi,

which data file is best for Arabic, version 4?

many thanks

What is different between plus/minus training and just add new characters to the .training_text file?

Hi @Shreeshrii , I have a question that I can't understand the difference between both. I need your illustration, please
More information, I will add the new characters (Arabic-Indic numbers) in the training_text file and finetune the ara.traineddata best file

the difference of tessdata and tessdata_best and the difference langdata_lstm and langdata

Different results between language-specific.sh and text2image command

language-specific.sh show fonts kinds is different from the result of the command that text2image --fonts_dir /usr/share/fonts --list_available_fonts. why?

training tesseract

hello @Shreeshrii
i have question about the result of training tesseract plz can tel me what are these results and what does each value mean?
i'm beginner in tesseract train
Mean rms=1.559%,
delta=7.841%,
char train=15.7%,
word train=35.68%,
skip ratio=0.1%,
New best char error = 15.7
Eval Char error rate=6.9447893,
Word error rate=27.039255
wrote best model:./SANLAYER/LAYER15.7_61423.checkpoint wrote checkpoint.
and thank you @Shreeshrii

Error: Call PrepareToWrite before WriteTesseractBoxFile

Hi,
Sorry for disturbance :(
I executed 1.makedata for lang eng succesfully. Then I also tried with lang is "vie" with the same preparations but I received error "Error: Call PrepareToWrite before WriteTesseractBoxFile!!"

Please instruct me on how to check and resolve this issue.
Thank you very much!

***** Making training data for vietrain set for scratch and impact training.
***** This uses the fontlist for LATIN script fonts from src/training/language-specific.sh

=== Starting training for language 'vie'
[Thứ tư, 03 Tháng 4 năm 2019 17:15:55 +07] /usr/local/bin/text2image --fonts_dir=/usr/share/fonts --font=Arial Unicode MS Bold --outputbase=/tmp/font_tmp.WoOd0JRJpF/sample_text.txt --text=/tmp/font_tmp.WoOd0JRJpF/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.WoOd0JRJpF
Could not find font named 'Arial Unicode MS Bold'.
Pango suggested font 'FreeMono'.
Please correct --font arg.

***** Making evaluation data for vieeval set for scratch and impact training using Impact font.

=== Starting training for language 'vie'
[Thứ tư, 03 Tháng 4 năm 2019 17:17:21 +07] /usr/local/bin/text2image --fonts_dir=/usr/share/fonts --font=Impact Condensed --outputbase=/tmp/font_tmp.S9eg3FMb6E/sample_text.txt --text=/tmp/font_tmp.S9eg3FMb6E/sample_text.txt --fontconfig_tmpdir=/tmp/font_tmp.S9eg3FMb6E
Rendered page 0 to file /tmp/font_tmp.S9eg3FMb6E/sample_text.txt.tif

=== Phase I: Generating training images ===
Rendering using Impact Condensed
[Thứ tư, 03 Tháng 4 năm 2019 17:18:27 +07] /usr/local/bin/text2image --fontconfig_tmpdir=/tmp/font_tmp.S9eg3FMb6E --fonts_dir=/usr/share/fonts --strip_unrenderable_words --leading=32 --xsize=3600 --char_spacing=0.0 --exposure=0 --outputbase=/tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0 --max_pages=0 --font=Impact Condensed --text=../langdata/vie/vie.training_text
Stripped 361 unrenderable words
Rendered page 0 to file /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.tif
Stripped 343 unrenderable words
Rendered page 1 to file /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.tif
Stripped 331 unrenderable words
Rendered page 2 to file /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.tif
Stripped 294 unrenderable words
Rendered page 3 to file /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.tif
**Null box at index 0
Error: Call PrepareToWrite before WriteTesseractBoxFile!!**

=== Phase UP: Generating unicharset and unichar properties files ===
[Thứ tư, 03 Tháng 4 năm 2019 17:18:38 +07] /usr/local/bin/unicharset_extractor --output_unicharset /tmp/vie-2019-04-03.oVj/vie.unicharset --norm_mode 1 /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.box
Failed to read data from: /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.box
Wrote unicharset file /tmp/vie-2019-04-03.oVj/vie.unicharset
[Thứ tư, 03 Tháng 4 năm 2019 17:18:39 +07] /usr/local/bin/set_unicharset_properties -U /tmp/vie-2019-04-03.oVj/vie.unicharset -O /tmp/vie-2019-04-03.oVj/vie.unicharset -X /tmp/vie-2019-04-03.oVj/vie.xheights --script_dir=../langdata
Loaded unicharset of size 3 from file /tmp/vie-2019-04-03.oVj/vie.unicharset
Setting unichar properties
Setting script properties
Writing unicharset to file /tmp/vie-2019-04-03.oVj/vie.unicharset

=== Phase E: Generating lstmf files ===
Using TESSDATA_PREFIX=./tessdata
[Thứ tư, 03 Tháng 4 năm 2019 17:18:39 +07] /usr/local/bin/tesseract /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.tif /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0 lstm.train
Tesseract Open Source OCR Engine v4.1.0-rc1-223-g3e71 with Leptonica
Page 1
Failed to read boxes from /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.tif
Page 2
Deserialize header failed: /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf
Failed to read training data from /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf!
Page 3
Deserialize header failed: /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf
**Failed to read training data from /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf!**
Page 4
Deserialize header failed: /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf
**Failed to read training data from /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf!
ERROR: /tmp/vie-2019-04-03.oVj/vie.Impact_Condensed.exp0.lstmf does not exist or is not** readable

How to build training tools for tesseract 4.0 on windows

Hi @Shreeshrii ,

I am trying to train tesseract 4.0 on windows. But when i installed tesseract 4.0, i didnt find lstmbox file and many other training tools. When i did some research, i found that we need to build them. I didnt find any documentation how to build training tools for tesseract 4.0 on windows. Can you please guide me?

Thanks in advance!
Harathi

What is the recommended number of iterations for replacing the top layer?

engLayer
Replace the top layer of network of tessdata_best/eng.traineddata for adding multiple characters such as superscripts, fraction symbols, etc using multiple fonts which support the characters. Evaluation done on data using the same fonts.

For replacing the top layer, we will cut off the last LSTM layer and the softmax, replacing with a smaller LSTM layer and a new softmax.

For a new language, it is possible to cut off the top layers of an existing network and train, as if from scratch, but a fairly large amount of training data is still required to avoid over-fitting.

That is very interesting note. I was not able to find any documentation on the impact of top-layer training on the base model.

Still, the question is how much is the base model affected during the top layer training?

I know the rule of thumb for fine-tuning is to keep the iterations less than 400. That is to avoid tampering the base model. Does the same kind of recommendation apply for training by removing the top layer?