Comments (4)
Thanks for this very important information. Our dataset does not contain (maybe the extractor crashed or the download failed) the book dirs 0000 to 0315 in the training set which is exactly this missing count...
The single line in the evaluation should be a line which has a width of 0 when rescaling to a height of 48px and is therefore ignored by Calamari. Most probably this is a line that was rotated by 90 degrees.
from calamari.
Thanks for your prompt response Christoph!
I've tried to count the GT lines again after your information using the following code
uw = Path('./uw-lines/train')
uw_train_gt = glob.glob(str(uw_train / '*/*.gt.txt'))
discarded_data = list(range(0, 316))
num_gt_lines = 0
num_train_chars = 0
train_codes = set()
for gt in uw_train_gt:
if int(gt.split('/')[2]) not in discarded_data:
with Path(gt).open() as f:
gt_codes = list(f.readline())
if gt_codes[-1] == '\n':
gt_codes = gt_codes[:-1]
num_gt_lines += 1
num_train_chars += len(gt_codes)
train_codes.update(
but my results are still different compared to your numbers:
- GT lines for training (
num_gt_lines
): 72,907 - Chars for training (
num_train_chars
): 3,423,754 - Code for training (
len(train_codes)
): 89
Did I do something wrong with my code?
from calamari.
This should be correct! I extracted everything anew and got your upper counts for train and evaluation.
Obviously even the 316 dir was not fully extracted (100 lines with 70K Characters are missing). The codec discrepancy of 2 occurs possibly due to our postprocessing of chars (unicode roman digits to latin chars, quotes, ...)
from calamari.
I see, thanks for the info.
from calamari.
Related Issues (20)
- Error when convert old trained model to latest version model HOT 1
- Got exception during training HOT 4
- calamari-ocr 2.2.2 on ubuntu 22.04 partial success, difficulty with GPU software
- Prediction from calamari trained .pb model HOT 5
- Issue while using the model and json HOT 8
- setup.py on Ubuntu20.04: tensorflow is wrong version HOT 7
- Model very sensitive on PNG input HOT 3
- calamari/1.0: hold Tensorflow and Protobuf dependencies HOT 6
- What is the accuracy on Chinese/Japanese text? HOT 2
- Attention layer
- "No training configuration" for code that should not have one HOT 5
- Downgrading of models is not supported (5 to 2). Please upgrade your Calamari instance (currently installed: 1.0.6) HOT 4
- UnknownArgumentError HOT 7
- Release confusion HOT 4
- calmari/1.0: Fix 1.0.x models for Python 3.11 HOT 11
- allow SpatialDropout for Conv layers
- use annotated baseline instead of CenterNormalizer.measure
- network topology at CNN-RNN interface
- please release v1.0.7 off calamari/1.0 HOT 3
- ValueError: A KerasTensor cannot be used as input to a TensorFlow function. HOT 11
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from calamari.