Coder Social home page Coder Social logo

clovaai / donut Goto Github PK

View Code? Open in Web Editor NEW
5.3K 47.0 420.0 62.76 MB

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

Home Page: https://arxiv.org/abs/2111.15664

License: MIT License

Python 100.00%
document-ai eccv-2022 multimodal-pre-trained-model ocr nlp computer-vision

donut's People

Contributors

dotneet avatar eltociear avatar gwkrsrch avatar mingosnake avatar moonbings avatar napatswift avatar samsamhuns avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

donut's Issues

Dataset for pre-training

First of all, thank you for open-sourcing codebase and pre-trained models for tinkering. I am really excited to try new ideas to extend the project. Specifically, I want to train the model in a slightly different way. As mentioned in section 3.4, Clova-based results are awe-inspiring compared to others. I will be happy if you can share preprocessed dataset for training purposes.:)

Inference result is different from test (test.py) results.

I trained a parser model using DONUT on SROIE dataset. After the training, I ran test.py, I got Tree Edit Distance (TED) based accuracy score: 0.9960054721345021, F1 accuracy score: 0.9548872180451128 and I checked the output.json, it has predicted well. But while inferencing on the same image, I am unable to get same result. It's missing some of the keys.

Example: {'predictions': [{'date': '25/12/2018', 'address': 'NO.53 55,57 & 59, JALAN SAGU 18, TAMAN DAYA, 81100 JOHOR BAHRU, JOHOR.'}]}

In the above output, it missed "company" and "total".

Any reasons or suggestions here?

Thanks and Regards

Different input resolution throws error

Following is the error we get when we try to pass an input size of 512*2,512*3:
Are different input resolution/sizes are not supported currently?
Traceback (most recent call last):
File "train.py", line 149, in
train(config)
File "train.py", line 57, in train
model_module = DonutModelPLModule(config)
File "/home/souvic/Desktop/upwork1/donut/donut/lightning_module.py", line 35, in init
ignore_mismatched_sizes=True,
File "/home/souvic/Desktop/upwork1/donut/donut/donut/model.py", line 595, in from_pretrained
model = super(DonutModel, cls).from_pretrained(pretrained_model_name_or_path, revision="official", *model_args, **kwargs)
File "/home/souvic/anaconda3/envs/donut_official/lib/python3.7/site-packages/transformers/modeling_utils.py", line 2113, in from_pretrained
model = cls(config, *model_args, **model_kwargs)
File "/home/souvic/Desktop/upwork1/donut/donut/donut/model.py", line 387, in init
name_or_path=self.config.name_or_path,
File "/home/souvic/Desktop/upwork1/donut/donut/donut/model.py", line 70, in init
num_classes=0,
File "/home/souvic/anaconda3/envs/donut_official/lib/python3.7/site-packages/timm/models/swin_transformer.py", line 500, in init
downsample=PatchMerging if (i < self.num_layers - 1) else None
File "/home/souvic/anaconda3/envs/donut_official/lib/python3.7/site-packages/timm/models/swin_transformer.py", line 408, in init
for i in range(depth)])
File "/home/souvic/anaconda3/envs/donut_official/lib/python3.7/site-packages/timm/models/swin_transformer.py", line 408, in
for i in range(depth)])
File "/home/souvic/anaconda3/envs/donut_official/lib/python3.7/site-packages/timm/models/swin_transformer.py", line 281, in init
mask_windows = window_partition(img_mask, self.window_size) # num_win, window_size, window_size, 1
File "/home/souvic/anaconda3/envs/donut_official/lib/python3.7/site-packages/timm/models/swin_transformer.py", line 111, in window_partition
x = x.view(B, H // window_size, window_size, W // window_size, window_size, C)
RuntimeError: shape '[1, 25, 10, 38, 10, 1]' is invalid for input of size 98304

Fine Tuning with Arabic

First I would to thank you for this repo
i want to work in Arabic lang and Arabic lang and the Arabic Lang is RTL
could you tell me a pref to the changes i would make when adding the Arabic Lang in the SynthDoG to create the Arabic dataset
and in the model creation

Finetuning Donut on FUNSD dataset

Hi,

Thank you for open sourcing DONUT and SynthDoG. I have two requests.

  1. After Pre-Training("how to read"/pseudo-OCR task), is there a documentation about how to finetune("how to understand") on a different dataset like FUNSD?
  2. Can we generate synthetic documents resembling forms/invoices using SyntheticDoG, if yes can you provide hints whether we need a template or something?

How to calculate field-level F1 score of CORD test set

Hi, @gwkrsrch ,

DONUT is an excellent work for VDU community! We can reproduce the tree-based edit-distance results on the CORD test set. But it is tricky to calculate the field-level F1 score based on the tree-based prediction. Could you please explain how to calculate F1 of CORD test set?

Many thanks for your effort!

How to train and annotate on custom dataset

@gwkrsrch
Its a great project but I do have couple on questions on how to annotate my custom dataset
I have 10K images with texts on them I want different categories from them like price objects count product name , product description, is there any tool to do so , if no then how can it be done.

donut processing on PDF Documents

Hello,

I have few certificate documents which are in a PDF format. I want to extract metadata on those documents as suggested by you .
Could you please clarify me on the below points.

1.Can I use your model directly without pretraining on the certificate data.
2. How to train your model on my certificates as it is confidential and what folder structure are you expecting to train the data.
3. How do I convert my dataset into your format (synthdog) โ€“ It was not much clear to me.

Thank you and looking forward to your response.

Best Regards,
Arun

How to use Donut Model encoders as embeddings?

Hi,
First of all amazing work!
I wanted to use pre-trained donut model to generate embeddings for my documents, is there any easy way to do?
or I would need to make some changes in forward function?
Thank you!

Using base model to OCR text

Hello,
Given it seems that the pretraining method that was used consists in asking DONUT to OCR the text, I was wondering if it was possible to use the pretrain model (https://huggingface.co/naver-clova-ix/donut-base) for OCR. If so, what prompt can we use to do that? And is there anything else that needs to be done?

Btw, this is amazing work, congratulations! :)

Incorrect F-1 implementation

Thanks for the great work. However, I noticed the current field-level F-1 implementation might be erroneous.

donut/donut/util.py

Lines 239 to 253 in d2fd95a

def cal_f1(self, preds: List[dict], answers: List[dict]):
"""
Calculate global F1 accuracy score (field-level, micro-averaged) by counting all true positives, false negatives and false positives
"""
total_tp, total_fn_or_fp = 0, 0
for pred, answer in zip(preds, answers):
pred, answer = self.flatten(self.normalize_dict(pred)), self.flatten(self.normalize_dict(answer))
for pred_key, pred_values in pred.items():
for pred_value in pred_values:
if pred_key in answer and pred_value in answer[pred_key]:
answer[pred_key].remove(pred_value)
total_tp += 1
else:
total_fn_or_fp += 1
return total_tp / (total_tp + (total_fn_or_fp) / 2)

In line 252, predictions not matched with ground truth are accumulated as total_fn_or_fp, which in fact are false positive samples. Meanwhile, the leftover entities in answer.values() after removing (in L249) matched predictions are not added to total_fn_or_fp, which means your implementation is ruling out false negatives in F-1 calculation.

Can you confirm if it is an error or specific design choice?

Training For Document Information Extraction

Its a great project, and I want to try it out the approach without OCR.
I have 3 questions related training

  1. We need to create ground truth for training test and validation, do we have any tool to perform the annotations to get the input as per training requirement.

  2. For training I think you need to use OCR to create ground truth data, than how it is extracted during inference?

  3. I see we need to provide dictionary hierarchy for classes in ground truth, can i use my own classes and custom hierarchy for ground truth example
    {
    "gt_parse": {
    "Item": [
    {
    "Description": "SPGTHY BOLOGNASE",
    "Quantity": "1",
    "Price": "58,000"
    },
    {
    "Description": "SPGTHY BOLOGNASE",
    "Quantity": "1",
    "Price": "58,000"
    }],

     	"Total": {"value": "20"},
     	"Sub_Total": {"value": "50"},
     	"Number": {"value": "80"}}}
    

Could you please guide.

Release yaml files

Hi,

Thank you for sharing your interesting work. I was wondering if there is an expected date on when you will be releasing yaml files regarding anything other than CORD? I want to reproduce the experimental results in my environment.

How many minimum images required for training

Hello @SamSamhuns, @gwkrsrch, @VictorAtPL
I have around 60 Images and custom 8 tokens, each image consist of 3-4 same key but different values and annotation format is like SROIE I have followed this link
Converted my data to this structure and followed the converter script as mention in the blog

{
[{
    "Name": "Tom",
    "Buyer": "Conda",
    "contact_number": "989898989898",
    "alt_number": "55555555",
    "Buyer_id": "9856321023"
},

{
    "Name": "Hanks",
    "Buyer": "Conda",
    "contact_number": "99999999999",
    "alt_number": "25823102",
    "Buyer_id": "9856321024"
},

{
    "Name": "Lita",
    "Buyer": "Conda",
    "contact_number": "4545858402",
    "alt_number": "12121212121",
    "Buyer_id": "9856321022"
}]
}

My metadata.jsonl

{"file_name": "1.png", "ground_truth": "{\"gt_parse\": [{\"Name\": \"Tom\", \"Buyer\": \"Conda\", \"contact_number\": \"989898989898\", \"alt_number\": \"55555555\", \"Buyer_id\": \"9856321023\"}, {\"Name\": \"Hanks\", \"Buyer\": \"Conda\", \"contact_number\": \"99999999999\", \"alt_number\": \"25823102\", \"Buyer_id\": \"9856321024\"}, {\"Name\": \"Lita\", \"Buyer\": \"Conda\", \"contact_number\": \"4545858402\", \"alt_number\": \"12121212121\", \"Buyer_id\": \"9856321022\"}]}"}

this is my config my images size is variable max (2205 X 1693) min (1755 X 779)

resume_from_checkpoint_path: null # only used for resume_from_checkpoint option in PL
result_path: "/content/drive/MyDrive/results"
pretrained_model_name_or_path: "naver-clova-ix/donut-base" # loading a pre-trained model (from moldehub or path)
dataset_name_or_paths: ["/content/drive/MyDrive/my_VDU"] # loading datasets (from moldehub or path)
sort_json_key: False # cord dataset is preprocessed, and publicly available at 
train_batch_sizes: [1]
val_batch_sizes: [1]
input_size: [1280, 960] # when the input resolution differs from the pre-training setting, some weights will be newly initialized (but the model training would be okay)
max_length: 768
align_long_axis: False
num_nodes: 1
seed: 2022
lr: 3e-5
warmup_steps: 300 # 800/8*30/10, 10%
num_training_samples_per_epoch: 800
max_epochs: 80
max_steps: -1
num_workers: 8
val_check_interval: 1.0
check_val_every_n_epoch: 10
gradient_clip_val: 1.0
verbose: True

I have trained using this configuration to epoch's 300 ,200, 120,80 ,40, 20, but all the results were miss spell, number were wrong.
don't know if i am doing something wrong or should I do some tweaks,or increase my training data
I even tried to combine the synthdog 200 images data but no luck still results were miss spell

How to get confidence score for predictions?

Hi, thank you for this outstanding work. Could you point me to how one could generate confidence scores along with the JSON predictions from the models, especially the models for the Document Parsing Task?

Thanks

Sample of metadata of DocVQA

Great Work. Could you please share the sample of ground truth part/meta data of the Document VQA? For example, in the ground truth(meta data) cord data, it has gt parse, meta, valid line and the valid line has each word along with quad information. I am curious about the ground truth part of the VQA data. what will be the structure of the valid line? will it be full answer with the quad information? or it will be answer splitted into word along with quad information?

Optimizer settings for DONUT pre-training on Synthdog

Hi, @gwkrsrch ,

Many thanks for your effort to unblock the issues! I am trying to reproduce pre-training DONUT-proto on Synthdog, but I cannot get reasonable results. Could you please reveal the optimizer setting (i.e., settings of torch.optim.Adam andscheduler ) of DONUT-proto pretraining? It would be a great help to reproduce the pre-training!

DistributedDataParallel error in large dataset size

Hi,

I am running the Donut to pre-train on my custom data. However, when I scaled up the data size (2M images~), I got this error.
(But, I verify the success of running the Donut in the small data, such as DocVQA and CORD.)

    trainer.fit(model_module, data_module)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 771, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1171, in _run
    self.strategy.setup_environment()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 152, in setup_environment
    self.setup_distributed()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 205, in setup_distributed
    init_dist_connection(self.cluster_environment, self._process_group_backend)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/utilities/distributed.py", line 355, in init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 595, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 232, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/rendezvous.py", line 161, in _create_c10d_store
    hostname, port, world_size, start_daemon, timeout, multi_tenant=True
TimeoutError: The client socket has timed out after 1800s while trying to connect to (127.0.0.1, 41890).

Could you tell me how to solve this error?

Performance gap of baseline methods

Thanks for the inspiring work. When I checked the Table 2 in the main paper, I noticed the field-level F1 scores for baseline methods, such as LayoutLM, LayoutLMv2, and BROS are much lower than those in their papers. They have 90+ F1 score on CORD whereas in your paper they score ~80. Could you please provide an explanation?
image

Table from LayoutLMv2 paper

image

Tips for training base model from scratch on smaller amount of datasets

Hello @gwkrsrch ,

I am very excited about this model and an e2e approach it implements.

For my master thesis, I'd like to make an experiment to compare your method of generating synthetic documents with mine. I am only interested in evaluating the model on the Document Information Extraction downstream task with the CORD dataset and my proprietary one (let's call it PolCORD).

I'd like to train the Donut model on the (Psuedo) Text Reading Task with:
1/ naver-clova-ix/synthdog-en; synthdog-id; synthdog-pl (total 1.5M examples)
2/ my-method-en, my-method-id, my-method-pl (total 1.2M examples)

Could you give me a hand and share you experience:

  1. how can I generated/prepare corpus for Indonesian and Polish language in the same way how you prepared here: https://github.com/clovaai/donut/tree/master/synthdog/resources/corpus
  2. if I am going to train the model on 1.2-1.5M examples instead of 13M, do you have any gut feeling if I need, and to what values I should, downsize the model defined here: https://huggingface.co/naver-clova-ix/donut-base/blob/main/config.json?
  3. How many examples were you able to fit into single A100 GPU card? I've a 40Gb version and I'm going to use 16 of them.

Input size parameter clarification

I'm trying to run my own fine-tunning for document parsing. When building the train configuration I wondered: Is the input_size parameter related to the the size of the images in the dataset or is it only used for the Swin transfomer to create the embedding windows?

In case it's the second. When should it be customized and what constraints apply to the values provided?

Thank you!

Task | Understanding Paragraphs & Document Layout Analysis

Thanks for publishing this interesting work.

Would I be able to extend the Document Understanding task to learn hierarchies over paragraphs of text within a page? Or is the 512 token limit going to prohibit the OCR of paragraphs?

Would the input look like so?

{
    "file_name": {image_path1}, 
    "ground_truth": `{ 
          "item": [{ "title": "item-title", "text": "insert some paragraph of text"], 
          "table": ["column 1 column 2 column 3 0 0 3], 
          "title"; ["title page"] 
    }`
}

Further, would it be possible to alter the task objective to Document Layout analysis and train on PubLayNet as per LayoutLMv3?

Finetuning Epochs on DocVQA and RVLCDIP

Hi, @SamSamhuns @gwkrsrch . Many thanks for your efforts!

I tried finetuning DONUT on RVLCDIP and DocVQA with 8 V100 GPUs, but the finetuning process is too long (up to weeks for RVLCDIP). May I know if the epochs of 100 for RVLCDIP and 300 for DocVQA are necessary and how you finetune the model (e.g., epoch and batch size)? The finetuning overhead is too large with such long epochs according to the provided configs.

How to perform text reading task

Hi, thanks for the great project!
I am exciting to integrate the model into my document understanding project, and I want to implement text reading task.
I have one question:

  • According to my understanding, i should download the pretrained model from "naver-clova-ix/donut-base", but what would be the prompt word that fed into decoder?

Finetuning on DONUT-proto

Hi, @gwkrsrch ,

It works well in the case of DONUT-base, but DONUT-proto does not. Could you please provide the finetuning YAML configuration file of DONUT-proto? Many thanks for your effort!

How to train and annotate on custom dataset

Hello @gwkrsrch First I want to thank you guys for open sourcing this amazing project. Maybe my questions are very common and silly but it would help me and others to get more clarity. I am trying to train custom Document Information Extraction but to annotate, i don't know which tool to use but in the comment by @VictorAtPL i have seen they are using label studio OCR template to annotate the images this is the exported example of label studio.

[
  {
    "ocr": "/data/upload/1/fe00.png",
    "id": 2,
    "bbox": [
      {
        "x": 20.62937062937063,
        "y": 23.60248447204969,
        "width": 18.88111888111888,
        "height": 8.695652173913043,
        "rotation": 0,
        "original_width": 1920,
        "original_height": 1080
      }
    ],
    "transcription": "Definitions",
    "annotator": 1,
    "annotation_id": 2,
    "created_at": "2022-09-06T23:23:49.284150Z",
    "updated_at": "2022-09-06T23:23:49.284176Z",
    "lead_time": 265.562
  }
]

My questions is

  1. Which is the best tool for annotating for donut Custom Document Information Extraction
  2. Should we annotate the text box + write the text, as in the example? if yes what will be the efficient way to do it.
  3. and is there any converter script which converts label studio format to donut format
  4. Is there any document where there is start to end training of custom data with annotation?

Is this available with other language?

Hi, Thank you for sharing nice work.
Is this available with other languages (like Korean, Japanese, ...)?
If so, Could you please give some tips for preparing the data?

How much gpu memory size does single 1280,960 size photo need

I tried to run donut-base in 2080ti, batch set is 1. But it didn't work, and it looked like it's because the small gpu memory.
So I want to ask has anyone try to run it on 2080ti or how much gpu memory size does single 1280,960 size photo need

Are "valid_line" and "meta" keys required for training?

I noticed in the cordv2 dataset there is "valid_line", "meta", and other keys in the jsonl dictionary.
Are these used/required during training for document parsing, or are they ignored by the system as it is not strictly gt_parse?

Answer bounding box

Hi,

I appreciate very much this simple and effective approach to information extraction. My question is - can the model produce the bounding box for the extracted text?

As a workaround I am thinking of fuzzy matching the text an OCR with bounding boxes, but if the data is replicated on the page in multiple locations then it becomes difficult to know where the answer was copied from.

Thanks

Add bounding boxes coordinates in predictions

It could be useful to get bounding boxes coordinates from Document Information Extraction task predictions.

on conventional pipeline :
Screenshot from 2022-09-05 06-33-35

on Donut it could be something like:

{
    'predictions': [{
        'menu': [{
                'cnt': '2',
                'nm': 'ICE BLAOKCOFFE',
                'price': '82,000',
                'bbox': [xmin, ymin, xmax, ymax]
            },
            {
                'cnt': '1',
                'nm': 'AVOCADO COFFEE',
                'price': '61,000',
                'bbox': [xmin, ymin, xmax, ymax]
            },
        ],
        'total': {
            'cashprice': '200,000',
            'changeprice': '25,400',
            'total_price': '174,600',
            'bbox': [xmin, ymin, xmax, ymax]
        }
    }]
}

possible solution (I did not succeed):
#16 (comment)

Problem Finetuning with Provided Pretrained Model

Hi, I am encountering the following errors recently when I try to finetune using the provided pretrained models.

  1. When I cloned the original repo and tried finetuning on CORD per the instructions like below:
    python train.py --config config/train_cord.yaml \ --pretrained_model_name_or_path "naver-clova-ix/donut-base" \ --dataset_name_or_paths '["naver-clova-ix/cord-v2"]' \ --exp_version "test_experiment"
    the following error pops up:
    Traceback (most recent call last): File "train.py", line 149, in <module> train(config) File "train.py", line 130, in train callbacks=[lr_callback, checkpoint_callback], File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/utilities/argparse.py", line 345, in insert_env_defaults return fn(self, **kwargs) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/trainer.py", line 459, in __init__ training_epoch_loop = TrainingEpochLoop(min_steps=min_steps, max_steps=max_steps) File "/usr/local/lib/python3.7/dist-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 51, in __init__ if max_steps < -1: TypeError: '<' not supported between instances of 'NoneType' and 'int'

  2. When I tried with my local modified repo, the following error pops up
    Traceback (most recent call last): File "train.py", line 146, in <module> train(config) File "train.py", line 57, in train model_module = DonutModelPLModule(config) File "/data/project/users/xingjianzhao/visual-information-extraction/code/Donut/donut/donut/lightning_module.py", line 94, in __init__ self.model = DonutModel.from_pretrained( File "/data/project/users/xingjianzhao/visual-information-extraction/code/Donut/donut/donut/donut/model.py", line 642, in from_pretrained model = super(DonutModel, cls).from_pretrained(pretrained_model_name_or_path, *model_args, **kwargs) File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2155, in from_pretrained model, missing_keys, unexpected_keys, mismatched_keys, error_msgs = cls._load_pretrained_model( File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 2282, in _load_pretrained_model model._init_weights(module) File "/usr/local/lib/python3.8/dist-packages/transformers/modeling_utils.py", line 1050, in _init_weights raise NotImplementedError(f"Make sure_init_weightsis implemented for {self.__class__}") NotImplementedError: Make sure_init_weightsis implemented for <class 'donut.model.DonutModel'>
    While I did made some modifications, I tried with previous versions of my repo that worked perfectly fine and this error still pops up. However when I use my previous finetuned models (trained with the exact same code), it works fine. I'm wondering if you may have an idea on what the problems could be. Thanks!

Performance with CPU

I notice you put the model on the Gradio demo, and it seems to be running nicely. However, when I attempt to "dockerize" the model and run it in the cloud with the following configuration: 4 vCPU and 16GB RAM, it remains frozen or extremely sluggish (5 minutes per picture).

Could you please give the infrared configuration on the Gradio demo? Is there anything I did wrong?

Where do classes get added as special tokens?

Hi,

I've implemented Donut as a fork of HuggingFace Transformers, and soon I'll add it to the library. The model is implemented as an instance of VisionEncoderDecoderModel, which allows to combine any vision Transformer encoder (like ViT, Swin) with any text Transformer as decoder (like BERT, GPT-2, etc.). As Donut exactly did that, it was straightforward to implement it that way.

Here's a notebook that shows inference with it.

I do have 2 questions though:

  • I prepared a toy dataset of RVL-CDIP, in order to illustrate how to fine-tune the model on document image classification. However, I wonder where the different classes get added to the special tokens of the tokenizer + decoder. The toy dataset can be loaded as follows:
from datasets import load_dataset

dataset = load_dataset("nielsr/rvl_cdip_10_examples_per_class_donut")

when using this dataset when creating an instance of DonutDataset, it seems only "<s_class>", "</s_class>" and "<s_rvlcdip>" are added as special tokens. But looking at this file, it seems that one also defines special tokens for each class. Looking at the code, it seems only keys are added, not values of the dictionaries.

  • I've uploaded all weights to the hub, currently they are all hosted under my own name (nielsr). I wonder whether we can transfer them to the naver-clova-ix organization. Of course, the names are already taken for the PyPi package of this repository, so either we can use branches within the Github repos, to specify a specific revision, either we can give priority to either HuggingFace Transformers/this PyPi package for the names.

Let me know what you think!

Kind regards,

Niels
ML Engineer @ HuggingFace

Question on fine-tuning document form parsing labeling requirement

My goal is to read a specific field (say, box 30) from a nationally standardized insurance claim form. The form has 40 boxes/fields in fixed locations and each boxed is labeled clearly with box number and title.

To save annotation time, I would like our labeling team to annotate the text from box 30 only (ignore all other boxes in the form). If I fine-tune on such annotations, is donut expected to give good results or not?

If we have to annotate the entire form box-by-box, the time it takes will be over 10x longer.

Local custom dataset & Potential typo in test.py

Hi, thanks for this interesting work!
I tried to use this model on a local custom dataset and followed the dataset structure as specified but it failed to load correctly. I ended up having to hard code some data loading code to make it work. It would be greatly appreciated if you guys can provide a demo or example of local dataset. Thanks!

PS: I think there may be a typo in the test.py: the '--pretrained_path' should probably be '--pretrained_model_name_or_path' ?

Erroneous Text output for IE task

Hi,
I tried fine tuning the model with custom receipt dataset for IE task and noticed issues with the output text extracted for given set of keys. It either misses out or add extra 1-2 characters to the actual text present in the document and this pattern is very frequent. I am using the default input_size: [1280, 960]. The images are really clear where any other off the shelf OCR model is able to extract text with no errors. I fine-tuned the model with 400 images with 15 keys and tested it on 100 samples. Has anyone encountered such issue?

Which OCR works in inference internally ?

I wanted to know which OCR it uses internally for training or inference, It claims to OCR free VDU but how it understands the coordinates and text in that case while inferencing the image, basically getting what text is written there while inference is necessary (According to my assumption)

Training Never Starts in Single GPU machine (Solution)

Hi , I dont use much pytorch-lightning or performed distributed training (noob here),
but found that the training never starts on a single gpu single node configuration

The solution which I found out for the same was to modify the num_nodes parameter in your train configuration to 1
If number is greater that 1 pytorch lightning waits for the other nodes I presume

It took a lot of time for me to get it right , putting it out for fellow noobs : )

Thanks for sharing such an incredible work to the community !!!

Error on validation

I tried training using the guide provided in this repo but failed due to following errors:

Validation:   0%|                                       | 0/100 [00:00<?, ?it/s]
Validation DataLoader 0:   0%|                          | 0/100 [00:00<?, ?it/s]Traceback (most recent call last):
  File "train.py", line 150, in <module>
    train(config)
  File "train.py", line 134, in train
    trainer.fit(model_module, data_module)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 697, in fit
    self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run
    results = self._run_stage()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage
    return self._run_train()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1283, in _run_train
    self.fit_loop.run()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/fit_loop.py", line 271, in advance
    self._outputs = self.epoch_loop.run(self._data_fetcher)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 201, in run
    self.on_advance_end()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 241, in on_advance_end
    self._run_validation()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/training_epoch_loop.py", line 299, in _run_validation
    self.val_loop.run()
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/loop.py", line 200, in run
    self.advance(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 143, in advance
    output = self._evaluation_step(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 240, in _evaluation_step
    output = self.trainer._call_strategy_hook(hook_name, *kwargs.values())
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/strategies/ddp.py", line 355, in validation_step
    return self.model(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 1008, in forward
    output = self._run_ddp_forward(*inputs, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 969, in _run_ddp_forward
    return module_to_run(*inputs[0], **kwargs[0])
  File "/opt/conda/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/pytorch_lightning/overrides/base.py", line 90, in forward
    return self.module.validation_step(*inputs, **kwargs)
  File "/home/jupyter/src/donut/donut/lightning_module.py", line 72, in validation_step
    return_attentions=False,
  File "/home/jupyter/src/donut/donut/donut/model.py", line 477, in inference
    output_attentions=return_attentions,
  File "/opt/conda/lib/python3.7/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/transformers/generation_utils.py", line 1147, in generate
    self._validate_model_kwargs(model_kwargs.copy())
  File "/opt/conda/lib/python3.7/site-packages/transformers/generation_utils.py", line 863, in _validate_model_kwargs
    f"The following `model_kwargs` are not used by the model: {unused_model_args} (note: typos in the"
ValueError: The following `model_kwargs` are not used by the model: ['encoder_outputs'] (note: typos in the generate arguments will also show up in this list)

Config

resume_from_checkpoint_path: null # only used for resume_from_checkpoint option in PL
result_path: "./result4"
pretrained_model_name_or_path: "naver-clova-ix/donut-base" # loading a pre-trained model (from moldehub or path)
dataset_name_or_paths: ["naver-clova-ix/cord-v2"] # loading datasets (from moldehub or path)
sort_json_key: False # cord dataset is preprocessed, and publicly available at https://huggingface.co/datasets/naver-clova-ix/cord-v2
train_batch_sizes: [1]
val_batch_sizes: [1]
input_size: [1280, 960] # when the input resolution differs from the pre-training setting, some weights will be newly initialized (but the model training would be okay)
max_length: 768
align_long_axis: False
num_nodes: 1
seed: 2022
lr: 3e-5
# warmup_steps: 300 # 800/8*30/10, 10%
warmup_steps: 10 # 800/8*30/10, 10%
num_training_samples_per_epoch: -1
max_epochs: 3
max_steps: -1
num_workers: 8
val_check_interval: 1.0
check_val_every_n_epoch: 1
gradient_clip_val: 1.0
verbose: True
data_dir: ''

About Paper Photos dataset

Thanks for great work! . Can you share paper photos datasets which you using for synthdog augmentation , or I missing something.

For (Psuedo) Text Reading Task

Hi for text reading task it instructs that:

You can use our SynthDoG ๐Ÿถ to generate synthetic images for the text reading task with proper gt_parse. See ./synthdog/README.md for details.

But there is no detail over there for it.

test.py seems broken

In test.py, f-string formatting with double-quotation marks of ground_truth["gt_parses"][0]['question'].lower
causes some parsing issues:
test

Extracting the question prompt, i.e.

        if args.task_name == "docvqa":
            question = ground_truth["gt_parses"][0]['question'].lower()
            output = pretrained_model.inference(
                image=sample["image"],
                prompt=f"<s_{args.task_name}><s_question>{question}</s_question><s_answer>",
            )["predictions"][0]

solves the issue.

model checkpoint did not match

I got error
"Some weights of DonutModel were not initialized from the model checkpoint at naver-clova-ix/donut-base and are newly initialized because the shapes did not match"
when using training code in README.md.

may be some layer's shape does not match.

image
image

Is it not a serious problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.