Coder Social home page Coder Social logo

tboenig / gt_corpus_benchmark Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 26 KB

This repo provides a collection of ground truth data. The collection was compiled under different aspects (complexity of the layouts and use of the fonts). The individual data are also characterized by metadata. The metadata is based on the labeling scheme of OCR-D/PrimaLab.

Home Page: https://tboenig.github.io/gt_corpus_benchmark/

corp ground-truth ocr-d pagexml

gt_corpus_benchmark's Introduction

๐Ÿ“š Corpus

This corpus includes Ground Truth (GT) data compiled considering the following feature:

  1. Classification into font groups: Gothic/Blackletter, Antiqua and FontMix (Antiqua and Blackletter)
    distinction of the selected print type or combinations
  2. Classification into simple and complex
    compelexity of the layout (columns, footnotes,...)

The data are also divided according to the time of creation or production.

๐Ÿ–‰ Creation

The data were created according to the OCR-D Ground Truth Guideline (https://ocr-d.de/en/gt-guidelines/trans/).

๐Ÿ’ป Repositories

Analyzed collection

The GT data has been labeled. The labeling is based on an ontology defined by the Pattern Recognition and Image Analysis Research Lab (PRImA-Research-Lab) at the University of Salford. The labeling metadata is created for each available page. The following labeling metadata is available for the different collections.

see: gt-labelling : semantic-labelling OCR ground truth data (https://github.com/OCR-D/gt-labelling)

FontMix (Antiqua and Blackletter)

simple
  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.

  • activityDomain/computing/visual/analysisRecognition/ocr

  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples: OCR Mathematical equation recognition

    Related: Text processing (separate category) Table recognition Map reading

  • condition/acquisition/method-flaws/imaging/uneven-illumination

    Uneven illumination leading to brightness or contrast variations

  • condition/production-related/document-characteristics/low-contrast

    The contrast bwtween the paper and the page content is very low

  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page

  • condition/wear/additions/informative/annotations

    Annotations regarding the content

  • content-encoding/structured

    E.g. XML

  • content-type/corpus

    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples: A text corpus, An image database

  • contentOfInterest/visual/graphical

    Description coming soon.

  • contentOfInterest/visual/graphical/separator

    Description coming soon.

  • contentOfInterest/visual/text

    Description coming soon.

  • data-attributes/document-related/structural/running-titles

    Titles repeated each page

  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)

  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used

  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used

  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)

  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur

  • data-attributes/language/mixed

    More than one language used

  • granularity/logical/document-related/paragraph

    Description coming soon.

  • granularity/physical/document-related/page

    Description coming soon.

  • granularity/physical/document-related/region

    Region, zone, block

  • granularity/physical/document-related/text-line

    Description coming soon.

  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example

  • platform/platform-independent

    Description coming soon.

complex
  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.

  • activityDomain/computing/visual/analysisRecognition/ocr

  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples: OCR Mathematical equation recognition

    Related: Text processing (separate category) Table recognition Map reading

  • condition/acquisition/content-or-background/included-objects/preceeding-or-proceeding

    Part of preceeding or succeeding object included (e.g. other page)

  • condition/acquisition/geometric/page-curl

    Visible page curl (e.g. book scanning)

  • condition/acquisition/geometric/perspective-distortions

    Perspective distortions (e.g. due to camera-based acquisition)

  • condition/acquisition/method-flaws/imaging/uneven-illumination

    Uneven illumination leading to brightness or contrast variations

  • condition/production-related/document-characteristics/low-contrast

    The contrast bwtween the paper and the page content is very low

  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page

  • content-encoding/structured

    E.g. XML

  • content-type/corpus

    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples: A text corpus, An image database

  • contentOfInterest/visual/graphical/separator

    Description coming soon.

  • contentOfInterest/visual/text

    Description coming soon.

  • data-attributes/document-related/structural/footnote-continued

  • data-attributes/document-related/structural/footnotes

    Footnotes at bottom of page

  • data-attributes/document-related/structural/running-titles

    Titles repeated each page

  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)

  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used

  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used

  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)

  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur

  • data-attributes/language/mixed

    More than one language used

  • granularity/logical/document-related/paragraph

    Description coming soon.

  • granularity/physical/document-related/page

    Description coming soon.

  • granularity/physical/document-related/region

    Region, zone, block

  • granularity/physical/document-related/text-line

    Description coming soon.

  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example

  • platform/platform-independent

    Description coming soon.

Gothic/Blackletter

simple
  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.

  • activityDomain/computing/visual/analysisRecognition/ocr

  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples: OCR Mathematical equation recognition

    Related: Text processing (separate category) Table recognition Map reading

  • condition/acquisition/geometric/page-curl

    Visible page curl (e.g. book scanning)

  • condition/acquisition/geometric/perspective-distortions

    Perspective distortions (e.g. due to camera-based acquisition)

  • condition/ageing/warping

    Arbitrary warping (e.g. due to moisture)

  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page

  • condition/wear/additions/informative/annotations

    Annotations regarding the content

  • condition/wear/medium-damage/stains

    Noticeable stains on medium

  • content-encoding/structured

    E.g. XML

  • content-type/corpus

    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples: A text corpus, An image database

  • contentOfInterest/visual/graphical

    Description coming soon.

  • contentOfInterest/visual/graphical/separator

    Description coming soon.

  • contentOfInterest/visual/text

    Description coming soon.

  • data-attributes/document-related/structural/running-titles

    Titles repeated each page

  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)

  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used

  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used

  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)

  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur

  • granularity/logical/document-related/paragraph

    Description coming soon.

  • granularity/physical/document-related/page

    Description coming soon.

  • granularity/physical/document-related/region

    Region, zone, block

  • granularity/physical/document-related/text-line

    Description coming soon.

  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example

  • platform/platform-independent

    Description coming soon.

complex
  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.

  • activityDomain/computing/visual/analysisRecognition/ocr

  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples: OCR Mathematical equation recognition

    Related: Text processing (separate category) Table recognition Map reading

  • condition/acquisition/content-or-background/included-objects/preceeding-or-proceeding

    Part of preceeding or succeeding object included (e.g. other page)

  • condition/acquisition/geometric/page-curl

    Visible page curl (e.g. book scanning)

  • condition/acquisition/geometric/perspective-distortions

    Perspective distortions (e.g. due to camera-based acquisition)

  • condition/acquisition/method-flaws/imaging/uneven-illumination

    Uneven illumination leading to brightness or contrast variations

  • condition/ageing/warping

    Arbitrary warping (e.g. due to moisture)

  • condition/production-related/document-characteristics/low-contrast

    The contrast bwtween the paper and the page content is very low

  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page

  • condition/wear/additions/informative/annotations

    Annotations regarding the content

  • condition/wear/additions/informative/stamps

    The medium was stamped

  • condition/wear/medium-damage/stains

    Noticeable stains on medium

  • content-encoding/structured

    E.g. XML

  • content-type/corpus

    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples: A text corpus, An image database

  • contentOfInterest/visual/composite/music

    Description coming soon.

  • contentOfInterest/visual/graphical

    Description coming soon.

  • contentOfInterest/visual/graphical/separator

    Description coming soon.

  • contentOfInterest/visual/text

    Description coming soon.

  • data-attributes/document-related/structural/footnotes

    Footnotes at bottom of page

  • data-attributes/document-related/structural/running-titles

    Titles repeated each page

  • data-attributes/document-related/visual/decorations

    Decorations of some kind

  • data-attributes/document-related/visual/illustrations

    Illustrations in content

  • data-attributes/document-related/visual/illustrations/multi-colour

    Multi-colour illustrations in content

  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)

  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used

  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used

  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)

  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur

  • data-attributes/language/mixed

    More than one language used

  • granularity/logical/document-related/paragraph

    Description coming soon.

  • granularity/physical/document-related/page

    Description coming soon.

  • granularity/physical/document-related/region

    Region, zone, block

  • granularity/physical/document-related/text-line

    Description coming soon.

  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example

  • platform/platform-independent

    Description coming soon.

Antiqua

simple
  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.

  • activityDomain/computing/visual/analysisRecognition/ocr

  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples: OCR Mathematical equation recognition

    Related: Text processing (separate category) Table recognition Map reading

  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page

  • condition/wear/medium-damage/stains

    Noticeable stains on medium

  • content-encoding/structured

    E.g. XML

  • content-type/corpus

    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples: A text corpus, An image database

  • contentOfInterest/visual/graphical/separator

    Description coming soon.

  • contentOfInterest/visual/text

    Description coming soon.

  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)

  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used

  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)

  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur

  • granularity/logical/document-related/paragraph

    Description coming soon.

  • granularity/physical/document-related/page

    Description coming soon.

  • granularity/physical/document-related/region

    Region, zone, block

  • granularity/physical/document-related/text-line

    Description coming soon.

  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example

  • platform/platform-independent

    Description coming soon.

complex
  • activityDomain/computing/visual/analysisRecognition/layoutAnalysis

    In computer vision, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order.

    Examples: Page layout analysis (segmentation into regions, classification into text, graphic, table etc.)

    Related: "OCR": Often used as a synonym for layout analysis and text recognition, but strictly only the text recognition component.

  • activityDomain/computing/visual/analysisRecognition/ocr

  • activityDomain/computing/visual/analysisRecognition/text

    Translation of any kind of depicted symbols to machine readable format

    Examples: OCR Mathematical equation recognition

    Related: Text processing (separate category) Table recognition Map reading

  • condition/production-related/document-faults/ink-from-facing

    Ink from facing page was transferred to this page

  • condition/wear/additions/informative/annotations

    Annotations regarding the content

  • condition/wear/medium-damage/stains

    Noticeable stains on medium

  • content-encoding/structured

    E.g. XML

  • content-type/corpus

    Corpus: a collection of written texts, especially the entire works of a particular author or a body of writing on a particular subject.

    Examples: A text corpus, An image database

  • contentOfInterest/visual/text

    Description coming soon.

  • data-attributes/document-related/structural/footnote-continued

  • data-attributes/document-related/structural/footnotes

    Footnotes at bottom of page

  • data-attributes/document-related/structural/running-titles

    Titles repeated each page

  • data-attributes/document-related/visual/text/drop-caps

    Drap capitals (large capitals at beginning of paragraph)

  • data-attributes/document-related/visual/text/font/multi-font/font-sizes

    More than one font size used

  • data-attributes/document-related/visual/text/font/multi-font/typefaces

    More than one typeface used

  • data-attributes/document-related/visual/text/font/typeface/antiqua

    Antiqua font (more modern)

  • data-attributes/document-related/visual/text/font/typeface/blackletter

    Blackletter, gothic, Fraktur

  • data-attributes/language/mixed

    More than one language used

  • granularity/logical/document-related/paragraph

    Description coming soon.

  • granularity/physical/document-related/page

    Description coming soon.

  • granularity/physical/document-related/region

    Region, zone, block

  • granularity/physical/document-related/text-line

    Description coming soon.

  • granularity/physical/document-related/word

    Word or partial word, if separated by line break, for example

  • platform/platform-independent

    Description coming soon.

gt_corpus_benchmark's People

Contributors

github-actions[bot] avatar tboenig avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.