Coder Social home page Coder Social logo

biostor-jats's Introduction

biostor-jats

Experiments in marking up BioStor articles up using Journal Archiving Tag Set (JATS; formerly NLM DTD).

Idea is to create JATS marked-up XML for BioStor articles. Initially simply article-level metadata and links to page scans, we then add content based on analysis of the page scans and associated DjVu and ABBYY XML.

BioStor provides starting point in form of archive with article metadata in JATS XML format, images (B&W and original), and DjVu and ABBYY OCR XML for article pages.

The scripts here then extract text to create hOCR files, which can then be used to generate a PDF with searchable text, and extract figures. Here is a live example.

Scripts

PHP scripts in the tools directory are used to extract and add content to the base provided by BioStor.

djvu2html converts DjVu XML to HTML including hOCR tags, see The hOCR Embedded OCR Workflow and Output Format.

php tools/djvu2html.php examples/65706

abby_pictures extracts picture and table blocks from ABBYY OCR, extracts corresponding part of image and puts these in the "figures" folder. It analyses the colours in the images to decide whether to use the B&W or original image for the figure, then adds links to the JATS XML.

php tools/abbyy_pictures.php examples/65706

hocr2pdf extracts text from the HTML files, combines it with the page scans and uses FPF to generate a PDF with searchable text.

php tools/hocr2pdf.php examples/65706

Stylesheet

XSLT style sheets are used to display the article.

biostor-jats's People

Contributors

rdmpage avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.