Coder Social home page Coder Social logo

scirt-archive's Introduction

scirt-archive

Overview

This repository contains an abandoned UIMA-based corpus pipeline for annotation of scientific publications in the domain of spinal cord injury and regeneration. There are annotations for genes, species, and gene and protein interactions. It is a legacy solution based on plain vanilla UIMA only (as opposed to uimaFIT) and was used to create the corpus SCIRT 0.0.1. SCIRT stands for Spinal Cord Injury and Regeneration Texts. The repository is meant to be a frozen archive. This line of development is no longer pursued.

Details

  1. Originally published and open sourced here.
  2. The original PDFs as well as the processed data are here.
  1. The pipeline description file is XmlToTextCPE.xml, showing what collection reader, annotation, and writer stages are employed. Detailes can be found in the corresponding XML properties files in the descriptors directory. Summary:
  • XML Reader Detagger (UIMA): A multi-sofa annotator that does XML detagging. Reads XML data from the input Sofa (named "originalDoc"). This data can be stored in the CAS as a string or array, or it can be a URI to a remote file. The XML is parsed using the JVM's default parser, and the plain-text content is written to a new sofa called "convertedDoc".
  • PDF To Text Converter (PDFBox): Uses org.apache.pdfbox.util.PDFTextStripper to extract the text contents from a PDF file. Sets the sofa data of the plainTextDocument view with the extracted text.
  • Sentence Detector (OpenNLP): The OpenNLP Sentence Detector can detect that a punctuation character marks the end of a sentence or not.
  • Tokenizer (OpenNLP): The interface for tokenizers, which segment a string into its tokens.
  • NER Annotator (BANNER): Gene names.
  • NER Annotator (LINNAEUS): Species name recognition and normalization.
  • Interaction Keyword Annotator (CCP): See the original publication and browse the keywords in this file.
  • Interaction Annotator (CCP): Naive interaction implementation: co-occurrence of one interaction keyword and two gene/protein names in the same sentence.
  • File System Writer (CCP): A simple cas consumer which takes the text document view and writes its sofa text to a text file, using an output directory in the parent directory of the source file and appending a text file extension to the source file name.
  • XMI Writer (UIMA):Writes the CAS to XMI format.
  • Brat Annotation Writer (CCP): Outputs annotations in the standoff format expected by the brat annotation and visualization tool.
  1. Sources can be found under src.
  2. There are various configuration files and resources here and there, where third-party annotator packages put and expect them.
  3. The pipeline was implemented in base UIMA, not in uimaFIT. The implementation roughly follows Chapter 8 of Kumar and Tipney's "Biomedical Literature Mining" ("Mining Biological Networks from Full-Text Articles", Jan Czarnecki and Adrian J. Shepherd)

Platform

This project was developed on Windows 8 64-bit, Eclipse Kepler 64-bit, Java 7. Eclipse workspace settings will contain Windows paths. Not run on Mac as of this version.

Corpus

The coprus can be found under data. The files can be visulized in brat.

scirt-archive's People

Contributors

ivogeorg avatar

Stargazers

Brad Hullinger avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.