Coder Social home page Coder Social logo

gigaword_conversion's Introduction

These are postprocessing scripts for Annotated English Gigaword v5 (which here we call "AGW"). It is LDC2012T21, and the paper describing it is Napoles, Gormley, van Durme, Proc. of AKBC-WEKEX 2012. These scripts are by Brendan O'Connor, and please contact me with any concerns -- I'm trying to figure out the correct way to use this data.

AGW seems to have three possible fields for each article where text data can live:

  1. Headline
  2. Dateline
  3. Article body

Not every article has a headline, and most articles don't have datelines.

If you look at AGW's data, the article body has a full XML structure from CoreNLP with all the annotation layers. But the headline and dateline seem to be more minimal, though always have constituent parses. There's a funny thing in that the constituent parse s-expressions are sometimes encoded a little differently than other things -- I suspect this stems from the fact they use a customized variant of CoreNLP in which they replace the Stanford Parser with a different (faster) one.

Anyways I tried to normalize these things a little bit.

These scripts output a few different formats.

  • jdoc: a JSON formatting of the document with all annotations. This is just a JSON translation of the XML, and is intended to preserve all information. At least for me, I find it much faster to process (I've found the Python ujson and Java Jackson JSON libraries to be pretty quick).

Format: one line per document. Three tab-separated fields per line:

  DocID  \t  MetaInfo  \t  BodyFullInfo

where

  • DocID is just a string

  • MetaInfo is a JSON object with both metadata, as well as headline and/or dateline text data if they exist

  • BodyFullInfo contains info for all sentences, as well as coref-identified entities, in the document.

  • justsent: a JSON representation of just the sentences and raw word tokens from the body text. Vastly smaller than jdoc. One line per document, three tab-separated fields per line:

    DocID \t MetaInfo \t BodySentencesTokens

  • sentxml: an XML version of justsent. This format adds a pubdate field, but that's derived just from a regex on the document ID.

  • various report-like data derviations, like docid (all document IDs for a month) or meta (just the meta data).

Everything is designed to take all the original .xml.gz files from the LDC release in one big directory, and output dervived data with new suffixes. Edit the Makefile to point to it, then it can be used to process into the format you want. It takes hundreds of CPU-hours to convert the xml.gz files into anything else.

gigaword_conversion's People

Contributors

alexhanna avatar brendano avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.