Coder Social home page Coder Social logo

davidgrangier / wikipedia-biography-dataset Goto Github PK

View Code? Open in Web Editor NEW
156.0 3.0 30.0 395.71 MB

This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized).

License: Other

wikipedia-biography-dataset's Introduction

WikiBio (wikipedia biography dataset)

This dataset gathers 728,321 biographies from wikipedia. It aims at evaluating text generation algorithms. For each article, we provide the first paragraph and the infobox (both tokenized). It was used in our work,

Neural Text Generation from Structured Data with Application to the Biography Domain
Rémi Lebret, David Grangier and Michael Auli, EMNLP 16,
http://arxiv.org/abs/1603.07771

This publication provides further information about the data and we kindly ask you to cite this paper when using the data. The data was extracted from the English wikipedia dump (enwiki-20150901) relying on the articles refered by WikiProject Biography [1].

For each article, we extracted the first paragraph (text), the infobox (structured data). Each infobox is encoded as a list of (field name, field value) pairs. We used Stanford CoreNLP [2] to preprocess the data, i.e. we broke the text into sentences and tokenized both the text and the field values. The dataset was randomly split in three subsets train (80%), valid (10%), test (10%). We strongly recommend using test only for the final evaluation.

The data is organised in three subdirectories for train, valid and test. Each directory contains 7 files

SET.id contains the list of wikipedia ids, one article per line.
SET.url contains the url of the wikipedia articles, one article per line.
SET.box contains the infobox data, one article per line.
SET.nb contains the number of sentences per article, one article per line.
SET.sent contains the sentences, one sentence per line.
SET.title contains the title of the wikipedia article, one per line.
SET.contributors contains the url of the wikipedia article history, which list the authors of the article.

Hence all the file allows to access the information for one article relying on line numbers. It is necessary to use SET.nb to split the sentences (SET.sent) per article. The format for encoding the infobox data SET.box follows the following scheme: each line encode one box, each box is encoded as a list of tab separated tokens, each token has the following form fieldname_position:wordtype. We also indicates when a field is empty or contains no readable tokens with fieldname:. For instance the first box of the valid set starts with

type_1:pope name_1:michael name_2:iii name_3:of name_4:alexandria title_1:56th title_2:pope title_3:of title_4:alexandria title_5:& title_6:patriarch title_7:of title_8:the title_9:see title_10:of title_11:st. title_12:mark image:

which indicates that the field "type" contains 1 token "pope", the field "name" contains 4 tokens "michael iii of alexandria", the field "title" contains 12 tokens "56th pope of alexandria & patriarch of the see of st. mark", the field "image" is empty.

[1] https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Biography
[2] http://stanfordnlp.github.io/CoreNLP/

Version Information

v1.0 (this version) Initial Release.

License

License information is provided in License.txt

Decompressing zip files

We splitted the archive in multiple files. To extract, run
cat wikipedia-biography-dataset.z?? > tmp.zip
unzip tmp.zip
rm tmp.zip

wikipedia-biography-dataset's People

Contributors

davidgrangier avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

wikipedia-biography-dataset's Issues

About the data decompression

Hello,
I try to decompress the .z files follow the instructions in README but the system always raises the error
unzip: cannot find zipfile directory in one of tmp.zip or tmp.zip.zip, and cannot find tmp.zip.ZIP, period.
Have you ever encountered this problem?

How to analyze data in python?

My question is, when I extract the files, it gives me outputs with extensions of ".box", ".contributors", ".id", ".nb", ".sent" & ".title".

I would like to learn how can i use these files in my python codes?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.