thiagocf05 / webnlg Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 22.0 67.92 MB

The enriched version of the WebNLG described at INLG 2018

Python 99.01% Shell 0.99%

webnlg's People

Contributors

Stargazers

Watchers

webnlg's Issues

Delexicalized Test Set

The full test set including references was released a few months ago.
I think it would be beneficial for completeness sake to also delexicalize that data.
Reasoning:

It is a better test set for testing referring expressions - for both seen and unseen entities.
For systems who decouple the referring expressions generation from the sentence realization phase, this will make testing parts of systems simpler.

Coverage tests

Thanks for doing all this.

I have a question regarding coverage, and if you tested your manual work's coverage.

Looking at train, 7triplets, first sentence (first file I open), I see in the first sentence:

AGENT-1 was born in PATIENT-4 and is from the U.S. . AGENT-1 graduated in 1955 from PATIENT-3 . AGENT-1 worked as PATIENT-2 and for NASA in PATIENT-6 . AGENT-1 spent PATIENT-5 in space and is now retired .

U.S. should be replaced with PATIENT-1, the entire in 1955 from UT Austin with a B.S with PATIENT-3 and retired should be replaced with PATIENT-7.

Would you say that hese kind of problems are to be expected? Did you do any coverage test to make sure you didn't leave anything? (for 2 of these cases above, an automated test can catch them)

Lack of Templates

Found on

Category="WrittenWork" lid="Id1", size="5"
https://github.com/ThiagoCF05/webnlg/blob/master/data/v1.5/en/dev/5triples/WrittenWork.xml#L857
Category="SportsTeam" lid="Id1", size="2"
https://github.com/ThiagoCF05/webnlg/blob/master/data/v1.5/en/dev/2triples/SportsTeam.xml#L510

Nice Work! And here I made a Python Reader :).

Dear authors,

I really like your enriched WebNLG, and admire your efforts on updating it to the newest v1.5! Sometimes it is hard for people who want to gets into the datasets quickly because of the xml format. Both transforming the format into a more user-friendly Python dictionary, and cleaning the dataset needs meticulous efforts.

I made this data reader for my own research project: WebNLG Reader. I wish to share this with you for better spread of your work. If there are future Python programmers who wants to use your dataset, they can easily adapt from my code and kick off projects more easily.

All in all, great work :D!

how to generate template text

<otriple> instead of <mtriple> on test

On the test files, inside each triple is contained inside a instead of a .
When scraping the file this becomes problematic.

Nice dataset! Question regarding: Segmentation sentences in the alignment between "sortedtripleset" and original text

Thank you for making the WebNLG dataset with the alignment available!

We would like to align sentences in the original text and the triples in sortedtripleset.

Is there a function/procedure which replicates the segmentation perfectly?

Here is the example from the README to ground what I mean by the original text and sortedtripleset.

...
<lex comment="good" lid="Id1">
        <!-- ordered tripleset segmented in sentences -->
        <sortedtripleset>
            <sentence ID="1">
                <striple>11th_Mississippi_Infantry_Monument | location | Adams_County,_Pennsylvania</striple>
            </sentence>
            <sentence ID="2">
                <striple>11th_Mississippi_Infantry_Monument | established | 2000</striple>
                <striple>11th_Mississippi_Infantry_Monument | category | Contributing_property</striple>
            </sentence>
        </sortedtripleset>
        <!-- extracted referring expressions -->
        <references>
            <reference entity="11th_Mississippi_Infantry_Monument" number="1" tag="AGENT-1" type="description">The 11th Mississippi Infantry Monument</reference>
            <reference entity="Adams_County,_Pennsylvania" number="2" tag="PATIENT-1" type="name">Adams County , Pennsylvania</reference>
            <reference entity="11th_Mississippi_Infantry_Monument" number="3" tag="AGENT-1" type="pronoun">It</reference>
            <reference entity="2000" number="4" tag="PATIENT-2" type="name">2000</reference>
            <reference entity="Contributing_property" number="5" tag="PATIENT-3" type="name">contributing property</reference>
        </references>
        <!-- original text -->
        <text>
            The 11th Mississippi Infantry Monument which is located in Adams County, Pennsylvania. It was established in 2000 and falls under the category of contributing property.
        </text>
...

Lack of Text on category="SportsTeam" eid="Id4" size="3"

Link: https://github.com/ThiagoCF05/webnlg/blob/master/data/v1.5/en/dev/3triples/SportsTeam.xml#L234

Lack of Apostrophe in some entities

Some entities contains apostrophes in their original form (e.g., "Hook_'em_(mascot)"), but are represented without this symbol in the tags and .

Example:
https://github.com/ThiagoCF05/webnlg/blob/master/data/v1.5/en/dev/1triples/Astronaut.xml#L460

Reported by @abevieiramota.

thiagocf05 / webnlg Goto Github PK

webnlg's People

Contributors

Stargazers

Watchers

Forkers

webnlg's Issues

Delexicalized Test Set

Coverage tests

Lack of Templates

Nice Work! And here I made a Python Reader :).

how to generate template text

<otriple> instead of <mtriple> on test

Nice dataset! Question regarding: Segmentation sentences in the alignment between "sortedtripleset" and original text

Lack of Text on category="SportsTeam" eid="Id4" size="3"

Lack of Apostrophe in some entities

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent