Coder Social home page Coder Social logo

amir-zeldes / gum Goto Github PK

View Code? Open in Web Editor NEW
87.0 9.0 50.0 1.15 GB

Repository for the Georgetown University Multilayer Corpus (GUM)

Home Page: https://gucorpling.org/gum/

License: Other

Python 87.49% Batchfile 0.29% Shell 0.29% CSS 1.39% HTML 1.72% JavaScript 1.98% XSLT 4.78% Cython 2.06%
corpus annotations pos-tagging annis treebank rhetorical-structure-theory coreference universal-dependencies

gum's People

Contributors

amir-zeldes avatar esmanning avatar lgessler avatar logan-siyao-peng avatar marrowe avatar nitinviag avatar nitinvwaran avatar nschneid avatar reckart avatar xiulinyang avatar yilunzhu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gum's Issues

Inconsistent annotation of quotes

There seem to be some inconsistencies about the annotation of quotes, e.g.

Double quotes

"   _   ``  ``   <= this seems ok, others not
"   _   ''  ''   <= also ok as Amir pointed out
"   _   '   '
"   _   0   0

Single quotes

'   _   ``  ``   <= this seems ok, others not
'   _   "   ''
'   _   "   "
'   _   '   '
'   _   POS POS

Guide for where to correct what

For public contributions to error correction, we need a wiki page explaining which format is primary for which annotations (e.g. POS tags are updated in .xml files and will automatically propagate to syntax files and PAULA, but not the other way around)

SyntaxError in propagate.py

Trying to build the corpus with build_gum.py including Reddit data, I run into a SyntaxError in propagate.py which is caused by a byte which is neither ASCII (which is a problem for Python 2) nor UTF-8 (which is a problem for Python 3):

$ python3 _build/utils/propagate.py
  File "_build/utils/propagate.py", line 107
SyntaxError: Non-UTF-8 code starting with '\x91' in file _build/utils/propagate.py on lin
e 107, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
$ python _build/utils/propagate.py
  File "_build/utils/propagate.py", line 107
SyntaxError: Non-ASCII character '\x91' in file _build/utils/propagate.py on line 107, bu
t no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Inspecting the file in Vim, there are more bytes like this beyond this point. Perhaps a coding declaration is missing at the top of the file?

Syntax for "you guys"

Attested analyses:

  • EWT
    • you <=dep= guys
  • GUM:
    • you <=det= guys
    • you <=compound= guys
    • you =appos=> guys
    • you =compound=> guys

Coref issues to do

  • "bio_emperor" line 12 fix tsv file for "Emperor Norton Bridge" (as both person=Emperor Norton and place=...Bridge)

Varying analyses of "including"

Instances of "including" in GUM vary in whether they're treated as case(Nchild, including) or acl(Nparent, including), or something else.

I'm not sure what to think of whether this is more clausal or prepositional. A clausal analysis seems best wherever there's a prototypical agent as its parent or as the subject of its parent:

  1. [The man [including examples in his GitHub issue]] wrote a few paragraphs
  2. The man [wrote a few paragraphs [including examples in his GitHub issue]]

Seems odder to me somehow when this is not the case since the basic sense of "including" would seem to require a volitional agent. The clausal analysis seems questionable in (3–5) to me.

  1. Three birds were visible on the fence including a robin, sparrow, and crow.
  2. Three birds including a robin sparrow and crow were visible on the fence.
  3. This pattern holds true for many workers including assembly line laborers who build cars, stylists who cut hair, and doctors who perform heart surgery.

Is this a case of genuine ambiguity or should these analyses be unified? Note that PTB POS guidelines say anything ending in -ing or -ed is never IN, so strict consistency with XPOS would always require a clausal analysis

"along with"

In e.g.

  • He arrived along with a few friends.
  • Along with the cake mix, I bought some candles.

EWT would treat "along" and "with" as sister ADPs, both attaching as case. Since "along" by itself can be a preposition (walk along the river), I think this makes sense.

Note that the verbal idioms "go along (with)" 'follow', "get along (with)", etc. are different.

Fix foreign names with right branching nn

Foreign names used in an English context:

  • Tag as (N)NP, not FW:
    • Paseo de Montejo Information Module (NP, NP, NP, NN, NN)
  • If used in foreign embedded syntax, tag as FW:
    • Módulo de Información Turística Paseo de Montejo (all FW)
  • For dependencies:
    • Respect entity borders
    • Within a foreign name entity, right-most token is the head, so "Montejo" is head of Paseo de Montejo
    • Note that Módulo de Información Turística Paseo de Montejo contains two entity heads:
      • Montejo is the head of Paseo de Montejo
      • Turistica is the head of Módulo de Información Turística
    • All dependents of the head are left branching nn; relations between entity heads are based on syntax/semantics

Fix stray 'cannot' tokens

There are a few untokenized 'cannots'

  • Tokenize
  • Add a <w> element
  • Do this just before release to avoid branches out of sync

Sentence initial w tag

w tags can't indicate sentences beginning with an unspaced word fused to the preceding period, due to XML nesting conflict. Currently this is marked by a single token <w> tag on the first word of the sentence not separated by a space:

<!-- text = events.I -->
<s>
...
events
.
</s>
<s>
<w>
I
</w>

The build bot should interpret such single token w spans correctly for NoSpace and text generation.

Adjunct relative that-clause

Per UniversalDependencies/UD_English-EWT#173, the sentence

  • You know what I think, I think the first time that it does the card mode, it takes a long time.

correctly tags "that" as SCONJ, but it has WDT, PronType=Rel and an enhanced ref dependency—these should be reserved for heads that fill a role in the embedded predicate.

POS tagging errors for directional preposition + here/there

example query:

I imagined a man down/RB there/RB in the dark
Someone's hung it up/RB here/RB
've really got something stuck up/IN there/RB
# 4 is a phrasal verb
a few new girls over/IN there/PP
is there really someone in a craft up/RB there/RB

RB RB is the most common analysis, but PTB has majority IN RB, which I think is the correct analysis. There are other reasons to not like a word like down as an adjective:

  A downwardly oriented arrow
* A down oriented arrow

License of academic and bio files

The website says that the academic and biography files are provided under "a CC-BY" license, but it doesn't state the version. Could you provide that information?

Better constituent parsing

Replace CoreNLP with SOA neural constituent parser.

  • Get rid of hardwired Windows dependency (look for lexparser_eng_const_plus.bat)
  • Need to choose a parser that accepts gold POS tag input (since we have gold POS)

POSTAG column empty in GUM_news_warming.conll10

More Column 5 (POSTAG) seems to be missing for some sentence(s) in these files:

  • dep/GUM_news_warming.conll10
  • dep/GUM_voyage_fortlee.conll10
  • dep/GUM_voyage_vavau.conll10
  • dep/GUM_whow_cupcakes.conll10
  • dep/GUM_whow_flirt.conll10
  • dep/GUM_whow_joke.conll10
  • dep/GUM_whow_languages.conll10
  • dep/GUM_whow_skittles.conll10

best_JJS with well lemma

In general "___ is best" is labeled as follows:

# sent_id = GUM_whow_elevator-22
# text = You could also keep adjusting your shirt or your hair, keeping a running dialogue about what look is best for you.
# s_type = sub
19      is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   20      cop     _       _
20      best    good    ADJ     JJS     Degree=Sup      15      acl     _       _

There is an exception, and I think this may be mislabeled:

# sent_id = GUM_voyage_fortlee-27
# text = Most people choose cars, but if headed to a central area, such as Main Street, it is best to drive there, park in a municipal parking lot, and walk around from there.
# s_type = decl
20      is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   21      cop     _       _
21      best    well    ADJ     JJS     Degree=Sup      3       conj    _       _

UD conversion errors

UD conversion error list, examples added as comments:

  • Subject relative nsubj with 'which' becomes 'obj' (turns out to be related to xcomp rule)
  • Subject nsubj -> obj in copula environment (same issue as the above)
  • Adverbial copula predicates should be treated like PP predicates (currently 'be' is left as the root in "it is here" but not in "it is in the house", see real example below)
  • Consider adding a rule: "and" with func dep -> reparandum (only occurs once)
  • Bug affecting coordinated prepositions (conj still comes out of the prep -> case; should come from lexical item)
  • Handle PP root with modifiers

Use coref information for UD dislocated

Use the TSV data collected here and also harvest the coref information:

https://github.com/amir-zeldes/gum/blob/dev/_build/utils/stanford2ud.py#L110

Based on the coref and dependency data:

  • Find tokens labeled dep
  • Find the borders of the NP that they are the head of
  • Find the borders of all other NPs whose heads depend on the same parent as dep
  • Check if any of them are coreferent

Whenever a coreferent 'co-child' of the 'dep' entity is found, save these 'deps' and change 'dep' to 'dislocated' in the UD conversion output.

ModuleNotFoundError: No module named 'utils.feature_extraction'

When I ran process_reddit.py, I was able to get the Reddit data fine. But when I ran build_gum.py, I got the following:

`====================
Validating files...

Found reddit source data
Including reddit data in build
o Found 148 documents
o File names match
o Token counts match across directories
o 148 documents pass XSD validation
Traceback (most recent call last):
File "_build/build_gum.py", line 95, in
from utils.repair_rst import fix_rst
File "/Users/katherineatwell/cs-programs/research/resources/gum/_build/utils/repair_rst.py", line 6, in
from .rst2dep import make_rsd
File "/Users/katherineatwell/cs-programs/research/resources/gum/_build/utils/rst2dep.py", line 14, in
from .feature_extraction import ParsedToken
ModuleNotFoundError: No module named 'utils.feature_extraction'`

I looked for a feature_extraction file in utils and couldn't find it, so I'm not sure what to do.

Thanks!

Unable to download 4.1.0 ZIP

When I try to download https://github.com/amir-zeldes/gum/archive/V4.1.0.zip, the download always gets stuck at 73400320 bytes. I tried from two different machines with different ISPs.

However, https://github.com/amir-zeldes/gum/archive/V4.1.0.tar.gz seems to download ok.

Plain text paragraphs

Is there any way to access the plain text paragraphs for the corpus data? Or would it be best to reconstruct this data using the XML files?

Possibly inconsistent lemmatization of longer

Someone's life isn't much longer, lemma is "long":

# text = Alternatively, if life has not been good to you so far, (and considering your current situation this seems more likely) consider how lucky you are that it will not be troubling you much longer.
38      longer  long    ADV     RBR     Degree=Cmp      35      advmod  _       SpaceAfter=No

but cleaning things takes longer, lemma is "longer":

# text = The gingerbread batter bowl could wait a day, even though it would take longer to clean tomorrow.
15      longer  longer  ADV     RBR     Degree=Cmp      14      advmod  _       _

People's rides are longer, lemma is "longer":

# text = This makes everyone’s ride on the elevator longer, if only for a few seconds.
9       longer  longer  ADJ     JJR     Degree=Cmp      2       xcomp   _       SpaceAfter=No

but someone's limbs are longer, lemma is "long":

# text = Longer limbs are expected to experience larger bending and torsional moments, so the fact that experimental animals had longer femora suggests that limb verticalization reduces these moments by orienting the bone more parallel to the GRF line of action.
1       Longer  long    ADJ     JJR     Degree=Cmp      2       amod    _       Discourse=cause:65->69|Entity=(object-112
20      longer  long    ADJ     JJR     Degree=Cmp      21      amod    _       Entity=(object-115

Fix RST EDUs spanning more than one sentence (based on s spans)

@loganpeng1992 found these cases as of GUM4 (ideally finding these should be built into the validation):

doc_id	line_id	rst_id	parent_id	relname	text
GUM_news_defector	23-24	28	54	sequence	Rathbun stated , " Daniel was picked up by an investigator in a black car with blacked out windows .
GUM_news_iodine	32-33	40	86	span	" Pregnant women in Australia are getting about half as much as what they require on a daily basis . So that alarms me , because there 's quite serious potential for adverse effects and brain damage in the next generation of children born in this country , " he said .
GUM_interview_peres	44-45	55	144	joint	No , no . Forget memory .
GUM_interview_dungeon	40-41	44	109	span	So it changes , so one year we got to be in the Press , but they have now stopped adult ads . Completely .
GUM_interview_dungeon	54-55	58	132	span	Because I want someone who 's going to treat people nicely and well . We have --
GUM_interview_dungeon	70-71	74	121	span	Be careful . He 'll make you fill one out .
GUM_interview_dungeon	72-73	75	74	concession	You 'd be scared . You would n't hire me .
GUM_interview_dungeon	91-92	94	95	solutionhood	Do you ask for references ?
GUM_news_sensitive	26-27	30	49	joint	Peter Zimonjic . " MP calls for probe into misplaced documents " — Edmonton Sun , August 16 , 2008
GUM_interview_gaming	15-16	25	82	span	And one of our friends , our common friends , he introduced us during study hall ,
GUM_interview_gaming	17-18	26	25	result	and we just kind of hit it off from there .
GUM_voyage_cleveland	5-6	5	47	joint	Plus , this region ranks fifth in the nation in number of major cultural resources per one million residents . Understand
GUM_academic_thrones	1-2	1	76	preparation	Re(a)d Wedding : A Comparative Discourse Analysis of Fan Responses to Game of Thrones
GUM_interview_herrick	32-33	37	36	elaboration	( http://www.wikihow.com/wikiHow:Carbon-Neutral ) Our community culture is focused on wikiLove and civility .
GUM_interview_libertarian	8-9	8	60	elaboration	With Wikinews , Sarvis discusses his background , views on McDonnell 's tenure , keys to campaign success , plan to implement his agenda , and the former Virginia governor he most admires . Interview
GUM_news_korea	5-7	5	26	joint	" War maniacs of the South Korean puppet military made another grave provocation to the DPRK in the central western sector of the front on Thursday afternoon . They perpetrated such reckless action as firing 36 shells at KPA civil police posts under the absurd pretext that the KPA fired one shell at the south side . Six shells of them hit the area near KPA civil police posts 542 and 543 and other 15 shells fell near KPA civil police posts 250 and 251 " , said KCNA .

GUM_vlog_pregnant: trinary RST branch in lisp binary

If I'm reading lisp_binary/GUM_vlog_pregnant.dis right, there's a trinary branch here:

                      ( Nucleus (span 102 103) (rel2par joint)
                        ( Nucleus (leaf 102) (rel2par span) (text _!and you just feel awful_!) )
                        ( Satellite (leaf 103) (rel2par circumstance) (text _!thinking ._!) )
                        ( Satellite (span 104 107) (rel2par elaboration)
                          ( Nucleus (span 104 106) (rel2par span)
                            ( Nucleus (span 104 105) (rel2par same-unit)
                              ( Nucleus (leaf 104) (rel2par span) (text _!The thought_!) )
                              ( Satellite (leaf 105) (rel2par elaboration) (text _!of telling someone_!) )
                            )
                            ( Nucleus (leaf 106) (rel2par same-unit) (text _!makes me want to burst into tears ._!) )
                          )
                          ( Satellite (leaf 107) (rel2par elaboration) (text _!All these things going onto my head ._!) )
                        )
                      )

Maybe there's something I'm misunderstanding? Janet and I came across this doing something unrelated

error thrown by build_gum.py

Hi, thanks for this great resource!
A couple of comments for running python build_gum.py.
Firstly, the user needs to cd to the _build directory, otherwise _build/src is not found from the root dir; this could be mentioned in the README maybe?
Second, there's an encoding error from utils/stanford2ud.py (using Python 2 on Mac 10.13). The fix is to add the following two lines at the top of the file:

#!/usr/bin/python
# -*- coding: utf-8 -*-

As in https://www.python.org/dev/peps/pep-0263
best, Andrew

Add metadata to _build/src/xml

Key-value pairs should be added to the <text> tag. The names and values to use are:

  • sourceURL (a URL)
  • dateCreated: (ISO data, e.g. 2017-06-13)
  • dateCollected: {2014-09-15, 2015-09-21, 2016-09-19}
  • dateModified: (from history, in ISO format; not newer than date collected)
  • speakerCount 2 (0 if no tags)
  • speakerList #Wikinews, #WilliamEvans || none (if no sp tags)
  • author:
    • Wikinews for interview and news.
    • For voyage and whow, first 3 listed authors, then 'and others (see URL)':
      • Jdlrobson, PsamatheM, Ikan Kekek and others (see URL)
      • Khadijah, Nick Geisler, Mohil Khare and others (see URL)
  • title: (usually in first element)
    • How to Make a Glowstick
    • York
  • shortTitle (last part of text id, after last underscore)
    • glowstick
    • york

ratchet not tagged correctly

sentence start end term_id token_id token lemma upos xpos feats head_token_id dep_rel deps misc
Basically, we do not give a fuck we gonna talk about shit what now, but we go keep it ratchet. 383 389 98 22 ratchet ratchet NOUN NN Number=Sing

Here, ratchet is tagged as a noun but is actually used as an adjective. Is there a way to update tagging for slang terms? Thanks very much!

Revise date dependencies

Previous date guidelines for dependencies had:

  • Month is dependent of day as nn
  • Year is dependent of the month as tmod

Revise all dates in dependency files to:

  • Month is still dependent of day, but as tmod
  • Year is dependent of the day, still as tmod

Malnested entity spans

Usually the result of errors with punctuation slipping into/out of a span. About 13 cases, findable using:

e1#entity _o_ e2#entity & ltok#tok & rtok#tok &
#ltok . #e2 & #e1 . #rtok & #e1 _i_ #ltok & #e2 _i_ #rtok

Build error: No module named 'utils.rst2dep'

When building GUM6 from a fresh clone of master, process_reddit.py works fine but I encounter the following error on running _build/build_gum.py:

» python _build/build_gum.py
====================
Validating files...
====================

Found reddit source data
Including reddit data in build
o Found 148 documents
o File names match
o Token counts match across directories
o 148 documents pass XSD validation                              
                                        Traceback (most recent call last):
  File "_build/build_gum.py", line 95, in <module>
    from utils.repair_rst import fix_rst
  File "/[...]/gum/_build/utils/repair_rst.py", line 6, in <module>
    from .rst2dep import make_rsd
ModuleNotFoundError: No module named 'utils.rst2dep'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.