amir-zeldes / gum Goto Github PK

View Code? Open in Web Editor NEW

87.0 9.0 50.0 1.15 GB

Repository for the Georgetown University Multilayer Corpus (GUM)

Home Page: https://gucorpling.org/gum/

License: Other

Python 87.49% Batchfile 0.29% Shell 0.29% CSS 1.39% HTML 1.72% JavaScript 1.98% XSLT 4.78% Cython 2.06%

corpus annotations pos-tagging annis treebank rhetorical-structure-theory coreference universal-dependencies

gum's People

Contributors

Stargazers

Watchers

Forkers

zangsir mmilutinovic013 cfhh888 reckart clemance jrm346 marrowe logan-siyao-peng gaikin katyfelkner esmanning sw1001 sibgha360 marouan5 mkv2020 doehae xf05888 mjabrams phsfr valrcs tollefj chrisdemian nitinvwaran leticiawanderley lgessler scarydemon2 newzsh janetlauyeung khushboo-gehi jdanndc thomascm23 amillert masterofmath2025 yilunzhu lisaterumi state-o-flux yunan4nlp ivrit nichyhan mlopezcortez happymoi nschneid akasuyi t-aoyam dominofire techthiyanes jl908069 ianporada xiulinyang oog-gamro

gum's Issues

Add Typo=Yes in dependencies based on <sic> tags in xml/

Inconsistent annotation of quotes

There seem to be some inconsistencies about the annotation of quotes, e.g.

Double quotes

"   _   ``  ``   <= this seems ok, others not
"   _   ''  ''   <= also ok as Amir pointed out
"   _   '   '
"   _   0   0

Single quotes

'   _   ``  ``   <= this seems ok, others not
'   _   "   ''
'   _   "   "
'   _   '   '
'   _   POS POS

Guide for where to correct what

For public contributions to error correction, we need a wiki page explaining which format is primary for which annotations (e.g. POS tags are updated in .xml files and will automatically propagate to syntax files and PAULA, but not the other way around)

"how"-modified acl/advcl

The dependency trees for these are wrong in a number of instances: GUM guidelines say an advcl containing "how" should always have "how" as an advmod of the clausal head, but many GUM sentences have "how" as the clausal head

SyntaxError in propagate.py

Trying to build the corpus with build_gum.py including Reddit data, I run into a SyntaxError in propagate.py which is caused by a byte which is neither ASCII (which is a problem for Python 2) nor UTF-8 (which is a problem for Python 3):

$ python3 _build/utils/propagate.py
  File "_build/utils/propagate.py", line 107
SyntaxError: Non-UTF-8 code starting with '\x91' in file _build/utils/propagate.py on lin
e 107, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
$ python _build/utils/propagate.py
  File "_build/utils/propagate.py", line 107
SyntaxError: Non-ASCII character '\x91' in file _build/utils/propagate.py on line 107, bu
t no encoding declared; see http://python.org/dev/peps/pep-0263/ for details

Inspecting the file in Vim, there are more bytes like this beyond this point. Perhaps a coding declaration is missing at the top of the file?

Syntax for "you guys"

Attested analyses:

EWT
- you <=dep= guys
GUM:
- you <=det= guys
- you <=compound= guys
- you =appos=> guys
- you =compound=> guys

Coref issues to do

"bio_emperor" line 12 fix tsv file for "Emperor Norton Bridge" (as both person=Emperor Norton and place=...Bridge)

Varying analyses of "including"

Instances of "including" in GUM vary in whether they're treated as case(Nchild, including) or acl(Nparent, including), or something else.

I'm not sure what to think of whether this is more clausal or prepositional. A clausal analysis seems best wherever there's a prototypical agent as its parent or as the subject of its parent:

[The man [including examples in his GitHub issue]] wrote a few paragraphs
The man [wrote a few paragraphs [including examples in his GitHub issue]]

Seems odder to me somehow when this is not the case since the basic sense of "including" would seem to require a volitional agent. The clausal analysis seems questionable in (3–5) to me.

Three birds were visible on the fence including a robin, sparrow, and crow.
Three birds including a robin sparrow and crow were visible on the fence.
This pattern holds true for many workers including assembly line laborers who build cars, stylists who cut hair, and doctors who perform heart surgery.

Is this a case of genuine ambiguity or should these analyses be unified? Note that PTB POS guidelines say anything ending in -ing or -ed is never IN, so strict consistency with XPOS would always require a clausal analysis

"along with"

In e.g.

He arrived along with a few friends.
Along with the cake mix, I bought some candles.

EWT would treat "along" and "with" as sister ADPs, both attaching as case. Since "along" by itself can be a preposition (walk along the river), I think this makes sense.

Note that the verbal idioms "go along (with)" 'follow', "get along (with)", etc. are different.

Fix foreign names with right branching nn

Foreign names used in an English context:

Tag as (N)NP, not FW:
- Paseo de Montejo Information Module (NP, NP, NP, NN, NN)
If used in foreign embedded syntax, tag as FW:
- Módulo de Información Turística Paseo de Montejo (all FW)
For dependencies:
- Respect entity borders
- Within a foreign name entity, right-most token is the head, so "Montejo" is head of Paseo de Montejo
- Note that Módulo de Información Turística Paseo de Montejo contains two entity heads:
  - Montejo is the head of Paseo de Montejo
  - Turistica is the head of Módulo de Información Turística
- All dependents of the head are left branching nn; relations between entity heads are based on syntax/semantics

Fix stray 'cannot' tokens

There are a few untokenized 'cannots'

Tokenize
Add a <w> element
Do this just before release to avoid branches out of sync

Fix possessive genitive 's not attached to markable

Sentence initial w tag

w tags can't indicate sentences beginning with an unspaced word fused to the preceding period, due to XML nesting conflict. Currently this is marked by a single token <w> tag on the first word of the sentence not separated by a space:

<!-- text = events.I -->
<s>
...
events
.
</s>
<s>
<w>
I
</w>

The build bot should interpret such single token w spans correctly for NoSpace and text generation.

Adjunct relative that-clause

Per UniversalDependencies/UD_English-EWT#173, the sentence

You know what I think, I think the first time that it does the card mode, it takes a long time.

correctly tags "that" as SCONJ, but it has WDT, PronType=Rel and an enhanced ref dependency—these should be reserved for heads that fill a role in the embedded predicate.

POS tagging errors for directional preposition + here/there

example query:

I imagined a man down/RB there/RB in the dark
Someone's hung it up/RB here/RB
've really got something stuck up/IN there/RB
# 4 is a phrasal verb
a few new girls over/IN there/PP
is there really someone in a craft up/RB there/RB

RB RB is the most common analysis, but PTB has majority IN RB, which I think is the correct analysis. There are other reasons to not like a word like down as an adjective:

  A downwardly oriented arrow
* A down oriented arrow

License of academic and bio files

The website says that the academic and biography files are provided under "a CC-BY" license, but it doesn't state the version. Could you provide that information?

Suspicious nmod:npmod annotations

Here are 50 nmod:npmods worth reviewing. Maybe the recipe ingredient ones are OK, but some of these are clearly errors.

http://match.grew.fr/[email protected]&custom=6032be56514f3

Some of the obl:npmod instances look like errors too.

Better constituent parsing

Replace CoreNLP with SOA neural constituent parser.

Get rid of hardwired Windows dependency (look for lexparser_eng_const_plus.bat)
Need to choose a parser that accepts gold POS tag input (since we have gold POS)

POSTAG column empty in GUM_news_warming.conll10

More Column 5 (POSTAG) seems to be missing for some sentence(s) in these files:

dep/GUM_news_warming.conll10
dep/GUM_voyage_fortlee.conll10
dep/GUM_voyage_vavau.conll10
dep/GUM_whow_cupcakes.conll10
dep/GUM_whow_flirt.conll10
dep/GUM_whow_joke.conll10
dep/GUM_whow_languages.conll10
dep/GUM_whow_skittles.conll10

best_JJS with well lemma

In general "___ is best" is labeled as follows:

# sent_id = GUM_whow_elevator-22
# text = You could also keep adjusting your shirt or your hair, keeping a running dialogue about what look is best for you.
# s_type = sub
19      is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   20      cop     _       _
20      best    good    ADJ     JJS     Degree=Sup      15      acl     _       _

There is an exception, and I think this may be mislabeled:

# sent_id = GUM_voyage_fortlee-27
# text = Most people choose cars, but if headed to a central area, such as Main Street, it is best to drive there, park in a municipal parking lot, and walk around from there.
# s_type = decl
20      is      be      AUX     VBZ     Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin   21      cop     _       _
21      best    well    ADJ     JJS     Degree=Sup      3       conj    _       _

Pre-possessive modifiers in NPs

I was exploring special cases where something could precede a possessive in an NP: http://match.grew.fr/?corpus=UD_English-GUM@dev&custom=611c22ca1ede8

Some of these appear to be determiner misattachments, e.g. "the nation's past" should be det(nation, the).

(In cases where the possessive is lexicalized it is legitimate: e.g. "my California Driver's license".)

Website doesn't advertise 2.3.2 release yet

Website doesn't advertise 2.3.2 release yet:

https://corpling.uis.georgetown.edu/gum/

UD conversion errors

UD conversion error list, examples added as comments:

Subject relative nsubj with 'which' becomes 'obj' (turns out to be related to xcomp rule)
Subject nsubj -> obj in copula environment (same issue as the above)
Adverbial copula predicates should be treated like PP predicates (currently 'be' is left as the root in "it is here" but not in "it is in the house", see real example below)
Consider adding a rule: "and" with func dep -> reparandum (only occurs once)
Bug affecting coordinated prepositions (conj still comes out of the prep -> case; should come from lexical item)
Handle PP root with modifiers

Handle cata chain in fiction_oversite

Due to multiple "it" foreshadowing the "implant"

Fix dep in bracket punctuation in whow_procrastinating

This is a reminder - not editing to avoid conflict with @loganpeng1992

Use coref information for UD dislocated

Use the TSV data collected here and also harvest the coref information:

https://github.com/amir-zeldes/gum/blob/dev/_build/utils/stanford2ud.py#L110

Based on the coref and dependency data:

Find tokens labeled dep
Find the borders of the NP that they are the head of
Find the borders of all other NPs whose heads depend on the same parent as dep
Check if any of them are coreferent

Whenever a coreferent 'co-child' of the 'dep' entity is found, save these 'deps' and change 'dep' to 'dislocated' in the UD conversion output.

ModuleNotFoundError: No module named 'utils.feature_extraction'

When I ran process_reddit.py, I was able to get the Reddit data fine. But when I ran build_gum.py, I got the following:

`====================
Validating files...

Found reddit source data
Including reddit data in build
o Found 148 documents
o File names match
o Token counts match across directories
o 148 documents pass XSD validation
Traceback (most recent call last):
File "_build/build_gum.py", line 95, in
from utils.repair_rst import fix_rst
File "/Users/katherineatwell/cs-programs/research/resources/gum/_build/utils/repair_rst.py", line 6, in
from .rst2dep import make_rsd
File "/Users/katherineatwell/cs-programs/research/resources/gum/_build/utils/rst2dep.py", line 14, in
from .feature_extraction import ParsedToken
ModuleNotFoundError: No module named 'utils.feature_extraction'`

I looked for a feature_extraction file in utils and couldn't find it, so I'm not sure what to do.

Thanks!

when, where and how in WH questions should be advmod, not mark

Unable to download 4.1.0 ZIP

When I try to download https://github.com/amir-zeldes/gum/archive/V4.1.0.zip, the download always gets stuck at 73400320 bytes. I tried from two different machines with different ISPs.

However, https://github.com/amir-zeldes/gum/archive/V4.1.0.tar.gz seems to download ok.

Plain text paragraphs

Is there any way to access the plain text paragraphs for the corpus data? Or would it be best to reconstruct this data using the XML files?

Check phone number tokenization

For example in fortlee: 1 201 461-1776

POSTAG column empty

Column 5 (POSTAG) seems to be missing here:

https://github.com/amir-zeldes/gum/blob/master/dep/GUM_news_hackers.conll10#L575

Possibly inconsistent lemmatization of longer

Someone's life isn't much longer, lemma is "long":

# text = Alternatively, if life has not been good to you so far, (and considering your current situation this seems more likely) consider how lucky you are that it will not be troubling you much longer.
38      longer  long    ADV     RBR     Degree=Cmp      35      advmod  _       SpaceAfter=No

but cleaning things takes longer, lemma is "longer":

# text = The gingerbread batter bowl could wait a day, even though it would take longer to clean tomorrow.
15      longer  longer  ADV     RBR     Degree=Cmp      14      advmod  _       _

People's rides are longer, lemma is "longer":

# text = This makes everyone’s ride on the elevator longer, if only for a few seconds.
9       longer  longer  ADJ     JJR     Degree=Cmp      2       xcomp   _       SpaceAfter=No

but someone's limbs are longer, lemma is "long":

# text = Longer limbs are expected to experience larger bending and torsional moments, so the fact that experimental animals had longer femora suggests that limb verticalization reduces these moments by orienting the bone more parallel to the GRF line of action.
1       Longer  long    ADJ     JJR     Degree=Cmp      2       amod    _       Discourse=cause:65->69|Entity=(object-112
20      longer  long    ADJ     JJR     Degree=Cmp      21      amod    _       Entity=(object-115

Fix RST EDUs spanning more than one sentence (based on s spans)

@loganpeng1992 found these cases as of GUM4 (ideally finding these should be built into the validation):

doc_id	line_id	rst_id	parent_id	relname	text
GUM_news_defector	23-24	28	54	sequence	Rathbun stated , " Daniel was picked up by an investigator in a black car with blacked out windows .
GUM_news_iodine	32-33	40	86	span	" Pregnant women in Australia are getting about half as much as what they require on a daily basis . So that alarms me , because there 's quite serious potential for adverse effects and brain damage in the next generation of children born in this country , " he said .
GUM_interview_peres	44-45	55	144	joint	No , no . Forget memory .
GUM_interview_dungeon	40-41	44	109	span	So it changes , so one year we got to be in the Press , but they have now stopped adult ads . Completely .
GUM_interview_dungeon	54-55	58	132	span	Because I want someone who 's going to treat people nicely and well . We have --
GUM_interview_dungeon	70-71	74	121	span	Be careful . He 'll make you fill one out .
GUM_interview_dungeon	72-73	75	74	concession	You 'd be scared . You would n't hire me .
GUM_interview_dungeon	91-92	94	95	solutionhood	Do you ask for references ?
GUM_news_sensitive	26-27	30	49	joint	Peter Zimonjic . " MP calls for probe into misplaced documents " — Edmonton Sun , August 16 , 2008
GUM_interview_gaming	15-16	25	82	span	And one of our friends , our common friends , he introduced us during study hall ,
GUM_interview_gaming	17-18	26	25	result	and we just kind of hit it off from there .
GUM_voyage_cleveland	5-6	5	47	joint	Plus , this region ranks fifth in the nation in number of major cultural resources per one million residents . Understand
GUM_academic_thrones	1-2	1	76	preparation	Re(a)d Wedding : A Comparative Discourse Analysis of Fan Responses to Game of Thrones
GUM_interview_herrick	32-33	37	36	elaboration	( http://www.wikihow.com/wikiHow:Carbon-Neutral ) Our community culture is focused on wikiLove and civility .
GUM_interview_libertarian	8-9	8	60	elaboration	With Wikinews , Sarvis discusses his background , views on McDonnell 's tenure , keys to campaign success , plan to implement his agenda , and the former Virginia governor he most admires . Interview
GUM_news_korea	5-7	5	26	joint	" War maniacs of the South Korean puppet military made another grave provocation to the DPRK in the central western sector of the front on Thursday afternoon . They perpetrated such reckless action as firing 36 shells at KPA civil police posts under the absurd pretext that the KPA fired one shell at the south side . Six shells of them hit the area near KPA civil police posts 542 and 543 and other 15 shells fell near KPA civil police posts 250 and 251 " , said KCNA .

GUM_vlog_pregnant: trinary RST branch in lisp binary

If I'm reading lisp_binary/GUM_vlog_pregnant.dis right, there's a trinary branch here:

                      ( Nucleus (span 102 103) (rel2par joint)
                        ( Nucleus (leaf 102) (rel2par span) (text _!and you just feel awful_!) )
                        ( Satellite (leaf 103) (rel2par circumstance) (text _!thinking ._!) )
                        ( Satellite (span 104 107) (rel2par elaboration)
                          ( Nucleus (span 104 106) (rel2par span)
                            ( Nucleus (span 104 105) (rel2par same-unit)
                              ( Nucleus (leaf 104) (rel2par span) (text _!The thought_!) )
                              ( Satellite (leaf 105) (rel2par elaboration) (text _!of telling someone_!) )
                            )
                            ( Nucleus (leaf 106) (rel2par same-unit) (text _!makes me want to burst into tears ._!) )
                          )
                          ( Satellite (leaf 107) (rel2par elaboration) (text _!All these things going onto my head ._!) )
                        )
                      )

Maybe there's something I'm misunderstanding? Janet and I came across this doing something unrelated

Suspicious: rightward NOUN -compound-> NOUN

http://match.grew.fr/[email protected]&custom=6028645a81fa8

Various issues in xml and dep

See log here: https://gist.github.com/lgessler/bb5a1d224c476c8c7340b09818233178

This log was made using a post-build validation script I made--happy to open a PR if you want it

Some brackets got "unfixed" again

Some brackets got "unfixed" again. Cf. f8ac994#commitcomment-18692737 , f8ac994#diff-0128a80c7d743491b094b0364ebd47d1L492.
, and #4.

error thrown by build_gum.py

Hi, thanks for this great resource!
A couple of comments for running python build_gum.py.
Firstly, the user needs to cd to the _build directory, otherwise _build/src is not found from the root dir; this could be mentioned in the README maybe?
Second, there's an encoding error from utils/stanford2ud.py (using Python 2 on Mac 10.13). The fix is to add the following two lines at the top of the file:

#!/usr/bin/python
# -*- coding: utf-8 -*-

As in https://www.python.org/dev/peps/pep-0263
best, Andrew

Add SpaceAfter=No to dependencies based on <w> tags in xml/

We will also need to generate 'trivial' cases, such as clitics (n't, 'll) and punctuation which does not have space by default (, . ! ? etc.), since these do not get tags.

Update TO tag to IN for prepositions

Add metadata to _build/src/xml

Key-value pairs should be added to the <text> tag. The names and values to use are:

sourceURL (a URL)
dateCreated: (ISO data, e.g. 2017-06-13)
dateCollected: {2014-09-15, 2015-09-21, 2016-09-19}
dateModified: (from history, in ISO format; not newer than date collected)
speakerCount 2 (0 if no tags)
speakerList #Wikinews, #WilliamEvans || none (if no sp tags)
author:
- Wikinews for interview and news.
- For voyage and whow, first 3 listed authors, then 'and others (see URL)':
  - Jdlrobson, PsamatheM, Ikan Kekek and others (see URL)
  - Khadijah, Nick Geisler, Mohil Khare and others (see URL)
title: (usually in first element)
- How to Make a Glowstick
- York
shortTitle (last part of text id, after last underscore)
- glowstick
- york

very/ADJ should be ADV

1 token: http://match.grew.fr/?corpus=UD_English-GUM@dev&custom=614f520d27d0f

ratchet not tagged correctly

sentence	start	end	term_id	token_id	token	lemma	upos	xpos	feats	head_token_id	dep_rel	deps	misc

Basically, we do not give a fuck we gonna talk about shit what now, but we go keep it ratchet.	383	389	98	22	ratchet	ratchet	NOUN	NN	Number=Sing

Here, ratchet is tagged as a noun but is actually used as an adjective. Is there a way to update tagging for slang terms? Thanks very much!

2 spurious nmod:npmod examples

http://bionlp-www.utu.fi/dep_search/query?search=_%20%3Cnmod%3Anpmod%20PRON&db=UD_English-GUM-dev&case_sensitive=True&hits_per_page=10

syntax consistency for cataphoric it's X to Y

Unify with UD EWT, which has pron as expl and clause as csubj (we have some with nsubj+xcomp)

Revise date dependencies

Previous date guidelines for dependencies had:

Month is dependent of day as nn
Year is dependent of the month as tmod

Revise all dates in dependency files to:

Month is still dependent of day, but as tmod
Year is dependent of the day, still as tmod

Make numbered figures, chapters etc. flat in dep

Currently phrases like "Figure 1" have many analyses:

nummod
dep
flat
amod

Choose one for consistency (prob. flat?)

https://corpling.uis.georgetown.edu/annis/#_q=ImVudGl0aWVzIg&_c=R1VN&cl=5&cr=5&s=0&l=10

Malnested entity spans

Usually the result of errors with punctuation slipping into/out of a span. About 13 cases, findable using:

e1#entity _o_ e2#entity & ltok#tok & rtok#tok &
#ltok . #e2 & #e1 . #rtok & #e1 _i_ #ltok & #e2 _i_ #rtok

Build error: No module named 'utils.rst2dep'

When building GUM6 from a fresh clone of master, process_reddit.py works fine but I encounter the following error on running _build/build_gum.py:

» python _build/build_gum.py
====================
Validating files...
====================

Found reddit source data
Including reddit data in build
o Found 148 documents
o File names match
o Token counts match across directories
o 148 documents pass XSD validation                              
                                        Traceback (most recent call last):
  File "_build/build_gum.py", line 95, in <module>
    from utils.repair_rst import fix_rst
  File "/[...]/gum/_build/utils/repair_rst.py", line 6, in <module>
    from .rst2dep import make_rsd
ModuleNotFoundError: No module named 'utils.rst2dep'

amir-zeldes / gum Goto Github PK

gum's People

Contributors

Stargazers

Watchers

Forkers

gum's Issues

`==================== Validating files...

Recommend Projects

Recommend Topics

Recommend Org

`====================
Validating files...