allenai / pybart Goto Github PK

View Code? Open in Web Editor NEW

37.0 37.0 5.0 11.06 MB

Converter from UD-trees to BART representation

Home Page: https://allenai.github.io/pybart/

License: Apache License 2.0

Python 98.38% HTML 1.62%

natural-language-processing nlp python

pybart's People

Contributors

Stargazers

Watchers

Forkers

scalablebytes avoydatta codeaudit aryehgigi stungkit

pybart's Issues

Predicate Ellipsis issue

Hello,

I found your paper very interesting and awesome, so thank you for this work.

I launched PyBART demo on the sentence : I like tea and you coffee.

But instead of adding a nsubj edge between like and you, it added two obj edges between you and coffee to like

I'm not sure, but you don't handle predicate ellipsis (missing copula or aux) in Parallel structures, mentioned in the paper.
Or is it because the conjunction propagation rule is applied before the ellipsis rule.

Thanks

ModuleNotFoundError: No module named 'spacy.tokens.graph'

Hi, I am trying to run PyBART in Google Colab like this:

import spacy
from pybart.api import *

# Load a UD-based english model
nlp = spacy.load("en_ud_model_sm") # here you can change it to md/sm/lg as you preffer

# Add BART converter to spaCy's pipeline
nlp.add_pipe("pybart_spacy_pipe", last="True", config={'remove_extra_info':True}) # you can pass an empty config for default behavior, this is just an example

# Test the new converter component
doc = nlp("He saw me while driving")
for i, sent in enumerate(doc._.get_pybart()):
    print(f"Sentence {i}")
    for edge in sent:
        print(f'{edge["head"]} --{edge["label"]}--> {edge["tail"]}')

And I get error

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-7-45a9ba268fe4> in <module>()
      1 import spacy
----> 2 from pybart.api import *
      3 
      4 # Load a UD-based english model
      5 nlp = spacy.load("en_ud_model_sm") # here you can change it to md/sm/lg as you preffer

2 frames
/usr/local/lib/python3.7/dist-packages/pybart/__init__.py in <module>()
      1 name = "pybart"
      2 
----> 3 from . import api

/usr/local/lib/python3.7/dist-packages/pybart/api.py in <module>()
      4 from .converter import Convert, get_conversion_names as inner_get_conversion_names, init_conversions
      5 from spacy.language import Language
----> 6 from .spacy_wrapper import parse_spacy_sent, enhance_to_spacy_doc
      7 
      8 

/usr/local/lib/python3.7/dist-packages/pybart/spacy_wrapper.py in <module>()
      5 
      6 from spacy.tokens import Doc, Token as SpacyToken
----> 7 from spacy.tokens.graph import Graph
      8 
      9 from .graph_token import Token, add_basic_edges, TokenId

ModuleNotFoundError: No module named 'spacy.tokens.graph'

---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.

To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------

Propagation is not working as expected when remove_node_adding_conversions is enabled

Searching pubmed for the [$ risk] [$ factors] for [$ stroke] are [s something] yields the result below

The parser in pubmed is a home trained version of scispacy, and the converter is the most recent version (2.2.0) with remove_node_adding_conversions=True, remove_extra_info=True and conv_iterations=1 (other parameters have default values).

As you can see only one of the risk factors is captured, instead of all 3. The reason is that the nsubj edge which connects factors and stenosis doesn't exist for the brother nodes of stenosis: fibrillation and hypertension. @aryehgigi tels me that the relevant propagation would occur if we were using the STATE node, but since we're using remove_node_adding_conversions (and given some other characteristic of the sentence), it fails to occur.

My expectation is that in copula constructions, any propagation that happens when STATE is added should happen also when it's not (e.g. by rearranging the tree to use the copula word instead of STATE, or through other means). This is quite critical since copula constructions are very common and basic in many relations.

Elaboration conversion clashes with adjectival modifiers conversion

taken from closed issue: #16
Interaction of elaboration clauses with adjectival modifiers causes two modifications to form a (probably) wrong relation.
e.g.: "The infection causes a clinical syndrome, like infertility",

@yoavg , wdyt the best way t handle it?
I don't want to be too specific (to add to the elaboration-modification an "if" statement to prevent propagating things that came from amod modification - specifically) cause then we will miss other possible wrong interaction
and not too general (to add to the elaboration-modification an "if" statement to prevent propagating things that came from any previous modification done by us - in general) cause then we will miss possible correct interactions..
?

special-case "using" in xcomp propagation

in constructions such as x helped to create y and x helped creating y, x was demonstrated to create y, pyBART makes x the nsubj of create/creating (which is great!). however, there is one (apparently common) construction where this fails: x was created using y, where pybart marks x as the subject of using.

my proposal is to special-case the word using out of this rule.

"case" edge not added in propagated nmod:x

In a group of activists including Jack and John, pyBART adds several "nmod:of" edges (as it should), but doesn't add the corresponding case edges to of.

in it was founded by her brother, John, pyBART adds an nmod:agent to John (as it should), but doesn't add the case edge from John to by.

STATE node has no postag

When adding STATE node, we don't give it any POS tag, which causes pybart conversions that depends on them to fail.
e.g.

Optional solution, to add its special POS tag (e.g. _) to the conversions rules where needed.

Issue when using conllu_to_odin function

Hi
Nice work!

I encountered an issue when I tried to use your conllu to odin format converter on pybart parsings.
I added the BART pipe to the spaCy parser's pipeline, after creating an instance of your converter:
converter = Converter()
nlp.add_pipe(converter, name="BART")

Then, I got the parsing of my sentence.
doc = nlp(sentence)

Finally, I used the conllu_to_odin funciton, like so:
odin_formated_doc = cw.conllu_to_odin(doc, is_basic=True, push_new_to_end=False)

This resulted with an error, which I couldn't resolve.
Maybe doc isn't the right input for conllu_to_odin?

Passive alteration clashes with currect passive subject

reproduce: "The sheriff was shot by Bob." (in no-extra-label-info mode)

passive-alteration creates a new subject relation between "shot" and "Bob".
But subject correction mistakenly corrects it to a passive subject.

Options for solutions:

[skipped] remove subject correction (consult with Yoav), or,
[skipped] add info for each edge regarding it being from our conversion, or,
fix subject correction.

pybart to odinson representation mapping

Hi,

Thanks for sharing this valuable package.
Could you provide some hints how pybart representation can be indexed and retrieved via odinson please?
Is there any tool to convert pybart representation to odinson documents?

New Matcher

Write spec (should subsume frigg and spike needs)
Discuss spec
Design algorithm and data structures
Implement matcher

copula & evidential verbs reconstructions

Complete the following changes to the converter:

fix copula reconstruction
add evidential verbs reconstruction
add aspectual verbs reconstruction
All according to the last representation discussion

converter's main code refactoring

write a dataclass for general purpose conversion functionality, and split the instantiation phase of the conversion-functions from the actual graph-rewriting part (so it would happen before matching loop)
fix turn-off-functions mechanism
fix label mechanism
fix tests accordingly

Lexema on case

on elaboration/specification clauses nmod propagations we put the lexema on the case instead of the nmod.

more than single added node crashes

Hi,

Some sentences seem to crash PyBart when config={'enhanced_extra':True', 'enhanced_plus_plus':True}.

For instance, running:

"London is home to many museums, galleries, and other institutions, many of which are free of admission charges and are major tourist attractions as well as playing a research role".

Will result in the error:

London is home to many museums, galleries, and other institutions, many of which are free of admission charges and are major tourist attractions as well as playing a research role
Traceback (most recent call last):
File "/Users/royi/missing_elements_exploration/process_data.py", line 198, in
squad()
File "/Users/royi/missing_elements_exploration/process_data.py", line 105, in squad
doc = pybart(sent_)
File "/Users/royi/RLProjcOURSE/venv/lib/python3.8/site-packages/spacy/language.py", line 995, in call
error_handler(name, proc, [doc], e)
File "/Users/royi/RLProjcOURSE/venv/lib/python3.8/site-packages/spacy/util.py", line 1498, in raise_error
raise e
File "/Users/royi/RLProjcOURSE/venv/lib/python3.8/site-packages/spacy/language.py", line 990, in call
doc = proc(doc, **component_cfg.get(name, {}))
File "/Users/royi/RLProjcOURSE/venv/lib/python3.8/site-packages/pybart/api.py", line 52, in call
converted_sents, convs_done = convert_spacy_doc(doc, *self.config, self.conversions)
File "/Users/royi/RLProjcOURSE/venv/lib/python3.8/site-packages/pybart/api.py", line 41, in convert_spacy_doc
enhance_to_spacy_doc(doc, converted)
File "/Users/royi/RLProjcOURSE/venv/lib/python3.8/site-packages/pybart/spacy_wrapper.py", line 94, in enhance_to_spacy_doc
orig_doc._.parent_graphs_per_sent.append(Graph(orig_doc, name="pybart", nodes=nodes, edges=edges, labels=labels))
File "spacy/tokens/graph.pyx", line 402, in spacy.tokens.graph.Graph.init
File "spacy/tokens/graph.pyx", line 71, in spacy.tokens.graph.Node.init
IndexError: Node index 34 out of bounds (34)

I am using the Spacy pipeline and "en_model_trf". If you try different models, different sentences will crash PyBart.

Also, for some sentences, I get different dependency graphs in the demo and my system. Is it possible to provide the configuration and model you use for the demo version?

Thanks

add spacy pipeline

Adjust code for multilingual capabilities

Move CI to github actions from CircleCI

As part of an org wide move from CircleCI to Github Actions (see https://github.com/allenai/beaker/issues/1866) we should start using Github Actions for our linting and tests, and retire CircleCI.

Share amod parent between conjunctions

In conjunctions where the head of the conjuncetion has an amod parent, the conjunction children should have a direct connection to the amod parent as well,
So in A gastrointestinal origin of chest pain is not infrequent and may be due to oesophageal , gastric or biliary disease .
we should have a direct connection between gastric to disease and between biliary to disease

pybart: last POS in demo is overridden when ends with punctuation

Extending the `like` expansions?

Consider this sentence, focusing on the word including:

PPV infection mainly causes reproductive clinical syndrome , including infertility , abortion , stillbirth , neonatal death , and reduced neonatal vitality

If it were like, we will be getting infertility, abortion, etc as dobj of causes, I think we should extend this behavior also to including, wdyt?

Another thing is that in the case of like, we get in addition to the dobj links, also nsubj links. Why do we get these?

PyBart crash with en_ud_model_trf

Hey!

When using the model "en_ud_model_trf" and the following sentence:

She was convicted of selling unregistered securities in Florida, and of unlawful phone calls in Ohio.

PyBart produces the following error:

Traceback (most recent call last):
File "/Users/royi/missing_elements_exploration/test.py", line 41, in
print([doc[t.i].text for t in edge.head.tokens], f" --{edge.label}-> ",
File "spacy/tokens/graph.pyx", line 119, in spacy.tokens.graph.Node.tokens.get
File "spacy/tokens/doc.pyx", line 461, in spacy.tokens.doc.Doc.getitem
File "spacy/tokens/token.pxd", line 23, in spacy.tokens.token.Token.cinit
IndexError: [E040] Attempt to access token at 19, max length 18._

The error can also be reproduced with the sentence mentioned in a previous issue:

London is home to many museums, galleries, and other institutions, many of which are free of admission charges and are major tourist attractions as well as playing a research role.

With the traceback:

Traceback (most recent call last):
File "/Users/royi/missing_elements_exploration/test.py", line 42, in
print([doc[t.i].text for t in edge.head.tokens], f" --{edge.label}-> ",
File "spacy/tokens/graph.pyx", line 119, in spacy.tokens.graph.Node.tokens.get
File "spacy/tokens/doc.pyx", line 460, in spacy.tokens.doc.Doc.getitem
File "spacy/tokens/doc.pyx", line 45, in spacy.tokens.doc.bounds_check
IndexError: [E026] Error accessing token at position 39: out of bounds in Doc of length 34._

I am running version 3.2.2.
Also, the error does not occur if I use "en_core_web_trf", however, in that case I get a different graph from what PyBart-demo produces.

Thanks

Test Matcher

Should test functionality of the new Matcher (#23) and include:

blackbox
code coverage
edge cases
nice to have but not required:
whitebox
stress

Refactoring

[37/37] For each conversion-function:
- Simplify the 'change-graph' code, and try to move part of it to the 'match-code' (restrictions)
- change from my current restriction to the new matcher constraint (see #23), and extract out of functions
- Add documentation as much as possible
- Mark english-specific and UD-v1 specific code segments

user friendly access to pybarts information within spacy Doc

The doc._.parent_graphs_per_sent parameter is not clear and the snippet code using it in the readme is voodoo. This should be replaced with a method that the user code call to retrieve something more meaningful such as triplets of words and the label connecting (notice not to return the spacy Tokens themselves cause we might add tokens which cant be referred in the original Doc).

related to #35

Make SPIKE API sentence level

re-visiting issue #9 (case propagation)

Is issue #9 resolved?
we encounter problems that relate to it in the COVID dataset (and I suspect also in pubmed), many searches that should match multiple parts of a conjunction match only the first one.
The propagation of nmod actually works very well, but without propagating also the case nodes, this is just not captured.

@aryehgigi , if this is already fixed, and case nodes are being propagated, then need to verify with @hillelt why it is not applied in spike, what should be changed for this to apply there.
(and if it is not fixed, please try and fix relatively soon, this is an important feature/bug).

Add support for UD V2

Stanza Integration

depends on #24

not adding nsubj for acl?

why no nsubj between caused and sciatica?

propagate case

As @yoavg said:

In cases where we propagate nmod, we should also propagate the 'case' edge.
For example, in
He saw them in the air, ground and water,
we propagate nmod through the conjunction and get:
saw <nmod ground
saw <nmod water

but i think we should also propagate the case, and have:
ground <case in
water <case in

this turns out in spike, where without this case propagation a pattern like:
we [v saw] them [$ in] the [loc car]
will only catch loc=air and not loc=ground, loc=water
while the pattern:
we [v saw] them in the [loc car]
(without the case restriction) will capture also water and ground.

I will just note, that this is a fix in Stanford's conj-propagation behavior and not our own conversion (in which: when we have to verbs in a cunjunction with a modifier only on one, we share that modifier between the verbs. while here the conjunction is between multiple modifiers, and Stanford's conversion states that the head of the first modifier should be shared.)
Anyway I commented inside the code that this is not part of their original behavior, but our own add-on/

allenai / pybart Goto Github PK

pybart's People

Contributors

Stargazers

Watchers

Forkers

pybart's Issues

Recommend Projects

Recommend Topics

Recommend Org