allenai / pybart Goto Github PK
View Code? Open in Web Editor NEWConverter from UD-trees to BART representation
Home Page: https://allenai.github.io/pybart/
License: Apache License 2.0
Converter from UD-trees to BART representation
Home Page: https://allenai.github.io/pybart/
License: Apache License 2.0
Hello,
I found your paper very interesting and awesome, so thank you for this work.
I launched PyBART demo on the sentence : I like tea and you coffee.
But instead of adding a nsubj edge between like and you, it added two obj edges between you and coffee to like
I'm not sure, but you don't handle predicate ellipsis (missing copula or aux) in Parallel structures, mentioned in the paper.
Or is it because the conjunction propagation rule is applied before the ellipsis rule.
Thanks
Hi, I am trying to run PyBART in Google Colab like this:
import spacy
from pybart.api import *
# Load a UD-based english model
nlp = spacy.load("en_ud_model_sm") # here you can change it to md/sm/lg as you preffer
# Add BART converter to spaCy's pipeline
nlp.add_pipe("pybart_spacy_pipe", last="True", config={'remove_extra_info':True}) # you can pass an empty config for default behavior, this is just an example
# Test the new converter component
doc = nlp("He saw me while driving")
for i, sent in enumerate(doc._.get_pybart()):
print(f"Sentence {i}")
for edge in sent:
print(f'{edge["head"]} --{edge["label"]}--> {edge["tail"]}')
And I get error
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-7-45a9ba268fe4> in <module>()
1 import spacy
----> 2 from pybart.api import *
3
4 # Load a UD-based english model
5 nlp = spacy.load("en_ud_model_sm") # here you can change it to md/sm/lg as you preffer
2 frames
/usr/local/lib/python3.7/dist-packages/pybart/__init__.py in <module>()
1 name = "pybart"
2
----> 3 from . import api
/usr/local/lib/python3.7/dist-packages/pybart/api.py in <module>()
4 from .converter import Convert, get_conversion_names as inner_get_conversion_names, init_conversions
5 from spacy.language import Language
----> 6 from .spacy_wrapper import parse_spacy_sent, enhance_to_spacy_doc
7
8
/usr/local/lib/python3.7/dist-packages/pybart/spacy_wrapper.py in <module>()
5
6 from spacy.tokens import Doc, Token as SpacyToken
----> 7 from spacy.tokens.graph import Graph
8
9 from .graph_token import Token, add_basic_edges, TokenId
ModuleNotFoundError: No module named 'spacy.tokens.graph'
---------------------------------------------------------------------------
NOTE: If your import is failing due to a missing package, you can
manually install dependencies using either !pip or !apt.
To view examples of installing some common dependencies, click the
"Open Examples" button below.
---------------------------------------------------------------------------
Searching pubmed for the [$ risk] [$ factors] for [$ stroke] are [s something]
yields the result below
The parser in pubmed is a home trained version of scispacy, and the converter is the most recent version (2.2.0) with remove_node_adding_conversions=True
, remove_extra_info=True
and conv_iterations=1
(other parameters have default values).
As you can see only one of the risk factors is captured, instead of all 3. The reason is that the nsubj
edge which connects factors
and stenosis
doesn't exist for the brother nodes of stenosis
: fibrillation
and hypertension
. @aryehgigi tels me that the relevant propagation would occur if we were using the STATE node, but since we're using remove_node_adding_conversions
(and given some other characteristic of the sentence), it fails to occur.
My expectation is that in copula constructions, any propagation that happens when STATE is added should happen also when it's not (e.g. by rearranging the tree to use the copula word instead of STATE, or through other means). This is quite critical since copula constructions are very common and basic in many relations.
taken from closed issue: #16
Interaction of elaboration clauses with adjectival modifiers causes two modifications to form a (probably) wrong relation.
e.g.: "The infection causes a clinical syndrome, like infertility",
@yoavg , wdyt the best way t handle it?
I don't want to be too specific (to add to the elaboration-modification an "if" statement to prevent propagating things that came from amod modification - specifically) cause then we will miss other possible wrong interaction
and not too general (to add to the elaboration-modification an "if" statement to prevent propagating things that came from any previous modification done by us - in general) cause then we will miss possible correct interactions..
?
in constructions such as x helped to create y
and x helped creating y
, x was demonstrated to create y
, pyBART makes x
the nsubj of create/creating
(which is great!). however, there is one (apparently common) construction where this fails: x was created using y
, where pybart marks x
as the subject of using
.
my proposal is to special-case the word using
out of this rule.
In a group of activists including Jack and John
, pyBART adds several "nmod:of" edges (as it should), but doesn't add the corresponding case
edges to of
.
in it was founded by her brother, John
, pyBART adds an nmod:agent
to John
(as it should), but doesn't add the case
edge from John
to by
.
Hi
Nice work!
I encountered an issue when I tried to use your conllu to odin format converter on pybart parsings.
I added the BART pipe to the spaCy parser's pipeline, after creating an instance of your converter:
converter = Converter()
nlp.add_pipe(converter, name="BART")
Then, I got the parsing of my sentence.
doc = nlp(sentence)
Finally, I used the conllu_to_odin funciton, like so:
odin_formated_doc = cw.conllu_to_odin(doc, is_basic=True, push_new_to_end=False)
This resulted with an error, which I couldn't resolve.
Maybe doc isn't the right input for conllu_to_odin?
reproduce: "The sheriff was shot by Bob." (in no-extra-label-info mode)
passive-alteration creates a new subject relation between "shot" and "Bob".
But subject correction mistakenly corrects it to a passive subject.
Options for solutions:
Hi,
Thanks for sharing this valuable package.
Could you provide some hints how pybart representation can be indexed and retrieved via odinson please?
Is there any tool to convert pybart representation to odinson documents?
Hi,
Some sentences seem to crash PyBart when config={'enhanced_extra':True', 'enhanced_plus_plus':True}.
For instance, running:
"London is home to many museums, galleries, and other institutions, many of which are free of admission charges and are major tourist attractions as well as playing a research role".
Will result in the error:
London is home to many museums, galleries, and other institutions, many of which are free of admission charges and are major tourist attractions as well as playing a research role
Traceback (most recent call last):
File "/Users/royi/missing_elements_exploration/process_data.py", line 198, in
squad()
File "/Users/royi/missing_elements_exploration/process_data.py", line 105, in squad
doc = pybart(sent_)
File "/Users/royi/RLProjcOURSE/venv/lib/python3.8/site-packages/spacy/language.py", line 995, in call
error_handler(name, proc, [doc], e)
File "/Users/royi/RLProjcOURSE/venv/lib/python3.8/site-packages/spacy/util.py", line 1498, in raise_error
raise e
File "/Users/royi/RLProjcOURSE/venv/lib/python3.8/site-packages/spacy/language.py", line 990, in call
doc = proc(doc, **component_cfg.get(name, {}))
File "/Users/royi/RLProjcOURSE/venv/lib/python3.8/site-packages/pybart/api.py", line 52, in call
converted_sents, convs_done = convert_spacy_doc(doc, *self.config, self.conversions)
File "/Users/royi/RLProjcOURSE/venv/lib/python3.8/site-packages/pybart/api.py", line 41, in convert_spacy_doc
enhance_to_spacy_doc(doc, converted)
File "/Users/royi/RLProjcOURSE/venv/lib/python3.8/site-packages/pybart/spacy_wrapper.py", line 94, in enhance_to_spacy_doc
orig_doc._.parent_graphs_per_sent.append(Graph(orig_doc, name="pybart", nodes=nodes, edges=edges, labels=labels))
File "spacy/tokens/graph.pyx", line 402, in spacy.tokens.graph.Graph.init
File "spacy/tokens/graph.pyx", line 71, in spacy.tokens.graph.Node.init
IndexError: Node index 34 out of bounds (34)
I am using the Spacy pipeline and "en_model_trf". If you try different models, different sentences will crash PyBart.
Also, for some sentences, I get different dependency graphs in the demo and my system. Is it possible to provide the configuration and model you use for the demo version?
Thanks
As part of an org wide move from CircleCI to Github Actions (see https://github.com/allenai/beaker/issues/1866) we should start using Github Actions for our linting and tests, and retire CircleCI.
In conjunctions where the head of the conjuncetion has an amod
parent, the conjunction children should have a direct connection to the amod
parent as well,
So in A gastrointestinal origin of chest pain is not infrequent and may be due to oesophageal , gastric or biliary disease .
we should have a direct connection between gastric
to disease
and between biliary
to disease
Consider this sentence, focusing on the word including
:
PPV infection mainly causes reproductive clinical syndrome , including infertility , abortion , stillbirth , neonatal death , and reduced neonatal vitality
If it were like
, we will be getting infertility, abortion, etc as dobj
of causes
, I think we should extend this behavior also to including
, wdyt?
Another thing is that in the case of like
, we get in addition to the dobj
links, also nsubj
links. Why do we get these?
Hey!
When using the model "en_ud_model_trf" and the following sentence:
She was convicted of selling unregistered securities in Florida, and of unlawful phone calls in Ohio.
PyBart produces the following error:
Traceback (most recent call last):
File "/Users/royi/missing_elements_exploration/test.py", line 41, in
print([doc[t.i].text for t in edge.head.tokens], f" --{edge.label}-> ",
File "spacy/tokens/graph.pyx", line 119, in spacy.tokens.graph.Node.tokens.get
File "spacy/tokens/doc.pyx", line 461, in spacy.tokens.doc.Doc.getitem
File "spacy/tokens/token.pxd", line 23, in spacy.tokens.token.Token.cinit
IndexError: [E040] Attempt to access token at 19, max length 18._
The error can also be reproduced with the sentence mentioned in a previous issue:
London is home to many museums, galleries, and other institutions, many of which are free of admission charges and are major tourist attractions as well as playing a research role.
With the traceback:
Traceback (most recent call last):
File "/Users/royi/missing_elements_exploration/test.py", line 42, in
print([doc[t.i].text for t in edge.head.tokens], f" --{edge.label}-> ",
File "spacy/tokens/graph.pyx", line 119, in spacy.tokens.graph.Node.tokens.get
File "spacy/tokens/doc.pyx", line 460, in spacy.tokens.doc.Doc.getitem
File "spacy/tokens/doc.pyx", line 45, in spacy.tokens.doc.bounds_check
IndexError: [E026] Error accessing token at position 39: out of bounds in Doc of length 34._
I am running version 3.2.2.
Also, the error does not occur if I use "en_core_web_trf", however, in that case I get a different graph from what PyBart-demo produces.
Thanks
Should test functionality of the new Matcher (#23) and include:
[37/37] For each conversion-function:
- Simplify the 'change-graph' code, and try to move part of it to the 'match-code' (restrictions)
- change from my current restriction to the new matcher constraint (see #23), and extract out of functions
- Add documentation as much as possible
- Mark english-specific and UD-v1 specific code segments
The doc._.parent_graphs_per_sent parameter is not clear and the snippet code using it in the readme is voodoo. This should be replaced with a method that the user code call to retrieve something more meaningful such as triplets of words and the label connecting (notice not to return the spacy Tokens themselves cause we might add tokens which cant be referred in the original Doc).
related to #35
Is issue #9 resolved?
we encounter problems that relate to it in the COVID dataset (and I suspect also in pubmed), many searches that should match multiple parts of a conjunction match only the first one.
The propagation of nmod
actually works very well, but without propagating also the case nodes, this is just not captured.
@aryehgigi , if this is already fixed, and case nodes are being propagated, then need to verify with @hillelt why it is not applied in spike, what should be changed for this to apply there.
(and if it is not fixed, please try and fix relatively soon, this is an important feature/bug).
depends on #24
As @yoavg said:
In cases where we propagate nmod, we should also propagate the 'case' edge.
For example, in
He saw them in the air, ground and water
,
we propagate nmod through the conjunction and get:
saw <nmod ground
saw <nmod waterbut i think we should also propagate the case, and have:
ground <case in
water <case inthis turns out in spike, where without this case propagation a pattern like:
we [v saw] them [$ in] the [loc car]
will only catchloc=air
and notloc=ground
,loc=water
while the pattern:
we [v saw] them in the [loc car]
(without the case restriction) will capture also water and ground.
I will just note, that this is a fix in Stanford's conj-propagation behavior and not our own conversion (in which: when we have to verbs in a cunjunction with a modifier only on one, we share that modifier between the verbs. while here the conjunction is between multiple modifiers, and Stanford's conversion states that the head of the first modifier should be shared.)
Anyway I commented inside the code that this is not part of their original behavior, but our own add-on/
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.