snowblink14 / smatch Goto Github PK

View Code? Open in Web Editor NEW

63.0 63.0 25.0 85 KB

Smatch tool: evaluation of AMR semantic structures

License: MIT License

Python 100.00%

smatch's People

Contributors

Stargazers

Watchers

smatch's Issues

fix Windows support

Please add encoding='utf8' as an arg to open in the following line:

smatch/smatch.py

Line 880 in ad7e655

file_handle.append(open(file_path))

That will allow running smatch on windows.

smatch ignores triples whose source is a constant

This issue is spun off of #10 as it's really a different issue despite also being about inverted roles.

Smatch compares triples which are deinverted, so (a ... :ARG0-of (b ... becomes the triple (ARG0, b, a) (in smatch's (role, source, target) order)). If this deinversion results in a constant becoming the source, then the triple is not considered by smatch, resulting in inflated scores. While these triples may not be considered valid AMR, they are nevertheless observed in the outputs of automatic systems and should be considered during evaluation.

Consider the following hypothetical AMRs as gold and as the outputs of three different systems. The triples that smatch computes (including the top triple) are in comments.

$ cat gold  # original:     (TOP, a, alpha) (instance, a, alpha) (ARG0, a, 1)
(a / alpha
   :ARG0 1)
$ cat a  # wrong value:     (TOP, a, alpha) (instance, a, alpha) (ARG0, a, 2)
(a / alpha
   :ARG0 2)
$ cat b  # missing edge:    (TOP, a, alpha) (instance, a, alpha)
(a / alpha)
$ cat c  # wrong direction: (TOP, a, alpha) (instance, a, alpha) (ARG0, 1, a)
(a / alpha
   :ARG0-of 1)

Now we compare these to the gold. The raw counts of matching triples used to compute precision and recall are in comments.

$ python smatch.py --pr -f a gold  # P=2/3  R=2/3
Precision: 0.67
Recall: 0.67
F-score: 0.67
$ python smatch.py --pr -f b gold  # P=2/2  R=2/3
Precision: 1.00
Recall: 0.67
F-score: 0.80
$ python smatch.py --pr -f c gold  # P=2/2  R=2/3  (Note: P != 2/3)
Precision: 1.00
Recall: 0.67
F-score: 0.80

Smatch considers three types of triples: instances (e.g., (instance, s, see-01)), attributes (e.g., (polarity, a, -), and edges (e.g., (ARG0, s, b)). In all cases, the source is the variable of some node. When the source is a constant, it doesn't fit into these three categories.

The straightforward fix is to ensure that inverted triples whose sources are constants (such as the (ARG0, 1, a) triple of c) are counted in the denominators for P and R, perhaps grouped in some "extra triples" category if necessary. When the role is :domain-of or :mod-of, however, there may be different behavior (see #10), but these can perhaps be resolved when decoding the AMR and not during the smatch computation.

Multi process SMATCH

Dear authors,

Why dont you develop parallel computing?
Do you have any plan to extend this score?

Thanks

Awareness of inverse roles, e.g. inverse of 'domain' is 'mod'

Currently Smatch is not aware that mod is the inverse role of domain. Also it has simplistic treatment of inverse roles. ROLE-of is always converted to ROLE.
This rule in some cases produces non-existing roles, e.g., these are primary roles consist-of, prep-on-behalf-of or prep-out-of, and their reduced versions are not AMR roles.
See the AMR issue

`:mod` relation ignored

I came across a difference in AMR graphis which is not detected by smatch.
comparing these two AMR graphs outputs a R/P/F of 1.00/1.00/1.00 by smatch (I am aware that :mod expressive is not valid AMR but nevertheless it is what my AMR Parser created, and smatch should detect it)

(d / do-02
      :ARG0 (ii / i)
      :ARG1 (a / about
            :op1 (d2 / disease
                  :name (n / name
                        :op1 "OCD")))
      :location (p / psychology)
      :time (t / today))

(d / do-02
      :ARG0 (ii / i)
      :ARG1 (a / about
            :op1 (d2 / disease
                  :name (n / name
                        :op1 "OCD")))
      :location (p / psychology)
      :time (t / today)
      :mod expressive)

I thinks this is due to the special treatment of :mod/:domain. any other relation is detected by the current version of smatch. For instance changing :mod expressive into :op1 expressive or even :toto expressive makes smatch detecting the error and outputs an F-score of 0.9677. My first guess is, that :mod is replaced by :domain (and start and end point reversed) in amr.sty

There is another AMR difference not detected by smatch :

# ::id ENG_NA_020001_20161020_G0023FSVB_0001.4
# ::snt we are dyin of thirst in MARTISAN 25BIS its***medicine without frontier they brought food an water for us..
(m / multi-sentence
      :snt1 (d / die-01
            :ARG1 (w / we)
            :ARG1-of (c / cause-01
                  :ARG0 (t / thirst-01
                        :ARG0 w))
            :location (s / street-address-91
                  :ARG1 "25 bis"
                  :ARG2 (r / road
                        :name (n / name
                              :op1 "Martisan"))))
      :snt2 (b / bring-01
            :ARG0 (o / organization
                  :name (n2 / name
                        :op1 "Doctors"
                        :op2 "without"
                        :op3 "Frontiers"))
            :ARG1 (a2 / and
                  :op1 (f / food)
                  :op2 (w2 / water))
            :ARG2 (w3 / we)))

and (note the :ARG1 "25" which is :ARG1 "25 bis" in the graph above

(m / multi-sentence
      :snt1 (d / die-01
            :ARG1 (w / we)
            :ARG1-of (c / cause-01
                  :ARG0 (t / thirst-01
                        :ARG0 w))
            :location (s / street-address-91
                  :ARG1 "25 bis"
                  :ARG2 (r / road
                        :name (n / name
                              :op1 "Martisan"))))
      :snt2 (b / bring-01
            :ARG0 (o / organization
                  :name (n2 / name
                        :op1 "Doctors"
                        :op2 "without"
                        :op3 "Frontiers"))
            :ARG1 (a2 / and
                  :op1 (f / food)
                  :op2 (w2 / water))
            :ARG2 (w3 / we)))

1.11 F1 score

I tried evaluating a single sentence against itself and I got a Smatch score greater than one (!!), any idea why?
Thank you.

Details below:

python smatchnew/smatch/smatch.py -f q3.txt q3.txt
F-score: 1.11

cat q3.txt
# ::snt How many white settlers were living in Kenya in the 1950's ?
(l / live-01
      :ARG0 (p / person
            :ARG1-of (s / settle-03
                  :ARG1 p
                  :ARG4 c)
            :ARG1-of (w / white-02)
            :quant (a / amr-unknown))
      :location (c / country :name "Kenya")
      :time (d / date-entity :decade 1950))

Smatch (still) yields scores above 100%

This seems like a bug similar to the solved

#15

which can cause Smatch scores to be artificially high by double counting edges and eventually reach values above 100%. The only way of knowing if a Smatch score below 100% is real or suffers from this bug is computing the Smatch of a file with itself (henceforce self-Smatch).

An example

# ::tok Uh ... Do you have legislative power or enforcement power ? <ROOT>
(h / have-03
      :ARG0 (y / you)
      :ARG1 (o / or
            :op1 (p / power
                  :instrument-of (l / legislate-01))
            :op2 (p2 / power
                  :instrument-of (e / enforce-01)))
      :mod (u / uh
            :mode expressive)
      :mode interrogative)

has the expected self-Smatch of 100%, while

# ::tok Uh ... Do you have legislative power or enforcement power ? <ROOT>
(h / have-03
      :ARG0 (y / you)
      :ARG1 (o / or
            :op1 (p / power
                  :instrument-of (l / legislate-01))
            :op2 (p2 / power
                  :instrument-of (e / enforce-01)))
      :mod (u / uh
            :mode expressive)
      :mode interrogative
      :mode interrogative)

has self-Smatch 110.5% due to the repeated :mode interrogative. The score can grow ad-infinitum just by repeating further.

Bug(?) in string parsing

I don't need this fixed, but maybe it's of interest:

I noticed that this smatch version returns a score of 1.00 for the different graphs:

(s / see-01
   :ARG0 (p / person
         :name (n / name
              :op1 "Hans")))

and

(s / see-01
   :ARG0 (p / person
         :name (n / name
              :op1 "Hans_")))

Wondering if there's a bug, or if there is some reason for this? It might be some sort of preprocessing that happens here which is not obvious. Since there are also https link etc. in AMR that may contain stuff like "_" I think it may not be sensible to remove characters.

edit

It can be even more severe. The score is also 1.00 for the very different graphs:

(s / see-01
   :ARG0 (p / person
         :name (n / name
              :op1 "Hans Meier")))

and

(s / see-01
   :ARG0 (p / person
         :name (n / name
              :op1 "Hans")))

While the first two graphs could be some pre-processing quirk, the second two ones clearly seem like bug.

Nondeterministic Behavior

I noticed that the smatch tool sometimes produces different results on the same input. Is this a known behavior?
Here is a simple example. I have the follwing two AMRs:

gold.amr

(v1 / res
  :tar (v2 / rere
    :rest (v3 / rest
      :pay (v4 / cc
        :na "cname1"))
    :prt (v5 / and
      :op1 (v6 / per
        :qua "num1")
      :op2 (v7 / chi
        :qua "num2"))))

pred.amr

(v1 / se
  :tar (v2 / rest
    :ham (v3 / and
      :op1 (v4 / mq
        :qua "num1")
      :op2 (v5 / per
        :qua "num1"))))

I'm running the smatch tool like this:
./smatch.py -f gold.amr pred.amr --pr
Most of the time, the result is this:

Precision: 0.29
Recall: 0.42
F-score: 0.34

But from time to time, I also get this result:

Precision: 0.24
Recall: 0.33
F-score: 0.28

How to compute smatch directly from triplets

Hi, thanks for your nice work.
For some reason, I need to compute the smatch directly from the triples such as

[('b', ':instance', 'believe-01'),
 ('b', ':ARG1', 'c8'),
 ('c8', ':instance', 'capable-01'),
 ('c8', ':ARG2', 'i'),
 ('i', ':instance', 'innovate-01'),
 ('i', ':ARG0', 'p'),
 ('p', ':instance', 'person'),
 ('p', ':mod', 'e2'),
 ('e2', ':instance', 'each'),
 ('c8', ':ARG1', 'p'),
 ('b', ':ARG0', 'p2'),
 ('p2', ':instance', 'person')]

Could you tell me how to do this using smatch.
Thanks a lot!

Issue about the 'TOP' attribute relation

Hi,

I pass two AMR strings with same meaning but do not get score 1. The only difference between the two strings is that one has ARG2 and another has ARG2-of. And I find that this result in different "TOP" attribute relation and thus the computed smatch score is not 1. I am wondering why the "TOP" attribute relation should be added and how to fix this problem.

Below are the two strings:

(e / except-01 :ARG2 (c/ change-01 :ARG1 (n/ nothing)) :ARG1 (p / pass-01 :ARG2 (l / law :name (n2 / name :op1 "Obaminationcare") :wiki "Patient_Protection_and_Affordable_Care_Act")))

(c / change-01:ARG1 (n / nothing):ARG2-of (e / except-01:ARG1 (p / pass-01:ARG2 (l / law :wiki "Patient_Protection_and_Affordable_Care_Act":name (n2 / name :op1 "Obaminationcare")))))

I use AMR.parse_AMR_line to parse the above two strings and get the following triples:

instance triple
('instance', 'e', 'except-01')
('instance', 'c', 'change-01')
('instance', 'n', 'nothing')
('instance', 'p', 'pass-01')
('instance', 'l', 'law')
('instance', 'n2', 'name')
attribute triple
('TOP', 'e', 'except-01')
('wiki', 'l', 'Patient_Protection_and_Affordable_Care_Act_')
('op1', 'n2', 'Obaminationcare_')
relation triple
('ARG2', 'e', 'c')
('ARG1', 'e', 'p')
('ARG1', 'c', 'n')
('ARG2', 'p', 'l')
('name', 'l', 'n2')

instance triple
('instance', 'c', 'change-01')
('instance', 'n', 'nothing')
('instance', 'e', 'except-01')
('instance', 'p', 'pass-01')
('instance', 'l', 'law')
('instance', 'n2', 'name')
attribute triple
('TOP', 'c', 'change-01')
('wiki', 'l', 'Patient_Protection_and_Affordable_Care_Act_')
('op1', 'n2', 'Obaminationcare_')
relation triple
('ARG1', 'c', 'n')
('ARG2', 'e', 'c')
('ARG1', 'e', 'p')
('ARG2', 'p', 'l')
('name', 'l', 'n2')

What is the latest version of Smatch?

Two related questions: What is the version of the current software? And what is considered the "official" fork (if there is one)? Some points to consider:

setup.py says "1.0"
The latest release on PyPI is 1.0.1 (the only differences between the version on PyPI and the current master branch are the changes introduced by #20 and #21)
SemEval 2016 refers to versions 2.0, 2.0.1, and 2.0.2, but as far as I can tell those versions predate the 1.0.0 and 1.0.1 releases on PyPI
There is also the ISI-NLP fork which adds an ILP solver but is otherwise behind this repo by 7 commits. It's inclusion in the ISI-NLP group makes it look like official version of Smatch, but it's not the version on PyPI.

@danielhers and @snowblink14 are listed as maintainers on PyPI, so what do you think? Should we bump the current version to, say, 2.0.3 so it's at least numerically consistent with the SemEval 2016 release? And perhaps the README could state whether this is the official (or at least preferred) Smatch code or if another repository should be used instead. I'm happy to do the work and provide a PR for these things but I cannot make the decision.

Add smatch to PyPI

@snowblink14

interpretation of TOP property

when comparing SMATCH results with the scorer from the 2019 CoNLL Shared Task on Cross-Framework Meaning Representation Parsing (MRP), we discovered that SMATCH will only consider the TOP property correct if the node labels also match. this appears to double-penalize for label mismatches and is maybe not the intended behavior? for more technical detail and a minimal test case, please see the MRP mtool issue.

Issues when parsing amr from lines

In the current implementation we can see that parsing AMR from line has an issue:

>>> import amr
>>> amr.AMR.parse_AMR_line("(z0 /chapter :mod 1)")
Node 0 z0
Value: chapter
Relations:
Attribute: TOP value top

The parsing script misses :mod 1
This issue was fixed in the Damente script: https://github.com/mdtux89/amr-evaluation

>>> import amr
>>> amr.AMR.parse_AMR_line("(z0 /chapter :mod 1)")
Node 0 z0
Value: chapter
Relations:
Attribute: mod value 1
Attribute: TOP value chapter

Licensing AMRs with missing role names

The AMRs with missing role names are accepted by Smatch and translated into triples.

(a0 / watch                               
      : (a1 / boy)                             
      :ARG1 (a2 / tv))

Triples by the smatch demo:
instance(a0,boy) ^ instance(a1,tv) ^ TOP(a0,boy) ^ ARG1(a0,a1)
This has several undesired consequences like licensing ill-formed AMRs that might get high scores.

Support programmatic usage

Right now the only way to use smatch is as a command-line script.
I would like to use it from my Python code, so I do something like:

from smatch import f1_score

print(f1_score('(b / bark-01 :ARG0 (d / dog))', '(w / walk-01 :ARG0 (m / man))'))

Support for ~e.2 etc.

Hello,

I've the following AMR format, but it raises a warning and error:

# ::tok Hallmark could make a fortune off of this guy .
# ::alignments 0-1.1.1.2.1 1-1 2-1.1 4-1.1.2 6-1.1.2.1.r 7-1.1.2.1.1 8-1.1.2.1
(p / possible-01~e.1 
      :ARG1 (m / make-05~e.2 
            :ARG0 (c / company :wiki "Hallmark_Cards" 
                  :name (n / name :op1 "Hallmark"~e.0)) 
            :ARG1 (f / fortune~e.4 
                  :source~e.6 (g / guy~e.8 
                        :mod (t / this~e.7)))))

These are the errors:

File 1 has error/warning message:
*** Line 3 - Ignoring unexpected tokens: (p / possible-01~e.1 
*** Line 4 - Ignoring unexpected token: :ARG1 
*** Line 4 - Ignoring unexpected tokens: (m / make-05~e.2 
*** Line 5 - Ignoring unexpected token: :ARG0 
*** Line 6 - Ignoring unexpected token: "Hallmark"~e.0 
*** Line 7 - Ignoring unexpected token: :ARG1 
*** Line 7 - Ignoring unexpected tokens: (f / fortune~e.4 
*** Line 8 - Ignoring unexpected token: :source~e.6 
*** Line 8 - Ignoring unexpected tokens: (g / guy~e.8 
*** Line 9 - Ignoring unexpected token: :mod 
*** Line 9 - Ignoring unexpected tokens: (t / this~e.7 
*** Line 9 - Non-matching close parenthesis. 
*** Line 9 - Non-matching close parenthesis. 
*** Line 9 - Non-matching close parenthesis. 
*** Line 9 - Non-matching close parenthesis. 
*** Line 9 - Non-matching close parenthesis. 
11 errors and 5 warnings in 9 lines. 

File 2 has error/warning message:
*** Line 3 - Ignoring unexpected tokens: (p / possible-01~e.1 
*** Line 4 - Ignoring unexpected token: :ARG1 
*** Line 4 - Ignoring unexpected tokens: (m / make-05~e.2 
*** Line 5 - Ignoring unexpected token: :ARG0 
*** Line 6 - Ignoring unexpected token: "Hallmark"~e.0 
*** Line 7 - Ignoring unexpected token: :ARG1 
*** Line 7 - Ignoring unexpected tokens: (f / fortune~e.4 
*** Line 8 - Ignoring unexpected token: :source~e.6 
*** Line 8 - Ignoring unexpected tokens: (g / guy~e.8 
*** Line 9 - Ignoring unexpected token: :mod 
*** Line 9 - Ignoring unexpected tokens: (t / this~e.7 
*** Line 9 - Non-matching close parenthesis. 
*** Line 9 - Non-matching close parenthesis. 
*** Line 9 - Non-matching close parenthesis. 
*** Line 9 - Non-matching close parenthesis. 
*** Line 9 - Non-matching close parenthesis. 
11 errors and 5 warnings in 9 lines.

Is the AMR in the correct format? Is something missing within Smatch?

Streamline release process

The current process for pushing a release to PyPI could be more direct. As @danielhers said in #22:

As for releases, a small detail is that they are currently actually created automatically only when a tag is created on my fork, because the PyPI token is encrypted with the public key associated with it:

smatch/.travis.yml

Line 12 in a4f2e28

secure: CVA8RIEx1IdOHUBMjF/MPd6FFE3wu0sAzJkowu0PzZ4VwbOlqSYXuYWjkNUV8plCVm8mgovMXBDjTC8q9AGYTJi8B5f92AY6YRfLVjJCpdMd8EH6VNlymxJTYg0t5W5RKpzdAyOLj5GyAPhVqY805TIE+ao2XKQ+UwlwRUq/SUhxi6gcvLsamuabfKg2OZxijKp6dPk+tqw33K+DYJVV3WOUqyI169Z3iXRhljghmLWiV6xdIs1/V34XsbdgRJDRs/kOsrhE19khOIUqZ/+to++qRFwLpgAsF71n23vGs/FwvP+ab1oMYtUNC8DHI2gmiGHO/ipE/FXqsYdYuhULnzs7nfx04YQoaZliD7Hbvze0zEczQZxHQpuHbykNND2WU20NOnckBzDgqhqzPIVjHwQTGFKMZqg7nZ8w81NrGSeGTXagtoBPtq6/RPU1alppFE4EU7/fZrAN/fXgiemcxmsK5Nl9Ps6zWoB2eFhCaLDeUI9BktIrp5nLplLQJmLfF+RfLaIYULD5hdwFzgcUTjG/nPr41XNMvwYvsM+rCGxlxBtL2Gc2xf7Kfv9T6vUr0eb0Yxp+oKNv7XN3j1rZ+0PIqMbi1dpMJu7OqxIR5lQB6bYwDcCeKxrFpnP/9drtXzjqGZnzCt/rdGnMUd9PrkZTHFXOrsq0ZtmFSNK5/m4=

This means the process for a release is

@snowblink14 creates a release (and thereby automatically a tag) on https://github.com/snowblink14/smatch

I replicate this tag on https://github.com/danielhers/smatch and a Travis CI job automatically deploys to PyPI

To get rid of step (2), @snowblink14 would need to enable Travis CI for https://github.com/snowblink14/smatch, and then update the encrypted token in .travis.yml. I'll be glad to advise how to do that but I don't mind doing step (2) myself so I think the current situation is fine.

While I appreciate @danielhers's willingness to do the extra step, I think it creates an unnecessary barrier. Let's make it so PyPI is updated whenever a release is made in this repo.

If @snowblink14 does not want to setup TravisCI, then GitHub Actions are pretty nice. I've had a good experience using the python-publish workflow (see here), and the Python Packaging Authority (PyPA) also has their own version (see here).

User loggers instead of verbose and veryVerbose

Background:
👍 for #5 (make it as PyPi package)
That will make smatch as an easily accessible library.
It will be nice to replace prints with log statements (so that library users can easily control)

Mappings:

Error -> logger.error
Verbose -> logger.info
VeryVerbose -> logger.debug

Then, we can set log level from CLI args for backward compatibiltiy.
default level = WARNING
When Verbose flag is enabled, level=INFO
When VeryVerbose flag is enabled, level=DEBUG

References:

https://docs.python.org/3/library/logging.html

Remove Python 2 support

Python 2 was retired at the beginning of the year, and it doesn't even ship with the lastest Ubuntu LTS (20.04) by default. I think smatch should remove explicit support for Python 2. Mainly this means that any workarounds for Python 2 are removed and it is no longer listed in setup.py, tox.ini, and .travis.yml as a supported version. Users who absolutely need Python 2 should pin their smatch version to the current or previous version.

I've already prepared a branch for a pull request, but I created this issue in case there is any discussion about what to do.

AMR graph which scores > 100%

Hi,
I'm using smatch 1.0.4 (installed via pip install) and I found a case where the precision (and F-mesure) are > 100%

gold:

# ::snt 22/02/2010 16:42
(d / date-entity
      :time "16:42"
      :day 22
      :month 2
      :year 2010)

predicted:

# ::snt 22/02/2010 16:42
(d / date-entity
      :time "16:42"
      :year 2010
      :month 2
      :day 22
      :day 22)

evaluation:

$ smatch.py --pr -f gold predicted
Precision: 1.17
Recall: 1.00
F-score: 1.08

Thanks !

Smatch is non-deterministic and does not yield score=1 for the same input/output graph

I found two quirks in the following example.

even though the smatch score is calculated between the same graph, the f1 score is not 1 (far from it)
the scores are non-deterministic. Sometimes the score is 0.9, then 0.92, then 0.87, etc.

Perhaps something is wrong with my calculate_smatch function, but I do not think so. (It is modified from score_amr_pairs.)

from typing import List

import smatch


def calculate_smatch(refs_penman: List[str], preds_penman: List[str]):
    total_match_num = total_test_num = total_gold_num = 0
    n_invalid = 0

    for sentid, (ref_penman, pred_penman) in enumerate(zip(refs_penman, preds_penman), 1):
        best_match_num, test_triple_num, gold_triple_num = smatch.get_amr_match(
            ref_penman, pred_penman, sent_num=sentid
        )

        total_match_num += best_match_num
        total_test_num += test_triple_num
        total_gold_num += gold_triple_num
        # clear the matching triple dictionary for the next AMR pair
        smatch.match_triple_dict.clear()

    score = smatch.compute_f(total_match_num, total_test_num, total_gold_num)

    return {
        "smatch_precision": score[0],
        "smatch_recall": score[1],
        "smatch_fscore": score[2],
        "ratio_invalid_amrs": n_invalid / len(preds_penman) * 100,
    }


s = """(r / result-01
   :ARG1 (c / compete-01
            :ARG0 (w / woman)
            :mod (p / preliminary)
            :time (t / today)
            :mod (p2 / polo
                     :mod (w2 / water)))
   :ARG2 (a / and
            :op1 (d / defeat-01
                    :ARG0 (t2 / team
                              :mod (c2 / country
                                       :wiki +
                                       :name (n / name
                                                :op1 "Hungary")))
                    :ARG1 (t3 / team
                              :mod (c3 / country
                                       :wiki +
                                       :name (n2 / name
                                                 :op1 "Canada")))
                    :quant (s / score-entity
                              :op1 13
                              :op2 7))
            :op2 (d2 / defeat-01
                     :ARG0 (t4 / team
                               :mod (c4 / country
                                        :wiki +
                                        :name (n3 / name
                                                  :op1 "France")))
                     :ARG1 (t5 / team
                               :mod (c5 / country
                                        :wiki +
                                        :name (n4 / name
                                                  :op1 "Brazil")))
                     :quant (s2 / score-entity
                                :op1 10
                                :op2 9))
            :op3 (d3 / defeat-01
                     :ARG0 (t6 / team
                               :mod (c6 / country
                                        :wiki +
                                        :name (n5 / name
                                                  :op1 "Australia")))
                     :ARG1 (t7 / team
                               :mod (c7 / country
                                        :wiki +
                                        :name (n6 / name
                                                  :op1 "Germany")))
                     :quant (s3 / score-entity
                                :op1 10
                                :op2 8))
            :op4 (d4 / defeat-01
                     :ARG0 (t8 / team
                               :mod (c8 / country
                                        :wiki +
                                        :name (n7 / name
                                                  :op1 "Russia")))
                     :ARG1 (t9 / team
                               :mod (c9 / country
                                        :wiki +
                                        :name (n8 / name
                                                  :op1 "Netherlands")))
                     :quant (s4 / score-entity
                                :op1 7
                                :op2 6))
            :op5 (d5 / defeat-01
                     :ARG0 (t10 / team
                                :mod (c10 / country
                                          :wiki +
                                          :name (n9 / name
                                                    :op1 "United"
                                                    :op2 "States")))
                     :ARG1 (t11 / team
                                :mod (c11 / country
                                          :wiki +
                                          :name (n10 / name
                                                     :op1 "Kazakhstan")))
                     :quant (s5 / score-entity
                                :op1 10
                                :op2 5))
            :op6 (d6 / defeat-01
                     :ARG0 (t12 / team
                                :mod (c12 / country
                                          :wiki +
                                          :name (n11 / name
                                                     :op1 "Italy")))
                     :ARG1 (t13 / team
                                :mod (c13 / country
                                          :wiki +
                                          :name (n12 / name
                                                     :op1 "New"
                                                     :op2 "Zealand")))
                     :quant (s6 / score-entity
                                :op1 12
                                :op2 2))))
"""

if __name__ == "__main__":
    for _ in range(5):
        smatch_score = calculate_smatch([s], [s])
        print(smatch_score)

Output

{'smatch_precision': 0.8866666666666667, 'smatch_recall': 0.8866666666666667, 'smatch_fscore': 0.8866666666666667, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.88, 'smatch_recall': 0.88, 'smatch_fscore': 0.88, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.8666666666666667, 'smatch_recall': 0.8666666666666667, 'smatch_fscore': 0.8666666666666667, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.9266666666666666, 'smatch_recall': 0.9266666666666666, 'smatch_fscore': 0.9266666666666666, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.8533333333333334, 'smatch_recall': 0.8533333333333334, 'smatch_fscore': 0.8533333333333335, 'ratio_invalid_amrs': 0.0}

The non-determinism is very worrying to me. If an evaluation metric is not deterministic, how then can we compare systems to each other in a fair way? A difference of 0.92 vs 0.87 is massive for the same input/output.

snowblink14 / smatch Goto Github PK

smatch's People

Contributors

Stargazers

Watchers

Forkers

smatch's Issues

References:

Recommend Projects

Recommend Topics

Recommend Org