Coder Social home page Coder Social logo

sapienzanlp / exploring-srl Goto Github PK

View Code? Open in Web Editor NEW
2.0 2.0 0.0 7.04 MB

Repository for the paper "Exploring Non-Verbal Predicates in Semantic Role Labeling: Challenges and Opportunities"

License: Creative Commons Attribution Share Alike 4.0 International

acl acl2023 conllu data dataset natural-language-processing nlp semantic-role-labeling srl

exploring-srl's Introduction

Exploring Non-Verbal Predicates in Semantic Role Labeling: Challenges and Opportunities

Conference Paper arXiv License: CC BY-SA 4.0

About the project

This is the repository for the paper Exploring Non-Verbal Predicates in Semantic Role Labeling: Challenges and Opportunities, presented at ACL 2023 by Riccardo Orlando, Simone Conia and Roberto Navigli.

Abstract

Although we have witnessed impressive progress in Semantic Role Labeling (SRL), most of the research in the area is carried out assuming that the majority of predicates are verbs. Conversely, predicates can also be expressed using other parts of speech, e.g., nouns and adjectives. However, non-verbal predicates appear in the benchmarks we commonly use to measure progress in SRL less frequently than in some real-world settings -- newspaper headlines, dialogues, and tweets, among others. In this paper, we put forward a new PropBank dataset which boasts wide coverage of multiple predicate types. Thanks to it, we demonstrate empirically that standard benchmarks do not provide an accurate picture of the current situation in SRL and that state-of-the-art systems are still incapable of transferring knowledge across different predicate types. Having observed these issues, we also present a novel, manually-annotated challenge set designed to give equal importance to verbal, nominal, and adjectival predicate-argument structures. We use such dataset to investigate whether we can leverage different linguistic resources to promote knowledge transfer. In conclusion, we claim that SRL is far from ''solved'', and its integration with other semantic tasks might enable significant improvements in the future, especially for the long tail of non-verbal predicates, thereby facilitating further research on SRL for non-verbal predicates.

Datasets

Download the data

You can download a copy of all the files in this repository by cloning the git repository:

git clone https://github.com/SapienzaNLP/exploring-srl.git

or download a zip archive.

Data format

The data is available in a format that is very similar to the CoNLL-U format. Here is an example:

# sentence_id = 109
# text = The police cordoned off the crime scene from the rest of the park .
109 	 0 	The      	-    	-      	-         	(ARG0*
109 	 1 	police   	-    	-      	-         	*)
109 	 2 	cordoned 	VERB 	cordon 	cordon.01 	*
109 	 3 	off      	-    	-      	-         	(ARGM-PRD*)
109 	 4 	the      	-    	-      	-         	(ARG1*
109 	 5 	crime    	-    	-      	-         	*
109 	 6 	scene    	-    	-      	-         	*)
109 	 7 	from     	-    	-      	-         	(ARG2*
109 	 8 	the      	-    	-      	-         	*
109 	 9 	rest     	-    	-      	-         	*
109 	10 	of       	-    	-      	-         	*
109 	11 	the      	-    	-      	-         	*
109 	12 	park     	-    	-      	-         	*)
109 	13 	.        	-    	-      	-         	*

  • Column 1: sentence id in the file.
  • Column 2: token index in the sentence.
  • Column 3: tokens in the sentence.
  • Column 4: pos of each token in the sentence.
  • Column 5: lemma of each token in the sentence.
  • Column 6: Frame of a predicate. If the token is not a predicate, then _.
  • Column 7 + i: Roles of predicate i. E.g. Column 7 lists the roles for the first predicate, i.e., cordoned.

Datasets

We release three datasets for verbal, nominal and adjectival SRL. Within the data folder, there are several sub-folders, each representing a specific dataset or category.

  • The challenge-srl folder contains the files for the Challenge-SRL dataset. It further contains two sub-folders:

    • propbank referers to the dataset annotated with PropBank for different predicate types: adjectives, nouns, and verbs.
    • verbatlas includes instead VerbAtlas annotations for adjectives, nouns, and verbs.
  • The parallel-semlink folder contains the files for the Parallel-Semlink dataset. A train, dev and test split is provided. It is orginized into sub-folders for each of the different linguistic inventory it includes, i.e., framenet, propbank, verbatlas, and verbnet.

  • The pb-examples folder contains the files for the PB-Examples dataset with PropBank annotations categorized by predicate types.

  • The pb-unseen folder contains the files for the PB-Unseen datase PropBank annotations also categorized by predicate types.

Each dataset file is stored within its corresponding folder, following a naming convention that indicates the category and type of annotations. For example, the file names often include abbreviations such as pb for PropBank, fn for FrameNet, or va for VerbAtlas, followed by the predicate type (e.g., adjectives, nouns, verbs).

data
├── challenge-srl
│   ├── propbank
│   │   ├── challengesrl.adjectives.pb.conllu
│   │   ├── challengesrl.all.pb.conllu
│   │   ├── challengesrl.nouns.pb.conllu
│   │   └── challengesrl.verbs.pb.conllu
│   └── verbatlas
│       ├── challengesrl.adjectives.va.conllu
│       ├── challengesrl.all.va.conllu
│       ├── challengesrl.nouns.va.conllu
│       └── challengesrl.verbs.va.conllu
├── parallel-semlink
│   ├── framenet
│   │   ├── parallelsemlink.fn.dev.conllu
│   │   ├── parallelsemlink.fn.test.conllu
│   │   └── parallelsemlink.fn.train.conllu
│   ├── propbank
│   │   ├── parallelsemlink.pb.dev.conllu
│   │   ├── parallelsemlink.pb.test.conllu
│   │   └── parallelsemlink.pb.train.conllu
│   ├── verbatlas
│   └── verbnet
│       ├── parallelsemlink.vn.dev.conllu
│       ├── parallelsemlink.vn.test.conllu
│       └── parallelsemlink.vn.train.conllu
├── pb-examples
│   ├── pbexamples.adjectives.conllu
│   ├── pbexamples.all.conllu
│   ├── pbexamples.nouns.conllu
│   └── pbexamples.verbs.conllu
└── pb-unseen
    ├── pbunseen.adjectives.conllu
    ├── pbunseen.all.conllu
    ├── pbunseen.nouns.conllu
    └── pbunseen.verbs.conllu

PB-Examples and PB-Unseen

PB-Examples is a new evaluation benchmark for comprehensively evaluate different predicate types. To build PB-Examples, we collect the examples sentences provided with each predicate in PropBank 3 (Palmer et al., 2005; Pradhan et al., 2022). In the table below we report statistics on the coverage of CoNLL-2009 (Hajič et al., 2009), OntoNotes 5.0 (Pradhan et al., 2022) and PB-Examples in terms of unique framesets (rightmost column). Compared to its alternatives, PB-Examples covers 7481 unique PropBank framesets against 2490 framesets covered in the OntoNotes test set and 2427 in CoNLL-2009. Moreover, when comparing PB-Examples to OntoNotes, the number of unique framesets used in verbal predicate occurrences is more than double (5465 vs. 2215), whereas it is almost double for nominal occurrences (1384 vs. 782). Adjectival occurrences are essentially missing in OntoNotes (with 3 unique framesets only), while PB-Examples covers 1599.

We also release PB-unseen, a more challenge subset of PB-Examples which features predicate senses that are not included in OntoNotes 5.0.

Verbs Nouns Adjs Framesets
CoNLL-2009 1090 1337 0 2427
OntoNotes 5.0 2215 782 3 2490
PB-Examples 5465 1384 1599 7481
PB-Unseen 2457 469 1389 4001

Parallel-SemLink

Parallel-SemLink is a multi-inventory benchmark made up of the subset of OntoNotes from SemLink 2.0 (Stowe et al., 2021), whose predicates and arguments are annotated with PropBank, FrameNet (Baker et al., 1998), and VerbNet (Schuler and Palmer, 2005). It also includes VerbAtlas (Di Fabio et al., 2019) annotations thanks to the inter-resource mapping between VerbNet, WordNet, and VerbAtlas. For each of these inventories, Parallel-SemLink includes a training, a validation, and a test set with 7336, 816, and 906 sentences, respectively.

Challenge-SRL

Challenge-SRL is a manually-annotated test-set for verbal, nominal and adjectival predicates that features parallel labels for PropBank and VerbAtlas. This dataset only features predicate-argument structures that do not appear in OntoNotes. Therefore, Challenge-SRL is useful to measure the capability of an SRL system to generalize i) across predicate types, and ii) on the long tail of predicate senses.

To construct Challenge-SRL, we randomly selected a total of 288 sentences -- 96 sentences for each predicate type -- from PB-Unseen. We then asked three expert annotators to independently annotate each sentence with predicate senses and their semantic roles. The annotation process was carried out in two phases: first, each person annotated each sentence independently, resulting in a disagreement of 32%; then, the annotators discussed and resolved their disagreements, if possible, reducing them to 6%. Overall, Challenge-SRL includes 1898 predicate-argument pairs.

Cite this work

If you use any part of this work, please consider citing the paper as follows:

@inproceedings{orlando-etal-2023-exploring-srl,
    title     = "Exploring Non-Verbal Predicates in Semantic Role Labeling: Challenges and Opportunities",
    author    = "Orlando, Riccardo and Conia, Simone and Navigli, Roberto",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2023",
    month     = jul,
    year      = "2023",
    address   = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
}

License

The data is licensed under Creative Commons Attribution-ShareAlike 4.0.

exploring-srl's People

Contributors

riccorl avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.