Coder Social home page Coder Social logo

gold_vietnamese-tatoeba's Introduction

Universal Proposition Banks

These is release 1.0 of the Universal Proposition Banks. It is built upon release 1.4 of the Universal Dependency Treebanks and inherits their licence. We use the frame and role labels from the English Proposition Bank version 3.0.

News (10/01/2019): Two domain-specific Propbank released (Contract, Finance)!

News (02/10/2017): Initial version of Italian UP released!

News (01/31/2017): Initial versions of Finnish, Portuguese and Spanish UP released!

News (04/15/2022): We are freezing the resources in this repository.

To be in consistent with UP2.0 repository format, we reorganize this repo and copy the data from each langauge specific folder to langauge specific repository. Following are the changes:

Introducing language and corpus specific repository similar to Universal Dependencies project.

All the UP1.0 resources have been moved to language specific repositories. Following folders are copied to corresponding repositories.

No further changes will be made to this repository (freezing all the resources). All the language specific updates will be in the corresponding repositories UP_<language>-<corpus>. To make this data available as it is, a RELEASE will be made named v1.0 data release. For more information, follow Universal PropBanks Website https://universalpropositions.github.io/

Languages

This release contains propbanks for the following languages:

Multilingual SRL

Using this data, we can create SRL systems that predict English PropBank labels for many different languages. See a recent demo screencast of this SRL for English, French and German here.

Introduction

This project aims to annotate text in different languages with a layer of "universal" semantic role labeling annotation. For this purpose, we use the frame and role labels of the English Proposition Bank to label shallow semantics in sentences in new target languages.

For instance, consider the German sentence "Seine Arbeit wird von ehrenamtlichen Helfern und Regionalgruppen des Vereins unterstützt" (His work is supported by volunteers and regional groupings of the association). In CoNLL format, it looks like this, with English PropBank labels in the last two columns:

Id Form POS HeadId Deprel Frame Role
1 Seine DET 2 det:poss _ _
2 Arbeit NOUN 11 nsubjpass _ A1
3 wird AUX 11 auxpass _ _
4 von ADP 6 case _ _
5 ehrenamtlichen ADJ 6 amod _ _
6 Helfern NOUN 11 nmod _ A0
7 und CONJ 6 cc _ _
8 Regionalgruppen NOUN 6 conj _ _
9 des DET 10 det _ _
10 Vereins NOUN 8 nmod _ _
11 unterstützt VERB 0 root support.01 _
12 . PUNCT 11 punct _ _

The German verb 'unterstützt' is labeled as evoking the 'support.01' frame with two roles: "Seine Arbeit" (his work) is labeled A1 (project being supported) and "ehrenamtlichen Helfern und Regionalgruppen des Vereins" (volunteers and regional groupings of the association) is labeled A0 (the helper).

Format

The universal propbank (UP) for each language consists of three files (training, dev and test data) with the extension .conllu but currently encoding an extension of the CoNLL-U format. The extension is based on the CoNLL format produced by the Propbank conversion scripts, called .gold_conll.

Besides the original 10 columns from the CoNLL-U format, the roleset column (column 11) gives the actual sense used, and that sense provides roleset specific meanings for each of the numbered arguments. Every column after the eleventh is a predicate, in order that they appear in the sentence. Note that the Propbank .gold_conll files contain a "frame file" column (column 11) that lets you know which ".xml" file contains the actual semantic form for the predicate in question (which is not always the same as the predicate: one must reference "lighten.xml" for lighten_up.02), but since all predicate identifier is unique, we haven't preserved this column.

The English dataset was the only one obtained in a different maner. See the README.org file in that directory for information.

In addition, each language has a folder with verb overview files (produced from the frame files) in html format. These files can be viewed in a browser and give an overview of all English frames that each target language verb can evoke.

Scope

Our current focus is to annotate all target language verbs with appropriate English frames. This means that the scope of frame-evoking elements is currently limited to verbs. We also do not label target language auxiliary verbs. For each universal propbank, about 90% of all verbs are currently labeled. Unlabeled verbs often convey semantics for which we either could not find an appropriate English verb, or are part of complex verb constructions which we currently do not handle.

A note on quality

This is an ongoing research project in which we use a combination of data-driven methods and some post-processing to generate these resources. This means that the labels in the UPs are mostly predicted over models trained on a different domain, which affects the quality. A good example is the German verb "angeben" which in our source data was mostly used in the "brag.01" sense, but in the German UD data is mostly used in the "report.01" sense, but almost never detected as such.

Current and future work

This is an ongoing project which we are improving along three lines: (1) We are working on adding new languages to the current release. (2) We are working to curate the data to improve the quality of SRL annotation. (3) We are looking into extending the scope of frame-evoking-elements to other types of predicates besides verbs. (4) We will migrate the data to newer UD standard.

Publications

Crowd-in-the-Loop: A Hybrid Approach for Annotating Semantic Roles. Chenguang Wang, Alan Akbik, Laura Chiticariu, Yunyao Li, Fei Xia and Anbang Xu. 2017 Conference on Empirical Methods on Natural Language Processing EMNLP 2017.

Active Learning for Black-Box Semantic Role Labeling with Neural Factors. Chenguang Wang, Laura Chiticariu and Yunyao Li. 2017 International Joint Conference on Artificial Intelligence IJCAI 2017.

Multilingual Aliasing for Auto-Generating Proposition Banks. Alan Akbik, Xinyu Guan and Yunyao Li. 26th International Conference on Computational Linguistics COLING 2016.

K-SRL: Instance-based Learning for Semantic Role Labeling. Alan Akbik and Yunyao Li. 26th International Conference on Computational Linguistics COLING 2016.

Multilingual Information Extraction with PolyglotIE. Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yonas Kbrom, Yunyao Li and Huaiyu Zhu. 26th International Conference on Computational Linguistics COLING 2016.

Towards Semi-Automatic Generation of Proposition Banks for Low-Resource Languages. Alan Akbik, Vishwajeet Kumar and Yunyao Li. 2016 Conference on Empirical Methods on Natural Language Processing EMNLP 2016.

Polyglot: Multilingual Semantic Role Labeling with Unified Labels. Alan Akbik and Yunyao Li. 54th Annual Meeting of the Association for Computational Linguistics ACL 2016.

Generating High Quality Proposition Banks for Multilingual Semantic Role Labeling. Alan Akbik, Laura Chiticariu, Marina Danilevsky, Yunyao Li, Shivakumar Vaithyanathan and Huaiyu Zhu. 53rd Annual Meeting of the Association for Computational Linguistics ACL 2015.

People

Contact

Please email your questions or comments to Huaiyu Zhu.

Core Team

  • Alan Akbik
  • Laura Chiticariu
  • Marina Danilevsky
  • Yunyao Li
  • Huaiyu Zhu

Contributors

  • Xinyu Guan, Yale University
  • Tomer Mahlin, IBM Systems Division, Israel
  • Vishwajeet Kumar, IIT Bombay
  • Fei Xia, University of Washington
  • Chenguang (Ray) Wang, Amazon

gold_vietnamese-tatoeba's People

Contributors

halinh1205 avatar ijindal avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

gold_vietnamese-tatoeba's Issues

Some errors in Gold_Vietnamese_Tatoeba data

###1. Missing AM-NEG for many sentences
For examle:
1 Tôi tôi PRON PRO _ 2 nsubj _ _
2 nghĩ nghĩ VERB V _ 0 root think.01 A0:1|A1:7
3 là là SCONJ C _ 7 mark _ _
4 hôm qua hôm qua NOUN N _ 7 obl:tmod _ _
5 Tom tom NOUN NNP _ 7 nsubj _ _
6 không không ADV ADV _ 7 advmod _ _
7 đi đi VERB V _ 2 ccomp get.03 AM-TMP:4|A0:5|A2:8
8 chạy chạy VERB V _ 7 xcomp _ _
9 bộ bộ NOUN N _ 8 obj _ _
10 , , PUNCT PUNCT _ 14 punct _ _
11 nhưng mà nhưng mà SCONJ C _ 14 cc _ _
12 tôi tôi PRON PRO _ 14 nsubj _ _
13 không không ADV ADV _ 14 advmod:neg _ _
14 chắc chắc ADJ ADJ _ 2 conj A0:12
15 . . PUNCT PUNCT _ 2 punct _ _

In lines: 6, 13. Some negation words in Vietnamese: không, không thể, chưa, ...
vi_tatoeba-gold_reviseV1.conllup.txt

###2. Annotate for "punct"
For example:

text = Tom nói với tôi rằng Mary nói là cô ấy không muốn làm việc đó nữa .

1 Tom tom NOUN NNP _ 2 nsubj _ _
2 nói nói VERB V _ 0 root say.01 A0:1|A2:4|A1:7
3 với với ADP PRE _ 4 case _ _
4 tôi tôi PRON PRO _ 2 obl:with _ _
5 rằng rằng SCONJ C _ 7 mark _ _
6 Mary mary NOUN NNP _ 7 nsubj _ _
7 nói nói VERB V _ 2 ccomp say.01 A0:6|A1:13
8 là là SCONJ C _ 7 mark _ _
9 cô cô NOUN N _ 12 nsubj _ _
10 ấy ấy PRON PRO _ 9 det:pmod _ _
11 không không ADV ADV _ 12 advmod:neg _ _
12 muốn muốn VERB V _ 13 aux _ _
13 làm việc làm việc VERB V _ 12 ccomp _ _
14 đó đó PRON PRO _ 13 obj _ _
15 nữa nữa ADV ADV _ 13 advmod _ _
16 . . PUNCT PUNCT _ 2 punct do.02 _

text = Ngoài Chủ Nhật ra thì ngày nào Tom cũng đi làm .

1 Ngoài ngoài NOUN N _ 9 obl _ _
2 Chủ Nhật chủ nhật NOUN NNP _ 1 nmod _ _
3 ra ra ADV ADV _ 2 advmod _ _
4 thì thì SCONJ C _ 9 mark _ _
5 ngày ngày NOUN N _ 10 obl:tmod _ _
6 nào nào PRON PRO _ 5 det _ _
7 Tom tom NOUN NNP _ 10 nsubj _ _
8 cũng cũng ADV ADV _ 9 advmod _ _
9 đi đi VERB V _ 0 root _ _
10 làm làm VERB V _ 9 compound:svc _ _
11 . . PUNCT PUNCT _ 9 punct work.01 A2:1|AM-TMP:5|A1:7|AM-TMP:8

text = Tôi tỉnh dậy và thấy mình nằm trên ghế sofa .

1 Tôi tôi PRON PRO _ 5 nsubj _ _
2 tỉnh tỉnh VERB V _ 0 root _ _
3 dậy dậy VERB V _ 2 compound:svc _ _
4 và và CCONJ CC _ 5 cc _ _
5 thấy thấy VERB V _ 2 conj see.01 A1:7
6 mình mình PRON PRO _ 7 nsubj _ _
7 nằm nằm VERB V _ 5 ccomp lie A0:6|AM-LOC:9
8 trên trên ADP PRE _ 9 case _ _
9 ghế ghế NOUN N _ 7 obl:comp _ _
10 sofa sofa Nb Nb _ 9 nmod _ _
11 . . PUNCT PUNCT _ 2 punct wake_up.02 A0:1

text = Tôi nghe thấy ai đó trên phố gọi tên tôi .

1 Tôi tôi PRON PRO _ 2 nsubj _ _
2 nghe nghe VERB V _ 0 root hear.01 A0:1|A1:8
3 thấy thấy VERB V _ 2 compound:svc _ _
4 ai ai PRON PRO _ 8 nsubj _ _
5 đó đó PRON PRO _ 4 det:pmod _ _
6 trên trên ADP PRE _ 7 det _ _
7 phố phố NOUN N _ 4 nmod _ _
8 gọi gọi VERB V _ 2 ccomp call.01 AM-LOC:7
9 tên tên NOUN N _ 8 obj _ _
10 tôi tôi PRON PRO _ 8 obj _ _
11 . . PUNCT PUNCT _ 2 punct hear.01 A0:1|C-A1:8

text = Việc đó xảy ra mười bảy năm trước khi hai anh em Wright bay thử lần đầu .

1 Việc việc NOUN N _ 3 nsubj _ _
2 đó đó PRON PRO _ 1 det:pmod _ _
3 xảy xảy VERB V _ 0 root _ _
4 ra ra ADV ADV _ 3 compound:svc _ _
5 mười bảy mười bảy NUM NUM _ 6 nummod _ _
6 năm năm NOUN N _ 3 obl:tmod _ _
7 trước trước NOUN N _ 6 appos:nmod _ _
8 khi khi NOUN N _ 6 obl:tmod _ _
9 hai hai NUM NUM _ 10 nummod _ _
10 anh em anh em NOUN N _ 12 nsubj _ _
11 Wright wright NOUN NNP _ 10 compound _ _
12 bay bay VERB V _ 8 acl:tmod fly.01 A0:10|A1:13|AM-PRP:14
13 thử thử VERB V _ 12 compound:svc try.01 _
14 lần lần NOUN N _ 12 obl _ _
15 đầu đầu NOUN N _ 14 amod _ _
16 . . PUNCT PUNCT _ 3 punct happen.01 A1:1|AM-TMP:6

text = Chúng tôi luôn đi bộ ngang qua bưu điện trên đường đi làm

1 Chúng tôi chúng tôi PRON PRO _ 3 nsubj _ _
2 luôn luôn ADV ADV _ 3 advmod _ _
3 đi đi VERB V _ 0 root _ _
4 bộ bộ NOUN N _ 3 obj _ _
5 ngang ngang VERB V _ 3 xcomp _ _
6 qua qua ADP PRE _ 7 case _ _
7 bưu điện bưu điện NOUN N _ 5 obl:comp _ _
8 trên trên ADP PRE _ 9 case _ _
9 đường đường NOUN N _ 11 obl:comp _ _
10 đi đi VERB V _ 9 acl:subj _ _
11 làm làm VERB V _ 3 conj work.01 A1:1|AM-ADV:2|AM-DIR:5

###3. Wrong dependency
For example:

text = Tổng thống Jefferson không muốn cấm vận thương mại kéo dài .

1 Tổng thống tổng thống NOUN N _ 4 nsubj _ _
2 Jefferson jefferson NOUN NNP _ 1 flat _ _
3 không không ADV ADV _ 4 advmod:neg _ _
4 muốn muốn VERB V _ 0 root want.01 A0:1|AM-NEG:3
5 cấm vận cấm vận VERB V _ 0 root ban.01 A0:5|A1:6
6 thương mại thương mại NOUN N _ 5 obj _ _
7 kéo dài kéo dài VERB V _ 5 xcomp last.01 A1:7
8 . . PUNCT PUNCT _ 4 punct _ _

text = Ngay khi trông thấy viên cảnh sát , anh ta chạy đi .

1 Ngay ngay PART PART:G _ 2 discourse _ _
2 khi khi NOUN N _ 10 obl:tmod _ _
3 trông trông VERB V _ 2 acl:tmod (see.01 A1:5) --> hay để số 4
3 trông trông VERB V _ 2 acl:tmod (see.01 A1:5) --> hay để số 4
4 thấy thấy VERB V _ 3 compound:svc _ _
5 viên viên NOUN Nc _ 6 det:clf _ _
6 cảnh sát cảnh sát NOUN N _ 3 obj _ _
7 , , PUNCT PUNCT _ 2 punct _ _
8 anh anh NOUN N _ 10 nsubj _ _
9 ta ta PRON PRO _ 8 det:pmod _ _
10 chạy chạy VERB V _ 0 root run.01 AM-TMP:2|A0:8|A1:11
11 đi đi ADV ADV _ 10 advmod _ _
12 . . PUNCT PUNCT _ 10 punct _ _

text = 46 triệu người Mỹ sống dưới ngưỡng nghèo trong năm 2010 .

1 46 46 NUM NUM _ 3 nummod _ _
2 triệu triệu NUM NUM _ 1 flat:number _ _
3 người người NOUN Nc _ 7 det _ _
4 Mỹ mỹ NOUN NNP _ 3 nmod _ _
5 sống sống VERB V _ 0 root live.01 A0:3|AM-TMP:10
6 dưới dưới ADP PRE _ 7 case _ _
7 ngưỡng ngưỡng NOUN N _ 7 compound _ _ --> Khuyên
8 nghèo nghèo ADJ ADJ _ 7 amod _ _
9 trong trong ADP PRE _ 10 case _ _
10 năm năm NOUN N _ 5 obl:tmod _ _
11 2010 2010 NUM NUM _ 10 flat:date _ _
12 . . PUNCT PUNCT _ 5 punct _ _

text = Anh ta miễn cưỡng đồng ý với đề nghị của tôi .

1 Anh anh NOUN N _ 3 nsubj _ _
2 ta ta PRON PRO _ 1 det:pmod _ _
3 miễn cưỡng miễn cưỡng VERB V _ 4 root A0:1
4 đồng ý đồng ý VERB V _ 3 xcomp agree.01 A1:6
5 với với ADP PRE _ 6 case _ _
6 đề nghị đề nghị NOUN N _ 4 obl:with _ _
7 của của ADP PRE _ 8 case _ _
8 tôi tôi PRON PRO _ 6 nmod:poss _ _
9 . . PUNCT PUNCT _ 3 punct _ _

text = Tôi đã muốn dành nhiều thời gian hơn tại Úc .

1 Tôi tôi PRON PRO _ 4 nsubj _ _
2 đã đã ADV ADV _ 3 advmod _ _
3 muốn muốn VERB V _ 0 root _ _
4 dành dành VERB V _ 0 root spend.02 A0:1|AM-TMP:2|A1:6|AM-LOC:9
5 nhiều nhiều ADJ ADJ _ 6 advmod:adj _ _
6 thời gian thời gian NOUN N _ 4 obj _ _
7 hơn hơn ADJ ADJ _ 6 advmod:adj _ _
8 tại tại ADP PRE _ 9 case _ _
9 Úc úc NOUN NNP _ 6 nmod _ _
10 . . PUNCT PUNCT _ 3 punct _ _

text = Tôi mong là Tom sẽ không muốn làm việc đó .

1 Tôi tôi PRON PRO _ 2 nsubj _ _
2 mong mong VERB V _ 0 root hope.01 A0:1|A1:
3 là là SCONJ C _ 8 mark _ _
4 Tom tom NOUN NNP _ 7 nsubj _ _
5 sẽ sẽ ADV ADV _ 7 advmod _ _
6 không không ADV ADV _ 7 advmod:neg _ _
7 muốn muốn VERB V _ 8 aux _ _
8 làm việc làm việc VERB V _ 2 ccomp work.01 A0:4|AM-TMP:5
9 đó đó PRON PRO _ 8 obj _ _
10 . . PUNCT PUNCT _ 2 punct _ _

###4. Text in SRL, for example:

text = Tôi vừa mới dậy . Cho tôi chút thời gian để chuẩn bị đi .

1 Tôi tôi PRON PRO _ 3 nsubj _ _
2 vừa mới vừa mới ADV ADV _ 3 advmod _ _
3 dậy dậy VERB V _ 0 root wake_up.01 A0:1
4 . . PUNCT PUNCT _ 5 punct _ _
5 Cho cho VERB V _ 3 conj give.01 A2:6|A1:8|chuẩn bị:10
6 tôi tôi PRON PRO _ 5 obj _ _
7 chút chút DET DET _ 8 det _ _
8 thời gian thời gian NOUN N _ 5 obj _ _
9 để để ADP PRE _ 11 mark _ _
10 chuẩn bị chuẩn bị VERB V _ 11 xcomp prepare.02 _
11 đi đi VERB V _ 10 xcomp _ _
12 . . PUNCT PUNCT _ 3 punct _ _

###5. Missing column, for example:

text = Anh ấy giữ bình tĩnh khi đối mặt với nguy hiểm .

1 Anh anh NOUN N _ 3 nsubj _ _
2 ấy ấy PRON PRO _ 1 det:pmod _ _
3 giữ giữ VERB V _ 0 root keep.02 A0:1|A1:4|AM-TMP:5
4 bình tĩnh bình tĩnh ADJ ADJ _ 3 acomp _ _
5 khi khi NOUN N _ 3 obl:tmod _ _
6 đối mặt đối mặt VERB V _ 5 acl:tmod A1:8
7 với với ADP PRE _ 8 case _ _
8 nguy hiểm nguy hiểm NOUN N _ 6 obl:with _ _
9 . . PUNCT PUNCT _ 3 punct _ _

###6. Incorrect translation, for example:

text = Tôi ngủ dậy và thấy mình nằm trên sàn nhà .

1 Tôi tôi PRON PRO _ 5 nsubj _ _
2 ngủ ngủ VERB V _ 0 root sleep.01 A0:1|A1:3 --> wake_up
3 dậy dậy VERB V _ 2 compound:svc _ _
4 và và CCONJ CC _ 5 cc _ _
5 thấy thấy VERB V _ 2 conj see.01 A1:7
6 mình mình PRON PRO _ 7 nsubj _ _
7 nằm nằm VERB V _ 5 ccomp A0:6 --> lie.01
8 trên trên ADP PRE _ 9 case _ _
9 sàn sàn NOUN N _ 7 obl _ _
10 nhà nhà NOUN N _ 9 det _ _
11 . . PUNCT PUNCT _ 2 punct _ _

###7. Spelling errors

text = John đã lấy chỉa khoá ra túi của anh ấy .

1 John john NOUN NNP _ 3 nsubj _ _
2 đã đã ADV ADV _ 3 advmod _ _
3 lấy lấy VERB V _ 0 root take.01 A0:1|AM-TMP:2|A1:4|AM-LOC:7
4 chìa chìa NOUN N _ 3 obj _ _
5 khoá khoá VERB V _ 3 xcomp _ _
6 ra ra ADV ADV _ 5 compound:dir _ _
7 túi túi NOUN N _ 3 obj _ _
8 của của ADP PRE _ 9 case _ _
9 anh anh NOUN N _ 7 nmod:poss _ _
10 ấy ấy PRON PRO _ 9 det:pmod _ _
11 . . PUNCT PUNCT _ 3 punct _ _

###8. Missing roles in the sentence, for example:

text = Thật ra không phải là tôi quên máy ảnh . Chỉ là tôi không muốn chụp ảnh thôi .

1 Thật ra thật ra X X _ 6 advmod _ _
2 không không ADV ADV _ 3 advmod:neg _ _
3 phải phải ADJ ADJ _ 6 aux _ _
4 là là SCONJ C _ 6 mark be.01 A0:5|A1:7
5 tôi tôi PRON PRO _ 7 nmod _ _
6 quên quên VERB V _ 0 root forget.01 _
7 máy ảnh máy ảnh NOUN N _ 6 obj _ _
8 . . PUNCT PUNCT _ 6 punct _ _
9 Chỉ chỉ ADV ADV _ 13 advmod _ _
10 là là VERB V _ 11 cop _ _
11 tôi tôi PRON PRO _ 13 parataxis _ _
12 không không ADV ADV _ 13 advmod:neg _ _
13 muốn muốn VERB V _ 14 aux _ _
14 chụp chụp VERB V _ 11 acl:subj take.01 A1:15|AM-DIS:16
15 ảnh ảnh NOUN N _ 16 obj _ _
16 thôi thôi PART PART:G _ 13 discourse _ _
17 . . PUNCT PUNCT _ 6 punct _ _

Missing ARGS:SPAN

The gold annotations contain span information. The SRL conllup exported data is missing a column for the argument spans.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.