nlpcl-lab / ace2005-preprocessing Goto Github PK

View Code? Open in Web Editor NEW

287.0 287.0 72.0 47 KB

ACE 2005 corpus preprocessing for Event Extraction task

License: MIT License

Python 100.00%

ace2005 event-extraction parsed-data preprocessing

ace2005-preprocessing's People

Contributors

Stargazers

Watchers

Forkers

shiqing1234 benkang-chen thanhducpham jy007 sxrczh meryemmhamdi1 kkkyan xujunrt limteng-rpi luongquy wangda1 marchbnr emanuelaboros lolajing0822 foxlf823 lixianyao caitlin-hilverman caojonas maxthomas cjopengler houchaoxu zyxnlp zxlzr s65b40 happywwy xiaomindog cktd lhz-97 monireh2 speedcell4 scarydemon2 itgirls hfaghihi15 razor-mu hungita spigo900 sz128 hubomax zhangbeibei1991 jibin5167 binliang-nlp zoumt1633 xiaomn sumehta piaofu110 fantasyoo666 ronalmoo lockinlucien7 keti-human-it-ai hlee-top zengchongq snowycat789 allensmile xiaoanshi zhaoxf4 chesterxi changjingna bit-engd wmwinner minzchan colinshin sunweishou rxj588 celialee520 aqhali abtuo mc2259 lavanyaraj19112002 eslamali86 vdsmitnov52

ace2005-preprocessing's Issues

about chinese

Hi! Thans for your code !
I want to know if this repository can be used to precess Chinese data?

what's the difference between "stanford-colcc" and "stanford_head" of TACRED dataset

Warning

Thanks for your works，in practice，I meet the warning

“The entity in the other sentence is mentioned. This argument will be ignored”

Can you help me out

exact sentence which caused 'end_idx = -1' issue

Hi there!
Sorry for bothering again.
I am using ace_2005_td_v7_LDC2006T06.tgz dataset and I have downloaded the latest version of this github repo.

During the processing of the training data, assertion error occurred:
assert end_idx != -1, "end_idx: {}, end_pos: {}, phrase: {}, tokens: {}, chars:{}".format(end_idx, end_pos, phrase, tokens, chars)
AssertionError: end_idx: -1, end_pos: 133, phrase: Doctors Without Borders/Médecins Sans Frontières (MSF, tokens: [{'index': 1, 'word': '', 'originalText': '"', 'lemma': '', 'characterOffsetBegin': 0,

I simply commented the assertion code and the main.py finished running without exception.

Here is what I found in the output file:

"sentence": ""Doctors Without Borders/M\u8305decins Sans Fronti\u732bres (MSF) has received an extraordinary outpouring of support for the people of South Asia and we are extremely grateful.",
"golden-entity-mentions": [

  {
    "text": "Doctors Without Borders/M\u00e9decins Sans Fronti\u00e8res (MSF",
    "entity-type": "ORG:Non-Governmental",
    "start": 12,
    **"end": -1**
  },...]

How to solve this end: -1 problem?
The entity recognition could be incomplete.

output

zhou@zhou-Lenovo-Legion-Y7000P-1060:~/文档/ace2005-preprocessing-master$ sudo python main.py --data =/home/zhou/文档/ACE/English/

('[preprocessing] type: ', 'dev')
0it [00:00, ?it/s]
======[Statistics]======
('sent :', 0)
('event :', 0)
('entity :', 0)
('argument:', 0)
Complete verification

('[preprocessing] type: ', 'test')
0it [00:00, ?it/s]
======[Statistics]======
('sent :', 0)
('event :', 0)
('entity :', 0)
('argument:', 0)
Complete verification

('[preprocessing] type: ', 'train')
0it [00:00, ?it/s]
======[Statistics]======
('sent :', 0)
('event :', 0)
('entity :', 0)
('argument:', 0)
Complete verification

Train Preprocess Error: start_idx != -1

您好，
在您最新commit的版本中，处理数据会出现错误，错误提示：

70%|████████████████████████████▌ | 368/529 [31:16<18:40, 6.96s/it][Warning] The entity in the other sentence is mentioned. This argument will be ignored.
File "main.py", line 162, in preprocessing
phrase=event_mention['trigger']['text'],
File "main.py", line 37, in find_token_index
assert start_idx != -1, "start_idx: {}, start_pos: {}, phrase: {}, tokens: {}".format(start_idx, start_pos, phrase, tokens)
AssertionError: start_idx: -1, start_pos: -5, phrase: die, tokens: [{'index': 1, 'characterOffsetEnd': 3, 'characterOffsetBegin': 0, 'pos': 'WRB', 'word': 'How', 'lemma': 'how', 'originalText': 'How', 'before': '', 'after': ' '}, {'index': 2, 'characterOffsetEnd': 9, 'characterOffsetBegin': 4, 'pos': 'MD', 'word': 'would', 'lemma': 'would', 'originalText': 'would', 'before': ' ', 'after': ' '}, {'index': 3, 'characterOffsetEnd': 13, 'characterOffsetBegin': 10, 'pos': 'PRP', 'word': 'you', 'lemma': 'you', 'originalText': 'you', 'before': ' ', 'after': ' '}, {'index': 4, 'characterOffsetEnd': 19, 'characterOffsetBegin': 14, 'pos': 'VB', 'word': 'react', 'lemma': 'react', 'originalText': 'react', 'before': ' ', 'after': ' '}, {'index': 5, 'characterOffsetEnd': 22, 'characterOffsetBegin': 20, 'pos': 'TO', 'word': 'to', 'lemma': 'to', 'originalText': 'to', 'before': ' ', 'after': ' '}, {'index': 6, 'characterOffsetEnd': 27, 'characterOffsetBegin': 23, 'pos': 'PDT', 'word': 'such', 'lemma': 'such', 'originalText': 'such', 'before': ' ', 'after': ' '}, {'index': 7, 'characterOffsetEnd': 29, 'characterOffsetBegin': 28, 'pos': 'DT', 'word': 'a', 'lemma': 'a', 'originalText': 'a', 'before': ' ', 'after': ' '}, {'index': 8, 'characterOffsetEnd': 34, 'characterOffsetBegin': 30, 'pos': 'NN', 'word': 'call', 'lemma': 'call', 'originalText': 'call', 'before': ' ', 'after': ''}, {'index': 9, 'characterOffsetEnd': 35, 'characterOffsetBegin': 34, 'pos': '.', 'word': '?', 'lemma': '?', 'originalText': '?', 'before': '', 'after': ''}]

之前的版本没有问题，但是之前在电话处理上似乎entity识别有误

Did not finish

The preprocessing took forever and has not finished yet! it has been about one hour. Is that expected?

Some entities, event triggers and arguments not in sentence

Hi there,

We have found that some of samples in the output have entities, event triggers and event arguments that do not appear in their sentence.

We made a test script to detect problematic samples and describe each inconsistency:

https://github.com/marcqleonard/ace2005-preprocessing/commits/bugfix-offbyone

Results of our test is that:

187 samples in the train.json set have at least one entity or a trigger or an argument that does not appear in their sentence
2 samples in test.json
5 samples in dev.json

What seems to be happening is that some samples end up with no entity or event trigger or event arguments as they end up being in the next sample.

A bug about processing

Hi~Appreciate you work! It is convenient to use.
But I found that, every 'Headline' text will appear in after processing. However, I found that 'Headline' will not be labed with entity mention or else...I suggest delete Headline after process.
Thanks !

bug in dev.json named entity

Hi,
thanks for your code!
I found the 'golden-entity-mentions' in dev.json are a little bit strange.
Most of them are empty or time '2003-03-29T16:00:00-05:00'.

train and test seems fine to me. Do you encounter similar results? Thank you.

No relation labels in json files. How to parse relation labels?

Some problems when preprocessing

Hello! Thanks for your contribution.
When I run this code to preprocess the ACE 2005 corpus, some warnings and errors occurred, and I wonder if these warnings and errors would affect the result?

[Warning] The entity in the other sentence is mentioned. This argument will be ignored. This warning occurred multiple times during preprocessing.
[Warning] fail to find offset! (start_index: 3348, text: Doctors Without Borders/Médecins Sans Frontières (MSF, path: D:\Data\ace_2005_td_v7\data\English\un/timex2norm/alt.vacation.las-vegas_20050109.0133) Actually this warning raises an assertion error(end_idx != -1), but I comment out the corresponding code in main.py to avoid the error. I have read other issues and I know simply deleting the file may solve the problem, but I want to know if there are other solutions except for deleting. And I also wonder if the result includes some mistakes due to this warning?
Look forward to your reply!

Support for Arabic

Hi @bowbowbow, thanks a lot for putting this together. Was wondering if it will be easy to extend the content in main.py to support Arabic.

In my initial trials, I tried the following:

Created data_list_arabic.csv file to include train/dev/test splits. An example of the first few lines of the file looks like the following:

type,path
train,nw/adj/ALH20001201.1900.0126
train,nw/adj/ALH20001201.1300.0071
dev,nw/adj/ALH20001128.1300.0081
test,nw/adj/ALH20001125.0700.0024
test,nw/adj/ALH20001124.1900.0127

Built Arabic properties following the info in https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties:

arabic_properties = {'annotators': 'tokenize,ssplit,pos,lemma,parse',
                         'tokenize.language': 'ar',
                         'segment.model': 'edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz',
                         'ssplit.boundaryTokenRegex': '[.]|[!?]+|[!\u061F]+',
                         'pos.model': 'edu/stanford/nlp/models/pos-tagger/arabic/arabic.tagger',
                         'parse.model': 'edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz'}

Created the nlp_res_raw object as:

nlp_res_raw = nlp.annotate(item['sentence'], properties=arabic_properties)

Downloaded the Arabic models:

cd stanford-corenlp-full-2018-10-05
wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar

Now when I run the script, I keep getting the following error: Failed to load segmenter edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz.

I must be making a mistake somewhere of not downloading the correct package or pointing an env_variable to the correct location. Any help to add support for Arabic is greatly appreciated.

Can this repository process Chinese data?

Hi! Thans for your code !
I want to know if this repository can be used to precess Chinese data?

Add download to NLTK's punkt

Please note that if punkt resource is not found on NLTK, your code would not work.
I suggest to add a check somewhere in the code for the resource, or add to documentation that users should perform nltk.download('punkt') before using this code.

fail to find offset! start_index: 3348, text: Doctors Without Borders/Médecins Sans Frontières

Thanks for your sharing！ When running this code and in the step of “preprocessing('train', train_files)” ， I encount the error as below:

71%|███████▏ | 377/529
Traceback (most recent call last):
File "E:/githubspace/ace2005-preprocessing/main.py", line 213, in
preprocessing('train', train_files)
File "E:/githubspace/ace2005-preprocessing/main.py", line 143, in preprocessing
phrase=entity_mention['text'],
File "E:/githubspace/ace2005-preprocessing/main.py", line 49, in find_token_index
assert end_idx != -1, "end_idx: {}, end_pos: {}, phrase: {}, tokens: {}, chars:{}".format(end_idx, end_pos, phrase, tokens, chars)
AssertionError: end_idx: -1, end_pos: 133, phrase: Doctors Without Borders/Médecins Sans Frontières (MSF, tokens: [{'index': 1, 'word': '', 'originalText': '"', 'lemma': '', 'characterOffsetBegin': 0, 'characterOffsetEnd': 1, 'pos': '``', 'before': '', 'after': ''}, {'index': 2, 'word': 'Doctors', 'originalText': 'Doctors', 'lemma': 'doctor', 'characterOffsetBegin': 1, 'characterOffsetEnd': 8, 'pos': 'NNS', 'before': '', 'after': ' '}, {'index': 3, 'word': 'Without', 'originalText': 'Without', 'lemma': 'without', 'characterOffsetBegin': 9, 'characterOffsetEnd': 16, 'pos': 'IN', 'before': ' ', 'after': ' '}, {'index': 4, 'word': 'Borders/M茅decins', 'originalText': 'Borders/M茅decins', 'lemma': 'borders/m茅decins', 'characterOffsetBegin': 17, 'characterOffsetEnd': 33, 'pos': 'NNS', 'before': ' ', 'after': ' '}, {'index': 5, 'word': 'Sans', 'originalText': 'Sans', 'lemma': 'san', 'characterOffsetBegin': 34, 'characterOffsetEnd': 38, 'pos': 'VBZ', 'before': ' ', 'after': ' '}, {'index': 6, 'word': 'Fronti猫res', 'originalText': 'Fronti猫res', 'lemma': 'fronti猫res', 'characterOffsetBegin': 39, 'characterOffsetEnd': 49, 'pos': 'NNS', 'before': ' ', 'after': ' '}, {'index': 7, 'word': '-LRB-', 'originalText': '(', 'lemma': '-lrb-', 'characterOffsetBegin': 50, 'characterOffsetEnd': 51, 'pos': '-LRB-', 'before': ' ', 'after': ''}, {'index': 8, 'word': 'MSF', 'originalText': 'MSF', 'lemma': 'msf', 'characterOffsetBegin': 51, 'characterOffsetEnd': 54, 'pos': 'NN', 'before': '', 'after': ''}, {'index': 9, 'word': '-RRB-', 'originalText': ')', 'lemma': '-rrb-', 'characterOffsetBegin': 54, 'characterOffsetEnd': 55, 'pos': '-RRB-', 'before': '', 'after': ' '}, {'index': 10, 'word': 'has', 'originalText': 'has', 'lemma': 'have', 'characterOffsetBegin': 56, 'characterOffsetEnd': 59, 'pos': 'VBZ', 'before': ' ', 'after': ' '}, {'index': 11, 'word': 'received', 'originalText': 'received', 'lemma': 'receive', 'characterOffsetBegin': 60, 'characterOffsetEnd': 68, 'pos': 'VBN', 'before': ' ', 'after': ' '}, {'index': 12, 'word': 'an', 'originalText': 'an', 'lemma': 'a', 'characterOffsetBegin': 69, 'characterOffsetEnd': 71, 'pos': 'DT', 'before': ' ', 'after': ' '}, {'index': 13, 'word': 'extraordinary', 'originalText': 'extraordinary', 'lemma': 'extraordinary', 'characterOffsetBegin': 72, 'characterOffsetEnd': 85, 'pos': 'JJ', 'before': ' ', 'after': ' '}, {'index': 14, 'word': 'outpouring', 'originalText': 'outpouring', 'lemma': 'outpouring', 'characterOffsetBegin': 86, 'characterOffsetEnd': 96, 'pos': 'NN', 'before': ' ', 'after': ' '}, {'index': 15, 'word': 'of', 'originalText': 'of', 'lemma': 'of', 'characterOffsetBegin': 97, 'characterOffsetEnd': 99, 'pos': 'IN', 'before': ' ', 'after': ' '}, {'index': 16, 'word': 'support', 'originalText': 'support', 'lemma': 'support', 'characterOffsetBegin': 100, 'characterOffsetEnd': 107, 'pos': 'NN', 'before': ' ', 'after': ' '}, {'index': 17, 'word': 'for', 'originalText': 'for', 'lemma': 'for', 'characterOffsetBegin': 108, 'characterOffsetEnd': 111, 'pos': 'IN', 'before': ' ', 'after': ' '}, {'index': 18, 'word': 'the', 'originalText': 'the', 'lemma': 'the', 'characterOffsetBegin': 112, 'characterOffsetEnd': 115, 'pos': 'DT', 'before': ' ', 'after': ' '}, {'index': 19, 'word': 'people', 'originalText': 'people', 'lemma': 'people', 'characterOffsetBegin': 116, 'characterOffsetEnd': 122, 'pos': 'NNS', 'before': ' ', 'after': ' '}, {'index': 20, 'word': 'of', 'originalText': 'of', 'lemma': 'of', 'characterOffsetBegin': 123, 'characterOffsetEnd': 125, 'pos': 'IN', 'before': ' ', 'after': ' '}, {'index': 21, 'word': 'South', 'originalText': 'South', 'lemma': 'South', 'characterOffsetBegin': 126, 'characterOffsetEnd': 131, 'pos': 'NNP', 'before': ' ', 'after': ' '}, {'index': 22, 'word': 'Asia', 'originalText': 'Asia', 'lemma': 'Asia', 'characterOffsetBegin': 132, 'characterOffsetEnd': 136, 'pos': 'NNP', 'before': ' ', 'after': ' '}, {'index': 23, 'word': 'and', 'originalText': 'and', 'lemma': 'and', 'characterOffsetBegin': 137, 'characterOffsetEnd': 140, 'pos': 'CC', 'before': ' ', 'after': ' '}, {'index': 24, 'word': 'we', 'originalText': 'we', 'lemma': 'we', 'characterOffsetBegin': 141, 'characterOffsetEnd': 143, 'pos': 'PRP', 'before': ' ', 'after': ' '}, {'index': 25, 'word': 'are', 'originalText': 'are', 'lemma': 'be', 'characterOffsetBegin': 144, 'characterOffsetEnd': 147, 'pos': 'VBP', 'before': ' ', 'after': ' '}, {'index': 26, 'word': 'extremely', 'originalText': 'extremely', 'lemma': 'extremely', 'characterOffsetBegin': 148, 'characterOffsetEnd': 157, 'pos': 'RB', 'before': ' ', 'after': ' '}, {'index': 27, 'word': 'grateful', 'originalText': 'grateful', 'lemma': 'grateful', 'characterOffsetBegin': 158, 'characterOffsetEnd': 166, 'pos': 'JJ', 'before': ' ', 'after': ''}, {'index': 28, 'word': '.', 'originalText': '.', 'lemma': '.', 'characterOffsetBegin': 166, 'characterOffsetEnd': 167, 'pos': '.', 'before': '', 'after': ''}], chars:extraordinaryoutpouringofsupportforthepeopleofSouthAsiaandweareextremelygrateful
Exception ignored in: <bound method tqdm.del of 71%|███████▏ | 377/529 [13:06<11:34, 4.57s/it]>
Traceback (most recent call last):
File "D:\ProgramData\Anaconda3\lib\site-packages\tqdm_tqdm.py", line 931, in del
self.close()
File "D:\ProgramData\Anaconda3\lib\site-packages\tqdm_tqdm.py", line 1133, in close
self._decr_instances(self)
File "D:\ProgramData\Anaconda3\lib\site-packages\tqdm_tqdm.py", line 496, in _decr_instances
cls.monitor.exit()
File "D:\ProgramData\Anaconda3\lib\site-packages\tqdm_monitor.py", line 52, in exit
self.join()
File "D:\ProgramData\Anaconda3\lib\threading.py", line 1053, in join
raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

Is replace() necessary ?

Hi there,
Thanks for the code.
I found a small mistake in the code, wanna clarify that. In parse_sgm(sgm_path) function, line no

ace2005-preprocessing/parser.py

Line 94 in 0c79b12

converted_text = converted_text.replace('Ltd.', 'Limited')

you change the sentence with replace() fun (e.g. 'U.S.' to 'US'). This will change the sentence, therefore when you call find () in line

ace2005-preprocessing/parser.py

Line 125 in 0c79b12

pos = sgm_text.find(sent, last_pos)

this will not return the correct position of the sentence in the actual text and will return -1. This lead to the wrong entity set for the sentences.

JMEE

very thank you for your code,i want to ask you do you use your output data to feed the JMEE model to achieve event extract and get the same F1 in the JMEE paper?because i find that in the JMEE paper the sentences in dev/test/train is different from yours
This data split includes 40 newswire articles (881 sentences) for the test set, 30 other documents (1087 sentences) for the development set and 529 remaining documents (21,090 sentences) for the training set
i am looking forward to your reply~very thank you

FileNotFoundError: [Errno 2] No such file or directory: '/ace_2005_td_v7/data/English/un/timex2norm/alt.vacation.las-vegas_20050109.0133.apf.xml'

Thank you for your sharing. I have a issue. When I process ACE data by using this code, it occurs the following issue.

Traceback (most recent call last):
File "main.py", line 229, in
preprocessing('train', train_files)
File "main.py", line 100, in preprocessing
parser = Parser(path=file)
File "Code/bert-event-extraction/ace2005-preprocessing/parser.py", line 16, in init
self.entity_mentions, self.event_mentions = self.parse_xml(path + '.apf.xml')
File "Code/bert-event-extraction/ace2005-preprocessing/parser.py", line 167, in parse_xml
tree = ElementTree.parse(xml_path)
File "anaconda3/envs/pytorch_38/lib/python3.8/xml/etree/ElementTree.py", line 1202, in parse
tree.parse(source, parser)
File "anaconda3/envs/pytorch_38/lib/python3.8/xml/etree/ElementTree.py", line 584, in parse
source = open(source, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/ace_2005_td_v7/data/English/un/timex2norm/alt.vacation.las-vegas_20050109.0133.apf.xml'

preprocessing problem

StanfordCore Exception Expecting value: line 1 column 1 (char 0) item["sentence"] : [ applause ] it is important for you all to understand and for our fellow americans to understand the tax relief that i have proposed and will push for until enacted would create -- [ applause ] will create 1.4 million new jobs by the end of 200 in two years time, this nation has experienced war, a recession and a national emergency. nlp_text : CoreNLP request timed out. Your document may be too long.

did you meet this problem?how can i solve it?

Hello

Hello, could you please share me with an ACE2005 dataset? Thank you very much!

any idea how to fix sentence tokenizer mismatch?

Thanks a lot for your code!

I was trying to use your code for event extraction processing in Chinese and Arabic. It seems to me for Chinese and some Arabic data, the number of sentence tokenizer mismatch is huge. So I end up only get a few sentences from the Chinese corpus.

Do you have any ideas how could we fix that? I tried to replace the nltk.sent_tokenize with my own way to split sentence but the assertion in function find_token_index() stopped the code. I am not sure how to deal with it.

Java not found and Resource punkt not found

File "/Users/myname/opt/anaconda3/lib/python3.9/site-packages/stanfordcorenlp/corenlp.py", line 47, in init
raise RuntimeError('Java not found.')
RuntimeError: Java not found.

raise LookupError(resource_not_found)

LookupError:

Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('punkt')

nlpcl-lab / ace2005-preprocessing Goto Github PK

ace2005-preprocessing's People

Contributors

Stargazers

Watchers

Forkers

ace2005-preprocessing's Issues

zhou@zhou-Lenovo-Legion-Y7000P-1060:~/文档/ace2005-preprocessing-master$ sudo python main.py --data =/home/zhou/文档/ACE/English/

('[preprocessing] type: ', 'dev') 0it [00:00, ?it/s] ======[Statistics]====== ('sent :', 0) ('event :', 0) ('entity :', 0) ('argument:', 0) Complete verification

('[preprocessing] type: ', 'test') 0it [00:00, ?it/s] ======[Statistics]====== ('sent :', 0) ('event :', 0) ('entity :', 0) ('argument:', 0) Complete verification

Recommend Projects

Recommend Topics

Recommend Org

('[preprocessing] type: ', 'dev')
0it [00:00, ?it/s]
======[Statistics]======
('sent :', 0)
('event :', 0)
('entity :', 0)
('argument:', 0)
Complete verification

('[preprocessing] type: ', 'test')
0it [00:00, ?it/s]
======[Statistics]======
('sent :', 0)
('event :', 0)
('entity :', 0)
('argument:', 0)
Complete verification