Coder Social home page Coder Social logo

ace2005-preprocessing's People

Contributors

bowbowbow avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ace2005-preprocessing's Issues

about chinese

Hi! Thans for your code !
I want to know if this repository can be used to precess Chinese data?

Warning

Thanks for your works,in practice,I meet the warning

“The entity in the other sentence is mentioned. This argument will be ignored”

Can you help me out

exact sentence which caused 'end_idx = -1' issue

Hi there!
Sorry for bothering again.
I am using ace_2005_td_v7_LDC2006T06.tgz dataset and I have downloaded the latest version of this github repo.

During the processing of the training data, assertion error occurred:
assert end_idx != -1, "end_idx: {}, end_pos: {}, phrase: {}, tokens: {}, chars:{}".format(end_idx, end_pos, phrase, tokens, chars)
AssertionError: end_idx: -1, end_pos: 133, phrase: Doctors Without Borders/Médecins Sans Frontières (MSF, tokens: [{'index': 1, 'word': '', 'originalText': '"', 'lemma': '', 'characterOffsetBegin': 0,

I simply commented the assertion code and the main.py finished running without exception.

Here is what I found in the output file:

"sentence": ""Doctors Without Borders/M\u8305decins Sans Fronti\u732bres (MSF) has received an extraordinary outpouring of support for the people of South Asia and we are extremely grateful.",
"golden-entity-mentions": [

  {
    "text": "Doctors Without Borders/M\u00e9decins Sans Fronti\u00e8res (MSF",
    "entity-type": "ORG:Non-Governmental",
    "start": 12,
    **"end": -1**
  },...]

How to solve this end: -1 problem?
The entity recognition could be incomplete.

output

zhou@zhou-Lenovo-Legion-Y7000P-1060:~/文档/ace2005-preprocessing-master$ sudo python main.py --data =/home/zhou/文档/ACE/English/

('[preprocessing] type: ', 'dev')
0it [00:00, ?it/s]
======[Statistics]======
('sent :', 0)
('event :', 0)
('entity :', 0)
('argument:', 0)
Complete verification

('[preprocessing] type: ', 'test')
0it [00:00, ?it/s]
======[Statistics]======
('sent :', 0)
('event :', 0)
('entity :', 0)
('argument:', 0)
Complete verification

('[preprocessing] type: ', 'train')
0it [00:00, ?it/s]
======[Statistics]======
('sent :', 0)
('event :', 0)
('entity :', 0)
('argument:', 0)
Complete verification

Train Preprocess Error: start_idx != -1

您好,
在您最新commit的版本中,处理数据会出现错误,错误提示:

70%|████████████████████████████▌ | 368/529 [31:16<18:40, 6.96s/it][Warning] The entity in the other sentence is mentioned. This argument will be ignored.
File "main.py", line 162, in preprocessing
phrase=event_mention['trigger']['text'],
File "main.py", line 37, in find_token_index
assert start_idx != -1, "start_idx: {}, start_pos: {}, phrase: {}, tokens: {}".format(start_idx, start_pos, phrase, tokens)
AssertionError: start_idx: -1, start_pos: -5, phrase: die, tokens: [{'index': 1, 'characterOffsetEnd': 3, 'characterOffsetBegin': 0, 'pos': 'WRB', 'word': 'How', 'lemma': 'how', 'originalText': 'How', 'before': '', 'after': ' '}, {'index': 2, 'characterOffsetEnd': 9, 'characterOffsetBegin': 4, 'pos': 'MD', 'word': 'would', 'lemma': 'would', 'originalText': 'would', 'before': ' ', 'after': ' '}, {'index': 3, 'characterOffsetEnd': 13, 'characterOffsetBegin': 10, 'pos': 'PRP', 'word': 'you', 'lemma': 'you', 'originalText': 'you', 'before': ' ', 'after': ' '}, {'index': 4, 'characterOffsetEnd': 19, 'characterOffsetBegin': 14, 'pos': 'VB', 'word': 'react', 'lemma': 'react', 'originalText': 'react', 'before': ' ', 'after': ' '}, {'index': 5, 'characterOffsetEnd': 22, 'characterOffsetBegin': 20, 'pos': 'TO', 'word': 'to', 'lemma': 'to', 'originalText': 'to', 'before': ' ', 'after': ' '}, {'index': 6, 'characterOffsetEnd': 27, 'characterOffsetBegin': 23, 'pos': 'PDT', 'word': 'such', 'lemma': 'such', 'originalText': 'such', 'before': ' ', 'after': ' '}, {'index': 7, 'characterOffsetEnd': 29, 'characterOffsetBegin': 28, 'pos': 'DT', 'word': 'a', 'lemma': 'a', 'originalText': 'a', 'before': ' ', 'after': ' '}, {'index': 8, 'characterOffsetEnd': 34, 'characterOffsetBegin': 30, 'pos': 'NN', 'word': 'call', 'lemma': 'call', 'originalText': 'call', 'before': ' ', 'after': ''}, {'index': 9, 'characterOffsetEnd': 35, 'characterOffsetBegin': 34, 'pos': '.', 'word': '?', 'lemma': '?', 'originalText': '?', 'before': '', 'after': ''}]

之前的版本没有问题,但是之前在电话处理上似乎entity识别有误

Did not finish

The preprocessing took forever and has not finished yet! it has been about one hour. Is that expected?

Some entities, event triggers and arguments not in sentence

Hi there,

We have found that some of samples in the output have entities, event triggers and event arguments that do not appear in their sentence.

We made a test script to detect problematic samples and describe each inconsistency:

Results of our test is that:

  • 187 samples in the train.json set have at least one entity or a trigger or an argument that does not appear in their sentence
  • 2 samples in test.json
  • 5 samples in dev.json

What seems to be happening is that some samples end up with no entity or event trigger or event arguments as they end up being in the next sample.

:(

A bug about processing

Hi~Appreciate you work! It is convenient to use.
But I found that, every 'Headline' text will appear in after processing. However, I found that 'Headline' will not be labed with entity mention or else...I suggest delete Headline after process.
Thanks !

bug in dev.json named entity

Hi,
thanks for your code!
I found the 'golden-entity-mentions' in dev.json are a little bit strange.
Most of them are empty or time '2003-03-29T16:00:00-05:00'.

train and test seems fine to me. Do you encounter similar results? Thank you.

Some problems when preprocessing

Hello! Thanks for your contribution.
When I run this code to preprocess the ACE 2005 corpus, some warnings and errors occurred, and I wonder if these warnings and errors would affect the result?

  • [Warning] The entity in the other sentence is mentioned. This argument will be ignored. This warning occurred multiple times during preprocessing.

  • [Warning] fail to find offset! (start_index: 3348, text: Doctors Without Borders/Médecins Sans Frontières (MSF, path: D:\Data\ace_2005_td_v7\data\English\un/timex2norm/alt.vacation.las-vegas_20050109.0133) Actually this warning raises an assertion error(end_idx != -1), but I comment out the corresponding code in main.py to avoid the error. I have read other issues and I know simply deleting the file may solve the problem, but I want to know if there are other solutions except for deleting. And I also wonder if the result includes some mistakes due to this warning?
    Look forward to your reply!

Support for Arabic

Hi @bowbowbow, thanks a lot for putting this together. Was wondering if it will be easy to extend the content in main.py to support Arabic.

In my initial trials, I tried the following:

  1. Created data_list_arabic.csv file to include train/dev/test splits. An example of the first few lines of the file looks like the following:
type,path
train,nw/adj/ALH20001201.1900.0126
train,nw/adj/ALH20001201.1300.0071
dev,nw/adj/ALH20001128.1300.0081
test,nw/adj/ALH20001125.0700.0024
test,nw/adj/ALH20001124.1900.0127
  1. Built Arabic properties following the info in https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/pipeline/StanfordCoreNLP-arabic.properties:
arabic_properties = {'annotators': 'tokenize,ssplit,pos,lemma,parse',
                         'tokenize.language': 'ar',
                         'segment.model': 'edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz',
                         'ssplit.boundaryTokenRegex': '[.]|[!?]+|[!\u061F]+',
                         'pos.model': 'edu/stanford/nlp/models/pos-tagger/arabic/arabic.tagger',
                         'parse.model': 'edu/stanford/nlp/models/lexparser/arabicFactored.ser.gz'}
  1. Created the nlp_res_raw object as:
nlp_res_raw = nlp.annotate(item['sentence'], properties=arabic_properties)
  1. Downloaded the Arabic models:
cd stanford-corenlp-full-2018-10-05
wget http://nlp.stanford.edu/software/stanford-arabic-corenlp-2018-02-27-models.jar

Now when I run the script, I keep getting the following error: Failed to load segmenter edu/stanford/nlp/models/segmenter/arabic/arabic-segmenter-atb+bn+arztrain.ser.gz.

I must be making a mistake somewhere of not downloading the correct package or pointing an env_variable to the correct location. Any help to add support for Arabic is greatly appreciated.

Add download to NLTK's punkt

Please note that if punkt resource is not found on NLTK, your code would not work.
I suggest to add a check somewhere in the code for the resource, or add to documentation that users should perform nltk.download('punkt') before using this code.

fail to find offset! start_index: 3348, text: Doctors Without Borders/Médecins Sans Frontières

Thanks for your sharing! When running this code and in the step of “preprocessing('train', train_files)” , I encount the error as below:

71%|███████▏ | 377/529
Traceback (most recent call last):
File "E:/githubspace/ace2005-preprocessing/main.py", line 213, in
preprocessing('train', train_files)
File "E:/githubspace/ace2005-preprocessing/main.py", line 143, in preprocessing
phrase=entity_mention['text'],
File "E:/githubspace/ace2005-preprocessing/main.py", line 49, in find_token_index
assert end_idx != -1, "end_idx: {}, end_pos: {}, phrase: {}, tokens: {}, chars:{}".format(end_idx, end_pos, phrase, tokens, chars)
AssertionError: end_idx: -1, end_pos: 133, phrase: Doctors Without Borders/Médecins Sans Frontières (MSF, tokens: [{'index': 1, 'word': '', 'originalText': '"', 'lemma': '', 'characterOffsetBegin': 0, 'characterOffsetEnd': 1, 'pos': '``', 'before': '', 'after': ''}, {'index': 2, 'word': 'Doctors', 'originalText': 'Doctors', 'lemma': 'doctor', 'characterOffsetBegin': 1, 'characterOffsetEnd': 8, 'pos': 'NNS', 'before': '', 'after': ' '}, {'index': 3, 'word': 'Without', 'originalText': 'Without', 'lemma': 'without', 'characterOffsetBegin': 9, 'characterOffsetEnd': 16, 'pos': 'IN', 'before': ' ', 'after': ' '}, {'index': 4, 'word': 'Borders/M茅decins', 'originalText': 'Borders/M茅decins', 'lemma': 'borders/m茅decins', 'characterOffsetBegin': 17, 'characterOffsetEnd': 33, 'pos': 'NNS', 'before': ' ', 'after': ' '}, {'index': 5, 'word': 'Sans', 'originalText': 'Sans', 'lemma': 'san', 'characterOffsetBegin': 34, 'characterOffsetEnd': 38, 'pos': 'VBZ', 'before': ' ', 'after': ' '}, {'index': 6, 'word': 'Fronti猫res', 'originalText': 'Fronti猫res', 'lemma': 'fronti猫res', 'characterOffsetBegin': 39, 'characterOffsetEnd': 49, 'pos': 'NNS', 'before': ' ', 'after': ' '}, {'index': 7, 'word': '-LRB-', 'originalText': '(', 'lemma': '-lrb-', 'characterOffsetBegin': 50, 'characterOffsetEnd': 51, 'pos': '-LRB-', 'before': ' ', 'after': ''}, {'index': 8, 'word': 'MSF', 'originalText': 'MSF', 'lemma': 'msf', 'characterOffsetBegin': 51, 'characterOffsetEnd': 54, 'pos': 'NN', 'before': '', 'after': ''}, {'index': 9, 'word': '-RRB-', 'originalText': ')', 'lemma': '-rrb-', 'characterOffsetBegin': 54, 'characterOffsetEnd': 55, 'pos': '-RRB-', 'before': '', 'after': ' '}, {'index': 10, 'word': 'has', 'originalText': 'has', 'lemma': 'have', 'characterOffsetBegin': 56, 'characterOffsetEnd': 59, 'pos': 'VBZ', 'before': ' ', 'after': ' '}, {'index': 11, 'word': 'received', 'originalText': 'received', 'lemma': 'receive', 'characterOffsetBegin': 60, 'characterOffsetEnd': 68, 'pos': 'VBN', 'before': ' ', 'after': ' '}, {'index': 12, 'word': 'an', 'originalText': 'an', 'lemma': 'a', 'characterOffsetBegin': 69, 'characterOffsetEnd': 71, 'pos': 'DT', 'before': ' ', 'after': ' '}, {'index': 13, 'word': 'extraordinary', 'originalText': 'extraordinary', 'lemma': 'extraordinary', 'characterOffsetBegin': 72, 'characterOffsetEnd': 85, 'pos': 'JJ', 'before': ' ', 'after': ' '}, {'index': 14, 'word': 'outpouring', 'originalText': 'outpouring', 'lemma': 'outpouring', 'characterOffsetBegin': 86, 'characterOffsetEnd': 96, 'pos': 'NN', 'before': ' ', 'after': ' '}, {'index': 15, 'word': 'of', 'originalText': 'of', 'lemma': 'of', 'characterOffsetBegin': 97, 'characterOffsetEnd': 99, 'pos': 'IN', 'before': ' ', 'after': ' '}, {'index': 16, 'word': 'support', 'originalText': 'support', 'lemma': 'support', 'characterOffsetBegin': 100, 'characterOffsetEnd': 107, 'pos': 'NN', 'before': ' ', 'after': ' '}, {'index': 17, 'word': 'for', 'originalText': 'for', 'lemma': 'for', 'characterOffsetBegin': 108, 'characterOffsetEnd': 111, 'pos': 'IN', 'before': ' ', 'after': ' '}, {'index': 18, 'word': 'the', 'originalText': 'the', 'lemma': 'the', 'characterOffsetBegin': 112, 'characterOffsetEnd': 115, 'pos': 'DT', 'before': ' ', 'after': ' '}, {'index': 19, 'word': 'people', 'originalText': 'people', 'lemma': 'people', 'characterOffsetBegin': 116, 'characterOffsetEnd': 122, 'pos': 'NNS', 'before': ' ', 'after': ' '}, {'index': 20, 'word': 'of', 'originalText': 'of', 'lemma': 'of', 'characterOffsetBegin': 123, 'characterOffsetEnd': 125, 'pos': 'IN', 'before': ' ', 'after': ' '}, {'index': 21, 'word': 'South', 'originalText': 'South', 'lemma': 'South', 'characterOffsetBegin': 126, 'characterOffsetEnd': 131, 'pos': 'NNP', 'before': ' ', 'after': ' '}, {'index': 22, 'word': 'Asia', 'originalText': 'Asia', 'lemma': 'Asia', 'characterOffsetBegin': 132, 'characterOffsetEnd': 136, 'pos': 'NNP', 'before': ' ', 'after': ' '}, {'index': 23, 'word': 'and', 'originalText': 'and', 'lemma': 'and', 'characterOffsetBegin': 137, 'characterOffsetEnd': 140, 'pos': 'CC', 'before': ' ', 'after': ' '}, {'index': 24, 'word': 'we', 'originalText': 'we', 'lemma': 'we', 'characterOffsetBegin': 141, 'characterOffsetEnd': 143, 'pos': 'PRP', 'before': ' ', 'after': ' '}, {'index': 25, 'word': 'are', 'originalText': 'are', 'lemma': 'be', 'characterOffsetBegin': 144, 'characterOffsetEnd': 147, 'pos': 'VBP', 'before': ' ', 'after': ' '}, {'index': 26, 'word': 'extremely', 'originalText': 'extremely', 'lemma': 'extremely', 'characterOffsetBegin': 148, 'characterOffsetEnd': 157, 'pos': 'RB', 'before': ' ', 'after': ' '}, {'index': 27, 'word': 'grateful', 'originalText': 'grateful', 'lemma': 'grateful', 'characterOffsetBegin': 158, 'characterOffsetEnd': 166, 'pos': 'JJ', 'before': ' ', 'after': ''}, {'index': 28, 'word': '.', 'originalText': '.', 'lemma': '.', 'characterOffsetBegin': 166, 'characterOffsetEnd': 167, 'pos': '.', 'before': '', 'after': ''}], chars:extraordinaryoutpouringofsupportforthepeopleofSouthAsiaandweareextremelygrateful
Exception ignored in: <bound method tqdm.del of 71%|███████▏ | 377/529 [13:06<11:34, 4.57s/it]>
Traceback (most recent call last):
File "D:\ProgramData\Anaconda3\lib\site-packages\tqdm_tqdm.py", line 931, in del
self.close()
File "D:\ProgramData\Anaconda3\lib\site-packages\tqdm_tqdm.py", line 1133, in close
self._decr_instances(self)
File "D:\ProgramData\Anaconda3\lib\site-packages\tqdm_tqdm.py", line 496, in _decr_instances
cls.monitor.exit()
File "D:\ProgramData\Anaconda3\lib\site-packages\tqdm_monitor.py", line 52, in exit
self.join()
File "D:\ProgramData\Anaconda3\lib\threading.py", line 1053, in join
raise RuntimeError("cannot join current thread")
RuntimeError: cannot join current thread

Is replace() necessary ?

Hi there,
Thanks for the code.
I found a small mistake in the code, wanna clarify that. In parse_sgm(sgm_path) function, line no

converted_text = converted_text.replace('Ltd.', 'Limited')

you change the sentence with replace() fun (e.g. 'U.S.' to 'US'). This will change the sentence, therefore when you call find () in line
pos = sgm_text.find(sent, last_pos)

this will not return the correct position of the sentence in the actual text and will return -1. This lead to the wrong entity set for the sentences.

JMEE

very thank you for your code,i want to ask you do you use your output data to feed the JMEE model to achieve event extract and get the same F1 in the JMEE paper?because i find that in the JMEE paper the sentences in dev/test/train is different from yours
This data split includes 40 newswire articles (881 sentences) for the test set, 30 other documents (1087 sentences) for the development set and 529 remaining documents (21,090 sentences) for the training set
i am looking forward to your reply~very thank you

FileNotFoundError: [Errno 2] No such file or directory: '/ace_2005_td_v7/data/English/un/timex2norm/alt.vacation.las-vegas_20050109.0133.apf.xml'

Thank you for your sharing. I have a issue. When I process ACE data by using this code, it occurs the following issue.

Traceback (most recent call last):
File "main.py", line 229, in
preprocessing('train', train_files)
File "main.py", line 100, in preprocessing
parser = Parser(path=file)
File "Code/bert-event-extraction/ace2005-preprocessing/parser.py", line 16, in init
self.entity_mentions, self.event_mentions = self.parse_xml(path + '.apf.xml')
File "Code/bert-event-extraction/ace2005-preprocessing/parser.py", line 167, in parse_xml
tree = ElementTree.parse(xml_path)
File "anaconda3/envs/pytorch_38/lib/python3.8/xml/etree/ElementTree.py", line 1202, in parse
tree.parse(source, parser)
File "anaconda3/envs/pytorch_38/lib/python3.8/xml/etree/ElementTree.py", line 584, in parse
source = open(source, "rb")
FileNotFoundError: [Errno 2] No such file or directory: '/ace_2005_td_v7/data/English/un/timex2norm/alt.vacation.las-vegas_20050109.0133.apf.xml'

preprocessing problem

StanfordCore Exception Expecting value: line 1 column 1 (char 0) item["sentence"] : [ applause ] it is important for you all to understand and for our fellow americans to understand the tax relief that i have proposed and will push for until enacted would create -- [ applause ] will create 1.4 million new jobs by the end of 200 in two years time, this nation has experienced war, a recession and a national emergency. nlp_text : CoreNLP request timed out. Your document may be too long.

did you meet this problem?how can i solve it?

Hello

Hello, could you please share me with an ACE2005 dataset? Thank you very much!

any idea how to fix sentence tokenizer mismatch?

Thanks a lot for your code!

I was trying to use your code for event extraction processing in Chinese and Arabic. It seems to me for Chinese and some Arabic data, the number of sentence tokenizer mismatch is huge. So I end up only get a few sentences from the Chinese corpus.

Do you have any ideas how could we fix that? I tried to replace the nltk.sent_tokenize with my own way to split sentence but the assertion in function find_token_index() stopped the code. I am not sure how to deal with it.

Java not found and Resource punkt not found

File "/Users/myname/opt/anaconda3/lib/python3.9/site-packages/stanfordcorenlp/corenlp.py", line 47, in init
raise RuntimeError('Java not found.')
RuntimeError: Java not found.

raise LookupError(resource_not_found)

LookupError:


Resource punkt not found.
Please use the NLTK Downloader to obtain the resource:

import nltk
nltk.download('punkt')

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.