juand-r / entity-recognition-datasets Goto Github PK

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

License: MIT License

Python 99.16% Shell 0.84%

entity-extraction named-entity-recognition ner datasets entity-recognition nlp-resources nlp corpora natural-language-processing annotations

entity-recognition-datasets's People

Contributors

Stargazers

Watchers

Forkers

leondz kirschbombe zhouyonglong liyuanlucasliu allensmile renespeck statdataanalyzer agromanou cclauss caoxu915683474 174high dexterzhao cicjoshua haoshuji yndu13 haif-liu himkt wqw123 moolighty newenglandml hqwu-hitcs quangph-1686 yuchenlin prashant118 munaachyuta cooleel kotwanikunal sarveshsparab mahmoudsalim allanj paulpig felflare zhenjason merajat ibrahim85 lidanxu turhaltemizer vikas95 msoancah erguvanix gmt20 prometeoai n2lzxcstars irnlpcoder thangarajan8 benkang-chen jcklie angelo337 seamrvaulter shobhit-agarwal ashutoshsingh0223 chenmosha beethovenvirus rajananaga neemax wamadahama salman0 yevhenkost dertilo yasserotiefy hitman56 darkangelkid luochao1210 bettisya haojiepan1 yelianjin carolinetong guojson sidney1994 apoorva-sin alvarocalle pankajmehar mohseen1196 jackychan1030 alex2251 xf05888 taipt colincen jinlanfu shirleylaulau guofeng201507 arita37 ethanphan pragnakalpdev5 artemovae tutubalinaev zpeng-research benathi gztangde nikitaboyko rizwan09 satishkrr 980202006 jinqiaogit salma-elanigri liutianling rogervaas aleversn codefly13 mdredze

entity-recognition-datasets's Issues

The ritter dataset cannot be accessed

http://kimi.ml.cmu.edu/transfer/data.tar.gz

Yoruba NER

Can you please add the Yoruba NER to the list of NER datasets?

Data
https://github.com/ajesujoba/YorubaTwi-Embedding/tree/master/Yoruba/Yor%C3%B9b%C3%A1-NER

Data statement
https://drive.google.com/file/d/177xu-O2FTJ7VJQ-0ohCWjVd1qu61Tvml/view

Paper
https://arxiv.org/abs/1912.02481

how use script convert all en datasets to the format 'conll'

Sorry, is there a demo to show how to use your scripts?

I just ran into this list - thanks for putting it up. I curate the GUM corpus included in the data folder, but it seems to be a rather old version. We now have much more data, including four more genres and bringing up the total word count to about 130,000 tokens annotated for nested, (non-)named entities. Would you like to update the data to include the latest version?

Information of the datasets

Hello,

Thanks for sharing these datasets !
I just try to find some more specific information on it ; for instance, how many tweets/comments/news are on the WNUT17 and on the CONLL 2003 ?

Thanks,
Cheers,
Camille

New Romainian Named Entity Corpus (RONEC)

Hello,

Me and my colleague created a new named entity corpus for the Romanian language - RONEC. The paper (recently accepted at LREC2020) can be found at https://arxiv.org/abs/1909.01247 and the repository at https://github.com/dumitrescustefan/ronec.

Could you add our corpus along the others at the Romanian language?

Thank you,
Avram Andrei-Marius

print() is a function in Python 3

flake8 testing of https://github.com/juand-r/entity-recognition-datasets on Python 3.7.0

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./data/NIST_IEER/CONLL-format/utils/quick_comma_fix.py:41:37: E999 SyntaxError: invalid syntax
                    print annotations
                                    ^
./data/NIST_IEER/CONLL-format/utils/makeconll.py:29:30: E999 SyntaxError: invalid syntax
                print category
                             ^
./data/GUM/CONLL-format/utils/webAnnotsv_to_conll.py:31:32: E999 SyntaxError: invalid syntax
        print 'Made directory: ', newdir
                               ^
./data/re3d/CONLL-format/utils/re3d_to_bratann.py:83:15: E999 SyntaxError: invalid syntax
        print i, e['value']
              ^
./data/i2b2_2006/CONLL-format/utils/i2b2toconll.py:23:15: E999 SyntaxError: invalid syntax
	print filename
              ^
./data/BBN/CONLL-format/utils/bbn2conll.py:35:18: E999 SyntaxError: invalid syntax
    print filename
                 ^
6     E999 SyntaxError: invalid syntax
6

A Knowledge Graph resource of NER datasets

Dear authors, this repository is such a great resource! Many thanks for creating it. I would like to suggest that maybe the Open Research Knowledge Graph (https://orkg.org/) could be leveraged to enlist such resources for persistence and knowledge sharing. Please find below some resources I created related to the information in this repository.

Named Entity Recognition Tasks in the MUC series

https://orkg.org/comparison/R162797/

NER in the Automatic Content Extraction (ACE) Series

https://orkg.org/comparison/R162851/

Named Entity Recognition in the CoNLL Series and the OntoNotes corpus as a related resource

https://orkg.org/comparison/R166315/

Named Entity Recognition Based on Wikipedia

https://orkg.org/comparison/R166240/

A comparison of the annotated resources of software mentions in scholarly articles

https://orkg.org/comparison/R166560/

NLP Datasets for Named Entity Recognition and Relation Extraction from Biomedicine Scholarly Articles

https://orkg.org/comparison/R163265/

Comparisons and Visualizations of the CrossNER Benchmark Corpus for its Source and Target Domains

https://orkg.org/comparison/R163843/

Surveying BioNLP Shared Tasks Corpora for Named Entity Recognition

https://orkg.org/comparison/R165702/

Surveying BioCreAtIvE Shared Tasks Corpora for Named Entity Recognition

https://orkg.org/comparison/R172155/

The benefits of such machine-encoded data is that Reviews can be automatically created thereby.

Surveying the BioCreAtIvE Shared Task Series

https://orkg.org/review/R172166

Surveying the BioNLP Shared Task Series

https://orkg.org/review/R165924

I would be very happy to offer support in this direction. :)

Done with  APW_19980314
Done with  APW_19980424
Done with  APW_19980429
Done with  NYT_19980315
Done with  NYT_19980403
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "makeconll.py", line 109, in write_all_to_conll
    write_conll(filename)
  File "makeconll.py", line 100, in write_conll
    sentences = parse_doc(filename, index)
  File "makeconll.py", line 67, in parse_doc
    tags = tree2conll_without_postags(dt)
  File "makeconll.py", line 35, in tree2conll_without_postags
    raise ValueError("Tree is too deeply nested to be printed in CoNLL format")
ValueError: Tree is too deeply nested to be printed in CoNLL format

I am also wondering if there is a way to reconstruct the original text from the articles