Coder Social home page Coder Social logo

juand-r / entity-recognition-datasets Goto Github PK

View Code? Open in Web Editor NEW
1.4K 41.0 242.0 2.53 MB

A collection of corpora for named entity recognition (NER) and entity recognition tasks. These annotated datasets cover a variety of languages, domains and entity types.

License: MIT License

Python 99.16% Shell 0.84%
entity-extraction named-entity-recognition ner datasets entity-recognition nlp-resources nlp corpora natural-language-processing annotations

entity-recognition-datasets's People

Contributors

abhipec avatar angledluffa avatar hvingelby avatar juand-r avatar leondz avatar mdredze avatar roman-janik avatar sted97 avatar toshihikosakai avatar tutubalinaev avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

entity-recognition-datasets's Issues

GUM version

I just ran into this list - thanks for putting it up. I curate the GUM corpus included in the data folder, but it seems to be a rather old version. We now have much more data, including four more genres and bringing up the total word count to about 130,000 tokens annotated for nested, (non-)named entities. Would you like to update the data to include the latest version?

Information of the datasets

Hello,

Thanks for sharing these datasets !
I just try to find some more specific information on it ; for instance, how many tweets/comments/news are on the WNUT17 and on the CONLL 2003 ?

Thanks,
Cheers,
Camille

print() is a function in Python 3

flake8 testing of https://github.com/juand-r/entity-recognition-datasets on Python 3.7.0

$ flake8 . --count --select=E901,E999,F821,F822,F823 --show-source --statistics

./data/NIST_IEER/CONLL-format/utils/quick_comma_fix.py:41:37: E999 SyntaxError: invalid syntax
                    print annotations
                                    ^
./data/NIST_IEER/CONLL-format/utils/makeconll.py:29:30: E999 SyntaxError: invalid syntax
                print category
                             ^
./data/GUM/CONLL-format/utils/webAnnotsv_to_conll.py:31:32: E999 SyntaxError: invalid syntax
        print 'Made directory: ', newdir
                               ^
./data/re3d/CONLL-format/utils/re3d_to_bratann.py:83:15: E999 SyntaxError: invalid syntax
        print i, e['value']
              ^
./data/i2b2_2006/CONLL-format/utils/i2b2toconll.py:23:15: E999 SyntaxError: invalid syntax
	print filename
              ^
./data/BBN/CONLL-format/utils/bbn2conll.py:35:18: E999 SyntaxError: invalid syntax
    print filename
                 ^
6     E999 SyntaxError: invalid syntax
6

A Knowledge Graph resource of NER datasets

Dear authors, this repository is such a great resource! Many thanks for creating it. I would like to suggest that maybe the Open Research Knowledge Graph (https://orkg.org/) could be leveraged to enlist such resources for persistence and knowledge sharing. Please find below some resources I created related to the information in this repository.

Named Entity Recognition Tasks in the MUC series

https://orkg.org/comparison/R162797/

NER in the Automatic Content Extraction (ACE) Series

https://orkg.org/comparison/R162851/

Named Entity Recognition in the CoNLL Series and the OntoNotes corpus as a related resource

https://orkg.org/comparison/R166315/

Named Entity Recognition Based on Wikipedia

https://orkg.org/comparison/R166240/

A comparison of the annotated resources of software mentions in scholarly articles

https://orkg.org/comparison/R166560/

NLP Datasets for Named Entity Recognition and Relation Extraction from Biomedicine Scholarly Articles

https://orkg.org/comparison/R163265/

Comparisons and Visualizations of the CrossNER Benchmark Corpus for its Source and Target Domains

https://orkg.org/comparison/R163843/

Surveying BioNLP Shared Tasks Corpora for Named Entity Recognition

https://orkg.org/comparison/R165702/

Surveying BioCreAtIvE Shared Tasks Corpora for Named Entity Recognition

https://orkg.org/comparison/R172155/


The benefits of such machine-encoded data is that Reviews can be automatically created thereby.

Surveying the BioCreAtIvE Shared Task Series

https://orkg.org/review/R172166

Surveying the BioNLP Shared Task Series

https://orkg.org/review/R165924

I would be very happy to offer support in this direction. :)

Tree too deeply nested; IEER dataset

I am trying to convert the NIST-IEER to CoNLL format and see the following error:
It looks like it gets through the first 6 files fine but only gets partway through NYT-19980407

Done with  APW_19980314
Done with  APW_19980424
Done with  APW_19980429
Done with  NYT_19980315
Done with  NYT_19980403
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "makeconll.py", line 109, in write_all_to_conll
    write_conll(filename)
  File "makeconll.py", line 100, in write_conll
    sentences = parse_doc(filename, index)
  File "makeconll.py", line 67, in parse_doc
    tags = tree2conll_without_postags(dt)
  File "makeconll.py", line 35, in tree2conll_without_postags
    raise ValueError("Tree is too deeply nested to be printed in CoNLL format")
ValueError: Tree is too deeply nested to be printed in CoNLL format

I am also wondering if there is a way to reconstruct the original text from the articles

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.