dkpro / dkpro-core Goto Github PK

View Code? Open in Web Editor NEW

194.0 28.0 68.0 101.18 MB

Collection of software components for natural language processing (NLP) based on the Apache UIMA framework.

Home Page: https://dkpro.github.io/dkpro-core

License: Other

Java 98.62% Groovy 0.31% HTML 0.93% Rich Text Format 0.14%

uima-components uima nlp natural-language-processing java dkpro

dkpro-core's Introduction

DKPro Core

DKPro Core is a collection of software components for natural language processing (NLP) based on the Apache UIMA framework.

For more information, visit the DKPro Core website.

For usage examples, see the DKPro Core Examples project

dkpro-core's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger renaud niklasme sjgrharb stennmanns helt luojianp luke1308 liangkai mjlaali wgapl mjunsilo f-zehner alainloisel laugehoyer tooa goldenzero rajdeokumarsingh anukat2015 artpar iamgr007 jakejing ab212 teacube ferschke trafikverketbth alainlompo erfans semanticbeeng alex3alex neumannm tbsflk hexiaofeng jianlins skybirdhe morlikowski rcffc izzieness wcolen jgrivolla luto65 pkluegl reckart talnupf mindis atuls2 jimsow jibun mischor shadowridgedev jkirsch parisni mileyjun qintucoding wuuugi sugunalakshmig tilmanbeck udaradg vivekjhaver maccodonaldo horsmann az79nefy yeladlouni cactuscommunications alaindesilets aggarwalpiush manolomon ajunlonglive

dkpro-core's Issues

[TreeTagger] Problem processing very long tokens

When there is a very very long token in the text, the analysis engine fails.

Original issue reported on code.google.com by richard.eckart on 2011-10-02 15:14:39

[TreeTagger] Problem if TreeTagger does not generate a POS or lemma

The analysis engine cannot deal with cases where TreeTagger does not output a POS and
lemma.

Original issue reported on code.google.com by richard.eckart on 2011-10-02 15:12:45

Checksums in TreeTagger resource packaging ant file are outdated

Checksums in TreeTagger resource packaging ant file are outdated.

Original issue reported on code.google.com by richard.eckart on 2011-05-07 20:14:22

Improved mechanism for loading models and mappings

Currently DKPro TreeTagger supports auto-lookup of model files. It looks up and
loads the appropriate language model automatically according to the document
language. All other DKPro analysis engines (AEs) doesn't possess this ability
yet.

Dive into DKPro TreeTagger and learn how it does such auto-lookup. Can this
mechanism be encapsulated into ExternalResource? Goal is to let AE
automatically gain this auto-lookup feature, when such an object is passed in
in the parameter for model file location.

Furthermore, specific default paths should be configurable via property files.

Lastly, can it load concrete resources lazily? Meaning to load the resource the
moment it is first used. (Good starting point: ExternalResourceFactory of
UIMAFit, line 220)

For the lazy-loeading resources, have a look at the class ParametrizedResource
in org.uimafit.factory.ExternalResourceFactoryTest.

There is one more aspect to this issue: tags produced by the TreeTagger or
other analysis components do not directly correspond to UIMA types. We usually
have a generic base type, e.g. POS for Part-of-Speech annotations and more
specific subtypes, e.g. V for verbs, N for nouns, etc. The same for parsers or
named entity recognition. The generic model resource should also have some
method getUimaType(String tag) were you pass in a tag and it retuns a UIMA type
to use for the annotation. See
de.tudarmstadt.ukp.dkpro.core.treetagger.TreeTaggerTT4JBase.getTagType(DKProModel,
String, TypeSystem) for how this is done in the TreeTagger component.

Original issue reported on code.google.com by richard.eckart on 2011-10-03 19:19:13

[TreeTagger] Add a standard mapping for non-supported language

At the moment, languages that are supported by the TreeTagger but do not yet have a
mapping to the DKPro type system cannot be used with the TreeTagger AE.

We should add a standard-mapping for non-supported languages that maps all 
POS-Tags to some general purpose annotation (I think "O" (=Other) is currently used
for non-mappable types). The original POS-Values can then be retrieved from PosValue-feature
of the O-Annotations.

This should not be seen as a replacement for a language mapping - but as a work around
for new languages until a new mapping to the DKPro-type system has been created.

Original issue reported on code.google.com by oliver.ferschke on 2011-05-09 22:12:21

[IMSCWB] Support for IMS Corpus Workbench format

Being able to read and write the IMS Corpus Workbench tab-separated format would be
useful. We could use it to export corpora for search with CQP. Also, we could read
the WaCky corpora.

Original issue reported on code.google.com by richard.eckart on 2011-10-02 15:26:29

[IMSCWB] Add support for writing CQP indexes directly

Provide some support to write CQP indexes directly, e.g. by calling cwb-makeall from
within the writer and passing all data and configuration directly to it.

Original issue reported on code.google.com by richard.eckart on 2011-12-23 09:29:19

Add a TokenFilter to filter out unwanted tokens

Add a TokenFilter component to remove tokens from the CAS.

Original issue reported on code.google.com by richard.eckart on 2011-10-02 15:08:14

[TreeTagger] Intern POS values and lemmas to save memory

Per default the TreeTagger wrapper should intern POS values and lemmas to save memory.
It should be an option however, as somebody may not want to incur the additional overhead.

Original issue reported on code.google.com by richard.eckart on 2011-05-29 08:53:01

SnowballStemmer may crash if a non-Lucene Snowball implementation is in the classpath.

The patched snowball from Lucene has "stem" as a method on SnowballProgram but if we
have some other snowball also in the classpath, Java might choose to use the other.
So to be safe, we should use a reflection here.

Original issue reported on code.google.com by richard.eckart on 2011-04-17 17:57:47

[TreeTagger] Add parameters for model and encoding

The model is always coupled to the language code parameter. There should be additional
parameters to override the model and the model encoding for the case that somebody
wants to specify a custom model.

Original issue reported on code.google.com by richard.eckart on 2011-01-03 12:30:51

[JWPL] Need information about the database connection a CAS was generated from

Need information about the database connection a CAS was generated from.

Original issue reported on code.google.com by richard.eckart on 2011-10-02 15:21:56

TextReader fails if emtpy pattern is provided.

Added a failing test that is currently ignored.

Original issue reported on code.google.com by torsten.zesch on 2011-04-05 13:02:34

[IO.BNC] No tokens generated for punctuation

There are no Tokens generated for "c" tags.

Original issue reported on code.google.com by richard.eckart on 2012-01-29 01:21:11

Rename and enhance TokenFilter

I think TokenFilter should be renamed to AnnotationByLengthFilter and be changed to
work on any kind of annotation instead of just working on tokens. I suppose it should
accept a list of types even. Probably this list could contain Token as default.

Original issue reported on code.google.com by richard.eckart on 2011-05-07 22:43:59

getUrlAsFile() fails to preserve extension

getUrlAsFile() should take care that the temporary files have the same extension as
specified in the URL. E.g. if the URL ends in ".exe", the temporary file should also
end in ".exe", but currently it ends in "exe" only (no dot).

Original issue reported on code.google.com by richard.eckart on 2011-06-16 09:29:19

Automatically used standard stopword lists depending on document language

Snowball comes with a set of standard stopword lists. Per default the tagger should
detect which language a document has and use the standard list for that language. It
should be possible to turn that behaviour off via a parameter. Another parameter should
allow to load additional stopword lists.

Original issue reported on code.google.com by richard.eckart on 2011-01-10 13:39:04

Stem and Lemma should be defined in LexMorph API

Currently Stem and Lemma are defined in the Segmentation API. Arguably, they don't have
anything to do with that API other than being used as features in Token. The types
should be moved to the LexMorph API.

Original issue reported on code.google.com by richard.eckart on 2011-09-06 12:30:42

[MMAX2] MMAX2 writer should take a parameter pointing to the empty (but configured) MMAX2 projekt that is to be used.

As in summary.

Original issue reported on code.google.com by torsten.zesch on 2011-01-21 16:22:35

SegmenterBase wastes memory

SegmenterBase creates annotations that are never added to the indexes and thus wastes
memory, because the necessary memory is still reserved in the CAS.

Original issue reported on code.google.com by richard.eckart on 2011-10-02 15:10:34

[BNC] Add reader for the BNC corpus

A reader for the BNC XML corpus format would be nice to have.

Original issue reported on code.google.com by richard.eckart on 2011-12-23 23:03:09

dkpro.core.toolbox corpora need to be refactored

The corpus hierarchy is broken at the moment.
Several corpora do not implement the Corpus interface.

Original issue reported on code.google.com by torsten.zesch on 2011-10-31 17:21:17

[MMAX2] MMAX reader will probably not work, as default project setup cannot be copied from resource folder in classpath.

As in summary.

Original issue reported on code.google.com by torsten.zesch on 2011-01-21 16:18:13

Type Anomaly should allow to store more than one suggestion.

Background: most spelling correctors return more than one suggestion.
The alternative would be to create one annotation per suggestion and merge later via
offset comparison (which sounds like too much of a hassle).

Original issue reported on code.google.com by torsten.zesch on 2011-08-08 06:51:17

Eclipse-generated META-INF causes RAT plugin to barf

When using DKPro Core in a Eclipse WTP project, Eclipse has the bad habit of creating
META-INF/MANIFEST.MF under src/main/java - of course without license. This causes builds
in Eclipse to behave strangely as the RAT plugin is executed as part of the build by
m2eclipse and RAT fails.

We could either run the RAT plugin in another phase or add an exclude.

Original issue reported on code.google.com by richard.eckart on 2011-07-14 13:04:54

BreakIteratorSegmenter does not produce tokens unless sentences are also enabled

BreakIteratorSegmenter does not produce tokens unless sentences are also enabled

Original issue reported on code.google.com by richard.eckart on 2011-04-28 22:58:57

NegraExportReader does not read Tiger Corpus files

Added a failing test (currently ignored) that tries to read a tiger corpus file.
This should be in Negra export format, but cannot be read with the current version
of the reader.

Original issue reported on code.google.com by torsten.zesch on 2011-09-29 15:44:38

Add sentence boundary markers in Web1TFormatWriter

Currently no sentence boundary markers are written which means it does not really write
the correct format.

Original issue reported on code.google.com by torsten.zesch on 2011-10-03 10:28:18

Web1TFormatWriter should work with arbitrary annotation types

So far, the Web1TFormatWriter always writes Token frequencies.
It should be possible to use different annotation types, e.g. Lemmas.

Original issue reported on code.google.com by torsten.zesch on 2011-10-14 12:21:09

[TreeTagger] Specify executable and model via parameters

I would be nice to optionally secify all neccessary properties, executables and resources
in the parameters of the analysis engine.

Example:
The TreeTagger installation for its wrapper in DKPro is currently only added by maven.
It's not possible for other developers to include DKPro components only with the descriptors.

Original issue reported on code.google.com by [email protected] on 2011-02-01 14:06:27

BreakIteratorSegmenter: New parameter for punctuation marks

I would appreciate a new boolean parameter in BreakIteratorSegmenter which constitutes
whether to mark punctuation marks as tokens or not.
(If available, see Bug 851 in DKPro Semantics.)

Thanks in advance,
Marko

Original issue reported on code.google.com by [email protected] on 2011-05-19 16:22:19

documentUri and documentId not set correctly

The documentUri is set to the ID of the document and the documentId is set to a running
number. Since multiple documents are in one file, the URI should be set to something
like file://path#docId and the documentId should be set to the docId.

Original issue reported on code.google.com by richard.eckart on 2012-01-14 17:57:03

DocumentMetaData should work even if created after setting text

Adding DocumentMetaData after text has been set means ending up with two DocumentAnnotation
instances in the CAS, one created by UIMA when setDocumentText() is called and one
created by DocumentMetaData.create(). This should *just work* without having to think
too much about it.

Original issue reported on code.google.com by richard.eckart on 2012-01-04 21:51:06

[IO.WEB1T] Output encoding is platform dependent

Currently the Web1t writer uses the current platform encoding to write files. Per default
it should use UTF-8 to write file and there should be a parameter to change the encoding
if desired. For the parameter, the conventions from the api.parameter module should
be used.

Original issue reported on code.google.com by richard.eckart on 2011-10-03 18:35:22

XmlWriterInline documentation should state "inaccuracies", warnings should be issued to the log

We need to document the "inaccuracies" of CasToInlineXml and possibly include some sanity
checks that log warnings if a CAS contains overlapping annotations and complex feature
structures being used as features - just to be that novice users are aware that strange
things my be happening:

- Features whose values are FeatureStructures are not represented.
- Feature values which are strings longer than 64 characters are truncated.
- Feature values which are arrays of primitives are represented by strings that look
like [ xxx, xxx ]
- The Subject of analysis is presumed to be a text string.
-Some characters in the document's Subject-of-analysis are replaced by blanks, because
the characters aren't valid in xml documents.
- It doesn't work for annotations which are overlapping, because these cannot be properly
represented as properly - nested XML.

Original issue reported on code.google.com by richard.eckart on 2011-03-29 18:50:37

Support for Negra export format

The NEGRA export format is one of the formats used by the Tiger Corpus and by TüBa D/Z.
It would be nice to be able to read them.

Original issue reported on code.google.com by richard.eckart on 2011-10-02 15:24:45

[JWPL] Wikipedia readers are outdated

The wikipedia readers have become quite complex and do not cover new funtionalities
of JWPL as e.g. revisions. A new set of readers should be provided.

Original issue reported on code.google.com by richard.eckart on 2011-10-02 15:17:01

Location where XmiWriter writes type system should be configurable

The UIMA CAS Editor expects a file called TypeSystem.xml in the project root. It would
be convenient if the XmiWriter could be configured to write the type system in that
location.

Original issue reported on code.google.com by richard.eckart on 2011-04-17 17:27:02

[ANNIS] Support for RelAnnis format

Support to write data in the RelAnnis format used by Annis2 would be nice.

Original issue reported on code.google.com by richard.eckart on 2011-10-02 15:23:18

URI created by TextReader stored in a documents DocumentMetaAnnotation is invalid when path contains whitespaces

How to reproduce the issue:
1. Let an AnalysisEngine process a document collection read with de.tudarmstadt.ukp.dkpro.core.io.text.TextReader
containing a whitespace in its path (Example: /var/lib/jenkins/jobs/DKPro Semantics/workspace/trunk/de.tudarmstadt.ukp.dkpro.semantics.bookindexing/src/test/resources/PhraseMatchEvaluator/)
2. Let the AnalysisEngine extract the String representation of the URI from the DocumentMetaData
and try to instantiate a new URI instance: URI uri = new URI(DocumentMetaData.get(jcas).getDocumentUri());
3. An exception will be thrown:
java.net.URISyntaxException: Illegal character in path at index 32: file:/var/lib/jenkins/jobs/DKPro
Semantics/workspace/trunk/de.tudarmstadt.ukp.dkpro.semantics.bookindexing/src/test/resources/PhraseMatchEvaluator/tokens%201.txt

The URI seems to be stored invalid in the DocumentMetaData, as the whitespace in the
path not been encoded as "%20". The basename is encoded correctly though.

Original issue reported on code.google.com by parzonka on 2011-09-05 15:21:58

[TreeTagger] Allow to select if POS annotations and/or Lemma annotations should be generated

The TreeTaggerPosLemma annotation always creates POS Tags and Lemmas. In the case that
a corpus is read that already provides POS tags, it would be nice to only add the lemmas.
Thus there should be switches to enable/disable the creation of Lemma and POS annotations.

Original issue reported on code.google.com by richard.eckart on 2011-05-28 14:58:09

Ark Tweet module has wrong artifactId

The artifactId of the new ark-tweek module does not end in "-asl".

Original issue reported on code.google.com by richard.eckart on 2012-02-03 12:26:22

[io.jwpl] WikipediaQueryReader MIN / MAX Tokens causes exception

What steps will reproduce the problem?
1. Use the WikipediaQueryReader with the parameter PARAM_MIN_TOKENS or PARAM_MAX_TOKENS,
respectively. Using a locally running MySQL DB.

What is the expected output? What do you see instead?
Get only Wikipages with at least MIN_TOKENS and not more than MAX_TOKENS. Causes an
IndexOutOfBounds-Exception during a substring operation (judging by the debugger) 

On what operating system?
OS-X

Original issue reported on code.google.com by richard.eckart on 2011-10-02 15:52:11

[IO] File-based writers all behave slightly differently

We have some file-based writers that all work slightly differently, in particular these:
- XmiWriter,
- XmlWriterInline,
- TextWriter

They also all have slightly different parameter names, do not all support compression,
etc.

Original issue reported on code.google.com by richard.eckart on 2012-01-28 19:50:43

[TreeTagger] Generate Maven metadata into JARs

It would be helpful if the model and binary JARs would contain Maven metadata that Artifactory
could read and already fill in the deploy form.

Original issue reported on code.google.com by richard.eckart on 2011-06-27 21:27:44

AlignedString in CasTransformation cannot properly handle transformations in first position.

replaceTest3() in AlignedStringTest reproduces the error.

Original issue reported on code.google.com by torsten.zesch on 2012-03-07 15:55:04

Support for n-gram frequencies as a UIMA resource

In several tasks we need access to n-gram frequencies, e.g. from the Google n-gram corpus.
These should be provided as an external resource.

Original issue reported on code.google.com by richard.eckart on 2011-10-02 15:19:26

[io.jwpl] WikipediaStandardReaderBase does not fill in several fields in DocumentMetaData and uses wrong language codes

WikipediaStandardReaderBase uses the collectionId property for the pageId and leaves
the documentId field empty. This causes problems with other components in a pipeline
which expect that documentId is always set. In general we consider documentUri and
documentId to be mandatory. baseUri and collectionId are optional. If baseUri is present,
it has to be a prefix of docUri.

E.g. TextWriter tries to use documentUri and baseUri to determine the relative output
path and file name.

Original issue reported on code.google.com by richard.eckart on 2011-08-30 17:38:21

Mandatory configuration parameter is hard to understand

ResourceCollectionReaderBase has a mandatory configuration parameter
"PARAM_PATTERNS" which IMHO should not be mandatory. Dfeault should be loading
all documents.

Original issue reported on code.google.com by richard.eckart on 2011-03-25 14:52:06