ivanhe / termolator Goto Github PK

Chinese version of NYU's Termolator terminology extraction system. Also includes source code for the English part-of-speech tagger used in the English version.

Home Page: http://nlp.cs.nyu.edu/termolator/

License: Apache License 2.0

Java 100.00%

termolator's Issues

Document properties

The properties file contains the properties:

stopWordListName = data/CN.nw.wordlist.txt
endWordListName = data/CN.endlist.txt
forbiddenCharListName = data/CN.charlist.txt
stopThreshold = 50
forbiddenThreshold = 800
minAV = 5
minCount = 3
minDocumentCount = 5
terminologyThreshold = 0.6

Could you document what each parameter is? I think the first 3 are obvious, but the others could use some explanation.

Thanks!

GeniaNPParser fails on lines with less than 4 tokens

The pos2conll.py script produces the following output:

: <tab> : <tab> PU <tab> O
<unprintable char> <tab> <unprintable char> <tab> OD <tab> O
L <tab> L <tab> M <tab> O

GeniaNPParser fails on the line with the unprintable characters, as the split() method only finds 2 tokens, not 4. I have made changes to my local codebase to simply skip any line that contains less than 4 tokens, and log a message to the console consisting of the actual line, line number and file name. With this info, you can find the offending lines and determine if it is worth fixing.

I'm happy to contribute my changes, or you could contribute your own changes that give the same result.

Recommend Projects

ivanhe / termolator Goto Github PK

termolator's Issues

Document properties

GeniaNPParser fails on lines with less than 4 tokens

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent