Coder Social home page Coder Social logo

ivanhe / termolator Goto Github PK

View Code? Open in Web Editor NEW
16.0 3.0 5.0 5.14 MB

Chinese version of NYU's Termolator terminology extraction system. Also includes source code for the English part-of-speech tagger used in the English version.

Home Page: http://nlp.cs.nyu.edu/termolator/

License: Apache License 2.0

Java 100.00%

termolator's Introduction

Termolator - Chinese Terminology Extraction and TJet

This package contains NYU's Chinese terminology extraction system and a Jet wrapper to support English Terminology extraction. The system is released under the Apache License, except for the files in sampleRDG/ and sampleBackground/ directories. See https://github.com/AdamMeyers/The_Termolator for the English version.

Files in demo/ are taken from Wikipedia and licensed under CC-BY-SA 3.0 License.

Chinese Terminology Extraction

The binary release of the Chinese terminology extraction system can be downloaded from:

https://github.com/ivanhe/termolator/releases/download/Beta1/chinese_term_extraction.zip

To perform Chinese terminology extraction, first unzip the pacakge and then run:

./run_cn.sh IN_DOMAIN_FILELIST OUT_OF_DOMAIN_FILELIST OUTPUT_FILE
  • IN_DOMAIN_FILELIST: List of in-domain files generated by a Chinese noun chunker, in CONLL format. The CONLL format we use assumes one word per line. Each line has four fields, delimited by the tab character. Field 1: Word; Field 2: Word; Field 3: Part-of-speech tag; Field 4: BIO tag for NP (B-NP, I-NP, or O).

  • OUT_OF_DOMAIN_FILELIST: List of background files generated by a Chinese noun chunker, in CONLL format

  • OUTPUT_FILE: Name of the output file. The output file will be a ranked list of terminologies.

The in-domain corpus is the corpus from which the terminologies are extracted; the out-of-domain corpus is supposed to be a corpus in general domain. To get a feeling, run:

./run_cn.sh demo.pos.filelist demo.neg.filelist demo.output

Here, the in-domain corpus is five documents related to the history of the Byzantine Empire, and the out-of-domain corpus consists of three random documents. There will be one term extracted in demo.output: "拜占庭" (Byzantine).

Building the System

We build the system by maven. In the FuseJet directory, run:

mvn package

The produced jar file is the FuseJet.jar used in the Chinese system, as well as the TJet.jar in the English system.

Using the Word Segmenter and Part-of-Speech Tagger

We provide a Chinese word segmenter and part-of-speech tagger, by courtesy of the Chinese Language Processing Group, Brandeis University. It is available at:

https://github.com/ivanhe/termolator/releases/download/Beta1/brandeis-segmenter-postagger.tgz

The Termolator License terms do NOT cover the word segmenter and part-of-speech tagger. Please find usage and license terms for the Brandeis tagger in Readme.txt from the zip package.

We also provide a Python3 script to convert the word segmenter/pos tagger output into the CoNLL format that our term extraction system requires. Usage:

./pos2conll.py POS_OUTPUT_DIR CONLL_OUTPUT_DIR CONLL_FILE_LIST

where POS_OUTPUT_DIR is the output directory of the Brandeis tagger, CONLL_OUTPUT_DIR is the directory that we save the output files in CoNLL format, and CONLL_FILE_LIST is an output file: pos2conll.py will create a list of files it has written to CONLL_OUTPUT_DIR in CONLL_FILE_LIST

CONLL_FILE_LIST can then be used as the input file list for run_cn.sh

Property File

The parameters for the Chinese property file is explained below:

# Note that the words lists contain words and their absolute frequencies in a news corpus, whether
# a word/character is considered as a stop word/character is determined by the thresholds below
stopWordListName = data/CN.nw.wordlist.txt
endWordListName = data/CN.endlist.txt
forbiddenCharListName = data/CN.charlist.txt
# words with frequency higher than this threshold will be filtered out
stopThreshold = 50
# words with characters higher than this threshold will be filtered out
forbiddenThreshold = 800
#
# The following 3 paramters are currently hard-coded in the system. The values in the properties file
# are not used. The hard-coded values are minAV=3 minCount=5 minDocumentCount=3
# This behavior can be changed in the constructor of ChineseTypedTermFilter
# 1) Threshold for the access variety statistic. Terms with AV less than this will be filtered out
# See Feng, Chen, Deng, and Zheng (2004): Accessor Variety Criteria for Chinese Word Extraction.
# Computational Linguistics 30 (1)
minAV = 5
# 2) Minimum absolute count for a term to be included in the output
minCount = 3
# 3) Terms appear in less than the threshold number of documents will be filtered out 
minDocumentCount = 5
# 
# Percentile of the all terms to output (0.6 means that top 60% of all unfiltered candidates will be returned)
terminologyThreshold = 0.6

Authors

Termolator is developed by Adam Meyers, Yifan He, Zachary Glass and Shasha Liao. The English version is available at: https://github.com/AdamMeyers/The_Termolator

The code for the Chinese terminology extractor and the English part-of-speech tagger in this git repository is developed by Yifan He and Shasha Liao.

We thank the Chinese Language Processing Group at Brandeis University and Prof. Nianwen Xue for providing the Chinese word segmenter and part-of-speech tagger.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.