rakuten-nlp / rakutenma Goto Github PK

Rakuten MA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript.

License: Apache License 2.0

JavaScript 100.00%

rakutenma's People

Stargazers

Watchers

rakutenma's Issues

Save model

how can i save model after Having trained my own analysis model from scratch ??

How to tokenize Chinese in Node.Js?

Hello, I'm trying to use your module, but I don't understand how to get the same results in Node.Js as your demo in Chinese in Javascript.

Do I have to train the model each time? Is it possible to do some simple use, that I could just put in the Chinese text and it would just give me the morphemes out, without any training?

I'm using the following code:

var RakutenMA = require('rakutenma');
var fs = require('fs');
var model = JSON.parse(fs.readFileSync("node_modules/rakutenma/model_zh.json"));
var rma = new RakutenMA(model, 1024, 0.007812);
rma.featset = RakutenMA.default_featset_zh;
rma.hash_func = RakutenMA.create_hash_func(15);

console.log(rma.tokenize('叙利亚毒气攻击遭到导弹回应'));

Thank you!

This is not issue, just a suggestion.

Suggestion

It will be cool if users can see the part of speech in Japanese. So far, the part of speech it generates when tokenizing is abbreviated form like below.

"A-c", "F", ...

However, I guess in most cases, users want to get Japanese part of speech directly like mecab.

"形容詞-一般", "副詞", ...

I am really happy if I can hear your feedback.

abcd漢字 fails to tokenize

With "abcd漢字" I get 6 tokens, with no POS info. But with "abc漢字" I get 2 tokens, with POS values.

Curiously, when I try it at http://rakuten-nlp.github.io/rakutenma/, both sentences work.

My (node.js) code is like this, using the default model_ja.json file that comes with the tokenizer.

const fs = require('fs');
const RakutenMA = require('rakutenma');
const model = JSON.parse(fs.readFileSync("model_ja.json"));
const rma = new RakutenMA(model);
rma.featset = RakutenMA.default_featset_ja;
rma.hash_func = RakutenMA.create_hash_func(15); //The 15 is to match the pre-trained data.
const s = "abcd漢字";
rma.tokenize(s)

Is there anything I am missing, or doing wrong, here?

Documentation: ctype_function should be set for chinese

The following code sample produces poor parse output unless the second last line is uncommented

var RakutenMA = require('./rakutenma');
var fs = require('fs');

var model = JSON.parse(fs.readFileSync("model_zh.json"));
var rma = new RakutenMA(model);
rma.featset = RakutenMA.default_featset_zh;
rma.hash_func = RakutenMA.create_hash_func(15);
var chardic = JSON.parse(fs.readFileSync("zh_chardic.json"))
// rma.ctype_func = RakutenMA.create_ctype_chardic_func(chardic);

console.log(rma.tokenize("下雨了。奶奶望着窗户外边哗哗的大雨，心里很着急。她想：京京带着伞，不要紧。小玲忘了带伞，一定要淋湿了。"));

I think this should be mentioned in the section "Using bundled models to analyze Chinese/Japanese sentences" in the readme.

npm support

npm is a popular package manager for JavaScript.
It would be great if you publish this library there.
All you need to create package.json via npm init and run npm publish.

Tips

If you not want to contain *.json in the package, you can ignore these by .npmignore or files attributes in package.json

Thanks!

The sentence is split up to individual characters with no PoS tags

Issue:

When the text is a bit long(in sample case, over 20000 letters), the sentence is split up to individual characters with no PoS tags.

bug.js

var fs = require('fs');
var RakutenMA = require("rakutenMA")

var model = JSON.parse(fs.readFileSync("../model_ja.json"));
rma = new RakutenMA(model, 1024, 0.007812);  // Specify hyperparameter for SCW (for demonstration purpose)
rma.featset = RakutenMA.default_featset_ja;
rma.hash_func = RakutenMA.create_hash_func(15);

var tokens = null
fs.readFile('sample.txt', 'utf8', function (err, text) {
    console.log(text);
    tokens = rma.tokenize(text)    
    fs.writeFile('tokens.txt', tokens , function (err) {
    console.log(err);
    });   
});

tokens.txt

著,,雍,,争,,奪,,戦,,で,,は,,、,,初,,日,,か,,ら,,前,,線,,へ,,出,,過,,ぎ,,た,,こ,,と,,で,,荀,,早,,隊,,に,,囚,,わ,,れ,,る,,失,,態,,を,,犯,,す,,も,,、,,咄,,嗟,,の,,判,,断,,で,,羌,,瘣,,に,,荀,,早,,を,,捕,,ら,,え,,さ,,せ,,た,,。,,そ,,の,,夜,,、,,魏,,国,,兵,,た,,ち,,に,,槍,,で,,小,,突,,き,,回,,さ,,れ,,る,,が,,、,,眼,,前,,の,,凱,,孟,,か,,ら,,の,,問,,い,,に,,臆,,せ,,ず,,素,,直,,に,,胸,,の,,内,,を,,語,,っ,,た,,こ,,と,,で,,、,,そ,,れ,,以,,上,,は,,粗,,略,,に,,さ,,れ,,ず,,、,,翌,,日,,の,,人,,質,,交,,換,,で,,飛,,信,,隊 (and so on)

How to add just a one word fix?

In the below code, it treats 百五銀行 as N-n, N-n, N-nc. I'd like to dynamically update to say it is a proper noun.

With just let rma = new RakutenMA(model);, nothing changes.
With new RakutenMA(model, 1024, 0.007812); I get the very weird:

[ '百', 'N-n' ],
[ '五銀', 'N-np' ],
[ '行', 'N-nc' ],
I tried calling train_one() multiple times, but it says updated:false after the first call.

const fs = require('fs');
const RakutenMA = require('rakutenma');

const model = JSON.parse(fs.readFileSync("node_modules/rakutenma/model_ja.json"));
let rma = new RakutenMA(model, 1024, 0.007812);
//let rma = new RakutenMA(model);
rma.featset = RakutenMA.default_featset_ja;
rma.hash_func = RakutenMA.create_hash_func(15); //The 15 is to match the pre-trained data.

console.log(rma.tokenize("三重県内では最大手の百五銀行の約５兆３０００億円に迫る"));
var res = rma.train_one([["百五銀行","N-np"]]);
console.log(res);
console.log(rma.tokenize("三重県内では最大手の百五銀行の約５兆３０００億円に迫る"));

Full output: (using Node 6.5)

[ [ '三重', 'N-pn' ],
  [ '県', 'N-nc' ],
  [ '内', 'Q-n' ],
  [ 'で', 'P-k' ],
  [ 'は', 'P-rj' ],
  [ '最', 'P' ],
  [ '大手', 'N-nc' ],
  [ 'の', 'P-k' ],
  [ '百', 'N-n' ],
  [ '五', 'N-n' ],
  [ '銀行', 'N-nc' ],
  [ 'の', 'P-k' ],
  [ '約', 'P' ],
  [ '５', 'N-n' ],
  [ '兆', 'N-nc' ],
  [ '３０００', 'N-n' ],
  [ '億', 'N-n' ],
  [ '円', 'N-nc' ],
  [ 'に', 'P-k' ],
  [ '迫る', 'V-c' ] ]
{ ans: [ [ '百五銀行', 'N-np' ] ],
  sys: [ [ '百', 'N-n' ], [ '五', 'N-n' ], [ '銀行', 'N-nc' ] ],
  updated: true }
[ [ '三重', 'N-pn' ],
  [ '県', 'N-nc' ],
  [ '内', 'Q-n' ],
  [ 'で', 'P-k' ],
  [ 'は', 'P-rj' ],
  [ '最', 'P' ],
  [ '大手', 'N-nc' ],
  [ 'の', 'P-k' ],
  [ '百', 'N-n' ],
  [ '五銀', 'N-np' ],
  [ '行', 'N-nc' ],
  [ 'の', 'P-k' ],
  [ '約', 'P' ],
  [ '５', 'N-n' ],
  [ '兆', 'N-nc' ],
  [ '３０００', 'N-n' ],
  [ '億', 'N-n' ],
  [ '円', 'N-nc' ],
  [ 'に', 'P-k' ],
  [ '迫る', 'V-c' ] ]

rakuten-nlp / rakutenma Goto Github PK

rakutenma's People

Stargazers

Watchers

Forkers

rakutenma's Issues

Save model

How to tokenize Chinese in Node.Js?

This is not issue, just a suggestion.

Suggestion

abcd漢字 fails to tokenize

Documentation: ctype_function should be set for chinese

npm support

Tips

The sentence is split up to individual characters with no PoS tags

Issue:

How to add just a one word fix?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent