rakuten-nlp / rakutenma Goto Github PK
View Code? Open in Web Editor NEWRakuten MA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript.
License: Apache License 2.0
Rakuten MA - morphological analyzer (word segmentor + PoS Tagger) for Chinese and Japanese written purely in JavaScript.
License: Apache License 2.0
how can i save model after Having trained my own analysis model from scratch ??
Hello, I'm trying to use your module, but I don't understand how to get the same results in Node.Js as your demo in Chinese in Javascript.
Do I have to train the model each time? Is it possible to do some simple use, that I could just put in the Chinese text and it would just give me the morphemes out, without any training?
I'm using the following code:
var RakutenMA = require('rakutenma');
var fs = require('fs');
var model = JSON.parse(fs.readFileSync("node_modules/rakutenma/model_zh.json"));
var rma = new RakutenMA(model, 1024, 0.007812);
rma.featset = RakutenMA.default_featset_zh;
rma.hash_func = RakutenMA.create_hash_func(15);
console.log(rma.tokenize('叙利亚毒气攻击遭到导弹回应'));
Thank you!
It will be cool if users can see the part of speech in Japanese. So far, the part of speech it generates when tokenizing is abbreviated form like below.
"A-c", "F", ...
However, I guess in most cases, users want to get Japanese part of speech directly like mecab.
"形容詞-一般", "副詞", ...
I am really happy if I can hear your feedback.
With "abcd漢字" I get 6 tokens, with no POS info. But with "abc漢字" I get 2 tokens, with POS values.
Curiously, when I try it at http://rakuten-nlp.github.io/rakutenma/, both sentences work.
My (node.js) code is like this, using the default model_ja.json file that comes with the tokenizer.
const fs = require('fs');
const RakutenMA = require('rakutenma');
const model = JSON.parse(fs.readFileSync("model_ja.json"));
const rma = new RakutenMA(model);
rma.featset = RakutenMA.default_featset_ja;
rma.hash_func = RakutenMA.create_hash_func(15); //The 15 is to match the pre-trained data.
const s = "abcd漢字";
rma.tokenize(s)
Is there anything I am missing, or doing wrong, here?
The following code sample produces poor parse output unless the second last line is uncommented
var RakutenMA = require('./rakutenma');
var fs = require('fs');
var model = JSON.parse(fs.readFileSync("model_zh.json"));
var rma = new RakutenMA(model);
rma.featset = RakutenMA.default_featset_zh;
rma.hash_func = RakutenMA.create_hash_func(15);
var chardic = JSON.parse(fs.readFileSync("zh_chardic.json"))
// rma.ctype_func = RakutenMA.create_ctype_chardic_func(chardic);
console.log(rma.tokenize("下雨了。奶奶望着窗户外边哗哗的大雨,心里很着急。她想:京京带着伞,不要紧。小玲忘了带伞,一定要淋湿了。"));
I think this should be mentioned in the section "Using bundled models to analyze Chinese/Japanese sentences" in the readme.
npm is a popular package manager for JavaScript.
It would be great if you publish this library there.
All you need to create package.json
via npm init
and run npm publish
.
If you not want to contain *.json in the package, you can ignore these by .npmignore
or files
attributes in package.json
Thanks!
When the text is a bit long(in sample case, over 20000 letters), the sentence is split up to individual characters with no PoS tags.
bug.js
var fs = require('fs');
var RakutenMA = require("rakutenMA")
var model = JSON.parse(fs.readFileSync("../model_ja.json"));
rma = new RakutenMA(model, 1024, 0.007812); // Specify hyperparameter for SCW (for demonstration purpose)
rma.featset = RakutenMA.default_featset_ja;
rma.hash_func = RakutenMA.create_hash_func(15);
var tokens = null
fs.readFile('sample.txt', 'utf8', function (err, text) {
console.log(text);
tokens = rma.tokenize(text)
fs.writeFile('tokens.txt', tokens , function (err) {
console.log(err);
});
});
tokens.txt
著,,雍,,争,,奪,,戦,,で,,は,,、,,初,,日,,か,,ら,,前,,線,,へ,,出,,過,,ぎ,,た,,こ,,と,,で,,荀,,早,,隊,,に,,囚,,わ,,れ,,る,,失,,態,,を,,犯,,す,,も,,、,,咄,,嗟,,の,,判,,断,,で,,羌,,瘣,,に,,荀,,早,,を,,捕,,ら,,え,,さ,,せ,,た,,。,,そ,,の,,夜,,、,,魏,,国,,兵,,た,,ち,,に,,槍,,で,,小,,突,,き,,回,,さ,,れ,,る,,が,,、,,眼,,前,,の,,凱,,孟,,か,,ら,,の,,問,,い,,に,,臆,,せ,,ず,,素,,直,,に,,胸,,の,,内,,を,,語,,っ,,た,,こ,,と,,で,,、,,そ,,れ,,以,,上,,は,,粗,,略,,に,,さ,,れ,,ず,,、,,翌,,日,,の,,人,,質,,交,,換,,で,,飛,,信,,隊 (and so on)
In the below code, it treats 百五銀行 as N-n, N-n, N-nc. I'd like to dynamically update to say it is a proper noun.
With just let rma = new RakutenMA(model);
, nothing changes.
With new RakutenMA(model, 1024, 0.007812);
I get the very weird:
[ '百', 'N-n' ],
[ '五銀', 'N-np' ],
[ '行', 'N-nc' ],
I tried calling train_one()
multiple times, but it says updated:false
after the first call.
const fs = require('fs');
const RakutenMA = require('rakutenma');
const model = JSON.parse(fs.readFileSync("node_modules/rakutenma/model_ja.json"));
let rma = new RakutenMA(model, 1024, 0.007812);
//let rma = new RakutenMA(model);
rma.featset = RakutenMA.default_featset_ja;
rma.hash_func = RakutenMA.create_hash_func(15); //The 15 is to match the pre-trained data.
console.log(rma.tokenize("三重県内では最大手の百五銀行の約5兆3000億円に迫る"));
var res = rma.train_one([["百五銀行","N-np"]]);
console.log(res);
console.log(rma.tokenize("三重県内では最大手の百五銀行の約5兆3000億円に迫る"));
Full output: (using Node 6.5)
[ [ '三重', 'N-pn' ],
[ '県', 'N-nc' ],
[ '内', 'Q-n' ],
[ 'で', 'P-k' ],
[ 'は', 'P-rj' ],
[ '最', 'P' ],
[ '大手', 'N-nc' ],
[ 'の', 'P-k' ],
[ '百', 'N-n' ],
[ '五', 'N-n' ],
[ '銀行', 'N-nc' ],
[ 'の', 'P-k' ],
[ '約', 'P' ],
[ '5', 'N-n' ],
[ '兆', 'N-nc' ],
[ '3000', 'N-n' ],
[ '億', 'N-n' ],
[ '円', 'N-nc' ],
[ 'に', 'P-k' ],
[ '迫る', 'V-c' ] ]
{ ans: [ [ '百五銀行', 'N-np' ] ],
sys: [ [ '百', 'N-n' ], [ '五', 'N-n' ], [ '銀行', 'N-nc' ] ],
updated: true }
[ [ '三重', 'N-pn' ],
[ '県', 'N-nc' ],
[ '内', 'Q-n' ],
[ 'で', 'P-k' ],
[ 'は', 'P-rj' ],
[ '最', 'P' ],
[ '大手', 'N-nc' ],
[ 'の', 'P-k' ],
[ '百', 'N-n' ],
[ '五銀', 'N-np' ],
[ '行', 'N-nc' ],
[ 'の', 'P-k' ],
[ '約', 'P' ],
[ '5', 'N-n' ],
[ '兆', 'N-nc' ],
[ '3000', 'N-n' ],
[ '億', 'N-n' ],
[ '円', 'N-nc' ],
[ 'に', 'P-k' ],
[ '迫る', 'V-c' ] ]
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.