Coder Social home page Coder Social logo

rakutenma's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

rakutenma's Issues

Save model

how can i save model after Having trained my own analysis model from scratch ??

How to tokenize Chinese in Node.Js?

Hello, I'm trying to use your module, but I don't understand how to get the same results in Node.Js as your demo in Chinese in Javascript.

Do I have to train the model each time? Is it possible to do some simple use, that I could just put in the Chinese text and it would just give me the morphemes out, without any training?

I'm using the following code:

var RakutenMA = require('rakutenma');
var fs = require('fs');
var model = JSON.parse(fs.readFileSync("node_modules/rakutenma/model_zh.json"));
var rma = new RakutenMA(model, 1024, 0.007812);
rma.featset = RakutenMA.default_featset_zh;
rma.hash_func = RakutenMA.create_hash_func(15);

console.log(rma.tokenize('叙利亚毒气攻击遭到导弹回应'));

Thank you!

This is not issue, just a suggestion.

Suggestion

It will be cool if users can see the part of speech in Japanese. So far, the part of speech it generates when tokenizing is abbreviated form like below.

"A-c", "F", ...

However, I guess in most cases, users want to get Japanese part of speech directly like mecab.

"形容詞-一般", "副詞", ...

I am really happy if I can hear your feedback.

abcd漢字 fails to tokenize

With "abcd漢字" I get 6 tokens, with no POS info. But with "abc漢字" I get 2 tokens, with POS values.

Curiously, when I try it at http://rakuten-nlp.github.io/rakutenma/, both sentences work.

My (node.js) code is like this, using the default model_ja.json file that comes with the tokenizer.

const fs = require('fs');
const RakutenMA = require('rakutenma');
const model = JSON.parse(fs.readFileSync("model_ja.json"));
const rma = new RakutenMA(model);
rma.featset = RakutenMA.default_featset_ja;
rma.hash_func = RakutenMA.create_hash_func(15); //The 15 is to match the pre-trained data.
const s = "abcd漢字";
rma.tokenize(s)

Is there anything I am missing, or doing wrong, here?

Documentation: ctype_function should be set for chinese

The following code sample produces poor parse output unless the second last line is uncommented

var RakutenMA = require('./rakutenma');
var fs = require('fs');

var model = JSON.parse(fs.readFileSync("model_zh.json"));
var rma = new RakutenMA(model);
rma.featset = RakutenMA.default_featset_zh;
rma.hash_func = RakutenMA.create_hash_func(15);
var chardic = JSON.parse(fs.readFileSync("zh_chardic.json"))
// rma.ctype_func = RakutenMA.create_ctype_chardic_func(chardic);

console.log(rma.tokenize("下雨了。奶奶望着窗户外边哗哗的大雨,心里很着急。她想:京京带着伞,不要紧。小玲忘了带伞,一定要淋湿了。"));

I think this should be mentioned in the section "Using bundled models to analyze Chinese/Japanese sentences" in the readme.

npm support

npm is a popular package manager for JavaScript.
It would be great if you publish this library there.
All you need to create package.json via npm init and run npm publish.

Tips

If you not want to contain *.json in the package, you can ignore these by .npmignore or files attributes in package.json

Thanks!

The sentence is split up to individual characters with no PoS tags

Issue:

When the text is a bit long(in sample case, over 20000 letters), the sentence is split up to individual characters with no PoS tags.

bug.js

var fs = require('fs');
var RakutenMA = require("rakutenMA")

var model = JSON.parse(fs.readFileSync("../model_ja.json"));
rma = new RakutenMA(model, 1024, 0.007812);  // Specify hyperparameter for SCW (for demonstration purpose)
rma.featset = RakutenMA.default_featset_ja;
rma.hash_func = RakutenMA.create_hash_func(15);

var tokens = null
fs.readFile('sample.txt', 'utf8', function (err, text) {
    console.log(text);
    tokens = rma.tokenize(text)    
    fs.writeFile('tokens.txt', tokens , function (err) {
    console.log(err);
    });   
});

tokens.txt

著,,雍,,争,,奪,,戦,,で,,は,,、,,初,,日,,か,,ら,,前,,線,,へ,,出,,過,,ぎ,,た,,こ,,と,,で,,荀,,早,,隊,,に,,囚,,わ,,れ,,る,,失,,態,,を,,犯,,す,,も,,、,,咄,,嗟,,の,,判,,断,,で,,羌,,瘣,,に,,荀,,早,,を,,捕,,ら,,え,,さ,,せ,,た,,。,,そ,,の,,夜,,、,,魏,,国,,兵,,た,,ち,,に,,槍,,で,,小,,突,,き,,回,,さ,,れ,,る,,が,,、,,眼,,前,,の,,凱,,孟,,か,,ら,,の,,問,,い,,に,,臆,,せ,,ず,,素,,直,,に,,胸,,の,,内,,を,,語,,っ,,た,,こ,,と,,で,,、,,そ,,れ,,以,,上,,は,,粗,,略,,に,,さ,,れ,,ず,,、,,翌,,日,,の,,人,,質,,交,,換,,で,,飛,,信,,隊 (and so on)

How to add just a one word fix?

In the below code, it treats 百五銀行 as N-n, N-n, N-nc. I'd like to dynamically update to say it is a proper noun.

  1. With just let rma = new RakutenMA(model);, nothing changes.

  2. With new RakutenMA(model, 1024, 0.007812); I get the very weird:

    [ '百', 'N-n' ],
    [ '五銀', 'N-np' ],
    [ '行', 'N-nc' ],

  3. I tried calling train_one() multiple times, but it says updated:false after the first call.

const fs = require('fs');
const RakutenMA = require('rakutenma');

const model = JSON.parse(fs.readFileSync("node_modules/rakutenma/model_ja.json"));
let rma = new RakutenMA(model, 1024, 0.007812);
//let rma = new RakutenMA(model);
rma.featset = RakutenMA.default_featset_ja;
rma.hash_func = RakutenMA.create_hash_func(15); //The 15 is to match the pre-trained data.

console.log(rma.tokenize("三重県内では最大手の百五銀行の約5兆3000億円に迫る"));
var res = rma.train_one([["百五銀行","N-np"]]);
console.log(res);
console.log(rma.tokenize("三重県内では最大手の百五銀行の約5兆3000億円に迫る"));

Full output: (using Node 6.5)

[ [ '三重', 'N-pn' ],
  [ '県', 'N-nc' ],
  [ '内', 'Q-n' ],
  [ 'で', 'P-k' ],
  [ 'は', 'P-rj' ],
  [ '最', 'P' ],
  [ '大手', 'N-nc' ],
  [ 'の', 'P-k' ],
  [ '百', 'N-n' ],
  [ '五', 'N-n' ],
  [ '銀行', 'N-nc' ],
  [ 'の', 'P-k' ],
  [ '約', 'P' ],
  [ '5', 'N-n' ],
  [ '兆', 'N-nc' ],
  [ '3000', 'N-n' ],
  [ '億', 'N-n' ],
  [ '円', 'N-nc' ],
  [ 'に', 'P-k' ],
  [ '迫る', 'V-c' ] ]
{ ans: [ [ '百五銀行', 'N-np' ] ],
  sys: [ [ '百', 'N-n' ], [ '五', 'N-n' ], [ '銀行', 'N-nc' ] ],
  updated: true }
[ [ '三重', 'N-pn' ],
  [ '県', 'N-nc' ],
  [ '内', 'Q-n' ],
  [ 'で', 'P-k' ],
  [ 'は', 'P-rj' ],
  [ '最', 'P' ],
  [ '大手', 'N-nc' ],
  [ 'の', 'P-k' ],
  [ '百', 'N-n' ],
  [ '五銀', 'N-np' ],
  [ '行', 'N-nc' ],
  [ 'の', 'P-k' ],
  [ '約', 'P' ],
  [ '5', 'N-n' ],
  [ '兆', 'N-nc' ],
  [ '3000', 'N-n' ],
  [ '億', 'N-n' ],
  [ '円', 'N-nc' ],
  [ 'に', 'P-k' ],
  [ '迫る', 'V-c' ] ]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.