Coder Social home page Coder Social logo

Comments (50)

BrikerMan avatar BrikerMan commented on July 17, 2024 1

我们用 MITIE 只做了词向量,那么可以用 gensim 做 word2vec 来替代这个词向量么?还是两者有本质区别?

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024

这个文件是给rasa nlu做词向量支持的,应该是mitie自己的binary格式。请问你打开的需求是想做什么?

from rasa_nlu_chi.

Jacky-Chiu avatar Jacky-Chiu commented on July 17, 2024

我是看了你的文章,也关注了公众号,现在主要目的是想获得一些语料做知识库,另外貌似也有看到有知识图谱API可以调用,想自己跟着搜集到的资料或者论文试着实现一个问答机器人

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024

total_word_feature_extractor_zh.dat只是词向量,和知识库没有关系的。

from rasa_nlu_chi.

Jacky-Chiu avatar Jacky-Chiu commented on July 17, 2024

明白,谢谢!

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on July 17, 2024

你好,我这里现在有一批影片名称和相关预料。如何在你训练的 total_word_feature_extractor_zh.dat 基础上继续训练利用这一批出书? 还是只能用 wordrep 重新训练?

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024

@BrikerMan 我所知道的只能重新训练(如果影片语料不够多,你可以wikipedia dump之类的语料一起训练),而且应该用同一个带自己词库的jieba做分词预处理。

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on July 17, 2024

@crownpku 了解了。谢谢~。我试试看。

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on July 17, 2024

@crownpku 有尝试过训练 spacy 模型么,MITIE 训练只能单线程,太慢了。而且以后电影名称库更新又得重来这个步骤。

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024

spacy对中文的支持也只是调用了jieba做分词部分... MITIE我的训练需要2天左右的时间,其实也还好。
这个模型不需要频繁更新,我觉得只有语料变动或者增量达到比如30%以上才需要重新训练,不然区别不大。

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on July 17, 2024

恩,看样子只能这样了。此外我的 MITIE 模型训练完后,训练 rasa nlu 也非常慢,目前只有 30 个 sample,似乎跟这个 mit-nlp/MITIE#11 (comment) issue 一个问题。你的 nlu 大概多少个数据,训练要多久?

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024

用MITIE的classfier会比较慢,用sklearn做分类会快很多,30个sample应该一分钟内可以训练完。
理论上是word2vec是比较普遍的方法。rasa_nlu官方坚持使用MITIE训练词向量,貌似是结合MITIE的NLP算法,会储存更多语义信息,效果更好。

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on July 17, 2024

中文 nlu 用了 MITIE 的话没办法用 sklearn 做分类器吧?我这个配置,30 个 sample 大概需要 40 来分钟 。

{
  "name": "rasa_zh_nlu",
  "pipeline": [
    "nlp_mitie",
    "tokenizer_bf",
    "ner_mitie",
    "ner_synonyms",
    "intent_entity_featurizer_regex",
    "intent_featurizer_mitie",
    "intent_classifier_sklearn"
  ],
  "language": "zh",
  "mitie_file": "./data/total_word_feature_extractor.dat",
  "path": "./models",
  "data": "./data/nlu_data.json",
}

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024

这个就是intent_classifier_sklearn,MITIE只是用来生成feature.
我用基本一致的配置确实一分钟内训练完的,当然jieba部分并没有用到词库。
另外tokenizer_bf是你自定义的分词器吗,是这里慢的原因吗?

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on July 17, 2024

这个分词器跟你的基本一致,就加了个自定义字点的加载。我把我的数据共享给你,你跑一下看看可以么。数据在这里, https://github.com/BrikerMan/rasa-demo/blob/master/data.json

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024

@BrikerMan 可以的,发我邮箱吧 [email protected]
我就是怀疑自定义字典加载慢的缘故...

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on July 17, 2024

我这里换成 'tokenizer_jieba' 也一样。似乎是这个问题,RasaHQ/rasa#260 (comment)

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on July 17, 2024

@crownpku 有结果么?

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024

@BrikerMan 我没有收到你的sample数据啊...

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on July 17, 2024

直接放在 github 了,上面有提到。 https://github.com/BrikerMan/rasa-demo/blob/master/data.json。

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024

用你的数据在跑了,跑到classification那一步确实很慢....

Part I: train segmenter
words in dictionary: 200000
num features: 271
now do training
C:           20
epsilon:     0.01
num threads: 1
cache size:  5
max iterations: 2000
loss per missed segment:  3
C: 20   loss: 3         0.807018
C: 35   loss: 3         0.807018
C: 20   loss: 4.5       0.877193
C: 5   loss: 3  0.807018
C: 20   loss: 1.5       0.789474
C: 20   loss: 6         0.877193
C: 20   loss: 5.25      0.877193
C: 21.5   loss: 4.65    0.877193
C: 16.9684   loss: 4.72073      0.877193
C: 18.2577   loss: 4.43072      0.877193
C: 18.2131   loss: 4.55681      0.877193
C: 20   loss: 4.4       0.877193
C: 20.9694   loss: 4.47547      0.877193
best C: 20
best loss: 4.5
num feats in chunker model: 4095
train: precision, recall, f1-score: 1 1 1
Part I: elapsed time: 4 seconds.

Part II: train segment classifier
now do training
num training samples: 58


还在跑中,是卡在了ner_mitie这里。我想下怎么回事。

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on July 17, 2024

@crownpku 嗯嗯,谢谢啦,我也在考虑为啥这么慢。

from rasa_nlu_chi.

BrikerMan avatar BrikerMan commented on July 17, 2024

有什么进展么?

from rasa_nlu_chi.

kevinsay avatar kevinsay commented on July 17, 2024

我有178个samples,加不加自定义词典,都很慢。

from rasa_nlu_chi.

cloudskyme avatar cloudskyme commented on July 17, 2024

total_word_feature_extractor_zh.dat,你好,这个文件现在下载不到了,有什么地方可以下载吗?

from rasa_nlu_chi.

kevinsay avatar kevinsay commented on July 17, 2024

from rasa_nlu_chi.

crapthings avatar crapthings commented on July 17, 2024

这个文件下载了,放到哪儿哦?

我放到
models/default.dat
还是提示我找不到

每次运行需要输入 --path ./models/default.data

然后提示

curl -XPOST localhost:5000/parse -d '{"q":"我发烧了该吃什么药?", "project": "rasa_nlu_test", "model": "model_20170921-170911"}' | python -mjson.tool
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   160    0    60  100   100   7545  12575 --:--:-- --:--:-- --:--:-- 14285
{
    "error": "No project found with name 'rasa_nlu_test'."
}

from rasa_nlu_chi.

KevinZhou92 avatar KevinZhou92 commented on July 17, 2024

@kevinsay 你好,请问还能分享total_word_feature_extractor_zh.dat这个文件吗,为什么我下载下来使用显示UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 40: invalid start byte

from rasa_nlu_chi.

kevinsay avatar kevinsay commented on July 17, 2024

@KevinZhou92 https://pan.baidu.com/s/1ojAr5usOtThrTtHDSdpwiw aqqd

from rasa_nlu_chi.

KevinZhou92 avatar KevinZhou92 commented on July 17, 2024

@kevinsay 谢谢!

from rasa_nlu_chi.

yuxuan2015 avatar yuxuan2015 commented on July 17, 2024

total_word_feature_extractor_zh.dat,有谁知道这个文件的数据长什么样吗?

from rasa_nlu_chi.

crapthings avatar crapthings commented on July 17, 2024

@yuxuan2015
这是训练出来的 binary 好像看了没用吧

from rasa_nlu_chi.

yuxuan2015 avatar yuxuan2015 commented on July 17, 2024

@crapthings 那知道怎么换成word2vec词向量吗?

from rasa_nlu_chi.

mashagua avatar mashagua commented on July 17, 2024

你好,这个文件已经没有了,能共享一份吗?@KevinZhou92

from rasa_nlu_chi.

KevinZhou92 avatar KevinZhou92 commented on July 17, 2024

@mashagua 链接:https://pan.baidu.com/s/1kNENvlHLYWZIddmtWJ7Pdg 密码:p4vx

from rasa_nlu_chi.

 avatar commented on July 17, 2024

您好,上面BrikerMan 提出的训练58个数据很慢的原因找到了吗,我训练90个sample也很慢,好几个小时了,都没有训练完

from rasa_nlu_chi.

yanolele avatar yanolele commented on July 17, 2024

你好!
@KevinZhou92
这个文件已经没有了,能再共享一份給我吗?

from rasa_nlu_chi.

siennx avatar siennx commented on July 17, 2024

有好心人可以分享一下文件包嗎? 我找了好久, 鏈結都失效了, 感謝.

from rasa_nlu_chi.

KevinZhou92 avatar KevinZhou92 commented on July 17, 2024

@siennx @yanolele 链接:https://pan.baidu.com/s/1kNENvlHLYWZIddmtWJ7Pdg 密码:p4vx

Edit: 发错链接了, 不好意思, 已修改.

from rasa_nlu_chi.

siennx avatar siennx commented on July 17, 2024

@KevinZhou92 謝謝妳的分享, 可是我點進去, 第一次看到網頁, 輸入密碼後說網頁不存在, 後來再進去 就都說網頁不存在了, 請問是我哪裡操作有問題嗎?
Update: 不好意思, 我試了新的鏈結, 還是遇到"頁面不存在"的問題, 可以再麻煩你看一下嗎? 感謝

from rasa_nlu_chi.

aqiank avatar aqiank commented on July 17, 2024

很久以前我曾经下载过该文件. 不懂是不是一样的文件. 我将文件上传到MEGA了. 下载速度可能慢一点.

链接: https://mega.nz/#!EWgTHSxR!NbTXDAuVHwwdP2-Ia8qG7No-JUsSbH5mNQSRDsjztSA
SHA-1: 1c0f473464d14c706af695f5791e6e959d5efac8

from rasa_nlu_chi.

mashagua avatar mashagua commented on July 17, 2024

from rasa_nlu_chi.

siennx avatar siennx commented on July 17, 2024

謝謝檔案分享, 已經下載了

from rasa_nlu_chi.

Ma-Dan avatar Ma-Dan commented on July 17, 2024

MITIE的wordrep训练非常耗时,我使用约1G的Wiki中文语料训练,需要64G内存,而且它只用了CPU的一个核,从开始到训练出word_vects.dat需要56小时。再从word_vects.dat训练得到total_word_feature_extractor.dat又需要7小时。

from rasa_nlu_chi.

red-frog avatar red-frog commented on July 17, 2024

同样遇到了为什么这麽慢的问题, 现在有解决办法了吗?

from rasa_nlu_chi.

yijinsheng avatar yijinsheng commented on July 17, 2024

一个是训练时间长的问题,还有我用了一个118M大小的训练数据,直接训练挂了,8核的cetos ,内存500G以上, 训练了几个小时之后直接显示killed了,有没有遇到这方面的问题,我google查说可能是用了mitile_classifier。中文这方面资料比较少,还希望大神指点

from rasa_nlu_chi.

yangyang1719 avatar yangyang1719 commented on July 17, 2024

Running coloredlogs-10.0/setup.py -q bdist_egg --dist-dir /tmp/easy_install-tkWOQ3/coloredlogs-10.0/egg-dist-tmp-Bmzmr6
Killed: 9
在为什么返回kill:9

from rasa_nlu_chi.

shengyaokai avatar shengyaokai commented on July 17, 2024

@BrikerMan 我问下你们是怎么训练自己需要的语句啊

from rasa_nlu_chi.

yangyang1719 avatar yangyang1719 commented on July 17, 2024

from rasa_nlu_chi.

shengyaokai avatar shengyaokai commented on July 17, 2024

@crownpku嗯嗯,谢谢啦,我也在考虑为啥这么慢。

我问一下你是怎么训练自己需要的语句的啊

from rasa_nlu_chi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.