Coder Social home page Coder Social logo

Comments (17)

crownpku avatar crownpku commented on July 17, 2024 2

还有一个丑陋的方法就是把词库里所有这样"MINI JCW"和文本中匹配到的"MINI JCW",都写一个脚本把空格换成@@,即换成“MINI@@jcw”。这样就不存在你说的问题了......

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024 2

@DoubleAix
我自己是256GB內存訓練的。

網路上有很多簡繁轉換的Python包,你也可以仍然使用我訓練好的模型,只是在處理繁體輸入時先在程式里轉換成簡體就可以了。

from rasa_nlu_chi.

jxg972 avatar jxg972 commented on July 17, 2024 1

找到了
https://github.com/mit-nlp/MITIE/blob/master/mitielib/include/mitie/conll_tokenizer.h
第179行
else if (ch == ' ' || ch == '\t' || ch == '\n' || ch == '\r')
修改成
else if (ch == '@' || ch == '\t' || ch == '\n' || ch == '\r')

这里ch是char类型,不能用@@,改为单个@了,不然改动就大了

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024 1

只有訓練MITIE這一步需要大內存,其它步驟不需要這麼大的內存。如果你是用我訓練好的MITIE模型,普通PC應該就足夠應付rasa nlu了。

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024

这是你跑哪一步的错误?用MITIE预训练词向量吗?我这一步是256GB内存。

from rasa_nlu_chi.

jxg972 avatar jxg972 commented on July 17, 2024

是训练词向量,我们机子主要都是集群用,单机都不是很大,看来只能想法子加内存了

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024

嗯,你可以去MITIE的仓库看看相关的issue或者直接向他们提问,看看有没有分布式之类的解决办法:
https://github.com/mit-nlp/MITIE

from rasa_nlu_chi.

jxg972 avatar jxg972 commented on July 17, 2024

好的,谢谢

from rasa_nlu_chi.

jxg972 avatar jxg972 commented on July 17, 2024

@crownpku 又打扰了,还有一个关于MITIE的问题。某些情况下会出现词语本身就包含空格,比如一些外国品牌,类似MINI JCW(汽车)、ANNA SUI 安娜·苏(化妆品),所以我想问一下,怎样修改mitie的代码,使得结巴分词换一个分隔符输出?比如换成@@?

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024

这是jieba分词部分的问题,请参考 https://github.com/fxsjy/jieba

一个例子:

seg_list = jieba.cut("我来到北京清华大学", cut_all=False)
print("Default Mode: " + "@@".join(seg_list))  # 精确模式

输出:

我@@来到@@北京@@清华大学

from rasa_nlu_chi.

jxg972 avatar jxg972 commented on July 17, 2024

这个我已经修改了,主要是目前流程中分词是这样的
python -m jieba -d " " ./test > ./test_cut
我的意思是,我如果改成
python -m jieba -d "@@" ./test > ./test_cut
恐怕,mitie识别不了吧?
如果只是修改结巴分词的部分,但是它输出的分隔符还是空格,那么原本含有空格的词汇不是还是相当于分开了么?
之前给到mitie的文件是:
我 来到 北京 清华大学
现在变成
我@@来到@@北京@@清华大学
不需要修改mitie代码么?
C语言比较陌生,不太看得懂mitie的代码,所以这里不是很明白,请见谅

from rasa_nlu_chi.

crownpku avatar crownpku commented on July 17, 2024

MITIE的tokenizer应该在这里,具体修改的地方需要你再仔细看代码:
https://github.com/mit-nlp/MITIE/blob/master/mitielib/include/mitie/unigram_tokenizer.h

退一步讲,这一步只是做embedding,“MINI JCW"这样特别的词关系不是很大。
建议的做法是在最后训练和inference的时候才把这样的词放去jieba词库里面。

from rasa_nlu_chi.

jxg972 avatar jxg972 commented on July 17, 2024

谢谢,我现在就去看看代码。主要是目前尝试在汽车领域应用,“MINI JCW"这样特别的词其实还挺多的,担心影响会比较大

from rasa_nlu_chi.

DoubleAix avatar DoubleAix commented on July 17, 2024

crownpku 你好,
因為我這邊必須要使用繁體中文字的wiki內容訓練MITIE,
但我在網路上都找不到繁體訓練完的 total_word_feature_extractor_chi.dat
所以想請教你關於內存的部份到底要多少GB才夠呢?
(官網上說要128GB,你BLOG建議說可能會幾十GB)

麻煩你了,謝謝!!

from rasa_nlu_chi.

DoubleAix avatar DoubleAix commented on July 17, 2024

@crownpku
你回好快喔,這招真高招,那我等我的主機來再玩玩MITIE
另外,我對rasa-nlu不熟,我看了文件是不是就是只有這步MITIE需要很大的內存呢?

from rasa_nlu_chi.

fangnster avatar fangnster commented on July 17, 2024

@crownpku 你好,我的机器内存太小了,无法训练得到文件total_word_feature_extractor_chi.dat 。
而你之前分享该文件的地址已经失效了,能不能再次分享一下这个训练好的文件呢?
万分感激中 :-)~~~~~~

from rasa_nlu_chi.

siennx avatar siennx commented on July 17, 2024

@crownpku 可不可請你再分享一次 total_word_feature_extractor_chi.dat 文件, 網路上找到的連結都已經失效了, 感謝

from rasa_nlu_chi.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.