Coder Social home page Coder Social logo

niutrans / niutrans.smt Goto Github PK

View Code? Open in Web Editor NEW
144.0 20.0 40.0 96.26 MB

NiuTrans.SMT is an open-source statistical machine translation system developed by a joint team from NLP Lab. at Northeastern University and the NiuTrans Team. The NiuTrans system is fully developed in C++ language. So it runs fast and uses less memory. Currently it supports phrase-based, hierarchical phrase-based and syntax-based (string-to-tree, tree-to-string and tree-to-tree) models for research-oriented studies.

License: GNU General Public License v2.0

Perl 1.94% Makefile 0.01% C++ 97.86% C 0.15% Shell 0.01% Python 0.01% Prolog 0.02% Batchfile 0.01% Raku 0.01%
machine-translation statistical-machine-translation decoder phrase-based-translation parsing

niutrans.smt's Issues

特定领域的翻译问题, 使用统计翻译模型大概需要多少数据量才能得到合理的翻译结果

首先感谢该项目, 我在完全不了解perl的情况下, 成功在自己的语料下完成了, 整个过程. (只遇到了一个 因 "#"字符导致的错误)

我当前的数据量只有几千条, 在未经任何数据处理下, 我的实验结果是训练集 bleu是0.76, 测试集是0.26.
使用的模型是基于层次的短语模型.

除了标题的问题, 我还想知道切换到其他开源翻译模型, 是否对翻译效果, 有帮助

语料对齐问题

image
示例里面用的是中翻英系统,src填中文语料路径,tgt填英文语料路径。
如果我想训练英翻中系统,src也是填中文,tgt也是填英文吗?

Error about "NiuTrans-running-segmenter"

some error occured when i run this script:
perl NiuTrans-running-segmenter.pl \ # 中文预处理 -lang ch \ -input ../work/preprocessing/chinese.clean.txt \ -output ../work/preprocessing/chinese.clean.txt.prepro \ -method 01

and some error info is as follows:

`########### SCRIPT ########### SCRIPT ############ SCRIPT ##########

NiuTrans Running NiuSeg (version 1.2.0 Beta) --www.nlplab.com

########### SCRIPT ########### SCRIPT ############ SCRIPT ##########
Running: ../bin/NiuSegmenter_CN_x64 ../config/NiuTrans.NiuSeg.ch.config tmp.config.chi 11101
--- Initialize Chinese program ...
--- Chinese_Wrapper : Load configure file.
--- Chinese_Wrapper : Configure file load finished.
--- Chinese_Wrapper : Initialize segmentation ...
Reading keys from ../resource/Dict0920/len2.lex...
Sorting keys...
Analyzing ...
keys wcstok failed

Error ##### chi_LM-Based_word_breaker reports lex:../resource/Dict0920/len2.lex||||||loc:../resource/Dict0920/len2.loc||||||org:../resource/bi.org.dict||||||psn:../resource/Dict0920/len2.psn not found or can't open!

--- Chinese_Wrapper : Segmentations initialize finished.
--- Chinese_Wrapper : Initialize preprocessor ...
--- all_PreProcessing_FullToHalf stand ready.
--- Chinese_Wrapper : PreProcessors initialize finished.
--- Chinese_Wrapper : Initialize prev-recognizers ...
--- all_PrevRecognition_RegexRecognizer stand ready.
--- Chinese_Wrapper : Prev-recognizers initialize finished.
--- Chinese_Wrapper : Initialize post-recognizers ...
--- chi_All_Post_Details stand ready.
--- all_PostRecognition_MergeAtomToCompose stand ready.
--- Chinese_Wrapper : Post-recognizers initialize finished.
--- Chinese_Wrapper : Initialize translators ...
--- chi_Translation_ChinumToArabicnum stand ready.
--- chi_Translation_ArabicNumToEngTranslate stand ready.
--- chi_Translation_BilingualDictionary stand ready.
--- chi_Translation_NumberTranslator stand ready.
Error: Execution of: ../bin/NiuSegmenter_CN_x64 ../config/NiuTrans.NiuSeg.ch.config tmp.config.chi 11101
die with signal 11, with coredump
`
Environment
Linux version 4.4.0-62-generic (buildd@lcy01-30) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) )

perl NiuTrans-running-segmenter.pl -lang ch -input ../sample-data/sample-submission-version/Test-set/Niu.test.txt -output ./sample-data/sample-submission-version/Test-set/pred -method 11

########### SCRIPT ########### SCRIPT ############ SCRIPT ##########

NiuTrans Running NiuSeg (version 1.2.0 Beta) --www.nlplab.com

########### SCRIPT ########### SCRIPT ############ SCRIPT ##########
Running: ../bin/NiuSegmenter_CN_x64 ../config/NiuTrans.NiuSeg.ch.config tmp.config.chi 11111
--- Initialize Chinese program ...
--- Chinese_Wrapper : Load configure file.
--- Chinese_Wrapper : Configure file load finished.
--- Chinese_Wrapper : Initialize segmentation ...
Reading keys from ../resource/Dict0920/len2.lex...
Sorting keys...
Analyzing ...
keys wcstok failed

Error ##### chi_LM-Based_word_breaker reports lex:../resource/Dict0920/len2.lex||||||loc:../resource/Dict0920/len2.loc||||||org:../resource/bi.org.dict||||||psn:../resource/Dict0920/len2.psn; not found or can't open!

--- Chinese_Wrapper : Segmentations initialize finished.
--- Chinese_Wrapper : Initialize preprocessor ...
--- all_PreProcessing_FullToHalf stand ready.
--- Chinese_Wrapper : PreProcessors initialize finished.
--- Chinese_Wrapper : Initialize prev-recognizers ...
--- all_PrevRecognition_RegexRecognizer stand ready.
--- Chinese_Wrapper : Prev-recognizers initialize finished.
--- Chinese_Wrapper : Initialize post-recognizers ...
--- chi_All_Post_Details stand ready.
--- all_PostRecognition_MergeAtomToCompose stand ready.
--- Chinese_Wrapper : Post-recognizers initialize finished.
--- Chinese_Wrapper : Initialize translators ...
--- chi_Translation_ChinumToArabicnum stand ready.
--- chi_Translation_ArabicNumToEngTranslate stand ready.
--- chi_Translation_BilingualDictionary stand ready.
--- chi_Translation_NumberTranslator stand ready.
Error: Execution of: ../bin/NiuSegmenter_CN_x64 ../config/NiuTrans.NiuSeg.ch.config tmp.config.chi 11111
die with signal 11, with coredump
zyyt@ubuntu:~/liuqingmin/enkk_wmt/tools/NiuTrans.SMT/scripts$ vi ../sample-data/sample-submission-version/Test-set/Niu.test.txt

Multiple compile issues on Linux

My C++ is very weak but doing a fresh pull and make has multiple errors when attempting to build from master on Ubuntu 18.04. It appears that there are many issues, the first of which is

OurTree.cpp: In member function ‘bool smt::Tree::CreateForest(const char*)’:                                                                                  
OurTree.cpp:377:23: error: ISO C++ forbids comparison between pointer and integer [-fpermissive]                                                 
         while(ibeg != '\0'){                                                                                                                                 
                       ^~~~                                                                                                                                   Makefile:13: recipe for target 'OurTree.o' failed                                                                                                             
make[1]: *** [OurTree.o] Error 1                                                                                                                              
make[1]: Leaving directory '/home/a.melser/dev/NiuTrans.SMT/src/NiuTrans.Decoder'                                                                             
Makefile:12: recipe for target 'all' failed                                                                                                                   
make: *** [all] Error 2  

But there seem to be many others, like missing variables (src/NiuTrans.PhraseExtractor/dispatcher.cpp, options.sort_phrase_table), missing methods:

ruletable_scorer.cpp: In member function ‘bool ruletable_scorer::PhraseTable::generatePhraseTable(ruletable_scorer::PhraseAlignment&, bool&, std::ofstream&, b
ool&, ruletable_scorer::OptionsOfScore&, ruletable_scorer::ScoreClassifyNum&)’:                                                                  
ruletable_scorer.cpp:280:80: error: no matching function for call to ‘ruletable_scorer::PhraseTable::output(std::ofstream&, bool&, ruletable_scorer::OptionsOf
Score&, ruletable_scorer::ScoreClassifyNum&, double&)’                                                                                                        
         output( outfile, inverseFlag, options, scoreClassifyNum ,totalFrequency); 

And maybe more. Is there something I am missing or has this version not been tested on Linux? If you have a version that has definitely been compiled on Linux I can compare with then I can help get this working!

FYI, none of the links to download packages on http://www.nlplab.com/NiuPlan/NiuTrans.html or http://www.niutrans.com/niutrans/NiuTrans.html are still working.

请问机器翻译中应该如何处理关于表情与特殊符号的问题

在神经机器翻译中,我已经收集到不少数据,但是出现的问题是,中文到英文准确度可以保证,当设置平行语句时总会出问题,并且在训练过程中,日语和韩语这两种语言与中文的转换并不准确,韩语与日语中有很多语法与中文语法不同,所以想请教一下大佬们,有没有好的建议,还有就是在训练中,如果一句话中加入表情,那么识别语种会有问题,并且表情符号也会被吞掉,以上这几个问题请问有没有好的解决办法呢?

Segmentation fault when using NiuTrans.Decoder

Hi, I am using the latest version of NiuTrans.SMT, and while following the Quick Walkthrough of the user manual, I encountered the following issue:
image
I am running this on Centos Stream 8. Please do advise. Thank you!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.