Coder Social home page Coder Social logo

niutrans / niutrans.smt Goto Github PK

View Code? Open in Web Editor NEW
144.0 144.0 40.0 96.26 MB

NiuTrans.SMT is an open-source statistical machine translation system developed by a joint team from NLP Lab. at Northeastern University and the NiuTrans Team. The NiuTrans system is fully developed in C++ language. So it runs fast and uses less memory. Currently it supports phrase-based, hierarchical phrase-based and syntax-based (string-to-tree, tree-to-string and tree-to-tree) models for research-oriented studies.

License: GNU General Public License v2.0

Perl 1.94% Makefile 0.01% C++ 97.86% C 0.15% Shell 0.01% Python 0.01% Prolog 0.02% Batchfile 0.01% Raku 0.01%
decoder machine-translation parsing phrase-based-translation statistical-machine-translation

niutrans.smt's People

Contributors

liyinqiao2012 avatar xiaotong avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

niutrans.smt's Issues

请问机器翻译中应该如何处理关于表情与特殊符号的问题

在神经机器翻译中,我已经收集到不少数据,但是出现的问题是,中文到英文准确度可以保证,当设置平行语句时总会出问题,并且在训练过程中,日语和韩语这两种语言与中文的转换并不准确,韩语与日语中有很多语法与中文语法不同,所以想请教一下大佬们,有没有好的建议,还有就是在训练中,如果一句话中加入表情,那么识别语种会有问题,并且表情符号也会被吞掉,以上这几个问题请问有没有好的解决办法呢?

语料对齐问题

image
示例里面用的是中翻英系统,src填中文语料路径,tgt填英文语料路径。
如果我想训练英翻中系统,src也是填中文,tgt也是填英文吗?

perl NiuTrans-running-segmenter.pl -lang ch -input ../sample-data/sample-submission-version/Test-set/Niu.test.txt -output ./sample-data/sample-submission-version/Test-set/pred -method 11

########### SCRIPT ########### SCRIPT ############ SCRIPT ##########

NiuTrans Running NiuSeg (version 1.2.0 Beta) --www.nlplab.com

########### SCRIPT ########### SCRIPT ############ SCRIPT ##########
Running: ../bin/NiuSegmenter_CN_x64 ../config/NiuTrans.NiuSeg.ch.config tmp.config.chi 11111
--- Initialize Chinese program ...
--- Chinese_Wrapper : Load configure file.
--- Chinese_Wrapper : Configure file load finished.
--- Chinese_Wrapper : Initialize segmentation ...
Reading keys from ../resource/Dict0920/len2.lex...
Sorting keys...
Analyzing ...
keys wcstok failed

Error ##### chi_LM-Based_word_breaker reports lex:../resource/Dict0920/len2.lex||||||loc:../resource/Dict0920/len2.loc||||||org:../resource/bi.org.dict||||||psn:../resource/Dict0920/len2.psn; not found or can't open!

--- Chinese_Wrapper : Segmentations initialize finished.
--- Chinese_Wrapper : Initialize preprocessor ...
--- all_PreProcessing_FullToHalf stand ready.
--- Chinese_Wrapper : PreProcessors initialize finished.
--- Chinese_Wrapper : Initialize prev-recognizers ...
--- all_PrevRecognition_RegexRecognizer stand ready.
--- Chinese_Wrapper : Prev-recognizers initialize finished.
--- Chinese_Wrapper : Initialize post-recognizers ...
--- chi_All_Post_Details stand ready.
--- all_PostRecognition_MergeAtomToCompose stand ready.
--- Chinese_Wrapper : Post-recognizers initialize finished.
--- Chinese_Wrapper : Initialize translators ...
--- chi_Translation_ChinumToArabicnum stand ready.
--- chi_Translation_ArabicNumToEngTranslate stand ready.
--- chi_Translation_BilingualDictionary stand ready.
--- chi_Translation_NumberTranslator stand ready.
Error: Execution of: ../bin/NiuSegmenter_CN_x64 ../config/NiuTrans.NiuSeg.ch.config tmp.config.chi 11111
die with signal 11, with coredump
zyyt@ubuntu:~/liuqingmin/enkk_wmt/tools/NiuTrans.SMT/scripts$ vi ../sample-data/sample-submission-version/Test-set/Niu.test.txt

特定领域的翻译问题, 使用统计翻译模型大概需要多少数据量才能得到合理的翻译结果

首先感谢该项目, 我在完全不了解perl的情况下, 成功在自己的语料下完成了, 整个过程. (只遇到了一个 因 "#"字符导致的错误)

我当前的数据量只有几千条, 在未经任何数据处理下, 我的实验结果是训练集 bleu是0.76, 测试集是0.26.
使用的模型是基于层次的短语模型.

除了标题的问题, 我还想知道切换到其他开源翻译模型, 是否对翻译效果, 有帮助

Segmentation fault when using NiuTrans.Decoder

Hi, I am using the latest version of NiuTrans.SMT, and while following the Quick Walkthrough of the user manual, I encountered the following issue:
image
I am running this on Centos Stream 8. Please do advise. Thank you!

Multiple compile issues on Linux

My C++ is very weak but doing a fresh pull and make has multiple errors when attempting to build from master on Ubuntu 18.04. It appears that there are many issues, the first of which is

OurTree.cpp: In member function ‘bool smt::Tree::CreateForest(const char*)’:                                                                                  
OurTree.cpp:377:23: error: ISO C++ forbids comparison between pointer and integer [-fpermissive]                                                 
         while(ibeg != '\0'){                                                                                                                                 
                       ^~~~                                                                                                                                   Makefile:13: recipe for target 'OurTree.o' failed                                                                                                             
make[1]: *** [OurTree.o] Error 1                                                                                                                              
make[1]: Leaving directory '/home/a.melser/dev/NiuTrans.SMT/src/NiuTrans.Decoder'                                                                             
Makefile:12: recipe for target 'all' failed                                                                                                                   
make: *** [all] Error 2  

But there seem to be many others, like missing variables (src/NiuTrans.PhraseExtractor/dispatcher.cpp, options.sort_phrase_table), missing methods:

ruletable_scorer.cpp: In member function ‘bool ruletable_scorer::PhraseTable::generatePhraseTable(ruletable_scorer::PhraseAlignment&, bool&, std::ofstream&, b
ool&, ruletable_scorer::OptionsOfScore&, ruletable_scorer::ScoreClassifyNum&)’:                                                                  
ruletable_scorer.cpp:280:80: error: no matching function for call to ‘ruletable_scorer::PhraseTable::output(std::ofstream&, bool&, ruletable_scorer::OptionsOf
Score&, ruletable_scorer::ScoreClassifyNum&, double&)’                                                                                                        
         output( outfile, inverseFlag, options, scoreClassifyNum ,totalFrequency); 

And maybe more. Is there something I am missing or has this version not been tested on Linux? If you have a version that has definitely been compiled on Linux I can compare with then I can help get this working!

FYI, none of the links to download packages on http://www.nlplab.com/NiuPlan/NiuTrans.html or http://www.niutrans.com/niutrans/NiuTrans.html are still working.

Error about "NiuTrans-running-segmenter"

some error occured when i run this script:
perl NiuTrans-running-segmenter.pl \ # 中文预处理 -lang ch \ -input ../work/preprocessing/chinese.clean.txt \ -output ../work/preprocessing/chinese.clean.txt.prepro \ -method 01

and some error info is as follows:

`########### SCRIPT ########### SCRIPT ############ SCRIPT ##########

NiuTrans Running NiuSeg (version 1.2.0 Beta) --www.nlplab.com

########### SCRIPT ########### SCRIPT ############ SCRIPT ##########
Running: ../bin/NiuSegmenter_CN_x64 ../config/NiuTrans.NiuSeg.ch.config tmp.config.chi 11101
--- Initialize Chinese program ...
--- Chinese_Wrapper : Load configure file.
--- Chinese_Wrapper : Configure file load finished.
--- Chinese_Wrapper : Initialize segmentation ...
Reading keys from ../resource/Dict0920/len2.lex...
Sorting keys...
Analyzing ...
keys wcstok failed

Error ##### chi_LM-Based_word_breaker reports lex:../resource/Dict0920/len2.lex||||||loc:../resource/Dict0920/len2.loc||||||org:../resource/bi.org.dict||||||psn:../resource/Dict0920/len2.psn not found or can't open!

--- Chinese_Wrapper : Segmentations initialize finished.
--- Chinese_Wrapper : Initialize preprocessor ...
--- all_PreProcessing_FullToHalf stand ready.
--- Chinese_Wrapper : PreProcessors initialize finished.
--- Chinese_Wrapper : Initialize prev-recognizers ...
--- all_PrevRecognition_RegexRecognizer stand ready.
--- Chinese_Wrapper : Prev-recognizers initialize finished.
--- Chinese_Wrapper : Initialize post-recognizers ...
--- chi_All_Post_Details stand ready.
--- all_PostRecognition_MergeAtomToCompose stand ready.
--- Chinese_Wrapper : Post-recognizers initialize finished.
--- Chinese_Wrapper : Initialize translators ...
--- chi_Translation_ChinumToArabicnum stand ready.
--- chi_Translation_ArabicNumToEngTranslate stand ready.
--- chi_Translation_BilingualDictionary stand ready.
--- chi_Translation_NumberTranslator stand ready.
Error: Execution of: ../bin/NiuSegmenter_CN_x64 ../config/NiuTrans.NiuSeg.ch.config tmp.config.chi 11101
die with signal 11, with coredump
`
Environment
Linux version 4.4.0-62-generic (buildd@lcy01-30) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) )

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.