aalto-speech / morfessor Goto Github PK
View Code? Open in Web Editor NEWMorfessor is a tool for unsupervised and semi-supervised morphological segmentation
Home Page: http://morpho.aalto.fi
License: BSD 2-Clause "Simplified" License
Morfessor is a tool for unsupervised and semi-supervised morphological segmentation
Home Page: http://morpho.aalto.fi
License: BSD 2-Clause "Simplified" License
Morfessor 2.0 - Quick start =========================== Installation ------------ Morfessor 2.0 is installed using setuptools library for Python. To build and install the module and scripts to default paths, type python setup.py install For details, see http://docs.python.org/install/ Documentation ------------- User instructions for Morfessor 2.0 are available in the docs directory as Sphinx source files (see http://sphinx-doc.org/). Instructions how to build the documentation can be found in docs/README. The documentation is also available on-line at http://morfessor.readthedocs.org/ Details of the implemented algorithms and methods and a set of experiments are described in the following technical report: Sami Virpioja, Peter Smit, Stig-Arne Grönroos, and Mikko Kurimo. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. Aalto University publication series SCIENCE + TECHNOLOGY, 25/2013. Aalto University, Helsinki, 2013. ISBN 978-952-60-5501-5. The report is available online at http://urn.fi/URN:ISBN:978-952-60-5501-5 Contact ------- Questions or feedback? Email: [email protected]
I noticed that the website says the latest version is 2.0.1 but the latest GitHub tag is 2.0.6:
http://morpho.aalto.fi/projects/morpho/morfessor2.html
https://github.com/aalto-speech/morfessor/releases
Is there a trained model for Finnish available for download somewhere?
Command-line.
flammie@saarkaany ~/Koodit/mt-development/complexity-stats (2145) [01:21:54]
$ cat > kolme
yksi
kolme
flammie@saarkaany ~/Koodit/mt-development/complexity-stats (2146) [01:22:08]
$ morfessor -l europarl-v7.fi-en.fi.morfessor --output-format-separator '> <' --output-newlines --output-format '{analysis} ' -T - < kolme
INFO:morfessor.io:Loading model from 'europarl-v7.fi-en.fi.morfessor'...
INFO:morfessor.io:Done.
No training data files specified.
Segmenting test data...
INFO:morfessor.io:Reading corpus from '-'...
yksi
kolme
INFO:morfessor.io:Done.
Done.
There should be empty line between yksi and kolme. This is useful for machine translation pipeline where the tools commonly fail when lines don't match.
Hi, I'm developing a tokenizer based on Korean.
Since my project is to develop a language model using SRILM's ngram
, the role of tokenizer is very important.
I couldn't experiment because of the large capacity of the corpus, but I want to hear your answer quickly, so I'm leaving an issue.
Is the result of morfessor deterministic? In other words, will the same model be created after repeated learning dozens of times?
If it is non-deterministic, are there any index or methods to measure how different the performance of results(tokenizers) varies?
Hi,
I use the following command for model training(morfessor2.0):
morfessor-train --traindata-list --logfile=log.log -S model.segm -d ones inputdata.txt
Then use the following command for word segmentation:
morfessor-segment -L model.segm test.txt
Why is the output in the terminal after the word segmentation? How to save the segmented word to Specified file?
Looking forward to your advice or answers.
Best regards,
yapingzhao
I used morfessor-segment -L en.model test.data > test.morf
It works, however the text in my resulting file test.morf
has a word on each line. As I am using corpus with one sentence on each line I would like to have to same output format but I cannot find how to achieve that
Thanks in advance
I want to use Morfessor to separate Turkish words into stem+suffixes.
I don't have a sample database. So, I must create a new data set for training.
Can you give me some explanatory example data lines in Turkish, or English that should be in the data set?
Thanks.
I write python code to segment given words, main code is :
model=io.read_any_model(model.bin')
with open(test.txt,'r') as OutputFile:
for line in InputFile:
words=line.strip().split()
morphemes=[(w," ".join(model.viterbi_segment(w)[0])) for w in words]
only few words segmented, but i used the same model on commend line to segment the same text, and most of the words are segmented,
$morfessor-segment -l model.bin test.txt
So any idea what is wrong in my python code? thank you!!!
Hi There,
I tried to craft some simple training like
design de sign, de sign
gender gen der, gen der
bilingual bi lingual, bi lingual
biography bio graphy, bio graphy
for testing list as
design
gender
bilingual
biography
and got the result as
morfessor -t td1.txt -S model.segm -T text.txt
Reading corpus from 'td1.txt'...
Detected utf-8 encoding
Done.
Compounds in training data: 16 types / 16 tokens
Starting batch training
Epochs: 0 Cost: 344.6809466060173
.................
Epochs: 1 Cost: 206.03260380373735
.................
Epochs: 2 Cost: 206.0326038037374
Done.
Epochs: 2
Final cost: 206.0326038037374
Training time: 0.017s
Saving segmentations to 'model.segm'...
Done.
Segmenting test data...
Reading corpus from 'text.txt'...
de sign
gen der
bi lingual
bi o graphy
Done.
Done.
Where the expected results is
de sign
gen der
bi lingual
bio graphy
My question is
-R
Jarod
Downloading Morfessor-2.0.2.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-build-wy451zlq/morfessor/setup.py", line 9, in <module>
main_py = open('morfessor/__init__.py').read()
File "/usr/lib/python3.5/encodings/ascii.py", line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 373: ordinal not in range(128)
cf061cf#diff-4ffe01edeab0886c81c728bf704ac894R13
maybe you need to add # -*- coding: utf-8 -*-
https://www.python.org/dev/peps/pep-0263/
Hi,
I am trying to load a model on python3.6 using the python API, but it fails.
Python 3.6.2 |Continuum Analytics, Inc.| (default, Jul 20 2017, 13:51:32)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import morfessor
>>> io = morfessor.MorfessorIO()
>>> mf = 'something.morfmodel.bin'
>>> model = io.read_binary_model_file(mf)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/someone/libs/miniconda3/envs/py3/lib/python3.6/site-packages/morfessor/io.py", line 179, in read_binary_model_file
model = pickle.load(fobj)
AttributeError: 'ConstrNode' object has no attribute '__dict__'
It is possible that the model may have been trained on the python 2 (i am not sure, my coworker trained it).
Questions is, shouldn't the model trained on python 2 work on python 3 (considering the same code is used)?
Hi,
I was wondering if the English trained model behind your demo is available for others to use. I hope this is the case.
Colin Goldberg
vocab-vi.txt
is a list of Vietnamese terms, with syllables separated by _
. I tried using Morfessor to group the syllables into words:
morfessor -t vocab-vi.txt -T vocab-vi.txt -x lexicon-vi.txt -S lexicon-vi.morf --traindata-list --atom-separator '_'
and I got this error, from code that apparently hasn't been ported to Python 3:
Traceback (most recent call last):
File "/home/rspeer/.virtualenvs/lum/bin/morfessor", line 22, in <module>
main(sys.argv[1:])
File "/home/rspeer/.virtualenvs/lum/bin/morfessor", line 13, in main
morfessor.main(args)
File "/home/rspeer/.virtualenvs/lum/lib/python3.5/site-packages/morfessor/cmd.py", line 393, in main
args.finish_threshold, args.maxepochs)
File "/home/rspeer/.virtualenvs/lum/lib/python3.5/site-packages/morfessor/baseline.py", line 572, in train_batch
(w, _constructions_to_str(segments)))
File "/home/rspeer/.virtualenvs/lum/lib/python3.5/site-packages/morfessor/baseline.py", line 17, in _constructions_to_str
isinstance(constructions[0], unicode)):
NameError: name 'unicode' is not defined
If I try replacing that check with just a check for str
, it also doesn't solve the problem, it just uncovers another one:
Traceback (most recent call last):
File "/home/rspeer/.virtualenvs/lum/bin/morfessor", line 22, in <module>
main(sys.argv[1:])
File "/home/rspeer/.virtualenvs/lum/bin/morfessor", line 13, in main
morfessor.main(args)
File "/home/rspeer/.virtualenvs/lum/lib/python3.5/site-packages/morfessor/cmd.py", line 466, in main
analysis = csep.join(constructions)
TypeError: sequence item 0: expected str instance, tuple found
I am using morfessor with the word count genereted from Wikipedia. I noticed that the larger the word count file is, the larger the model is. Around 0.5GiB the pickle file is.
Is there a correlation?
What do you think the best practice is?
This is in reference to Gensim PR #1067.
On Python 2.6, (Travis Job) the version check in line 18 of morfessor/io.py fails with the following error.
Traceback (most recent call last):
File "/home/travis/build/RaRe-Technologies/gensim/gensim/test/test_varembed_wrapper.py", line 49, in testEnsembleMorphemeEmbeddings
morfessor_model=varembed_model_morfessor_file, use_morphemes=True)
File "/home/travis/build/RaRe-Technologies/gensim/gensim/models/wrappers/varembed.py", line 70, in load_varembed_format
import morfessor
File "/home/travis/miniconda2/envs/gensim-test/lib/python2.6/site-packages/morfessor/__init__.py", line 29, in <module>
from .cmd import main, get_default_argparser, main_evaluation, \
File "/home/travis/miniconda2/envs/gensim-test/lib/python2.6/site-packages/morfessor/cmd.py", line 12, in <module>
from .io import MorfessorIO
File "/home/travis/miniconda2/envs/gensim-test/lib/python2.6/site-packages/morfessor/io.py", line 18, in <module>
PY3 = sys.version_info.major == 3
AttributeError: 'tuple' object has no attribute 'major'
There seems a really simple fix for this issue to get it working for Python 2.6 of using sys.version_info[0] instead of sys.version_info.major.
If it's fine, I'll go ahead and submit a PR with this fix as we are to integrate that PR into Gensim as well.
Hello,
I'm getting this issue:
p3/bin/morfessor -t en-cs/train.en.tok --num-morph-types 50000 -S morf-models/morf-model.train.en-cs.50k.en -s morf-model.train.en-cs.50k.pickle.en
INFO:morfessor.io:Reading corpus from 'en-cs/train.en.tok'...
INFO:morfessor.io:Detected utf-8 encoding
INFO:morfessor.io:Done.
INFO:morfessor.baseline:Compounds in training data: 1938261 types / 1938261 tokens
INFO:morfessor.baseline:Starting batch training
INFO:morfessor.baseline:Epochs: 0 Cost: 75567655.89912468
.......................................................ERROR:morfessor:Fatal Error <class 'KeyError'> 'lhjij'
Traceback (most recent call last):
File "p3/bin/morfessor", line 22, in <module>
main(sys.argv[1:])
File "p3/bin/morfessor", line 13, in main
morfessor.main(args)
File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/cmd.py", line 435, in main
args.finish_threshold, args.maxepochs)
File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 595, in train_batch
segments = self._recursive_optimize(w, *algorithm_params)
File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 299, in _recursive_optimize
constructions += self._recursive_split(part)
File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 312, in _recursive_split
rcount, count = self._remove(construction)
File "/lnet/spec/work/people/machacek/morf-seg-nmt/p3/lib/python3.4/site-packages/morfessor/baseline.py", line 124, in _remove
rcount, count, splitloc = self._analyses[construction]
KeyError: 'lhjij'
Morfessor (2.0.3)
The input file is tokenized English side of CzEng. Is it correct?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.