ericxsun / fasttext Goto Github PK
View Code? Open in Web Editor NEWThis project forked from facebookresearch/fasttext
Library for fast text representation and classification.
License: Other
This project forked from facebookresearch/fasttext
Library for fast text representation and classification.
License: Other
Is installation on windows supported?
I encounter the following errors:
Using make: c ++ -pthread -std = c ++ 0x -O3 -funroll-loops -c src / args.cc process_begin: CreateProcess (NULL, c ++ -pthread -std = c ++ 0x -O3 -funroll-loops -c src / args .cc, ...) failed. make (e = 2): The system can not find the specified file. make: *** [args.o] Error 2
Using cmake: make: *** No targets specified and no "make" control file found. Stop. (on executing make && make install)
$ git clone https://github.com/ericxsun/fastText.git $ cd fastText $ mkdir build && cd build && cmake .. $ make && make install
gives me
c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/args.cc
src/args.cc: In member function ‘void fasttext::Args::printTrainingHelp()’:
src/args.cc:259:7: error: expected primary-expression before ‘<<’ token
<< " -incr incremental training, default ["
^
make: *** [args.o] Error 1
I want to do incremental training on the pretrained wiki.bin. Could you please tell me the format of the train.txt file that has to be provided. Should only the sentences be provided or the labels also are required. Can the sentence be given as such or should it be tokenised?
First of all, thank you very much for the great work. Would really appreciate this feature. However, as I played around with it, I could not get started on re-training a word embedding model. Here are the steps to reproduce the error:
# run the sample script first to generate a model
$ bash word-vector-example.sh
# after the model is generated from the sample script
# we run
./fasttext skipgram -input data/fil9 -inputModel result/fil9.bin -output retrained -incr
The output displayed was:
Update args
Load dict from trained model
Load dict from training data
Read 124M words
Number of words: 218316
Number of labels: 0
Merge dict
Read 124M words
Number of words: 0
Number of labels: 0
terminate called after throwing an instance of 'std::invalid_argument'
what(): Empty vocabulary. Try a smaller -minCount value.
Aborted
I have tried to adjust -minCount
option, but it did not work.
After looking at the code, I feel that the error has something to do with
dict_->addDict(dictInData, false);
from line 744 of fasttext.cc
, but I am not sure about the exact cause of this problem. In fact, I am a bit confused by why we can load dictionary from the raw text we are supposed to train on.
Hi,
I downloaded the german wiki word vectors from fasttext: https://fasttext.cc/docs/en/pretrained-vectors.html (bin+text)
I copy-pasted some political text in a training file (around 180MB), cleaned with:
cat train.txt | sed -e 'y/[]/()/' -e "s/([.!?,'/()])/ \1 /g" | tr "[:upper:]" "[:lower:]" | sed -e 'y/ÖÜÄ/öüä/' > train_clean.txt
Trained the wiki.de.bin
./fasttext skipgram -input train_clean.txt -inputModel wiki.de.bin -output wikiplus -incr
Extracted the vectors from the *.vec files
Made a cluster analysis and was amazed, how stupid my new vectors are. E.g. the clustering with the original vectors would cluster bus and train together, the new vectors make not much sense.
Do you have any idea, why?
Trained the wiki.en.bin
./fasttext supervised -input formed.txt -inputModel wiki.en.bin -output retrained -incr
What(): wiki.wiki.en.bin cannot be opened for re-training
Aborted
A sample of our data looks like this "__label__1 Tom is a good cat"
Please look at my error and help me?
Hi,
I am trying to use the word vectors from here:
https://fasttext.cc/docs/en/crawl-vectors.html
and use pretrainedVectors option to make a classifier.
./fasttext supervised -input some_data.txt -output some_data.txt.model -pretrainedVectors model.bin -dim 300
However I get:
terminate called after throwing an instance of 'std::invalid_argument'
what(): Dimension of pretrained vectors (-1) does not match dimension (300)!
Aborted (core dumped)
or
terminate called after throwing an instance of 'std::invalid_argument'
what(): Dimension of pretrained vectors (-283686952306184) does not match dimension (300)!
Aborted (core dumped)
Content of sample_data.txt is:
__label__a hello
__label__b good
__label__c bad
I appreciate any advice.
Hi,
Being able to finetune a pre-trained model is a anazing tool. However, I don't know why but when I am starting to run my command './fasttext supervised -input -inputModel -output -thread 25 -incr', it says 'Load dict from trained model' and never go to the next step (I waited more than 1 hour).
Is the syntax correct ? What am I missing ?
Thanks
Yohan
Hi,
I clone this project and occurs error while executing the command "make"
-------------- first error begin --------------
/opt/workspace/fastText/src/args.cc: In member function ‘void fasttext::Args::printTrainingHelp()’:
/opt/workspace/fastText/src/args.cc:259:7: error: expected primary-expression before ‘<<’ token
<< " -incr incremental training, default ["
^~
CMakeFiles/fasttext-static.dir/build.make:62: recipe for target 'CMakeFiles/fasttext-static.dir/src/args.cc.o' failed
make[2]: *** [CMakeFiles/fasttext-static.dir/src/args.cc.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/fasttext-static.dir/all' failed
make[1]: *** [CMakeFiles/fasttext-static.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2
-------------- first error end --------------
The above error occurs at line 258 in args.cc
The semicolon should be removed.
After remove the semicolon, occurs another errors as follow
-------------- second error begin --------------
Scanning dependencies of target fasttext-static
[ 2%] Building CXX object CMakeFiles/fasttext-static.dir/src/args.cc.o
[ 5%] Building CXX object CMakeFiles/fasttext-static.dir/src/dictionary.cc.o
[ 8%] Building CXX object CMakeFiles/fasttext-static.dir/src/fasttext.cc.o
/opt/workspace/fastText/src/fasttext.cc: In member function ‘void fasttext::FastText::predict(std::istream&, int32_t, bool, fasttext::real, std::__cxx11::string)’:
/opt/workspace/fastText/src/fasttext.cc:440:19: error: declaration of ‘std::istream& in’ shadows a parameter
std::istream& in,
^~
/opt/workspace/fastText/src/fasttext.cc:440:19: error: ‘in’ declared as reference but not initialized
/opt/workspace/fastText/src/fasttext.cc:441:13: error: expected initializer before ‘k’
int32_t k,
^
/opt/workspace/fastText/src/fasttext.cc:468:31: error: qualified-id in declaration before ‘(’ token
void FastText::printLabelStats(
^
/opt/workspace/fastText/src/fasttext.cc:509:31: error: qualified-id in declaration before ‘(’ token
void FastText::printLabelStats(std::istream& in, int32_t k, real threshold)
^
/opt/workspace/fastText/src/fasttext.cc:537:33: error: qualified-id in declaration before ‘(’ token
void FastText::getSentenceVector(std::istream& in, fasttext::Vector& svec) {
^
/opt/workspace/fastText/src/fasttext.cc:570:28: error: qualified-id in declaration before ‘(’ token
void FastText::ngramVectors(std::string word) {
^
/opt/workspace/fastText/src/fasttext.cc:588:37: error: qualified-id in declaration before ‘(’ token
void FastText::precomputeWordVectors(Matrix& wordVectors) {
^
/opt/workspace/fastText/src/fasttext.cc:601:22: error: qualified-id in declaration before ‘(’ token
void FastText::findNN(
^
/opt/workspace/fastText/src/fasttext.cc:631:25: error: qualified-id in declaration before ‘(’ token
void FastText::analogies(int32_t k) {
^
/opt/workspace/fastText/src/fasttext.cc:663:27: error: qualified-id in declaration before ‘(’ token
void FastText::trainThread(int32_t threadId) {
^
/opt/workspace/fastText/src/fasttext.cc:702:27: error: qualified-id in declaration before ‘(’ token
void FastText::loadVectors(std::string filename) {
^
/opt/workspace/fastText/src/fasttext.cc:745:21: error: qualified-id in declaration before ‘(’ token
void FastText::train(const Args args) {
^
/opt/workspace/fastText/src/fasttext.cc:953:28: error: qualified-id in declaration before ‘(’ token
void FastText::startThreads() {
^
/opt/workspace/fastText/src/fasttext.cc:981:27: error: qualified-id in declaration before ‘(’ token
int FastText::getDimension() const {
^
/opt/workspace/fastText/src/fasttext.cc:985:23: error: qualified-id in declaration before ‘(’ token
bool FastText::isQuant() const {
^
/opt/workspace/fastText/src/fasttext.cc: At global scope:
/opt/workspace/fastText/src/fasttext.cc:989:1: error: expected ‘}’ at end of input
} // namespace fasttext
^
CMakeFiles/fasttext-static.dir/build.make:110: recipe for target 'CMakeFiles/fasttext-static.dir/src/fasttext.cc.o' failed
make[2]: *** [CMakeFiles/fasttext-static.dir/src/fasttext.cc.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/fasttext-static.dir/all' failed
make[1]: *** [CMakeFiles/fasttext-static.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2
-------------- second error end --------------
How do I build this project?
Hi Eric,
Thanks for the excellent enhancement. I am trying to use your repo for incremental learning. I am getting a memory error while running the script. My machine has 32gb ram and I am able to load the pre-trained model otherwise for inference tasks.
Pre-trained model size: 6.8gb
Command executed:
./fasttext skipgram -input /home/aaa/Downloads/datasets/nlu/sed_sof_corpus.txt -inputModel /home/aaa/Downloads/datasets/wiki-news-300d-1M-subword.bin -output sed_sof_trlearn -incr
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.