Coder Social home page Coder Social logo

ericxsun / fasttext Goto Github PK

View Code? Open in Web Editor NEW

This project forked from facebookresearch/fasttext

15.0 1.0 9.0 4.36 MB

Library for fast text representation and classification.

License: Other

Shell 1.99% CMake 0.12% Makefile 0.09% Python 4.34% C++ 7.52% JavaScript 10.08% HTML 73.57% CSS 2.17% Perl 0.12%

fasttext's Issues

Installation on windows

Is installation on windows supported?
I encounter the following errors:

  1. Using make: c ++ -pthread -std = c ++ 0x -O3 -funroll-loops -c src / args.cc process_begin: CreateProcess (NULL, c ++ -pthread -std = c ++ 0x -O3 -funroll-loops -c src / args .cc, ...) failed. make (e = 2): The system can not find the specified file. make: *** [args.o] Error 2

  2. Using cmake: make: *** No targets specified and no "make" control file found. Stop. (on executing make && make install)

Error in make

$ git clone https://github.com/ericxsun/fastText.git $ cd fastText $ mkdir build && cd build && cmake .. $ make && make install

gives me

c++ -pthread -std=c++0x -march=native -O3 -funroll-loops -c src/args.cc
src/args.cc: In member function ‘void fasttext::Args::printTrainingHelp()’:
src/args.cc:259:7: error: expected primary-expression before ‘<<’ token
       << "  -incr               incremental training, default ["
       ^
make: *** [args.o] Error 1

Regarding Incremental training

I want to do incremental training on the pretrained wiki.bin. Could you please tell me the format of the train.txt file that has to be provided. Should only the sentences be provided or the labels also are required. Can the sentence be given as such or should it be tokenised?

Errors during loading Dictionary for word embedding incremental training

First of all, thank you very much for the great work. Would really appreciate this feature. However, as I played around with it, I could not get started on re-training a word embedding model. Here are the steps to reproduce the error:

# run the sample script first to generate a model
$ bash word-vector-example.sh

# after the model is generated from the sample script
# we run
./fasttext skipgram -input data/fil9 -inputModel result/fil9.bin -output retrained -incr

The output displayed was:

Update args
Load dict from trained model
Load dict from training data
Read 124M words
Number of words:  218316
Number of labels: 0
Merge dict
Read 124M words
Number of words:  0
Number of labels: 0
terminate called after throwing an instance of 'std::invalid_argument'
  what():  Empty vocabulary. Try a smaller -minCount value.
Aborted

I have tried to adjust -minCount option, but it did not work.

After looking at the code, I feel that the error has something to do with

dict_->addDict(dictInData, false);

from line 744 of fasttext.cc, but I am not sure about the exact cause of this problem. In fact, I am a bit confused by why we can load dictionary from the raw text we are supposed to train on.

Word embeddings (vectors) lost their 'sense' after incr training

Hi,
I downloaded the german wiki word vectors from fasttext: https://fasttext.cc/docs/en/pretrained-vectors.html (bin+text)
I copy-pasted some political text in a training file (around 180MB), cleaned with:
cat train.txt | sed -e 'y/[]/()/' -e "s/([.!?,'/()])/ \1 /g" | tr "[:upper:]" "[:lower:]" | sed -e 'y/ÖÜÄ/öüä/' > train_clean.txt
Trained the wiki.de.bin
./fasttext skipgram -input train_clean.txt -inputModel wiki.de.bin -output wikiplus -incr
Extracted the vectors from the *.vec files
Made a cluster analysis and was amazed, how stupid my new vectors are. E.g. the clustering with the original vectors would cluster bus and train together, the new vectors make not much sense.
Do you have any idea, why?

wiki.en.bin cannot be opened for re-training?

Trained the wiki.en.bin
./fasttext supervised -input formed.txt -inputModel wiki.en.bin -output retrained -incr

What(): wiki.wiki.en.bin cannot be opened for re-training
Aborted

A sample of our data looks like this "__label__1 Tom is a good cat"

Please look at my error and help me?

Dimension mismatch when using pretrained vectors

Hi,

I am trying to use the word vectors from here:
https://fasttext.cc/docs/en/crawl-vectors.html

and use pretrainedVectors option to make a classifier.
./fasttext supervised -input some_data.txt -output some_data.txt.model -pretrainedVectors model.bin -dim 300

However I get:

terminate called after throwing an instance of 'std::invalid_argument'
what(): Dimension of pretrained vectors (-1) does not match dimension (300)!
Aborted (core dumped)

or

terminate called after throwing an instance of 'std::invalid_argument'
what(): Dimension of pretrained vectors (-283686952306184) does not match dimension (300)!
Aborted (core dumped)

Content of sample_data.txt is:
__label__a hello
__label__b good
__label__c bad

I appreciate any advice.

Load dict from trained model

Hi,

Being able to finetune a pre-trained model is a anazing tool. However, I don't know why but when I am starting to run my command './fasttext supervised -input -inputModel -output -thread 25 -incr', it says 'Load dict from trained model' and never go to the next step (I waited more than 1 hour).

Is the syntax correct ? What am I missing ?

Thanks
Yohan

build failed

Hi,
I clone this project and occurs error while executing the command "make"

-------------- first error begin --------------

/opt/workspace/fastText/src/args.cc: In member function ‘void fasttext::Args::printTrainingHelp()’:
/opt/workspace/fastText/src/args.cc:259:7: error: expected primary-expression before ‘<<’ token
<< " -incr incremental training, default ["
^~
CMakeFiles/fasttext-static.dir/build.make:62: recipe for target 'CMakeFiles/fasttext-static.dir/src/args.cc.o' failed
make[2]: *** [CMakeFiles/fasttext-static.dir/src/args.cc.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/fasttext-static.dir/all' failed
make[1]: *** [CMakeFiles/fasttext-static.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

-------------- first error end --------------

The above error occurs at line 258 in args.cc
The semicolon should be removed.

After remove the semicolon, occurs another errors as follow
-------------- second error begin --------------
Scanning dependencies of target fasttext-static
[ 2%] Building CXX object CMakeFiles/fasttext-static.dir/src/args.cc.o
[ 5%] Building CXX object CMakeFiles/fasttext-static.dir/src/dictionary.cc.o
[ 8%] Building CXX object CMakeFiles/fasttext-static.dir/src/fasttext.cc.o
/opt/workspace/fastText/src/fasttext.cc: In member function ‘void fasttext::FastText::predict(std::istream&, int32_t, bool, fasttext::real, std::__cxx11::string)’:
/opt/workspace/fastText/src/fasttext.cc:440:19: error: declaration of ‘std::istream& in’ shadows a parameter
std::istream& in,
^~
/opt/workspace/fastText/src/fasttext.cc:440:19: error: ‘in’ declared as reference but not initialized
/opt/workspace/fastText/src/fasttext.cc:441:13: error: expected initializer before ‘k’
int32_t k,
^
/opt/workspace/fastText/src/fasttext.cc:468:31: error: qualified-id in declaration before ‘(’ token
void FastText::printLabelStats(
^
/opt/workspace/fastText/src/fasttext.cc:509:31: error: qualified-id in declaration before ‘(’ token
void FastText::printLabelStats(std::istream& in, int32_t k, real threshold)
^
/opt/workspace/fastText/src/fasttext.cc:537:33: error: qualified-id in declaration before ‘(’ token
void FastText::getSentenceVector(std::istream& in, fasttext::Vector& svec) {
^
/opt/workspace/fastText/src/fasttext.cc:570:28: error: qualified-id in declaration before ‘(’ token
void FastText::ngramVectors(std::string word) {
^
/opt/workspace/fastText/src/fasttext.cc:588:37: error: qualified-id in declaration before ‘(’ token
void FastText::precomputeWordVectors(Matrix& wordVectors) {
^
/opt/workspace/fastText/src/fasttext.cc:601:22: error: qualified-id in declaration before ‘(’ token
void FastText::findNN(
^
/opt/workspace/fastText/src/fasttext.cc:631:25: error: qualified-id in declaration before ‘(’ token
void FastText::analogies(int32_t k) {
^
/opt/workspace/fastText/src/fasttext.cc:663:27: error: qualified-id in declaration before ‘(’ token
void FastText::trainThread(int32_t threadId) {
^
/opt/workspace/fastText/src/fasttext.cc:702:27: error: qualified-id in declaration before ‘(’ token
void FastText::loadVectors(std::string filename) {
^
/opt/workspace/fastText/src/fasttext.cc:745:21: error: qualified-id in declaration before ‘(’ token
void FastText::train(const Args args) {
^
/opt/workspace/fastText/src/fasttext.cc:953:28: error: qualified-id in declaration before ‘(’ token
void FastText::startThreads() {
^
/opt/workspace/fastText/src/fasttext.cc:981:27: error: qualified-id in declaration before ‘(’ token
int FastText::getDimension() const {
^
/opt/workspace/fastText/src/fasttext.cc:985:23: error: qualified-id in declaration before ‘(’ token
bool FastText::isQuant() const {
^
/opt/workspace/fastText/src/fasttext.cc: At global scope:
/opt/workspace/fastText/src/fasttext.cc:989:1: error: expected ‘}’ at end of input
} // namespace fasttext
^
CMakeFiles/fasttext-static.dir/build.make:110: recipe for target 'CMakeFiles/fasttext-static.dir/src/fasttext.cc.o' failed
make[2]: *** [CMakeFiles/fasttext-static.dir/src/fasttext.cc.o] Error 1
CMakeFiles/Makefile2:67: recipe for target 'CMakeFiles/fasttext-static.dir/all' failed
make[1]: *** [CMakeFiles/fasttext-static.dir/all] Error 2
Makefile:127: recipe for target 'all' failed
make: *** [all] Error 2

-------------- second error end --------------

How do I build this project?

Memory error while loading wiki-news model for incremental learning

Hi Eric,
Thanks for the excellent enhancement. I am trying to use your repo for incremental learning. I am getting a memory error while running the script. My machine has 32gb ram and I am able to load the pre-trained model otherwise for inference tasks.

image

Pre-trained model size: 6.8gb
Command executed:
./fasttext skipgram -input /home/aaa/Downloads/datasets/nlu/sed_sof_corpus.txt -inputModel /home/aaa/Downloads/datasets/wiki-news-300d-1M-subword.bin -output sed_sof_trlearn -incr

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.