Coder Social home page Coder Social logo

pommedeterresautee / fastrtext Goto Github PK

View Code? Open in Web Editor NEW
101.0 13.0 15.0 6.04 MB

R wrapper for fastText

Home Page: https://pommedeterresautee.github.io/fastrtext/

License: Other

R 19.01% C++ 80.78% Shell 0.08% C 0.13%
fasttext rstats machine-learning nlp classification word-embeddings text-classification neural-network embeddings

fastrtext's People

Contributors

pommedeterresautee avatar vrasneur avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fastrtext's Issues

Question - how to load vec & bin file from external source?

Hi,

I would like to load a pre-trained non binary model, but not sure how to do it with load_model?

Does it need to be in bin format for load_model to work, or is there an argument that could be passed such as binary=False?

I'm trying to use this:
https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki-news-300d-1M.vec.zip

Not sure if this a duplicate of :
#17

I also tried to load a bin file from and got an error
https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.zip

modelwiki <- load_model(paste0('C:\Users\xxx\R\fastrtext_test\wiki.en\wiki.en'))
add .bin extension to the path
Error in (function (env, objName) :
argument to 'findVar' is not an environment
get_analogies(modelwiki, "PhD", "MS","school")
Error in model$get_nn_by_vector(vec, c(w1, w2, w3), k) : Encountered NaN.

get_parameters(modelwiki)
$learning_rate
[1] 0.05
$learning_rate_update
[1] 100
$dim
[1] 300
$context_window_size
[1] 5
$epoch
[1] 5
$min_count
[1] 5
$min_count_label
[1] 0
$n_sampled_negatives
[1] 5
$word_ngram
[1] 1
$bucket
[1] 2000000
$min_ngram
[1] 3
$max_ngram
[1] 6
$sampling_threshold
[1] 1e-04
$label_prefix
[1] "label"
$pretrained_vectors_filename
[1] ""
$nlabels
[1] 0
$n_words
[1] 2519370
$loss_name
[1] "ns"
$model_name
[1] "sg"

Build package issue

Issue posted on http://lists.r-forge.r-project.org/pipermail/rcpp-devel/
http://lists.r-forge.r-project.org/pipermail/rcpp-devel/2017-August/009711.html

Some posts:
https://www.google.fr/search?q=%22undefined+symbol%22+site:http://lists.r-forge.r-project.org/pipermail/rcpp-devel&ei=l36AWej_LYSwaa6Qn6gB&start=10&sa=N&biw=1204&bih=890

https://stackoverflow.com/questions/13995266/using-3rd-party-header-files-with-rcpp

==> R CMD INSTALL --preclean --no-multiarch --with-keep.source FastRText

g++ -std=gnu++11 -I/usr/share/R/include -DNDEBUG -I../inst/include/ -I"/home/geantvert/R/x86_64-pc-linux-gnu-library/3.4/Rcpp/include"    -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-X2xP8j/r-base-3.4.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c RcppExports.cpp -o RcppExports.o
* installing to library ‘/home/geantvert/R/x86_64-pc-linux-gnu-library/3.4’
* installing *source* package ‘FastRText’ ...
** libs
g++ -std=gnu++11 -I/usr/share/R/include -DNDEBUG -I../inst/include/ -I"/home/geantvert/R/x86_64-pc-linux-gnu-library/3.4/Rcpp/include"    -fpic  -g -O2 -fdebug-prefix-map=/build/r-base-X2xP8j/r-base-3.4.1=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -g -c fastRtext.cpp -o fastRtext.o
g++ -std=gnu++11 -shared -L/usr/lib/R/lib -Wl,-Bsymbolic-functions -Wl,-z,relro -o FastRText.so RcppExports.o fastRtext.o -llapack -lblas -lgfortran -lm -lquadmath -L/usr/lib/R/lib -lR
installing to /home/geantvert/R/x86_64-pc-linux-gnu-library/3.4/FastRText/libs
** R
** inst
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
Error: package or namespace load failed for ‘FastRText’ in dyn.load(file, DLLpath = DLLpath, ...):
 impossible de charger l'objet partagé '/home/geantvert/R/x86_64-pc-linux-gnu-library/3.4/FastRText/libs/FastRText.so':
  /home/geantvert/R/x86_64-pc-linux-gnu-library/3.4/FastRText/libs/FastRText.so: undefined symbol: _ZN8fasttext8FastTextC1Ev
Erreur : le chargement a échoué
Exécution arrêtée
ERROR: loading failed
* removing ‘/home/geantvert/R/x86_64-pc-linux-gnu-library/3.4/FastRText’

Exited with status 1.

Space-separated bigram output from get_nn

In my execute, I'm able to easily tell fastRtext to use bigrams, per the instructions/commands for fasttext:

execute(
  commands = c(
    "skipgram",
    "-input",
    tmp_file_txt,
    "-output",
    tmp_file_model,
    "-verbose",
    1,
    "-wordNgrams",
    2
  )
)

However, to get bigrams out that are space-separated, I'm not sure how to do that with the get_nn function, and it's not in the documentation:

nndf <- as.data.frame(get_nn(model, term, 250))

Thus I end up with messy bigrams in the DF. Any way to force get_nn to return space-separated bigrams?

Save fastrtext trained model

Hi,

I would like to save a trained model (more specifically a supervised model for text classification) on disk for later re-use (so it should not be a temporary file).

I am using fastrtext on Microsoft Azure Machine Learning Studio for a project and it would be ideal if trained models could be saved as ".rds" files. Is this possible, and if not what would you suggest as a workaround?

Thanks!

This application has requested the Runtime to terminate it

I am running the unsupervised learning algorithm on

  • 350k Mails and 55 Categories
  execute(
    commands = c("supervised", "-input", train_tmp_file_txt, "-output", tmp_file_model, "-dim", 100, "-lr", 1, "-epoch", 20, "-wordNgrams", 2, "-verbose", 1)
  )

Its difficult to make a reproducible example, since data is confidential. I also made a test with dummy data, where the code did not fail with over 2m "text elements". But i am very willing to do my best, to make it more reproducible if possible.

Issue:
When running the execute part of the code Rstudio crashes with:

This application has requested the Runtime to terminate it in an unusual way.
Please contact the application's support team for more information.
terminate called recursively

Is this a known error?

What i tried:

Maybe there is a way to test the code easily outside R.




> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2    packrat_0.5.0 

Hardware info:

grafik

loading time for supervised learning with pretrainedVector param

Hi, I found that in supervised learning, if I include params -pretrainedVector the load times for this command is painstakingly slow.

Without it it is very quick. My pretrainedVector is 1.5Gb in size. I am not sure if this is the case but does pretrainedVector read into R memory? I can check with my drive it only reads at around 400kb/s.

Thanks

Predict error

Hi,

I am using fastrtext to label some twitter messages. I wish to use supervised learning, get the model and use it to predict some other twitter messages. Luckily some people helped me classified the twitter messages into two labels.

Since the characteristics of twitter message is different for each message, so one cannot include all the words in the trained model.

After I try to get the prediction:
predictions <- predict(model, sentences = test_to_write)

I get this message which I believe is normal because I didn't have all the words in my model:

Error: Some sentences have no predictions. It may be caused by the fact that all their words are have not been seen during the training.

So after getting this error, I would like to at least view those message that is being classified. Is there a way to do this? The "predictions" object does not exist due to the error being raised.

Lastly, for this type of data like twitter, is there a way to regex clean the expression? For example in twitter message. Yeah can be written as yh, yea etc... and obviously these are three different words totally.

Many thanks.

How to quantize

How do I quantize an existing model?

It's eluded to in the package intro help page, but I couldn't find any reference to it in help. According to the Fasttext page, they use a quantize function which runs against an existing model? Is this supported?

Many thanks!
Alan

Get similar documents using fastrtext

Hello~

I am wondering if there's a build-in function inside fastrtext package similar as get_nn that can find similar documents not just similar words. Thanks!

Load a fastText pre-trained model (Chinese), how to set encoding?

Hi there,

I am using load_model() function to load a Chinese fastText pre-trained model, and here is the command I used:

model <- load_model("D:/CMD/cc.zh.300.bin/cc.zh.300.bin")
my <- get_word_vectors(model)

However, the loaded word vectors are not correctly encoded, and the problem remained after I set Save With Coding to UTF-8. I am wondering whether there is a way to set encoding when loading the pre-trained model? Thanks for your time in advance!

Jovian

Analogies?

How could one use FastRText to execute the 'analogies' command of fasttext?

Debugging sentences with no predictions

Hi there,

Thanks for a great package - really enjoying trying this out.

Had a q: I'm getting the "sentences have no predictions" error. To get around this I've set unlock_empty_predictions = T so that I can see the output to debug as per the help file.

How do I go about debugging. Is there an easy way to see which input text fields are missing? The output is just a list of classes and probabilities, so wasn't sure the best way to investigate further?

Any tips? I looked for short tweets (i'm using tweets, so wondered if it was posts with just a single hashtag).

Many thanks

Alan

Why increasing wordNgrams in execute() makes accuracy decrease

Hi,
I noticed that the accuracy of my prediction(using the function predict) decreased from 0.9 to 0.8 when I increased the parameter wordNgrams from 1 to 2 during training. And as I kept increasing wordNgrams, the accuracy kept decreasing and it even hit 0.002 when wordNgrams became 5. I was really confused since I thought increasing wordNgrams would improve the performance of training. Can anyone tell me what's going on?
Thanks!

Fastrtext fails to install direct from Git

Hi there

I was trying to install on a fresh OSX installation. As the package isn't on CRAN, I tried to install from Github using devtools, but this fails with compiling errors.

Any idea of a workaround?

Alan

R
library(devtools)
install_gihub("https://github.com/pommedeterresautee/fastrtext")

With session

R version 3.5.2 (2018-12-20)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] usethis_1.5.0  devtools_2.0.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.5        magrittr_1.5      pkgload_1.0.2     R6_2.4.1
 [5] rlang_0.4.9       fansi_0.4.0       tools_3.5.2       pkgbuild_1.0.3
 [9] sessioninfo_1.1.1 cli_2.0.1         withr_2.1.2       remotes_2.0.4
[13] assertthat_0.2.1  digest_0.6.24     rprojroot_1.3-2   crayon_1.3.4
[17] processx_3.4.2    callr_3.2.0       fs_1.3.1          ps_1.3.2
[21] curl_3.3          testthat_2.1.1    memoise_1.1.0     glue_1.4.2
[25] compiler_3.5.2    desc_1.2.0        backports_1.1.4   prettyunits_1.0.2

Returns

Downloading GitHub repo pommedeterresautee/fastrtext@master
   checking for file ‘/private/var/folders/lb/1sl3d1_n78jgtdt0x4php_j80000gn/T/RtmpIID4qT/remotes156871b7181d/pommedeterresautee-fastrtext-b63c5de/DESCRIPTION’ ✔  checking for file ‘/private/var/folders/lb/1sl3d1_n78jgtdt0x4php_j80000gn/T/RtmpIID4qT/remotes156871b7181d/pommedeterresautee-fastrtext-b63c5de/DESCRIPTION’
─  preparing ‘fastrtext’:
✔  checking DESCRIPTION meta-information ...
─  cleaning src
─  running ‘cleanup’
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  looking to see if a ‘data/datalist’ file should be added
─  building ‘fastrtext_0.3.4.tar.gz’

* installing *source* package ‘fastrtext’ ...
** libs
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c add_prefix.cpp -o add_prefix.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c r_compliance.cc -o r_compliance.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/autotune.cc -o fasttext/autotune.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/args.cc -o fasttext/args.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/matrix.cc -o fasttext/matrix.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/dictionary.cc -o fasttext/dictionary.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/loss.cc -o fasttext/loss.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/productquantizer.cc -o fasttext/productquantizer.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/densematrix.cc -o fasttext/densematrix.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/quantmatrix.cc -o fasttext/quantmatrix.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/vector.cc -o fasttext/vector.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/model.cc -o fasttext/model.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/utils.cc -o fasttext/utils.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/meter.cc -o fasttext/meter.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/fasttext.cc -o fasttext/fasttext.o
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fasttext/main.cc -o fasttext/main.o
fasttext/main.cc:324:14: warning: variable 'k' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized]
  } else if (args.size() == 4) {
             ^~~~~~~~~~~~~~~~
fasttext/main.cc:330:7: note: uninitialized use occurs here
  if (k <= 0) {
      ^
fasttext/main.cc:324:10: note: remove the 'if' if its condition is always true
  } else if (args.size() == 4) {
         ^~~~~~~~~~~~~~~~~~~~~~
fasttext/main.cc:321:12: note: initialize the variable 'k' to silence this warning
  int32_t k;
           ^
            = 0
1 warning generated.
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c fastrtext.cpp -o fastrtext.o
fastrtext.cpp:203:13: warning: unused variable 'i' [-Wunused-variable]
    int32_t i = 0;
            ^
1 warning generated.
clang++ -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -include r_compliance.h -I./fasttext -I"/Library/Frameworks/R.framework/Versions/3.5/Resources/library/Rcpp/include" -I/usr/local/include   -fPIC  -Wall -g -O2 -c RcppExports.cpp -o RcppExports.o
clang++ -std=gnu++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o fastrtext.so add_prefix.o r_compliance.o ./fasttext/autotune.o ./fasttext/args.o ./fasttext/matrix.o ./fasttext/dictionary.o ./fasttext/loss.o ./fasttext/productquantizer.o ./fasttext/densematrix.o ./fasttext/quantmatrix.o ./fasttext/vector.o ./fasttext/model.o ./fasttext/utils.o ./fasttext/meter.o ./fasttext/fasttext.o ./fasttext/main.o fastrtext.o RcppExports.o -pthread -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
clang: warning: argument unused during compilation: '-pthread'
ld: warning: text-based stub file /System/Library/Frameworks//CoreFoundation.framework/CoreFoundation.tbd and library file /System/Library/Frameworks//CoreFoundation.framework/CoreFoundation are out of sync. Falling back to library file for linking.
installing to /Library/Frameworks/R.framework/Versions/3.5/Resources/library/fastrtext/libs
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded
Error: package or namespace load failed for ‘fastrtext’ in dyn.load(file, DLLpath = DLLpath, ...):
 unable to load shared object '/Library/Frameworks/R.framework/Versions/3.5/Resources/library/fastrtext/libs/fastrtext.so':
  dlopen(/Library/Frameworks/R.framework/Versions/3.5/Resources/library/fastrtext/libs/fastrtext.so, 6): Symbol not found: __ZN8fasttext8Autotune12kCutoffLimitE
  Referenced from: /Library/Frameworks/R.framework/Versions/3.5/Resources/library/fastrtext/libs/fastrtext.so
  Expected in: flat namespace
 in /Library/Frameworks/R.framework/Versions/3.5/Resources/library/fastrtext/libs/fastrtext.so
Error: loading failed
Execution halted
ERROR: loading failed
* removing ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/fastrtext’
* restoring previous ‘/Library/Frameworks/R.framework/Versions/3.5/Resources/library/fastrtext’
Error in i.p(...) :
  (converted from warning) installation of package ‘/var/folders/lb/1sl3d1_n78jgtdt0x4php_j80000gn/T//RtmpIID4qT/file15682dc6f3a7/fastrtext_0.3.4.tar.gz’ had non-zero exit status

Sorting multiclass prediction output by label name

Working on a multi-class classification. Looking for a way to flatten the list with output predictions. Currently I have outputs in this format:

$document1_text
__label__B __label__C __label__A
0.9129 0.0441 0.0166
$document2_text
__label__A __label__C __label__B
0.0741 0.0736 0.0730

Given a command like t(as.data.frame(predictions)) I am able to get to the following flat format:

id____________ __label__B __label__C __label__A
document1_text 0.9129 0.0441 0.0166
document2_text 0.0741 0.0736 0.0730
The issue is that due to differences in order of labels, observation document2_text gets wrong values in each of the columns. I hope that authoring this package you might have already come across this situation, even though it is more about general list manipulations in R. Unfortunately your code on # you can get flat list of results when you are retrieving only one label per observation print(head(predict(model, sentences = test_to_write, simplify = TRUE))) does not help given my current design.

I think this would be solved easily if we could order prediction outputs by class name after X most likely classes are provided, as in __label__A __label__B __label__C. Can you recommend a way to do this?

Convenience functions for training

When training models, the whole train_tmp_file_txt <- tempfile(); writeLines(text = train_to_write, con = train_tmp_file_txt) dance can get quite tiresome and error-prone, and so I have written a few convenience functions for training supervised and unsupervised models, that also expose all possible arguments in tab-complete and documentation, when using these in RStudio or similar environment.

Is there an interest in including such functions in fastrtext? I can prepare a PR with the ones I use, or we can discuss API and I can prepare new versions

Chinese text classification issue

I got an issue with Chinese text classification prediction model as folloing:

test_sentences$text2[9]
[1] "蛋白粉 开封 后 两个 月 在 次 食用 味道 发苦"
predict(model,test_sentences$text2[9])
[[1]]
__label__262
0.5312194

predict(model, "蛋白粉 开封 后 两个 月 在 次 食用 味道 发苦")
[[1]]
__label__314
0.9935217

Basically, after you trained the model using "fastrtext", if you try to predict a Chinese tokenized text and put it as an object (e.g. test_sentences$text2[9] in my case), it will give you a wrong prediction with low probability. If you just simply copy the tokenized Chinese text into the prediction model like I did above, it will give a correct one with high probability. I am really confused about this situation. Anyone can help with it? Much appreciated!

Add autotune option

Hi there,

I saw on the Fasttext page here they've added an autotune feature, which automatically optimizes the various hyperparameters.

Seems it can be activated with the -autotune-validation option, which isn't currently supported. Wondered if this could be added with the updates for CRAN?

https://fasttext.cc/docs/en/autotune.html

best

Alan

computation with sentences instead of single words

For example, if in "get_word_distance(model, w1, w2)" w1 and w2 be n-word sentences instead of single words, does result of this command is real distance between sentences, or its working only with single words?

removed from cran

Package ‘fastrtext’ was removed from the CRAN repository.

Formerly available versions can be obtained from the archive.

Archived on 2019-09-04 as check problems were not corrected in time. 

Are you going to get it back on CRAN?

Preprocessing Text for multiclass text classification

I am trying to perform supervised multiclass email text classification with fastrtext loading the Wikipedia pretrained vectors. I am experiencing low performance in overall accuracy (< 0.2) despite

  1. I have reduced the number of classes from 150 to 32 (setting a threshold and performing sampling on the overrepresented classes);
  2. I have tuned hyperparameters such as loss function, learning rate, minimum and maximum characters n-grams, epochs etc.

I have also performed some text preprocessing, even tohugh I have not been able to obtain a perfectly clean text as desired. Do you think the low accuracy is related to text preprocessing (low quality of text) or am I missing something obvious?

If this is the case, could you point me to some R libraries that could help me achieving my goal?

Any help appreciated!

Incorrect values for sentences from get_word_distance and get_nn

Hello, first of all, thank you for this package.
I’m interested in cosine similarities between sentences or between word and sentences. The following code I believe produces correct results:

pv <- get_sentence_representation(mod, c("she was", "and to") )
pv <- t(pv)

# using lsa package
lsa::cosine(pv)

# manual
v1 <- as.numeric(pv[,1])
v2 <- as.numeric(pv[,2])
sum(v1*v2) / ( sqrt(sum(v1*v1)) * sqrt(sum(v2*v2)) )

The manual way and lsa produce the same results. However, I obtain different results if I try to use get_word_distance (same similarity score than get_nn):
1 - get_word_distance(mod, "she was", "and to")

Is it correct that get_word_distance does not work with sentences? If so, it would be very helpful to get an error message instead of some value.

Thank you,
Luca

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.