mayabot / fasttext4j Goto Github PK

View Code? Open in Web Editor NEW

158.0 12.0 23.0 11.84 MB

Implementing Facebook's FastText with java

fasttext kotlin word2vec wordembeddings java

fasttext4j's People

Contributors

Stargazers

Watchers

fasttext4j's Issues

测试集文本

测试集的文本太少了，是不是会导致落到每个类别的概率变平均？或者是fasttext4j对中文不是很友好，会是概率平均化？

load 官方python版bin classifier 报错

想请问下是否可以load 用facebook官方python接口训练出的bin classifier
我直接load 会报错：
Exception in thread "main" java.lang.RuntimeException: Model file has wrong file format!
at com.mayabot.mynlp.fasttext.LoadFastTextFromClangModel.loadCModel(LoadFastTextFromClangModel.kt:32)
at com.mayabot.mynlp.fasttext.LoadFastTextFromClangModel.loadCModel(LoadFastTextFromClangModel.kt:124)
at com.mayabot.mynlp.fasttext.FastText$Companion.loadFasttextBinModel(FastText.kt:366)
at com.mayabot.mynlp.fasttext.FastText.loadFasttextBinModel(FastText.kt)

模型测试问题

model.test()打印的信息中
“N” 打印出来的样本数量跟我的测试文件样本数量对不上，要少上不少

MAX_VOCAB_SIZE 过大导致java 内存溢出

你好，代码中的MAX_VOCAB_SIZE = 30000000默认值是否设置过大，当我导入多个fasttext模型的时候就会导致内存溢出（预估了下这个参数大小已经远超物理机内存，而实际模型中的词语数量远达不到这个大小）。有没有什么解决办法，或开放配置参数。

测试集的格式

https://upload-images.jianshu.io/upload_images/12081581-70d412eebb570280?imageMogr2/auto-orient/strip%7CimageView2/2/w/323

您看看这样格式的可以吗？不行的话，那测试集的格式就必须是“label，txt”这种格式的吗？

FAILURE: Build failed with an exception.

When I Run gradle compile 'com.mayabot:fastText4j:1.1.5',
then
FAILURE: Build failed with an exception.

Where:
Build file '...fastText4j/build.gradle' line: 105
What went wrong:
A problem occurred evaluating root project 'fastText4j'.

Could not get unknown property 'oss_user' for repository container of type org.gradle.api.internal.artifacts.dsl.DefaultRepositoryHandler.

FastText训练速度慢

您好：
我在使用fastext4j进行训练时，60M语料进行fasttext skipgram训练时，训练速度特别慢，显示条提示需要200+小时才能训练完成。
我的参数为：

lr:0.1
dim:128
wordNgrams:3
Bucket: 100
neg:5
thread:12

但是无论在性能较好的服务器上还是本机训练相较于Python版本的fasttext的都训练较慢，请教一下是否是我的配置存在问题？

fasttext的序列化

你好，因为lateinit修饰wordVector会导致FastText对象序列化失败，采用fastjson的Json.toJsonString，为什么要使用这个修饰符？或者有其他的使用的序列化方法吗？我好放在缓存里面使用，谢谢

预训练词向量没有生效

inputArgs.setPretrainedVectors()
在使用上面的方法设置预训练词向量时设置的是pretrainedVectors这个属性，
而实际上在判断需不需要使用预训练词向量时，使用的是preTrainedVectors这个属性，
一字之差

IllegalArgumentException "Unknown EntryType enum second" when loading saved model.

I'm using FastText to train a model and all is OK, but I got following exception when loading from saved files:

Stacktrace:
Exception in thread "main" java.lang.IllegalArgumentException: Unknown EntryType enum second :136
at com.mayabot.mynlp.fasttext.EntryType$Companion.fromValue(Dictionary.kt:651)
at com.mayabot.mynlp.fasttext.Dictionary.load(Dictionary.kt:596)
at com.mayabot.mynlp.fasttext.FastText$Companion.loadModel(FastText.kt:410)
at com.mayabot.mynlp.fasttext.FastText.loadModel(FastText.kt)

Version: "com.mayabot" % "fastText4j" % "1.2.3".

Plz help me and let me know if you want more information.

Thx!

刚用就OOM

JDK8_121
Read file build dictionary ...

Read 0M words
Number of words: 251
Number of labels: 0
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

只有几十行的中文分好词空格隔开，测试一下训练词向量瞬间OOM
用其他w2v java是好的，有些行比较少的词

但是做supervise没报错：
Read file build dictionary ...

Read 0M words
Number of words: 958
Number of labels: 0

Progress: 100.95% words/sec/thread: Infinity lr: -0.00095 loss: 0.00000 ETA: 0h 0m 0s
Progress: 100.00% words/sec/thread: Infinity lr: 0.00000 loss: 0.00000 ETA: 0h 0m 0s
Train use time 123 ms

TrainArgs中wordNgrams参数设置

请问，我想用fasttext做文本分类训练，看TrainArgs中是有wordNgram是参数的，但是maven引用fastText4j的jar之后，在TrainArgs的参数设置中，没有找到wordNgram对应的set方法，对kotlin不熟，请指教，谢谢！

cutoff and retrain args missing while quantizing

您好，

我在使用中发现，quantize过程中只有dsub和qnorm两个参数发挥了作用，而cutoff和retrain并没有出现在代码中，请问是尚未完成的功能吗？如果是，近期有加入它们的计划吗？

另外，首页文档的installation中的版本号依然是1.2.2，是个没有加入minn和maxn等参数的版本，而最新版已经更新至1.2.3，这是我查看issue才发现的，请及时更新文档。

祝好

其他训练参数怎么设置

请问，FastText.train()除了可以设置文件，模型训练类型之前。其他的参数如何设置，比如学习率，epoch
n-gram_window等等。请指教

similar code?

Hello, what is the difference between your code and the code in https://github.com/linkfluence/fastText4j?

如何评估测试集？

官方版本的fasttext中的模型除了predict方法外还有test方法，可以直接得出测试集的准确率和召回率，这个java实现貌似没有这个接口？

您能提供全java代码吗？Kotlin不熟，还需要时间学习，如果java就直接上手撸了，程序员都喜欢这样。。。:)

相同模型文件(官方 c++ 训练的 ftz)，预测结果不一致

利用 fasttext 官方最新代码训练的模型 (ftz 格式)，用fasttext4j 加载模型预测结果时，2w 个样本，出现了 3k 个样本的预测结果同官方的代码预测结果不一致。

仅对比分类标签

在python版中用到了reduce_model在这里没有找到

python中这个代码fasttext.util.reduce_model(fast_model, 128)
我在这里找不到对应的方法

离线调用jar包，如何配置

请问，如果离线调用fasttext4j的jar包,jar包已经编译好了，还需要如何配置。因为公司内网不能从maven下载东西。

加载模型出错

加载模型时，出现如下错误：
Exception in thread "main" java.lang.IllegalArgumentException: Unknown EntryType enum second :136
at com.mayabot.nlp.fasttext.dictionary.EntryType$Companion.fromValue(DictUtils.kt:57)
at com.mayabot.nlp.fasttext.dictionary.Dictionary$Companion.loadModel(Dictionary.kt:384)
at com.mayabot.nlp.fasttext.FastText$Companion.loadModel(FastText.kt:711)
at cn.com.duiba.spark.dmp.api.FasttextTest.loadModel(FasttextTest.java:61)
at cn.com.duiba.spark.dmp.api.FasttextTest.main(FasttextTest.java:43)

求解

The problem in JAVA model

Hi, I tested the fastText4j, and train model by:
fastText = FastText.train("ag.train", ModelName.sup);
fastText.saveModel(java_model);

than load model for predicting a sentence:
fastText = FastText.loadModel(java_model);
fastText.predict(Arrays.asList(testStr.split(" ")), 5);

get error:
java.lang.ArithmeticException: / by zero
at com.mayabot.mynlp.fasttext.AreaByteBufferMatrix.get(Matrix.kt:222)
at com.mayabot.mynlp.fasttext.FastText.addInputVector(FastText.kt:261)
at com.mayabot.mynlp.fasttext.FastText.getWordVector(FastText.kt:196)
at com.mayabot.mynlp.fasttext.FastText.getWordVector(FastText.kt:207)

But, i used same data, train model by fastText-0.1.0, load model by:
FastText.loadFasttextBinModel(cpp_model);
predict = fastText.predict(Arrays.asList(testStr.split(" ")), 5);

That is OK:

predict result:[[__label__4,0.9996598], [__label__1,2.4863696E-4], [__label__3,1.076918E-4], [__label__2,2.3928827E-5]]

Why?

Thank you.

This library returns wrong vectors when reading from cpp binary

I have fasttext dump trained with python code unsupervised with quantization. When i call getWordVector using java code and python code i see very different vectors.

How to load crawl-300d-2M-subword.zip vector?

Hi there,

Can you share any info on how to load the English vector 'crawl-300d-2M-subword.zip' from https://fasttext.cc/docs/en/english-vectors.html into this library?

It looks like you can only load .vec and .bin files. But from the description on the website, it appears that this is a text file. Any ideas how it can be loaded into fastText4j?

Thanks.

参数问题

这里出现了概率大于1是什么情况？

nnSearch and analogies does not return the specified k

Loaded the model for english if I set k=5 return less

   private static FastText model;

    @BeforeClass
    public static void onlyOnce() throws Exception {
        String modelPath = "../cc.en.300.bin";
        model = FastText.loadFasttextBinModel(modelPath);
    }

    @Test
    public void nnSearch() {
        int topN = 5;
        List<FloatStringPair> predict = model.nearestNeighbor("king", topN);
        assertEquals(topN, predict.size());
    }

    @Test
    public void analogies() {
        int topN = 5;
        List<FloatStringPair> predict = model.analogies("berlin", "germany", "france", topN);
        assertEquals(topN, predict.size());
    }

java.lang.AssertionError: 
Expected :5
Actual   :3

java.lang.AssertionError: 
Expected :5
Actual   :4

Same model different prediction results

I'm having a binary classification use case and trained the model in C++ (with non-default hyper parameters). However, I'm getting completely different prediction results with the same test cases in C++ and in fasttext4j (results from C++ are correct for the 10 cases i tested).

Any help or explanation would be appreciated!

Thanks,
Catherine

train和predict好用,单没有找到test方法,

train和predict好用,没有找到test方法 TrainArgs,FastText 似乎都没有,是没有或者在其它地方

mayabot / fasttext4j Goto Github PK

fasttext4j's People

Contributors

Stargazers

Watchers

Forkers

fasttext4j's Issues

Recommend Projects

Recommend Topics

Recommend Org