mayabot / fasttext4j Goto Github PK
View Code? Open in Web Editor NEWImplementing Facebook's FastText with java
Implementing Facebook's FastText with java
测试集的文本太少了,是不是会导致落到每个类别的概率变平均?或者是fasttext4j对中文不是很友好,会是概率平均化?
想请问下是否可以load 用facebook官方python接口训练出的bin classifier
我直接load 会报错:
Exception in thread "main" java.lang.RuntimeException: Model file has wrong file format!
at com.mayabot.mynlp.fasttext.LoadFastTextFromClangModel.loadCModel(LoadFastTextFromClangModel.kt:32)
at com.mayabot.mynlp.fasttext.LoadFastTextFromClangModel.loadCModel(LoadFastTextFromClangModel.kt:124)
at com.mayabot.mynlp.fasttext.FastText$Companion.loadFasttextBinModel(FastText.kt:366)
at com.mayabot.mynlp.fasttext.FastText.loadFasttextBinModel(FastText.kt)
model.test()打印的信息中
“N” 打印出来的样本数量跟我的测试文件样本数量对不上,要少上不少
你好,代码中的MAX_VOCAB_SIZE = 30000000默认值是否设置过大,当我导入多个fasttext模型的时候就会导致内存溢出(预估了下这个参数大小已经远超物理机内存,而实际模型中的词语数量远达不到这个大小)。有没有什么解决办法,或开放配置参数。
您看看这样格式的可以吗?不行的话,那测试集的格式就必须是“label,txt”这种格式的吗?
When I Run gradle compile 'com.mayabot:fastText4j:1.1.5'
,
then
FAILURE: Build failed with an exception.
Where:
Build file '...fastText4j/build.gradle' line: 105
What went wrong:
A problem occurred evaluating root project 'fastText4j'.
Could not get unknown property 'oss_user' for repository container of type org.gradle.api.internal.artifacts.dsl.DefaultRepositoryHandler.
您好:
我在使用fastext4j进行训练时,60M语料进行fasttext skipgram训练时,训练速度特别慢,显示条提示需要200+小时才能训练完成。
我的参数为:
lr:0.1
dim:128
wordNgrams:3
Bucket: 100
neg:5
thread:12
但是无论在性能较好的服务器上还是本机训练相较于Python版本的fasttext的都训练较慢,请教一下是否是我的配置存在问题?
你好,因为lateinit修饰wordVector会导致FastText对象序列化失败,采用fastjson的Json.toJsonString,为什么要使用这个修饰符?或者有其他的使用的序列化方法吗?我好放在缓存里面使用,谢谢
inputArgs.setPretrainedVectors()
在使用上面的方法设置预训练词向量时设置的是pretrainedVectors这个属性,
而实际上在判断需不需要使用预训练词向量时,使用的是preTrainedVectors这个属性,
一字之差
I'm using FastText to train a model and all is OK, but I got following exception when loading from saved files:
Stacktrace:
Exception in thread "main" java.lang.IllegalArgumentException: Unknown EntryType enum second :136
at com.mayabot.mynlp.fasttext.EntryType$Companion.fromValue(Dictionary.kt:651)
at com.mayabot.mynlp.fasttext.Dictionary.load(Dictionary.kt:596)
at com.mayabot.mynlp.fasttext.FastText$Companion.loadModel(FastText.kt:410)
at com.mayabot.mynlp.fasttext.FastText.loadModel(FastText.kt)
Version: "com.mayabot" % "fastText4j" % "1.2.3".
Plz help me and let me know if you want more information.
Thx!
JDK8_121
Read file build dictionary ...
Read 0M words
Number of words: 251
Number of labels: 0
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
只有几十行的中文 分好词 空格隔开, 测试一下训练词向量瞬间OOM
用其他w2v java是好的,有些行比较少的词
但是做supervise没报错:
Read file build dictionary ...
Read 0M words
Number of words: 958
Number of labels: 0
Progress: 100.95% words/sec/thread: Infinity lr: -0.00095 loss: 0.00000 ETA: 0h 0m 0s
Progress: 100.00% words/sec/thread: Infinity lr: 0.00000 loss: 0.00000 ETA: 0h 0m 0s
Train use time 123 ms
请问,我想用fasttext做文本分类训练,看TrainArgs中是有wordNgram是参数的,但是maven引用fastText4j的jar之后,在TrainArgs的参数设置中,没有找到wordNgram对应的set方法,对kotlin不熟,请指教,谢谢!
您好,
我在使用中发现,quantize过程中只有dsub和qnorm两个参数发挥了作用,而cutoff和retrain并没有出现在代码中,请问是尚未完成的功能吗?如果是,近期有加入它们的计划吗?
另外,首页文档的installation中的版本号依然是1.2.2,是个没有加入minn和maxn等参数的版本,而最新版已经更新至1.2.3,这是我查看issue才发现的,请及时更新文档。
祝好
请问,FastText.train()除了可以设置文件,模型训练类型之前。其他的参数如何设置,比如学习率,epoch
n-gram_window等等。请指教
Hello, what is the difference between your code and the code in https://github.com/linkfluence/fastText4j?
官方版本的fasttext中的模型除了predict方法外还有test方法,可以直接得出测试集的准确率和召回率,这个java实现貌似没有这个接口?
利用 fasttext 官方最新代码训练的模型 (ftz 格式),用fasttext4j 加载模型预测结果时,2w 个样本,出现了 3k 个样本的预测结果同官方的代码预测结果不一致。
仅对比分类标签
python中这个代码fasttext.util.reduce_model(fast_model, 128)
我在这里找不到对应的方法
请问,如果离线调用fasttext4j的jar包,jar包已经编译好了,还需要如何配置。因为公司内网不能从maven下载东西。
加载模型时,出现如下错误:
Exception in thread "main" java.lang.IllegalArgumentException: Unknown EntryType enum second :136
at com.mayabot.nlp.fasttext.dictionary.EntryType$Companion.fromValue(DictUtils.kt:57)
at com.mayabot.nlp.fasttext.dictionary.Dictionary$Companion.loadModel(Dictionary.kt:384)
at com.mayabot.nlp.fasttext.FastText$Companion.loadModel(FastText.kt:711)
at cn.com.duiba.spark.dmp.api.FasttextTest.loadModel(FasttextTest.java:61)
at cn.com.duiba.spark.dmp.api.FasttextTest.main(FasttextTest.java:43)
求解
Hi, I tested the fastText4j, and train model by:
fastText = FastText.train("ag.train", ModelName.sup);
fastText.saveModel(java_model);
than load model for predicting a sentence:
fastText = FastText.loadModel(java_model);
fastText.predict(Arrays.asList(testStr.split(" ")), 5);
get error:
java.lang.ArithmeticException: / by zero
at com.mayabot.mynlp.fasttext.AreaByteBufferMatrix.get(Matrix.kt:222)
at com.mayabot.mynlp.fasttext.FastText.addInputVector(FastText.kt:261)
at com.mayabot.mynlp.fasttext.FastText.getWordVector(FastText.kt:196)
at com.mayabot.mynlp.fasttext.FastText.getWordVector(FastText.kt:207)
But, i used same data, train model by fastText-0.1.0, load model by:
FastText.loadFasttextBinModel(cpp_model);
predict = fastText.predict(Arrays.asList(testStr.split(" ")), 5);
That is OK:
predict result:[[__label__4,0.9996598], [__label__1,2.4863696E-4], [__label__3,1.076918E-4], [__label__2,2.3928827E-5]]
Why?
Thank you.
I have fasttext dump trained with python code unsupervised with quantization. When i call getWordVector using java code and python code i see very different vectors.
Hi there,
Can you share any info on how to load the English vector 'crawl-300d-2M-subword.zip' from https://fasttext.cc/docs/en/english-vectors.html into this library?
It looks like you can only load .vec and .bin files. But from the description on the website, it appears that this is a text file. Any ideas how it can be loaded into fastText4j?
Thanks.
Loaded the model for english if I set k=5 return less
private static FastText model;
@BeforeClass
public static void onlyOnce() throws Exception {
String modelPath = "../cc.en.300.bin";
model = FastText.loadFasttextBinModel(modelPath);
}
@Test
public void nnSearch() {
int topN = 5;
List<FloatStringPair> predict = model.nearestNeighbor("king", topN);
assertEquals(topN, predict.size());
}
@Test
public void analogies() {
int topN = 5;
List<FloatStringPair> predict = model.analogies("berlin", "germany", "france", topN);
assertEquals(topN, predict.size());
}
java.lang.AssertionError:
Expected :5
Actual :3
java.lang.AssertionError:
Expected :5
Actual :4
I'm having a binary classification use case and trained the model in C++ (with non-default hyper parameters). However, I'm getting completely different prediction results with the same test cases in C++ and in fasttext4j (results from C++ are correct for the 10 cases i tested).
Any help or explanation would be appreciated!
Thanks,
Catherine
train和predict好用,没有找到test方法 TrainArgs,FastText 似乎都没有,是没有或者在其它地方
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.