Java API for word embeddings
This is the source code repository for the linguatools DISCO API. For more information on DISCO visit http://www.linguatools.de/disco/disco_en.html.
Download the source code by cloning this repository:
git clone [email protected]:linguatools/disco.git
Go into the repository folder and build the executable jar with dependencies:
cd disco/
./gradlew shadowJar
For instructions on command line usage call DISCO API without any parameters:
java -jar build/libs/disco-3.0.0-all.jar
or consult the web page.
Download a fastText vector file in text format and unpack it:
wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.de.300.vec.gz
gunzip cc.de.300.vec.gz
Download DISCO Builder:
wget http://www.linguatools.de/disco/DISCOBuilder-1.1.1.tar.bz2
tar jxf DISCOBuilder-1.1.1.tar.bz2
Convert the vector file into a DISCO DenseMatrix:
java -Xmx8g -cp DISCOBuilder-1.1.1/DISCOBuilder-1.1.1-all.jar de.linguatools.disco.builder.Import -in cc.de.300.vec -out cc.de.300.col.denseMatrix -wsType COL
Query the new DISCO word space from the command line with the DISCO API:
java -Xmx4g -jar ~/repos-linguatools/disco/build/libs/disco-3.0.0-all.jar cc.de.300.col.denseMatrix/cc.de.300-COL.denseMatrix -s Haus Wohnung COSINE
0.64413786
To include DISCO in your Maven or Gradle project see below or visit the DISCO page on JitPack.
Add this to your build.gradle
file:
repositories {
maven { url 'https://jitpack.io' }
}
dependencies {
compile 'com.github.linguatools:disco:v3.0.0'
}
Add this to your pom.xml
file:
<repositories>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
<dependency>
<groupId>com.github.linguatools</groupId>
<artifactId>disco</artifactId>
<version>v3.0.0</version>
</dependency>
DISCO disco = DISCO.load("cc.de.300-COL.denseMatrix");
float sim = disco.semanticSimilarity("Haus", "Häuschen",
DISCO.getVectorSimilarity(SimilarityMeasure.COSINE));
System.out.println("similarity between 'Haus' and 'Häuschen': "+sim);
// get word vector for "Haus" as map
Map<String,Float> wordVectorHaus = disco.getWordvector("Haus");
// get word embedding for "Haus" as float array
float[] wordEmbeddingHaus = ((DenseMatrix) disco).getWordEmbedding("Haus");
// solve analogy x is to "Frau" as "König" is to "Mann"
List<ReturnDataCol> result = Compositionality.solveAnalogy("Frau", "König", "Mann", disco);
- import vector files (text format) from word2vec, GloVe or fastText using DISCO Builder. There are pre-computed vector files from fastText for 157 languages.
- you can download ready-to-use native DISCO word spaces (high-dimensional distributional count vectors) and DISCO word embeddings (low-dimensional predict vectors) imported from word2vec for several languages at http://www.linguatools.de/disco/disco-download_en.html.
- you can create your own high-dimensional distributional count vectors from a text corpus using DISCO Builder.
- native Java API
- the API provides many useful methods for computing text similarity, solving analogies, clustering of similar words, compositional semantics, etc.
- efficient storage of high-dimensional sparse matrices (distributional count vectors) as well as low-dimensional dense matrices (word embeddings)
- higher-order word similarities can be stored and retrieved efficiently
- API is open source with Apache license.