Coder Social home page Coder Social logo

linguatools / disco Goto Github PK

View Code? Open in Web Editor NEW
8.0 1.0 4.0 7.55 MB

compute semantic similarity between arbitrary words and phrases in many languages

Home Page: http://www.linguatools.de/disco/disco_en.html

Java 100.00%
java-api embeddings word2vec distributional-count-vectors disco-api

disco's Introduction

DISCO API

Java API for word embeddings

This is the source code repository for the linguatools DISCO API. For more information on DISCO visit http://www.linguatools.de/disco/disco_en.html.

Quickstart

Install DISCO API

Download the source code by cloning this repository:

git clone [email protected]:linguatools/disco.git

Go into the repository folder and build the executable jar with dependencies:

cd disco/
./gradlew shadowJar

For instructions on command line usage call DISCO API without any parameters:

java -jar build/libs/disco-3.0.0-all.jar

or consult the web page.

Import a vector file from fastText

Download a fastText vector file in text format and unpack it:

wget https://s3-us-west-1.amazonaws.com/fasttext-vectors/word-vectors-v2/cc.de.300.vec.gz
gunzip cc.de.300.vec.gz

Download DISCO Builder:

wget http://www.linguatools.de/disco/DISCOBuilder-1.1.1.tar.bz2
tar jxf DISCOBuilder-1.1.1.tar.bz2

Convert the vector file into a DISCO DenseMatrix:

java -Xmx8g -cp DISCOBuilder-1.1.1/DISCOBuilder-1.1.1-all.jar de.linguatools.disco.builder.Import -in cc.de.300.vec -out cc.de.300.col.denseMatrix -wsType COL 

Query the new DISCO word space from the command line with the DISCO API:

java -Xmx4g -jar ~/repos-linguatools/disco/build/libs/disco-3.0.0-all.jar cc.de.300.col.denseMatrix/cc.de.300-COL.denseMatrix -s Haus Wohnung COSINE
0.64413786

Java API

To include DISCO in your Maven or Gradle project see below or visit the DISCO page on JitPack.

Gradle

Add this to your build.gradle file:

repositories {
    maven { url 'https://jitpack.io' }
}
dependencies {
    compile 'com.github.linguatools:disco:v3.0.0'
}

Maven

Add this to your pom.xml file:

<repositories>
	<repository>
	    <id>jitpack.io</id>
	    <url>https://jitpack.io</url>
	</repository>
</repositories>

<dependency>
    <groupId>com.github.linguatools</groupId>
    <artifactId>disco</artifactId>
    <version>v3.0.0</version>
</dependency>

Example Java code

DISCO disco = DISCO.load("cc.de.300-COL.denseMatrix");
float sim = disco.semanticSimilarity("Haus", "Häuschen", 
      	    	DISCO.getVectorSimilarity(SimilarityMeasure.COSINE));
System.out.println("similarity between 'Haus' and 'Häuschen': "+sim);
// get word vector for "Haus" as map
Map<String,Float> wordVectorHaus = disco.getWordvector("Haus");
// get word embedding for "Haus" as float array
float[] wordEmbeddingHaus = ((DenseMatrix) disco).getWordEmbedding("Haus");
// solve analogy x is to "Frau" as "König" is to "Mann"
List<ReturnDataCol> result = Compositionality.solveAnalogy("Frau", "König", "Mann", disco); 

Documentation

How to get word spaces for DISCO?

Features

  • native Java API
  • the API provides many useful methods for computing text similarity, solving analogies, clustering of similar words, compositional semantics, etc.
  • efficient storage of high-dimensional sparse matrices (distributional count vectors) as well as low-dimensional dense matrices (word embeddings)
  • higher-order word similarities can be stored and retrieved efficiently
  • API is open source with Apache license.

disco's People

Contributors

pekoli avatar pkolb avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

disco's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.