Coder Social home page Coder Social logo

andy-wagner / dawg-1 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from hedgeonline/dawg

0.0 1.0 0.0 37 KB

Yet another Java library for storing (and searching) strings in a Directed Acyclic Word Graph aka Minimal Acyclic Finite-State Automaton

License: MIT License

Java 100.00%

dawg-1's Introduction

Directed Acyclic Word Graph

Yet another Java library for storing (and searching) strings in a Directed Acyclic Word Graph aka Minimal Acyclic Finite-State Automaton

This is another java implementation of algorithm described in https://www.aclweb.org/anthology/W98-1305.pdf. It was written in 2015 for a Natural Language Processing engine and now it's available here. The current implementation uses binary search for both compilation and search and requires less memory for large alphabet dictionaries.

Download

If you want a plain JAR-file (JDK 1.7), you can get it here https://sourceforge.net/projects/hedgeonline-dawg/files

Dependencies

No dependencies except for JUnit4, used for testing.

Usage

// Create modifiable automaton
Automaton auto = new Automaton();

// Add strings
auto.add("some word or phrase");

// Can check if string is present in the automaton
boolean result = auto.contains("some word or phrase");

// Can list all suffixes for given prefix
List<String> suffixes = auto.listSuffixes("some word");

// ... or can list all entries
List<String> entries = auto.listSuffixes("");

// Can also save as a read-only binary file (needs less space)
auto.save(new FileOutputStream("mydict_readonly.bin"), false);

// ... or as in modifiable format
auto.save(new FileOutputStream("mydict_modifiable.bin"), true);

// Read binary file into a read-only search instance
ISearch dict = Dictionary.load("mydict_readonly.bin");

// ... can do this from both format types
ISearch dict = Dictionary.load("mydict_modifiable.bin");

// And finally can also reinitialize an appendable automaton from modifiable format
Automaton newAuto = Automaton.load("mydict_modifiable.bin");

Performance

Tested (and heavily used) as morphological (POS-tagging and lemmatization) dictionary core for Russian. 5M wordforms with annotations compile in about 50 seconds into a 4-5M binary file (depending on format) on i5-2400. Suffix searching speed (needed for morphologycal annotations retrieval) on the same CPU is about 250K searches per second single-threaded, with java process consuming 40-65M of memory. Automaton class is not thread safe, Dictionary class can be accessed by several threads since it is stateless.

dawg-1's People

Contributors

hedgeonline avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.