Coder Social home page Coder Social logo

dawg's People

Contributors

kawu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

ypares

dawg's Issues

Simplify library interface

Only the following type combinations really make sense:

  • Static automaton / Vector-based transition map.
  • Dynamic automaton / Map-based transition map.

Therefore, a particular transition back-end can be used directly within the context of an appropriate automaton. Right now, a class-based interface is used to provide all automaton/transition type combinations, which results in a rather ugly library interface.

Try using Data.Graph module for topological sorting

Topological sorting is needed when translating an immutable automaton to a mutable one. When the graph is sorted topologically, it is easy to compute numbers of ingoing paths for each node in the graph.

Optimization

Structures

There are functions and structures which should be optimized, if possible. Profiling shows that the most time consuming functions are related to the HashMap implementation:


COST CENTRE MODULE %time %alloc

insertUnsafe Data.DAWG.HashMap 7.8 10.0
deleteUnsafe Data.DAWG.HashMap 7.5 8.0
lookup Data.DAWG.HashMap 3.7 0.2
lookupUnsafe Data.DAWG.HashMap 3.7 0.3


Surprisingly, lookup and lookupUnsafe (which are evaluated almost the same number of times) need the same amount of time and memory. Nevertheless, if we substitute lookup for lookupUnsafe in the library code, number of (==) evaluations with respect to HashMap keys (automaton states) is greatly increased, so it seems that workload of HashMap functions is primarily related to the HashMap structure itself.

Other functions, which summarily consume a lot of resources, are from the Graph module:


decIngo Data.DAWG.Graph 5.9 11.3
remNode.nodeMap' Data.DAWG.Graph 4.8 6.0
remNode.ingoMap' Data.DAWG.Graph 4.0 6.0
newNode.nodeMap' Data.DAWG.Graph 3.6 6.9
newNode.ingoMap' Data.DAWG.Graph 3.2 6.9
newNode.(...) Data.DAWG.Graph 3.0 2.7
decIngo.k Data.DAWG.Graph 2.7 0.5
nodeBy Data.DAWG.Graph 2.6 0.3
newNode Data.DAWG.Graph 2.5 3.0
...


Algorithms

Another optimization which comes to mind is to change the algorithm of (path, value) insertion. We can take the number of ingoing edges into account, i.e. there's no need to clone states with only one in-transition (root state, for example), we can just modify them. But if we modify the set of transitions outgoing from the particular state, hash of the state will also change (unless we fundamentally change the method of hash computation), so perhaps its better to leave this idea for another time.

Keep weights in transition labels

In static representation of the automaton, weights should be moved from nodes to edges. It will allow to implement fast, O(|w| * log(|A|)) indexing operations, where w denotes input word and A denotes alphabet.

On the other hand, it will increase the size of the graph, since there will be additional integer number stored for each edge in the graph.

Distinguish branch/leaf nodes in static automaton

Branch and leaf (value) nodes should be distinguished also in the static automaton. Otherwise, when parsing binary representation of the automaton, value for each node will be created individually.

Improve `Dynamic.insertM` implementation

Consider optimization used in the Dynamic.Internal.insert function of the new, alternative version of the library: if the ID of the subgraph hasn't changed, there's no need to update the current node.

Node type class

Abstract node operations with a type class. This modification will allow to use the dynamic automata version with different node implementations. In particular, user will be able to decide whether the speed is more important (and use IntMap for adjacency vectors' representation) or memory (vector containers).

Add thaw function

Implement thaw function which transforms immutable automaton into a mutable one.

Insert operation in a presence of a big alphabet

Currently, edges are kept in adjacency vectors and, therefore, the insert operation works in O(|w| * |A|) where w is an input word and A is an alphabet. It should be possible to use a version with edges stored in IntMap's, which would make the insert operation work in O(|w| * log(|A|)). It is especially important in the presence of big-alphabet dictionaries.

Common interface for both versions of DAWG

Context: we want to implement approximate dictionary searching library on top of the DAWG library, but we don't know in advance which version of DAWG the end user will want to use.

Add useful features (static DAWG)

The following set of functions will be useful:

  • Return the sub-DAWG containing all keys beginning with a prefix.
  • Return the number of elements in a weighted DAWG.

Traditional weight semantics

Employ traditional semantics of weight annotations. See "Dynamic Perfect Hashing with Finite-State Automata", chapter 3.

Add tests

Or copy from the dawg-new library.

Increased size of the automaton in the program memory

Experiments suggest that size of the automaton is higher after loading it from a disk to a program memory (14MB -> 80MB, for example). Check if it is really the case (perhaps the redundant memory is allocated when loading the automaton?) and, if so, identify the cause and find the solution. Hint: try using the Storable data types.

Inconsistent level of exports

Character of functions exported from two DAWG modules is different:

  • Internal representation of the dynamic automaton is exported, internal representation of the static one is not,
  • In the static module you can see that automaton consists of a list of nodes, but there is no way to check the implementation of node.

Potential solutions:

  • Hide internal representations of both automata versions,
  • Add separate internal modules for the static automata.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.