dawg,kawu

Count reachable states instead of ingoing paths

DAWG tracks numbers of ingoing paths for each node in the automaton. It should track numbers of reachable final states instead. It will allow to use DAWG as a perfect hash automaton.

Simplify library interface

Only the following type combinations really make sense:

Static automaton / Vector-based transition map.
Dynamic automaton / Map-based transition map.

Therefore, a particular transition back-end can be used directly within the context of an appropriate automaton. Right now, a class-based interface is used to provide all automaton/transition type combinations, which results in a rather ugly library interface.

Move vector containers into separate library

It might be a good idea to implement vector containers (Data.Vector.Map, Data.Vector.Set, etc.) as a separate library.

Try using Data.Graph module for topological sorting

Topological sorting is needed when translating an immutable automaton to a mutable one. When the graph is sorted topologically, it is easy to compute numbers of ingoing paths for each node in the graph.

Optimization

Structures

There are functions and structures which should be optimized, if possible. Profiling shows that the most time consuming functions are related to the HashMap implementation:

COST CENTRE MODULE %time %alloc

insertUnsafe Data.DAWG.HashMap 7.8 10.0
deleteUnsafe Data.DAWG.HashMap 7.5 8.0
lookup Data.DAWG.HashMap 3.7 0.2
lookupUnsafe Data.DAWG.HashMap 3.7 0.3

Surprisingly, lookup and lookupUnsafe (which are evaluated almost the same number of times) need the same amount of time and memory. Nevertheless, if we substitute lookup for lookupUnsafe in the library code, number of (==) evaluations with respect to HashMap keys (automaton states) is greatly increased, so it seems that workload of HashMap functions is primarily related to the HashMap structure itself.

Other functions, which summarily consume a lot of resources, are from the Graph module:

decIngo Data.DAWG.Graph 5.9 11.3
remNode.nodeMap' Data.DAWG.Graph 4.8 6.0
remNode.ingoMap' Data.DAWG.Graph 4.0 6.0
newNode.nodeMap' Data.DAWG.Graph 3.6 6.9
newNode.ingoMap' Data.DAWG.Graph 3.2 6.9
newNode.(...) Data.DAWG.Graph 3.0 2.7
decIngo.k Data.DAWG.Graph 2.7 0.5
nodeBy Data.DAWG.Graph 2.6 0.3
newNode Data.DAWG.Graph 2.5 3.0
...

Algorithms

Another optimization which comes to mind is to change the algorithm of (path, value) insertion. We can take the number of ingoing edges into account, i.e. there's no need to clone states with only one in-transition (root state, for example), we can just modify them. But if we modify the set of transitions outgoing from the particular state, hash of the state will also change (unless we fundamentally change the method of hash computation), so perhaps its better to leave this idea for another time.

Keep weights in transition labels

In static representation of the automaton, weights should be moved from nodes to edges. It will allow to implement fast, O(|w| * log(|A|)) indexing operations, where w denotes input word and A denotes alphabet.

On the other hand, it will increase the size of the graph, since there will be additional integer number stored for each edge in the graph.

Distinguish branch/leaf nodes in static automaton

Branch and leaf (value) nodes should be distinguished also in the static automaton. Otherwise, when parsing binary representation of the automaton, value for each node will be created individually.

Improve `Dynamic.insertM` implementation

Consider optimization used in the Dynamic.Internal.insert function of the new, alternative version of the library: if the ID of the subgraph hasn't changed, there's no need to update the current node.

Use one Node data type in both automata implementations

It looks difficult to carry out because the generalized version of the Node module works much slower in the context of dynamic automata computations.

Amend numStates and numEdges functions

Functions should take the current root into account.

Distinguish numNodes and numLeaves functions

It should be possible to compute number of internal automaton nodes and automaton leaves separately.

Implement perfect hashing function for dictionary entries

Node type class

Abstract node operations with a type class. This modification will allow to use the dynamic automata version with different node implementations. In particular, user will be able to decide whether the speed is more important (and use IntMap for adjacency vectors' representation) or memory (vector containers).

Add thaw function

Implement thaw function which transforms immutable automaton into a mutable one.

Insert operation in a presence of a big alphabet

Currently, edges are kept in adjacency vectors and, therefore, the insert operation works in O(|w| * |A|) where w is an input word and A is an alphabet. It should be possible to use a version with edges stored in IntMap's, which would make the insert operation work in O(|w| * log(|A|)). It is especially important in the presence of big-alphabet dictionaries.

Common interface for both versions of DAWG

Context: we want to implement approximate dictionary searching library on top of the DAWG library, but we don't know in advance which version of DAWG the end user will want to use.

Add useful features (static DAWG)

The following set of functions will be useful:

Return the sub-DAWG containing all keys beginning with a prefix.
Return the number of elements in a weighted DAWG.

Add phantom type for character representation

Add benchmarks

Or copy from the dawg-new library.

Traditional weight semantics

Employ traditional semantics of weight annotations. See "Dynamic Perfect Hashing with Finite-State Automata", chapter 3.

Add tests

Or copy from the dawg-new library.

Function `submap` should return value of type Maybe

Implement frozen version of DAWG

Remove hash and unHash functions

These functions are just synonyms.

Increased size of the automaton in the program memory

Experiments suggest that size of the automaton is higher after loading it from a disk to a program memory (14MB -> 80MB, for example). Check if it is really the case (perhaps the redundant memory is allocated when loading the automaton?) and, if so, identify the cause and find the solution. Hint: try using the Storable data types.

Internal representation of the dynamic automaton is exported, internal representation of the static one is not,
In the static module you can see that automaton consists of a list of nodes, but there is no way to check the implementation of node.

Potential solutions:

Hide internal representations of both automata versions,
Add separate internal modules for the static automata.

kawu / dawg Goto Github PK

dawg's People

Contributors

Stargazers

Watchers

Forkers

dawg's Issues

Structures

Algorithms

Recommend Projects

Recommend Topics

Recommend Org