kawu / dawg Goto Github PK
View Code? Open in Web Editor NEWDirected acyclic word graphs
License: BSD 2-Clause "Simplified" License
Directed acyclic word graphs
License: BSD 2-Clause "Simplified" License
DAWG tracks numbers of ingoing paths for each node in the automaton. It should track numbers of reachable final states instead. It will allow to use DAWG as a perfect hash automaton.
Only the following type combinations really make sense:
Therefore, a particular transition back-end can be used directly within the context of an appropriate automaton. Right now, a class-based interface is used to provide all automaton/transition type combinations, which results in a rather ugly library interface.
It might be a good idea to implement vector containers (Data.Vector.Map, Data.Vector.Set, etc.) as a separate library.
Topological sorting is needed when translating an immutable automaton to a mutable one. When the graph is sorted topologically, it is easy to compute numbers of ingoing paths for each node in the graph.
There are functions and structures which should be optimized, if possible. Profiling shows that the most time consuming functions are related to the HashMap implementation:
COST CENTRE MODULE %time %alloc
insertUnsafe Data.DAWG.HashMap 7.8 10.0
deleteUnsafe Data.DAWG.HashMap 7.5 8.0
lookup Data.DAWG.HashMap 3.7 0.2
lookupUnsafe Data.DAWG.HashMap 3.7 0.3
Surprisingly, lookup and lookupUnsafe (which are evaluated almost the same number of times) need the same amount of time and memory. Nevertheless, if we substitute lookup for lookupUnsafe in the library code, number of (==) evaluations with respect to HashMap keys (automaton states) is greatly increased, so it seems that workload of HashMap functions is primarily related to the HashMap structure itself.
Other functions, which summarily consume a lot of resources, are from the Graph module:
decIngo Data.DAWG.Graph 5.9 11.3
remNode.nodeMap' Data.DAWG.Graph 4.8 6.0
remNode.ingoMap' Data.DAWG.Graph 4.0 6.0
newNode.nodeMap' Data.DAWG.Graph 3.6 6.9
newNode.ingoMap' Data.DAWG.Graph 3.2 6.9
newNode.(...) Data.DAWG.Graph 3.0 2.7
decIngo.k Data.DAWG.Graph 2.7 0.5
nodeBy Data.DAWG.Graph 2.6 0.3
newNode Data.DAWG.Graph 2.5 3.0
...
Another optimization which comes to mind is to change the algorithm of (path, value) insertion. We can take the number of ingoing edges into account, i.e. there's no need to clone states with only one in-transition (root state, for example), we can just modify them. But if we modify the set of transitions outgoing from the particular state, hash of the state will also change (unless we fundamentally change the method of hash computation), so perhaps its better to leave this idea for another time.
In static representation of the automaton, weights should be moved from nodes to edges. It will allow to implement fast, O(|w| * log(|A|)) indexing operations, where w denotes input word and A denotes alphabet.
On the other hand, it will increase the size of the graph, since there will be additional integer number stored for each edge in the graph.
Branch and leaf (value) nodes should be distinguished also in the static automaton. Otherwise, when parsing binary representation of the automaton, value for each node will be created individually.
Consider optimization used in the Dynamic.Internal.insert
function of the new, alternative version of the library: if the ID of the subgraph hasn't changed, there's no need to update the current node.
It looks difficult to carry out because the generalized version of the Node module works much slower in the context of dynamic automata computations.
Functions should take the current root into account.
It should be possible to compute number of internal automaton nodes and automaton leaves separately.
Abstract node operations with a type class. This modification will allow to use the dynamic automata version with different node implementations. In particular, user will be able to decide whether the speed is more important (and use IntMap for adjacency vectors' representation) or memory (vector containers).
Implement thaw function which transforms immutable automaton into a mutable one.
Currently, edges are kept in adjacency vectors and, therefore, the insert operation works in O(|w| * |A|) where w is an input word and A is an alphabet. It should be possible to use a version with edges stored in IntMap's, which would make the insert operation work in O(|w| * log(|A|)). It is especially important in the presence of big-alphabet dictionaries.
Context: we want to implement approximate dictionary searching library on top of the DAWG library, but we don't know in advance which version of DAWG the end user will want to use.
The following set of functions will be useful:
Or copy from the dawg-new
library.
Employ traditional semantics of weight annotations. See "Dynamic Perfect Hashing with Finite-State Automata", chapter 3.
Or copy from the dawg-new
library.
These functions are just synonyms.
Experiments suggest that size of the automaton is higher after loading it from a disk to a program memory (14MB -> 80MB, for example). Check if it is really the case (perhaps the redundant memory is allocated when loading the automaton?) and, if so, identify the cause and find the solution. Hint: try using the Storable data types.
Like: intersection, union
It will involve optimizing the VMap module.
Character of functions exported from two DAWG modules is different:
Potential solutions:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.