bzaar / dawgsharp Goto Github PK
View Code? Open in Web Editor NEWDAWG String Dictionary in C#
Home Page: http://www.nuget.org/packages/DawgSharp/
License: GNU General Public License v3.0
DAWG String Dictionary in C#
Home Page: http://www.nuget.org/packages/DawgSharp/
License: GNU General Public License v3.0
... and verify that this actually speeds up things on a typical 4-core system. Excessive locking might negate all the benefits of parallelism. Try locking at Node and DawgBuilder levels and see what difference it makes.
Rewrite lookup code in IL to match the performance of a Dictionary. Use the IL Support VS extension. Or maybe this? https://www.codeproject.com/articles/438868/inline-msil-in-csharp-vb-net-and-generic-pointers
Performance doesn't seem to be an issue at the moment so I'm leaving this here as a suggestion. Let me know in the comments if DawgSharp is not making the mark for your project because of performance. Or just upvote this issue.
I'm trying to get this to work in VS2008 so I've had to tweak the code a bit (to replace the => syntax with normal properties). Now it compiles; but when I try your simple Usage example from the GitHub project page, the Dawg.Save call fails:
"DawgSharp.Dawg does not contain a definition for Save..."
And indeed, the only Dawg.Save method is one with private accessibility and two parameters:
private void Save (Stream stream, Action <OldDawg, BinaryWriter> save)
Looks like I should use SaveTo instead (a public method) -- so does your sample code need fixing?
In case the payloads are strings or objects, it may make sense to write each payload once and use indexes when payloads are used.
This is where payloads are saved currently:
https://github.com/bzaar/DawgSharp/blob/master/DawgSharp/OldDawg.cs#L107
This will involve two passes. In pass 1, it computes the hashes for all nodes. A hash for a node cosists of the payload has plus the hashes of all children's chars plus the hashes of all child nodes, i.e. it is recursive.
In pass two it uses the hashes to compare and merge the nodes effectively.
Both oasses are bottom up, so each node is visited only twice.
Hello,
I'm thinking about recommending this to be used in an upcoming project, but I can't use GPL code. Could you release this as possibly LGPL? I'd be happy to maybe help contribute to the project if I find problems with the code (if this DAWG implementation gets selected), but the licensing thing is a showstopper.
Great library, by the way! Thanks for considering!
Приветствую! Пишу по-русски, т.к. вижу, что вы русский, и вопрос касается русского языка.
В реализации DAWG на Python есть поиск с заменами, который позволяет находить слова с ё даже если записать их через е. Есть ли здесь такое?
Hi there,
Great library, and im considering it for commercial user however there are few things i need to clear.
How can i use it for solving anagram ? Lets suppose i have random letters like a,c,x,y and what i want to see how many words can be formed with only these 4 letters what function i use?
Thanks
BuildDawg failed using the word list from https://raw.github.com/eneko/data-repository/master/data/words.txt due to StackOverflowException generated from LevelBuilder.GetLevel method.
Hi there,
This amazing library already has MatchPrefix which covers half of the cases. Can you please either provide MatchSuffix
or advise how to implement this?
Thanks
Hello,
I am receiving this error when trying to load a Dawg bin file
unable to read beyond the end of the stream
Here is how I am creating the dawg file:
string path = Environment.GetFolderPath(Environment.SpecialFolder.Desktop) + "/EN_words.txt";
var s = File.ReadAllLines(path);
var dawgBuilder = new DawgSharp.DawgBuilder<bool>();
foreach (var line in s)
{
dawgBuilder.Insert(line, true);
}
var dawgpath = Environment.GetFolderPath(Environment.SpecialFolder.Desktop) + "/EN_dawg.bin";
var dawg = dawgBuilder.BuildDawg();
dawg.SaveTo(File.Create(dawgpath));
and here is how i am loading the dawg file
var fpath = Environment.GetFolderPath(Environment.SpecialFolder.Desktop) + "/EN_dawg.bin";
var file = File.Open(fpath, FileMode.Open);
dawgg = Dawg<bool>.Load(file);
The error occours in YaleDawg.cs at line 66.
Thank you very much for the attention.
My main use case involved 2.5M words that were represented using 150K nodes. But other users might have < 32K nodes in which case they may benefit from a 2x memory footprint cut (as if we haven't done enough!) It might be worth taking this number up to 64K-1 as we only need one special value.
If we are doing this, we may also want to think of users who might have more than 2G nodes and add a ulong version. Unlikely, but future-ready. Add a long version of GetNodeCount ().
I.e. add another template parameter to YaleDawg<TPayload>, TIndex which can be byte, ushort, uint or ulong and instantiate the required specialization based on the total number of nodes.
The Node class has a children field which is a Dictionary (a hash table essentially). This is not particularly memory efficient. It could do with a simple sorted array (Node []) and use BinarySearch for lookups.
I am trying to store a List in the value field. I am using the SaveTo overload which excepts a custom action to serialize the type. The file is successfully written. When I try to read in the file using the overload of the Load method I get the exception:
System.IO.EndOfStreamException: 'Unable to read beyond the end of the stream.'
I know the List payload is successfully deserialized in the Func as I can break before returning and see the List containing the expected data.
I am able to reproduce the exception in Visual Studio 16.6.0 Preview 2 and LinqPad 6 on Windows 10 machine.
private static void Main(string[] args)
{
var dawgBuilder = new DawgBuilder<List<string>>();
dawgBuilder.Insert("test", new List<string> { "test", "hello", "world" });
var dawg = dawgBuilder.BuildDawg();
using (var file = File.Create(@"E:\Data\Dawg\DAWG-Test.bin"))
dawg.SaveTo(file, new Action<BinaryWriter, List<string>>((r, payload) =>
{
byte[] bytes = null;
BinaryFormatter bf = new BinaryFormatter();
using (MemoryStream ms = new MemoryStream())
{
bf.Serialize(ms, payload);
bytes = ms.ToArray();
}
r.Write(bytes, 0, bytes.Length);
}));
dawg = Dawg<List<string>>.Load(File.Open(@"E:\Data\Dawg\DAWG-Test.bin", FileMode.Open), new Func<BinaryReader, List<string>>(r =>
{
List<string> result = null;
using (var ms = new MemoryStream())
{
r.BaseStream.CopyTo(ms);
BinaryFormatter bf = new BinaryFormatter();
ms.Seek(0, SeekOrigin.Begin);
result = (List<string>)bf.Deserialize(ms);
}
return result;
}));
}
I am trying this simple piece of code, for benchmarking purposes, but I get an OutOfMemoryException when reaching around 350K elements inserted.
DawgBuilder dawgBuilder = new DawgBuilder();
for (int i = 0; i < 10000000; i++)
{
dawgBuilder.Insert(Guid.NewGuid().ToString(), true);
}
DawgBuilder.TryGetValue()
might return true even the queried key has never been added. E.g.:
var builder = new DawgBuilder<string>();
builder.Insert("dates", "dates");
bool b = builder.TryGetValue("date", out var v);
Assert.Null(v); // Okay
Assert.True(b); // that's unexpected
I'd expect b
to be false
here because "date"
was not added and no value were obtained by TryGetValue()
I want to pick a random word from dictionary, however there is no such method that can help with that. As you developed this amazing library so can you please guide how'd you approach such requirement?
Thanks
Hi there,
Currently MathSuffx returns all the words which maynot be ideal in some situation so please provide a maxCount optional argument which for example if i say 10 then if then i get only maximum 10 words.
Thanks
...by caching the hash codes inside NodeWrappers.
and migrate to VS 2017 project format.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.