Comments (6)
The short answer is as things stand, fuzzy matching will do the trick for you - I think the confusion comes from your use of wildcards in the matching. This fiddle has some working code for you to play with:
var index = new FullTextIndexBuilder<int>()
.WithQueryParser(o => o.AssumeFuzzySearchTerms())
.Build();
await index.AddAsync(1, "Murphy's law");
Console.WriteLine(index.Search("Murphys").Count());
// Prints "1"
Console.WriteLine(index.Search("Murphy's").Count());
// Prints "1"
But I feel that your question deserves a bit of a deeper dive into what's going on.
By default LIFTI will split words on punctuation, including apostrophes. This means that (rightly or wrongly) "Murphy's law" will actually get tokenized as three words:
- Murphy
- s
- law
When you search for "Murphy's", the query is actually parsed as "Murphy & s" because search terms are also split using the same tokenization. You can see this using the following code (with fuzzy matching on as default):
var query = index.QueryParser.Parse(index.FieldLookup, "Murphy's", index.DefaultTokenizer);
Console.WriteLine(query.Root.ToString());
// Prints "?3,?MURPHY & ?0,?S"
This is searching for documents containing fuzzy matches for "murphy" and "s" (although because "s" is so short, only an exact match would be acceptable - the zero in ?0,? is stating that no edits are allowed)
When you add wildcards into the mix, for your first example you get this:
var query = index.QueryParser.Parse(index.FieldLookup, "*Murphy's*", index.DefaultTokenizer);
Console.WriteLine(query.Root.ToString());
// Prints "*MURPHY & S*"
Which essentially means documents containing any words ending with "murphy" and any words starting with "S". This will by coincidence match your document.
But your second search looks like this:
var query = index.QueryParser.Parse(index.FieldLookup, "*Murphys*", index.DefaultTokenizer);
Console.WriteLine(query.Root.ToString());
// Prints *MURPHYS*
Which means documents containing any words completely containing "murphys" - this won't match in your index.
from lifti.
Thanks very much @mikegoatly. I have been playing around with it a bit, and I have it working pretty well with the fuzzy logic. I also mixed in the wildcard search like so:
?1?query | *query*
This is so I can match on "murphys", "murphy's" but also "mur". I found the default fuzzy matching was returning too many irrelevant results in my database. I'll tweak this over time once I get used to the library a bit more.
Thanks again!
from lifti.
No problem at all, glad to help.
One thing to be aware of is that using a *
at the start of your search terms isn't particularly efficient for large indexes at the moment because of the way the index structure has to be recursively scanned to find the first character to match. The queries will be faster if you can just use a wildcard at the end, e.g. mur*
, although you may not notice if your index isn't particularly big.
from lifti.
At the moment, this particular index is only 2,000 records, and will probably never be any more than 10,000. I assume they would be considered small numbers?
from lifti.
It'll probably be fine - it was more just something to be aware of. Let me know if you run into any problems though.
from lifti.
Closing this as the question is resolved. Feel free to raise another issue if anything else come up.
from lifti.
Related Issues (20)
- Write up implementing a custom serializer
- Apply field and document filtering when collecting results from IndexNavigator HOT 1
- Add README.md to nuget package
- Execution plans
- Consider dropping support for netstandard2
- Query syntax: Support wildcard field searches/searching across all dynamic fields from a specific provider HOT 3
- Remove dependency on System.Collections.Immutable HOT 2
- Suggestion: custom stemmers HOT 2
- Search for words with a `=` character HOT 5
- Escaped characters in LIFTI query syntax HOT 1
- Q: is possible to fetch the whole document by Id? HOT 2
- Refresh documentation HOT 20
- Split IdPool and ItemStore HOT 1
- Consider switching to using ValueTask across the library HOT 1
- Operaterrors as a text HOT 3
- Standardize terminology
- Track source object type against a document's metadata
- Add a "not contains" query operator
- v6 documentation changes
- Create a standardised way of rehydrating an index from a serializer
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lifti.