resin,kreeben

Parse query into doubly-chained linked list

Regarding Resin's QL: A plus sign means "AND". A space means "OR". A minus sign means "NOT.

The QL currently doesn't allow for grouping/nesting. We need nesting to be able to rewrite this fuzzy query of two terms:

+title:religion +body:jesus~

into these three terms:

+title:religion +(body:jesus body:jesuz)

Let each term (query clause) be a node in a doubly-chained linked list, let left be down and let right be forward. The depth of a node will represent the nesting level.

Implement: Search across multiple collections

Re-use Query when mapping across multiple ReadSessions.

Implement: Delete doc from index

Re-balance search tree upon insert

The root of the tree should be in the center so the tree is split in half upon first traversal step.

Application startup exception: System.PlatformNotSupportedException: The named version of this synchronization primitive is not supported on this platform.

I'm getting the following error while trying to run Http server in macos.

Application startup exception: System.PlatformNotSupportedException: The named version of this synchronization primitive is not supported on this platform.
   at System.Threading.Semaphore.CreateSemaphore(Int32 initialCount, Int32 maximumCount, String name)
   at System.Threading.Semaphore..ctor(Int32 initialCount, Int32 maximumCount, String name, Boolean& createdNew)
   at Sir.Store.SessionFactory..ctor(ITokenizer tokenizer, IConfigurationProvider config) in /Users/kshitij/github/resin/src/Sir.Store/Session/SessionFactory.cs:line 36
   at Sir.Store.Start.OnApplicationStartup(IServiceCollection services, ServiceProvider serviceProvider, IConfigurationProvider config) in /Users/kshitij/github/resin/src/Sir.Store/Start.cs:line 16
   at Sir.HttpServer.ServiceConfiguration.Configure(IServiceCollection services) in /Users/kshitij/github/resin/src/Sir.HttpServer/ServiceConfiguration.cs:line 66
   at Sir.HttpServer.Startup.ConfigureServices(IServiceCollection services) in /Users/kshitij/github/resin/src/Sir.HttpServer/Startup.cs:line 26
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.AspNetCore.Hosting.ConventionBasedStartup.ConfigureServices(IServiceCollection services)
   at Microsoft.AspNetCore.Hosting.Internal.WebHost.EnsureApplicationServices()
   at Microsoft.AspNetCore.Hosting.Internal.WebHost.Initialize()
--- End of stack trace from previous location where exception was thrown ---
   at Microsoft.AspNetCore.Hosting.Internal.WebHost.BuildApplication()

Dotnet version

01:13 $ dotnet --info
.NET Core SDK (reflecting any global.json):
 Version:   2.2.107
 Commit:    2212cac826

Runtime Environment:
 OS Name:     Mac OS X
 OS Version:  10.14
 OS Platform: Darwin
 RID:         osx.10.14-x64
 Base Path:   /usr/local/share/dotnet/sdk/2.2.107/

Host (useful for support):
  Version: 2.2.5
  Commit:  0a3c9209c0

.NET Core SDKs installed:
  2.2.106 [/usr/local/share/dotnet/sdk]
  2.2.107 [/usr/local/share/dotnet/sdk]

.NET Core runtimes installed:
  Microsoft.AspNetCore.All 2.2.4 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.All 2.2.5 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.All]
  Microsoft.AspNetCore.App 2.2.4 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 2.2.5 [/usr/local/share/dotnet/shared/Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 2.2.4 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]
  Microsoft.NETCore.App 2.2.5 [/usr/local/share/dotnet/shared/Microsoft.NETCore.App]

To install additional .NET Core runtimes or SDKs:
  https://aka.ms/dotnet-download

I have tried to run the server on tag v0.3a and it worked completely fine.

@kreeben - Can you please take a look at it.
I have some experience working with C# and I'm ready to help if you could guide me.

And I want to thank you for making this open source.

Use relative paths only and Path.Combine (for Linux compatibility)

This includes readers, writers and tests.

store datetime as utc

see https://github.com/kreeben/resin/blob/master/src/ResinCore/Field.cs#L30

Shouldn't the datetime value be stored as utc?

ps. numeric values are also stored using the culture specific ToString().

Split IScoringScheme in two: factory and scheme

Add wasCompressed flag to IxInfo

So that readers never have to know beforehand if a index was compressed or not.

Implement collation

Lexiographical ordering of keys is currently achieved by adhering to the Unicode ordering of characters. This will not work for all cultures.

Wrap all document-related readers and writers in IDocReader/Writer to make the storage engine pluggable

Let there be a IDocumentReader and IDocumentWriter for all document operations so that the storage engine becomes pluggable.

Migrate to Core

Create Core sln
Add all files from v4 sln
Add JSON.NET and log4net Core Nuget packages
Unit testing for Core? Right now I'm using nunit 2.

Let v4 and Core solutions live in same repo.

Should be it.

faceted search queries

Can we expect faceting features like in solr?

Demo page (didyougogo.com) fails to load

The GET request seems to fail with a timeout.

Write char constants as either character-literals or integer-literals cast to char

You'll find ' ', ':', '~' and more in the code. Mostly query parsing does this.

Wireframes for GUI á la Luke/Management Studio

Honor surrogate pairs when analysing

https://msdn.microsoft.com/en-us/library/system.globalization.stringinfo(v=vs.110).aspx

Split long strings

Let the analyzer split long strings to make tries shallow.

Create abstraction layer for System.IO

Today we want to be able to promise that a commit is acctually commited. We can do this by making the same promise to our clients as the file system make to us. In other words, we can write a file to disk, verify it's really there, commited and in a readable state and then we can tell our client that the scope of our promise has been fulfilled.

Later, we want to make other types of promises that the file system cannot subscribe to, e.g. we want to promise that data has been persisted not only one machine but on two and that both machines have the data in a readable state.

Thus the need for a file system abstraction layer.

change file extension doc into unique extension

doc is already used by Microsoft Word. Maybe use a different extension? .rsin ?

Implement "delete by term"

There is a delete by pk operation that might be interesting to have a look at. If one resolves a term into primary keys, one could reuse the existing delete operation.

Implement bm25

Replace log4net with microsoft.extensions.logging.abstractions

Please replace log4net as a dependency with https://www.nuget.org/packages/Microsoft.Extensions.Logging.Abstractions
That allows you to use the logger of your application instead hard depending on a specific logger. It's up to the application to configure logging not the library.

[Questions] Architecture documentation

Hi

Very interesting project.

We also looking for a good search engine for Location based data (mostly addresses) and now we are working on a Elastic based version but would be great to create other prototype with this. I'm fan of .Net Core so I'm happy to see project like this implementing a good tech stuff on .Net Core.
I have some question what came into my mind.

Are there any project or company behind this implementation or it is just your free time project?
Do you have architecture documentation? I'm very interested in how this system works, but reading the code is not the best starting point. I'm interested in a high level architecture and how the docs and trie scanned, how docs and trie held in memory or scanned on disk, etc.

Thanks

Benchmarks

Hi,

I am following Ayende's review and I am motivated to make some tests with Ants Profiler.

What do I have to do to run some benchmarks?

Btw: I found this free profiler: http://www.getcodetrack.com/

Add support for IComparable instead of string

Improve IAnalyzer architecture

It's just messy now. Make it pretty where ever there is a analyzer.

Linux version in C/C++

Grouping of query statements

The query

+label:golden +label:age of porn~ +genre:documentaries

should be rewritten by the query parser to

+label:golden +(label:age~ label:of~ label:porn~) +genre:documentaries

make Levenshtein implementation pluggable

Is it planned to make the Levenshtein pluggable and for example replace it with Trigram?

Implement: Query multiple index versions

Prohibit first statement in query from being a "not" statement but allow all other query operators

Prohibit first statement in a clause from being a "not" statement. "Or" and "and" statements are allowed.

Implement TruncateOperation

Remove old documents. Re-write indices.

Implement a custom storage engine to test the contracts/abstractions

GetTicks/GetNextChronologicalFileId question

I was looking over your commits and noticed your changes to GetTicks() in 139da03

Since GetNextChronologicalFileId() is the sole consumer of GetTicks() I am trying to understand what you are trying to do with it? Before it was just a wrapper for DateTime.Now.Ticks, but now it is an incrementing number. The implementation of GetNext() is also not theadsafe. Since Random isn't a thread safe type and the Ticks++ is not guaranteed to be atomic. In fact i'm not sure what the use of Random is besides to introduce a delay in the function.

Based on the name of GetNextChronologicalFileId() and the commit message I am assuming it is intended to be:

Strictly increasing in return value
Safe to call simultaneously from multiple threads

Are there any other rules that this function must follow?

Version for .NETFramework,Version=v4.5.2

Severity Code Description Project File Line Suppression State
Error Could not install package 'ResinDB 2.0.3'. You are trying to install this package into a project that targets '.NETFramework,Version=v4.5.2', but the package does not contain any assembly references or content files that are compatible with that framework. For more information, contact the package author. 0

Create slack channel

Add support for dates

Implement "concept" as a first-class citizen along-side Term.

The purpose of a concept is to give meaning to a word or cluster of words so that an aggregated concept can be built that describe either a paragraph or the document in its entirety.

In a corpus there are always fewer concepts than there are terms. Therefore, if you could compare concepts in vector space instead of terms, you would gain in querying speed.

In order to give new meaning to a word or cluster of words, more information has to be added to the equation than just the words.

It's a good thing then that concepts may be extracted from the context in which a word or sentence live, the context being the words or sentences that surround them.

Sounds like fun, right?

Async await instead of Parallel.ForEach

Increase phrase search relevance by storing term positions

Store term positions at indexing time. At scoring time multiply the weight of a term in a phrase query by a factor proportional to the distance to its predecessor term.

Search results aren't good

Using these terms def con 26 badge ama the web page I wanted was this: https://www.reddit.com/r/Defcon/comments/973jik/dc26_official_badge_hardware_ama/

(This is just a real life example...)

Google returns that page for those terms as the second result. Duck Duck Go returns the parent page, but not the desired page.

DidYouGoGo.com doesn't return anything related.

Implement term count as DocumentPosting<int>

Let each trie node carry a Data field where T can be any class or struct. Return this via a Word to the Collector and include it in the scorable DocumentPosting. Data.TermCount is where the data goes that support the tf-idf scoring model. Data.Value is where your custom data is. Only EndOfWord nodes carry data.

kreeben / resin Goto Github PK

resin's People

Contributors

Stargazers

Watchers

Forkers

resin's Issues

Recommend Projects

Recommend Topics

Recommend Org