Coder Social home page Coder Social logo

zuliasearch's Introduction

Zulia

Distributed Lucene with deep object searching

Zulia is a real-time distributed search and storage system. Zulia is designed to scale both vertically and horizontally across servers.

Zulia is:

  • Realtime
  • Distributed
  • Pure Java
  • Open Source
  • Based on Lucene 9.x

Zulia supports:

To learn more see the wiki:

zuliasearch's People

Contributors

mdavis95 avatar payammeyer avatar millmanw avatar kdallasd avatar iswepston avatar fperez809 avatar loayghreeb avatar

Stargazers

Hristo I Stoyanov avatar  avatar  avatar  avatar Vignesh Venkatesh avatar Minoru OSUKA avatar Amit Khandelwal avatar  avatar LiuYihan avatar  avatar Daniel Ling avatar Dainius Jocas avatar Celestia Airdrop avatar  avatar  avatar  avatar Alfredo Serafini avatar Alexandre Rafalovitch avatar Prashant Surti avatar 1mSky avatar Nic avatar ramos avatar Lucas Winch avatar Yashodhan avatar Tiberiu Ichim avatar mingfeng.zhang avatar  avatar Tom Xu avatar Hao Yu avatar

Watchers

James Cloos avatar  avatar  avatar  avatar  avatar CIH avatar

zuliasearch's Issues

Allow better query handling of nested fields

Consider the sample document:

{  
   "id": "123",
   "title": "The best title",
   authors: [
      {
          firstName: "Tom",
          lastName: "Jones"
      },
      {
          firstName: "Jennifer",
          lastName: "Smith"
      }
   ]
}

Currently it is possible to search authors.firstName:"Tom" AND authors.lastName:"Smith" but that would erroneous match the above document because the matching it is at the root document level.

Using Lucene's blockjoin, it is possible to index child document alongside the parent document and allow separate queries against the child documents.

Zulia should support defining authors as a nested type and then allow searching on firstName and lastName correctly within that nested type.

Add term query syntax and numeric set syntax

Currently term queries and numeric set queries can be executed as a separate query but not as part of the search syntax.
Add new search syntax to enable these queries inline inside another query.

Term Queries

field1:zl:tq(term1 term2 term3)  // search for the terms term1 term2 term3 in field1 
zl:tq(term1 term2)               // search for the terms term1 term2 term3 in  the default search fields

Keyword tokenized fields can contain terms spaces or dashes and should be quoted.

Numeric Set Queries

field1:zl:ns(1 2 3)              // field1 must be numeric
zl:ns(1 2 3 4)                   // uses the default search fields which all must be numeric

to search for a field named zl (not recommend name for a field)

// use default search fields
new FilterQuery("value").addQueryFields("zl");
OR
// use multi field syntax with an empty field
new FilterQuery("zl,:ns")

Update to Lucene 9.7

Full List:
https://lucene.apache.org/core/9_7_0/changes/Changes.html#v9.7.0

Most interesting release notes for zulia currently:

  • KNN indexing and querying can now take advantage of vectorization for distance computation between vectors. To enable this, use exactly Java 20 or 21, and pass --add-modules jdk.incubator.vector as a command-line parameter to the Java program.

  • Queries sorted by field are now able to dynamically prune hits only using the after value. This yields major speedups when paginating deeply.

  • Reduced merge-time overhead of computing the number of soft deletes.

  • KNN vectors are now disallowed to have non-finite values such as NaN or ±Infinity.

Allow efficiently fetching all values for facetable fields

Currently the only way to fetch all values for a field is to use a keyword analyzer and use GetTerms request (which will contained deleted values) on the keyword analyzed field or to do a *:* search with a facet on a facetable field.

The zulia facet index has all the values and they can be fetched without running a *:* query and counting every document.

Add a GetFacetValuesRequest that takes a list of optional categories and optional path. Facet values are never deleted from the taxonomy so this has issues like GetTerms

Default Command Line Host/Port to Environment Variables

Follow these rules for Zulia server specification on the command line utils (everything except zuliad)

  • If --server is given use server given on the command line
  • If --server is NOT given use ZULIA_HOST as the default server if set, other default to localhost

Follow same logic for --port and ZULIA_PORT with default value of 32191

Change Associated Meta To Accept Any JSON/BSON metadata

The current structure for meta is Map<String, String>. This was done due to a protobuf limitation of key/value only and no handling for generic objects. This needs to be changed to accept a byte[] so that a bson document with proper types can be stored.

List Length is Wrong for Arrays of Objects

with the data structure

{
  id: "123",
  authors: [{
    firstName: "Bob",
    lastName: "Jones",
    degrees: ["PHD", "MD"]
  },
  {
    firstName: "Tom",
    lastName: "Smith",
    degrees: ["PHD"]
  }
 ]
}

searching |||authors|||:2 will find that record but |||authors.degrees|||:3 will not. Fix to correctly compute number of items in these cases.

Finish adding segment replication functionality

Zulia already contains indexing and query routing and some configuration for replicas, however there is no code copying data to the replicas or and wiring up the complete handling on replicas. There is also no code for new primary election.

Add distinct value facet

Instead of counting the count of each facet just return the existence of a facet for a search. Storage can be in BitSet instead of a much more memory and compute heavy HashIntIntMap. Values would be returned in A-Z order up until a count.

Update to Lucene 9.6

Adds support for Java 19/20 (instead of just 19) optimized foreign memory API
Adds string optimizations when docvalues exist (example in KeywordField.newSetQuery)

Compress documents on disk

Since Zulia 3.0, Zulia has been storing the BSON document in binary doc values fields instead of stored fields. Inside Lucene, stored fields are compressed but binary doc values are not compressed in newer versions of Lucene.

Goals:

  • Turn on document level compression by default but allow it to be turned off at the index level
  • Use a high speed compression algorithm
  • Existing indexes must not break but new documents (or reindexed documents) will be stored compressed into the index unless the user disables document level compression in the index config

Implementation:

  • Add a single byte flag (bool) with the id meta info (IdInfo) of each document that indicates if the document is compressed. The protobuf field will default to false to enable backwards compatibility with existing documents
  • Add a boolean flag to IndexSettings called disableCompression that will turn off compression at an index level. The protobuf field will default to false turning this on for new or reindexed documents by default
  • Use snappy compression for high throughput compression

Create Index Aliases

Index aliases are pointers to other indexes that can be switched quickly to point to another index.

Add commands
Create/Update Alias (aliasName, indexToPointTo)
Remove Alias (aliasName)
List Aliases

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.