Coder Social home page Coder Social logo

askorama / orama Goto Github PK

View Code? Open in Web Editor NEW
8.1K 38.0 260.0 60.67 MB

🌌 Fast, dependency-free, full-text and vector search engine with typo tolerance, filters, facets, stemming, and more. Works with any JavaScript runtime, browser, server, service!

Home Page: https://docs.askorama.ai

License: Other

TypeScript 97.00% Shell 0.47% Dockerfile 0.01% Astro 1.27% MDX 1.04% Rust 0.22%
data-structures full-text search typo-tolerance algiorithm search-engine search-algorithm javascript typescript node vector vector-database vector-database-embedding vector-search vector-search-engine

orama's Introduction


Full-text, vector, and hybrid search with a unique API.
On your browser, server, mobile app, or at the edge.
In less than 2kb.


Tests npm bundle size

Join Orama's Slack channel

If you need more info, help, or want to provide general feedback on Orama, join the Orama Slack channel

Highlighted features

Installation

You can install Orama using npm, yarn, pnpm, bun:

npm i @orama/orama

Or import it directly in a browser module:

<html>
  <body>
    <script type="module">
      import { create, search, insert } from 'https://unpkg.com/@orama/orama@latest/dist/index.js'

      // ...
    </script>
  </body>
</html>

With Deno, you can just use the same CDN URL or use npm specifiers:

import { create, search, insert } from 'npm:@orama/orama'

Read the complete documentation at https://docs.askorama.ai.

Usage

Orama is quite simple to use. The first thing to do is to create a new database instance and set an indexing schema:

import { create, insert, remove, search, searchVector } from '@orama/orama'

const db = await create({
  schema: {
    name: 'string',
    description: 'string',
    price: 'number',
    embedding: 'vector[1536]', // Vector size must be expressed during schema initialization
    meta: {
      rating: 'number',
    },
  },
})

Orama currently supports 10 different data types:

Type Description example
string A string of characters. 'Hello world'
number A numeric value, either float or integer. 42
boolean A boolean value. true
enum An enum value. 'drama'
geopoint A geopoint value. { lat: 40.7128, lon: 74.0060 }
string[] An array of strings. ['red', 'green', 'blue']
number[] An array of numbers. [42, 91, 28.5]
boolean[] An array of booleans. [true, false, false]
enum[] An array of enums. ['comedy', 'action', 'romance']
vector[<size>] A vector of numbers to perform vector search on. [0.403, 0.192, 0.830]

Orama will only index properties specified in the schema but will allow you to set and store additional data if needed.

Once the db instance is created, you can start adding some documents:

await insert(db, {
  name: 'Wireless Headphones',
  description: 'Experience immersive sound quality with these noise-cancelling wireless headphones.',
  price: 99.99,
  embedding: [...],
  meta: {
    rating: 4.5,
  },
})

await insert(db, {
  name: 'Smart LED Bulb',
  description: 'Control the lighting in your home with this energy-efficient smart LED bulb, compatible with most smart home systems.',
  price: 24.99,
  embedding: [...],
  meta: {
    rating: 4.3,
  },
})

await insert(db, {
  name: 'Portable Charger',
  description: 'Never run out of power on-the-go with this compact and fast-charging portable charger for your devices.',
  price: 29.99,
  embedding: [...],
  meta: {
    rating: 3.6,
  },
})

After the data has been inserted, you can finally start to query the database.

const searchResult = await search(db, {
  term: 'headphones',
})

In the case above, you will be searching for all the documents containing the word "headphones", looking up in every string property specified in the schema:

{
  elapsed: {
    raw: 99512,
    formatted: '99μs',
  },
  hits: [
    {
      id: '41013877-56',
      score: 0.925085832971998432,
      document: {
        name: 'Wireless Headphones',
        description: 'Experience immersive sound quality with these noise-cancelling wireless headphones.',
        price: 99.99,
        meta: {
          rating: 4.5
        }
      }
    }
  ],
  count: 1
}

You can also restrict the lookup to a specific property:

const searchResult = await search(db, {
  term: 'immersive sound quality',
  properties: ['description'],
})

Result:

{
  elapsed: {
    raw: 21492,
    formatted: '21μs',
  },
  hits: [
    {
      id: '41013877-56',
      score: 0.925085832971998432,
      document: {
        name: 'Wireless Headphones',
        description: 'Experience immersive sound quality with these noise-cancelling wireless headphones.',
        price: 99.99,
        meta: {
          rating: 4.5
        }
      }
    }
  ],
  count: 1
}

You can use non-string data to filter, group, and create facets:

const searchResult = await search(db, {
  term: 'immersive sound quality',
  where: {
    price: {
      lte: 199.99
    },
    rating: {
      gt: 4
    }
  },
})

Performing hybrid and vector search

Orama is a full-text and vector search engine. This allows you to adopt different kinds of search paradigms depending on your specific use case.

To perform vector or hybrid search, you can use the same search method used for full-text search.

You'll just have to specify which property you want to perform vector search on, and a vector to be used to perform vector similarity:

const searchResult = await searchVector(db, {
  mode: 'vector', // or 'hybrid'
  vector: {
    value: [...], // OpenAI embedding or similar vector to be used as an input
    property: 'embedding' // Property to search through. Mandatory for vector search
  }
})

If you're using the Orama Secure AI Proxy (highly recommended), you can skip the vector configuration at search time, since the official Orama Secure AI Proxy plugin will take care of it automatically for you:

import { create } from '@orama/orama'
import { pluginSecureProxy } from '@orama/plugin-secure-proxy'

const secureProxy = secureProxyPlugin({
  apiKey: '<YOUR-PUBLIC-API-KEY>',
  defaultProperty: 'embedding', // the default property to perform vector and hybrid search on
  model: 'openai/text-embedding-ada-002' // the model to use to generate embeddings
})

const db = await create({
  schema: {
    name: 'string',
    description: 'string',
    price: 'number',
    embedding: 'vector[1536]',
    meta: {
      rating: 'number',
    },
  },
  plugins: [secureProxy]
})

const resultsHybrid = await search(db, {
  mode: 'vector', // or 'hybrid'
  term: 'Videogame for little kids with a passion about ice cream',
  where: {
    price: {
      lte: 19.99
    },
    'meta.rating': {
      gte: 4.5
    }
  }
})

Performing Geosearch

Orama supports Geosearch as a search filter. It will search through all the properties specified as geopoint in the schema:

import { create, insert } from '@orama/orama'

const db = await create({
  schema: {
    name: 'string',
    location: 'geopoint'
  }
})

await insert(db, { name: 'Duomo di Milano', location: { lat: 45.46409, lon: 9.19192 } })
await insert(db, { name: 'Piazza Duomo',    location: { lat: 45.46416, lon: 9.18945 } })
await insert(db, { name: 'Piazzetta Reale', location: { lat: 45.46339, lon: 9.19092 } })

const searchResult = await search(db, {
  term: 'Duomo',
  where: {
    location: {           // The property we want to filter by
      radius: {           // The filter we want to apply (in that case: "radius")
        coordinates: {    // The central coordinate
          lat: 45.4648, 
          lon: 9.18998
        },
        unit: 'm',        // The unit of measurement. The default is "m" (meters)
        value: 1000,      // The radius length. In that case, 1km
        inside: true      // Whether we want to return the documents inside or outside the radius. The default is "true"
      }
    }
  }
})

Orama Geosearch APIs support distance-based search (via radius), or polygon-based search (via polygon).

By default, Orama will use the Haversine formula to perform Geosearch, but high-precision search can be enabled by passing the highPrecision option in your radius or polygon configuration. This will tell Orama to use the Vicenty Formulae instead, which is more precise for longer distances.

Read more in the official docs.

Official Docs

Read the complete documentation at https://docs.askorama.ai.

Official Orama Plugins

Write your own plugin: https://docs.askorama.ai/open-source/plugins/writing-your-own-plugins

License

Orama is licensed under the Apache 2.0 license.

orama's People

Contributors

allevo avatar balastrong avatar castarco avatar codyzu avatar danielefedeli avatar dbritto-dev avatar ematipico avatar h4ad avatar ilteoood avatar ishibi avatar jkomyno avatar marco-ippolito avatar mateonunez avatar micheleriva avatar okikio avatar optic-release-automation[bot] avatar philippebeaulieu avatar rafaelgss avatar raiindev avatar rishi-raj-jain avatar samsalvatico avatar saravieira avatar shogunpanda avatar sp321 avatar stearm avatar thezalrevolt avatar thomscoder avatar valeriocomo avatar valstu avatar yusufyilmazfr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

orama's Issues

Search case-sensitive for terms with a length less than 3

Describe the bug
It seems that the search is case-sensitive for words with a length < 3

To Reproduce
Insert records like:

  • if you are interested
  • if you will go out..
  • search for the term If
  • you retrieve 0 result
  • if you search for the terms: if you
  • you get results

Expected behavior
I am expecting that the search is case-sensitive or not with a specific rule

related PR: #57 (but it doesn't fix the problem)

[BUG] Duplicate hits when overriding `id`

Describe the bug
When overriding the property id of documents, search will return more hits than the actual count of hits in the search results. The additional hits will be duplicated documents.

To Reproduce
Steps to reproduce the behavior:

Use this sample script:

import { create, insert, search } from '@nearform/lyra';

const db = create({
	schema: {
		id: 'string',
		quote: 'string',
		author: 'string',
		what: 'string',
		who: 'string',
	},
});

insert(db, {
	id: '1.0',
	quote: 'It is during our darkest moments that we must focus to see the light.',
	author: 'Aristotle',
	what: '',
	who: '',
});

insert(db, {
	id: '2.0',
	quote: 'If you really look closely, most overnight successes took a long time.',
	author: 'Steve Jobs',
	what: '',
	who: '',
});

insert(db, {
	id: '3.0',
	quote: 'If you are not willing to risk the usual, you will have to settle for the ordinary.',
	author: 'Jim Rohn',
	what: '',
	who: '',
});

insert(db, {
	id: '4.0',
	quote: 'You miss 100% of the shots you don\'t take',
	author: 'Wayne Gretzky - Michael Scott',
	what: '',
	who: '',
});

const searchResult = search(db, {
	term: 'if',
	properties: '*',
});

console.log(searchResult);

Sample output:

{
  elapsed: 169640n,
  hits: [
    {
      id: '2.0',
      quote: 'If you really look closely, most overnight successes took a long time.',
      author: 'Steve Jobs',
      what: '',
      who: ''
    },
    {
      id: '3.0',
      quote: 'If you are not willing to risk the usual, you will have to settle for the ordinary.',
      author: 'Jim Rohn',
      what: '',
      who: ''
    },
    {
      id: '2.0',
      quote: 'If you really look closely, most overnight successes took a long time.',
      author: 'Steve Jobs',
      what: '',
      who: ''
    },
    {
      id: '3.0',
      quote: 'If you are not willing to risk the usual, you will have to settle for the ordinary.',
      author: 'Jim Rohn',
      what: '',
      who: ''
    },
    {
      id: '2.0',
      quote: 'If you really look closely, most overnight successes took a long time.',
      author: 'Steve Jobs',
      what: '',
      who: ''
    },
    {
      id: '3.0',
      quote: 'If you are not willing to risk the usual, you will have to settle for the ordinary.',
      author: 'Jim Rohn',
      what: '',
      who: ''
    },
    {
      id: '2.0',
      quote: 'If you really look closely, most overnight successes took a long time.',
      author: 'Steve Jobs',
      what: '',
      who: ''
    },
    {
      id: '3.0',
      quote: 'If you are not willing to risk the usual, you will have to settle for the ordinary.',
      author: 'Jim Rohn',
      what: '',
      who: ''
    }
  ],
  count: 2
}

Expected behavior
It is expected that the hits array contains no more than count: n and no duplicated documents.

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: Fedora 36
  • Browser: N/A
  • Version: @nearform/lyra 0.0.4, 0.0.5

Smartphone (please complete the following information):
N/A

Additional context
Bug is reproducible under Node 16, 17, and 18, and Deno 1.24.x (using import of https://cdn.skypack.dev/@nearform/[email protected]?dts)

Missing .tsv file in benchmarks

Describe the bug
Hoping I'm not missing some configurations... benchmarks are crashing at

const fileStream = fs.createReadStream("./dataset/title.tsv");

To Reproduce
Steps to reproduce the behavior:

  1. Running pnpm benchmark from root directory (supposing no external/manual confurations are needed 👀)

Screenshots
Schermata 2022-07-24 alle 00 05 33

Desktop (please complete the following information):

  • OS: MacOS M1 Monterey 12.4
  • Node: 18.6.0

JavaScript heap out of memory (node)

I'm giving the node 6 gigs already,

I am trying to index a file of 6 million documents, but no worries; each document is literally like:

{ title: "hello this is a Wikipedia title" }

Most of titles are shorter actually, (wikipedia) titles dump. (the whole dump is 350 mb on disk, as text, so in memory I'm sure it is way lighter)

A second question please, I have already a tokenizer (natural library), so what I'm doing is:

insert(index, {
    title:
        nex.join(' '),
});

So is there a way to deactivate Lyra tokenizer ? (and pass an array of words instead)

Thanks a lot

Add possibility to disable stemming during indexing

There might be cases where we want to store the exact document without stemming. I propose an API similar to the following:

const INSERT_CONFIG = {
  stemming: false
};

const doc = {
  quote: "hello world",
  author: "me"
};

await lyra.insert(doc, INSERT_CONFIG);

We could also add the stemming: <bool> property while initializing a new Lyra instance:

const db = new Lyra({
  schema: {},
  defaultLanguage: 'english',
  stemming: false // true by default
});

Bring coverage >90% for v0.1.0

Is your feature request related to a problem? Please describe.
Before launching Lyra as a stable project, we should improve our tests and bring the test coverage up to 90% minimum

Count term occurrencies in document

Is your feature request related to a problem? Please describe.
Right now, we're not considering how many times a term appears in a document.

For instance, given the following strings:

  • "It's alive! It's alive!"
  • "I saw them alive"

When searching for "alive", the first string should have priority as the term "alive" appears twice.

CaseSensitive search

Is your feature request related to a problem? Please describe.
At the moment the search is case-insensitive, so it could be useful to add a parameter to specify if the search must be in case-sensitive

Describe the solution you'd like
Adding a parameter during the search called caseSensitive could be good enough.

Benchmarks for typo tolerance do not perform typo-tolerant search

Describe the bug

The benchmarks for typo-tolerance are executing a search with exact: true, so they only perform exact match.

To Reproduce

No reproduction as this is a problem with the benchmarks, but see how the benchmark sets exact: true. The current implementation (as of Lyra 0.0.1-beta-14) returns early if exact is true and an exact match is found, no matter if tolerance is set.

Expected behavior

The benchmark for typo tolerance should perform a typo tolerant search.

Note: it would probably also make sense to disable stemming on the typo tolerance benchmarks to avoid confounding results.

Create better docs

With Lyra's upcoming first stable release, I would love to have a better documentation website, maybe using Docusaurus or something similar. The current design is not very optimal, and documentation is poorly organized

Typo tolerance misses results

Describe the bug

The bug occurs in lyra-0.0.1-beta-13.

Typo tolerant searches miss expected results.

To Reproduce

const db = new Lyra({
  schema: {
    txt: "string",
  },
  stemming: false // Disabling stemming to avoid confounding the results
})

// Insert "stelle", and other words that are all within edit distance 2
await db.insert({ txt: 'stelle' })
await db.insert({ txt: 'stele' })
await db.insert({ txt: 'snelle' })
await db.insert({ txt: 'stellle' })
await db.insert({ txt: 'scelte' })

await db.search({ term: 'stelle', tolerance: 2 })
/* returns:
{
  elapsed: '1ms',
  hits: [
    { id: 'cuD10UKGVzBmEQz30dQKI', txt: 'stelle' },
    { id: 'DqXro5_9z2wiPQlTmK91n', txt: 'stellle' },
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined
  ],
  count: 2
}
*/

Expected behavior

All 5 documents should be returned, since they all contain a term within edit distance 2 from the query.

Support for nested properties

Is your feature request related to a problem? Please describe.
As for now, Lyra does not support nested properties. Thus, the following code will break:

import { lyra } from '@nearfom/lyra';

const movieDB = new Lyra({
  schema: {
    title: 'string',
    director: 'string',
    plot: 'string',
    year: 'number',
    isFavorite: 'boolean',
    cast: { // <-------- objects are not supported yet
      director: 'string',
      leading: 'string'
    }
  }
});

We should grant support for nested properties to Lyra

Error on Nubula Run

Describe the bug

file:///Users/nathanclevenger/.nvm/versions/node/v16.13.0/lib/node_modules/@lyrasearch/nebula/dist/deploy.js:1
import { access, constants } from "node:fs/promises";
                 ^^^^^^^^^
SyntaxError: The requested module 'node:fs/promises' does not provide an export named 'constants'
    at ModuleJob._instantiate (node:internal/modules/esm/module_job:124:21)
    at async ModuleJob.run (node:internal/modules/esm/module_job:181:5)
    at async Promise.all (index 0)
    at async ESMLoader.import (node:internal/modules/esm/loader:281:24)
    at async loadESM (node:internal/process/esm_loader:88:5)
    at async handleMainPromise (node:internal/modules/run_main:65:12)

To Reproduce
Steps to reproduce the behavior:

  1. I followed the instructions at https://docs.lyrajs.io/deployment/nebula/running-nebula2.
  2. execute nebula run3.

Expected behavior
I expected it to generate src/index.js

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • Mac OS X
  • Node v16.13.0

Error stemming in Dutch and Spanish

Describe the bug
When using the PorterStemmerNl and PorterStemmerEs, natural module can't resolve the following methods: postlude() and isVowel().

PorterStemmerNl:

 TypeError: Cannot read property 'postlude' of undefined

      51 |   }
      52 |
    > 53 |   return input.map(stemmer.stem);
         |                ^
      54 | }
      55 |

postule() returns the word in lowercase

PorterStemmerEs:

    TypeError: Cannot read property 'isVowel' of undefined

      51 |   }
      52 |
    > 53 |   return input.map(stemmer.stem);
         |                ^
      54 | }
      55 |

To Reproduce
Steps to reproduce the behavior:

  1. Create a new test
  2. Add the following code:

PorterStemmerNl:

  it("Should stem correctly in dutch", async () => {
    // some words in dutch
    const input: string[] = ["banken"];

    // the expected output
    const expected = ["bank"];

    const output = stemArray(input, "dutch");

    expect(output).toEqual(expected);
  });

PorterStemmerEs:

  it("Should stem correctly in spanish", async () => {
    // some words in spanish
    const input: string[] = ["avenida"];

    // the expected output
    const expected = ["aven"];

    const output = stemArray(input, "spanish");

    expect(output).toEqual(expected);
  });

Desktop (please complete the following information):

  • OS: Windows 11
  • Environment: Jest
  • Node: v18.6.0
  • pnpm: v7.5.1
  • Lyra: v0.0.1-beta-10

Possible solution

I was testing some solutions for this issue, and I've found that the error is caused when the stemArray method passes the reference of the stem method to the input.map()

return input.map(stemmer.stem);

Fixing typos, and extending the map to force the execution of stem method, the stem method works as expected.

I've published a repo with the possible solution and the stemmer tests.

Error with fresh Lyra installation

Describe the bug
I've created a new project where I wanted to implement Lyra, so I've added it as a dependency. When I've try to import Lyra, I get the following output error:

❯ npm run test

> [email protected] test
> node index.js

node:internal/errors:465
    ErrorCaptureStackTrace(err);
    ^

Error [ERR_MODULE_NOT_FOUND]: Cannot find package '~/dev/new-lyra/node_modules/@nearform/lyra/' imported from ~/dev/new-lyra/index.js

To Reproduce
Steps to reproduce the behavior:

  1. Create a new empty JS project: npm init --yes
  2. Install Lyra as a dependency: npm install @nearform/lyra
  3. Create a simple file that imports Lyra.
  4. Run it.

Screenshots

This is the content of Lyra's directory inside node_modules

image

If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: macOS Monterey 12.4
  • Environment: Node
  • Node: v18.6.0
  • npm: v8.14.0
  • Lyra: v0.0.1-beta-11

Additional context

I've created a repository with the reproduced error hoping it could be useful.

Search method returns undefined elements

Describe the bug
Searching by a term that doesn't exist, Lyra returns X undefined elements.

The error also occurs using limit, property or exact params.

To Reproduce
Steps to reproduce the behavior:

  1. lyra.search({ term: "whatever" })

Expected behavior
Lyra shouldn't return X undefined elements. Instead of that should return an empty array.

Screenshots

image

Desktop:

  • OS: Windows 11
  • Environment: Node
  • Node: v18.6.0
  • pnpm: v7.5.1
  • Lyra: v0.0.1-beta-12

How does insert work? Why storing individual tokens?

Hi,

Apologies, I'm just exploring the code and had a question about insert(nodes: Nodes, node: Node, word: string, docId: string). Looking through the code it looks like word is being looped across, and the parameter, word implied to me it takes a single word.

However, I was checking out the test:

const phrases = [
  { id: "1", doc: "the quick, brown fox" },
  { id: "2", doc: "jumps over the lazy dog" },
  { id: "3", doc: "just in time!" },
  { id: "4", doc: "there is something wrong in there" },
  { id: "5", doc: "this is me" },
  { id: "6", doc: "thought it was sunday" },
  { id: "7", doc: "let's try this trie" },
];

t.test("trie", t => {
  t.plan(3);

  t.test("should correctly index phrases into a prefix tree", t => {
    t.plan(phrases.length);

    const nodes = {};
    const trie = createNode();

    for (const { doc, id } of phrases) {
      trieInsert(nodes, trie, doc, id);
    }
...

It looks like from this code that word can be any string, i.e a phrase.

The purpose of this issue isn't a "gotcha", but just me trying to get used to libraries that I'm not familiar with and taking small steps to investigate the code.

I've added a few console.logs and reran the tests to check and it does seem that i.e the quick, brown fox is passed in as word, is this assumption correct? As far as I can see, it shouldn't matter if you passed in h, hello or hello world, as the function loops over each character anyway.

P.S I don't think there's anything wrong with this, just wanted to see if my testing was correct. Although, would be interested in your thoughts on renaming the parameter to phrase? This is really not crucial, or important whatsoever, so apologies again if it's distracted you from more important features/issues.

Is it possible not to use any Lyra tokenizer ?

Hi,
Thanks for providing Lyra !

So i've already relied on a tokenizer with my algorithm, is it possible to pass plain arrays of words ? or I need to re-join the words and go again through the internal tokenizer of Lyra ?

Remove Natural as a dependency

As seen in #31, Natural is crashing on the browser, and there's no easy workaround for it. We should consider moving away from this library before hitting a stable release.

Sequential match priority

Is your feature request related to a problem? Please describe.
As for now, Lyra performs searches on individual tokens.

So for example, if I have the following documents:

  • "Hello everyone here. My name is Michele"
  • "Hello Michele"

When searching for the term "Hello Michele", there's still a chance to get the first document. We should take into account that we want to find the documents and tokens in a precise order

React example can't find built Lyra package

Describe the bug

With a fresh clone of Lyra, try to run the example/with-react package and I've got the following error:

image

To Reproduce
Steps to reproduce the behavior:

  1. git clone https://github.com/nearform/lyra
  2. cd lyra && npm|yarn|pnpm install
  3. cd pacakges/examples/with-react
  4. npm|yarn|pnpm dev

Additional context

This issue could be fixed by building Lyra local package before build or serve the example.

Consider Dependency Injection

Currently the natural language processing from natural includes a great deal of large dependencies. For example when I bundle Lyra with esbuild to deploy to lambda it is 13.1 MB in size.

Is it possible to allow injection of both the stemmer, and word tokenizer, so long as those injected dependencies fulfill the contract of the function interface?

This would mean a smaller, but likely less accurate, dependency could be used, making Lyra a lot smaller - possibly allowing for FE deployment - as at 13.1 MB I would not deploy into my browser.

Follow Up

  • Create example web app with bundled size
  • Create example edge function with bundled size

Deleted documents still appear in searches

Describe the bug

The bug occurs in lyra-0.0.1-beta-13.

In some cases, deleted documents still appear in later searches

To Reproduce

The bug seems to depend on the specific content of the index. For example, start with a setup identical to #38 :

const db = new Lyra({
  schema: {
    txt: "string",
  },
  stemming: false // Disabling stemming to avoid confounding the results
})

await db.insert({ txt: 'stelle' })
//=> { id: '3A5o53WhYbi6o1i2ZgO7B' }
await db.insert({ txt: 'stele' })
//=> { id: 'DDWCn_B1z5SF9TRr8yIOb' }
await db.insert({ txt: 'snelle' })
//=> { id: 'XF3RgLkQEMY0OlFn95Nl_' }
await db.insert({ txt: 'stellle' })
//=> { id: 'UAUueC4rOGfnFhky7oEMm' }
await db.insert({ txt: 'scelte' })
//=> { id: '04fWgLHV3zA9GQHm8mFay' }

await db.search({ term: 'stelle', tolerance: 2 })
/* It returns two results, which is incorrect but not the point of this issue (see #38 for that):
{
  elapsed: '624μs',
  hits: [
    { id: '3A5o53WhYbi6o1i2ZgO7B', txt: 'stelle' },
    { id: 'UAUueC4rOGfnFhky7oEMm', txt: 'stellle' },
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined
  ],
  count: 2
}
*/

// Now delete one of the documents that were returned by the previous query:
await db.delete('UAUueC4rOGfnFhky7oEMm')
#=> true

await db.search({ term: 'stelle', tolerance: 2 })
/* Still returns two results:
{
  elapsed: '624μs',
  hits: [
    { id: '3A5o53WhYbi6o1i2ZgO7B', txt: 'stelle' },
    { id: 'UAUueC4rOGfnFhky7oEMm' },
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined
  ],
  count: 2
}
*/

Note that the last search results include the deleted document, although with no fields apart from the id.

Expected behavior

Deleted documents should not be returned by any search performed after the deletion.

Deleted documents still appear in searches - (through substrings)

Describe the bug
Most likely related to #39 ( @lucaong ). On Lyra 0.0.1-beta-14...

To Reproduce

import { Lyra } from '@nearform/lyra';

const lyra = new Lyra({
  schema: {
    word: "string",
  }
});

const { id: halo } = await lyra.insert({ word: 'Halo' });
const { id: halloween } = await lyra.insert({ word: 'Halloween' });
const { id: greenLantern } = await lyra.insert({ word: 'Hal' });

await lyra.delete(halo);

const search = await lyra.search({
  term: 'Hal',
});

console.log(search);

The results returns an id but no match

{
  elapsed: '164μs',
  hits: [
    { id: 'FUKQeo1l-rnOwzQyHLSlf', word: 'Hal' },
    { id: 'oMk8_x6NpV6siIn6SysEE' },
    { id: '3uv5A8-OCVIxfbwcWzxK7', word: 'Halloween' }
  ],
  count: 3
}

Expected behavior

{
  elapsed: '164μs',
  hits: [
    { id: 'FUKQeo1l-rnOwzQyHLSlf', word: 'Hal' },
    { id: '3uv5A8-OCVIxfbwcWzxK7', word: 'Halloween' }
  ],
  count: 2
}

Screenshot

Screenshot 2022-07-28 152442

[new language support] Apply a PR for chinese language support

Hello! Thanks for the lyra first.
I'm a new chinese user and I install it to my code after I saw the lyra 5 minutes, it is easy to understand and use.
But unfortunately it does not support Chinese. i find the tokenizer/index.ts is loosely coupled that i can add language support conveniently. may i have a chance to commit a PR for the "chinese language support"?
Ask for your permission (the guidelines say that i need to apply first to commit pr).
thx.

Is your feature request related to a problem? Please describe.
No chinese language support.

Describe the solution you'd like
add a Regular Expression in tokenizer/index.ts like

chinese: /[^a-z0-9_\u4e00-\u9fa5-]+/gim

it can easy to test in nodejs like

"chinese support test 中文 支持 测试".match(/[a-z0-9_\u4e00-\u9fa5-]+/gim)
>"[ 'chinese', 'support', 'test', '中文', '支持', '测试' ]"

(i'll do more test for the RE.)

Natural dep residual - crash

Describe the bug
I was creating examples for #24 and noticed a crash on lyra beta 16

To Reproduce

import { create, insert } from '@nearform/lyra';

const db = create({
  schema: {
    author: 'string',
    quote: 'string'
  }
});

insert(db, {
  quote: 'If you really look closely, most overnight successes took a long time.',
  author: 'Steve Jobs'
});

insert(db, {
  quote: 'If you are not willing to risk the usual, you will have to settle for the ordinary.',
  author: 'Jim Rohn'
});


const searchResult = search(db, {
  term: 'if',
  properties: '*'
});

console.log(searchResult)

Screenshots
Schermata 2022-07-30 alle 11 35 44

Desktop (please complete the following information):

  • OS: Monterey 12.4
  • Node: 14.16.0 <= x <= 18.7.0

input is not a string type (crash on `remove` method)

Describe the bug
The remove method makes Lyra beta_17 crash.

It gives the following error

TypeError: input.toLowerCase is not a function

at

const tokens = input.toLowerCase().split(splitRule);

To Reproduce

const movieDB = create({
  schema: {
    title: 'string',
    director: 'string',
    plot: 'string',
    year: 'number',
    isFavorite: 'boolean'
  }
});

const { id: harryPotter } = insert(movieDB, {
  title: 'Harry Potter and the Philosopher\'s Stone',
  director: 'Chris Columbus',
  plot: 'Harry Potter, an eleven-year-old orphan, discovers that he is a wizard and is invited to study at Hogwarts. Even as he escapes a dreary life and enters a world of magic, he finds trouble awaiting him.',
  year: 2001,
  isFavorite: false
});


remove(movieDB, harryPotter);

Screenshots
Schermata 2022-07-31 alle 00 31 24

Desktop (please complete the following information):

  • OS: Monterey 12.4
  • Node: 14.16.0 <= x <= 18.7.0

Relevance of schema fields

Is your feature request related to a problem? Please describe.
It would be great if the schema could allow specifying the relevance for individual fields. For example, if you have a schema with a title and a content field, then in many cases a match on the title implies that the result is more relevant than others where the keyword just matches on the content.

Describe the solution you'd like
Here is an example API that makes title 10x more important for ranking than content:

const db = create({
  schema: {
    content: {relevance: 10, type: 'string'},
    title: {relevance: 100, type: 'string'},
  },
});

Describe alternatives you've considered
One could issue multiple calls to search, for each property, and then manually merge the results and rank them. That's a lot of work!

Disk persistence

Is your feature request related to a problem? Please describe.
In order to be in line with cloud environment (or instance rebooting) an API for persist/restore in memory database could be fantastic.

Describe the solution you'd like
I'd like file system usage for this kind of activities because you can resize disks size without reboot the instance and disk usage is cheaper than ram.

TypeError when trying to run simple Node project with Lyra imported

Describe the bug
Get "TypeError: Cannot read properties of undefined (reading 'length')" when trying to run a barebones Node project with Lyra imported

To Reproduce
Steps to reproduce the behavior:

  1. Make a Node project and import Lyra.
  2. Make a simple Lyra DB and add some items to it.
  3. Set up a variable containing a search result from the DB.
  4. console.log out the search result.
  5. Try to run the file by typing node FileName.
  6. See an error in the console.

Expected behavior
I expected the console log to run smoothly and print out the object that the Search function returns.

Screenshots
The example code I tried to run when I got the error.
bug

The error in the terminal.
bugterminal

Desktop

  • OS: Windows 10
  • Browser: Brave
  • Version: 1.42.97

Additional context
This bug shows up in Lyra 0.0.5. I don't know if it shows up in earlier versions.

Allow passing custom IDs and doc updating

Is your feature request related to a problem? Please describe.
Hi, I have documents that already have their own unique IDs, and when those docs are updated I want to be able to tell Lyra to update the index with that doc in mind. Currently I can do this like so:

  1. const {id: lyraId} = insert(db, doc)
  2. lyraMap[MY_DOC_ID] = lyraId
  3. an edit is made to MY_DOC_ID
  4. const lyraId = lyraMap[MY_DOC_ID]
  5. remove(db, lyraId)
  6. const {id: newLyraId} = insert(db, doc)
  7. lyraMap[MY_DOC_ID] = newLyraId

This is far too involved imo.

Describe the solution you'd like
The ability to pass a custom ID into the insert function, followed by an update function that efficiently updates the index with that record in mind. Or just modify the insert function to be more of an upsert method, e.g. you can always call something like insert(db, doc, {id: MY_ID}) and it will do the right thing (based on whether the ID already exists or not). If no ID is passed initially then an ID is generated as is the case now.

Natural (stemming dependency) crashes in browser

Describe the bug
When running Lyra in a browser, you will get the following error:

chunk-OROXOI2D.js?v=4fc34af8:10 Uncaught Error: Dynamic require of "webworker-threads" is not supported
    at chunk-OROXOI2D.js?v=4fc34af8:10:9
    at ../../../node_modules/.pnpm/[email protected]/node_modules/natural/lib/natural/classifiers/classifier_train_parallel.js (classifier_train_parallel.js:6:13)
    at __require2 (chunk-OROXOI2D.js?v=4fc34af8:16:50)
    at ../../../node_modules/.pnpm/[email protected]/node_modules/natural/lib/natural/classifiers/classifier.js (classifier.js:28:25)
    at __require2 (chunk-OROXOI2D.js?v=4fc34af8:16:50)
    at ../../../node_modules/.pnpm/[email protected]/node_modules/natural/lib/natural/classifiers/bayes_classifier.js (bayes_classifier.js:26:20)
    at __require2 (chunk-OROXOI2D.js?v=4fc34af8:16:50)
    at ../../../node_modules/.pnpm/[email protected]/node_modules/natural/lib/natural/classifiers/index.js (index.js:25:27)
    at __require2 (chunk-OROXOI2D.js?v=4fc34af8:16:50)
    at ../../../node_modules/.pnpm/[email protected]/node_modules/natural/lib/natural/index.js (index.js:37:3)

To Reproduce
Steps to reproduce the behavior:

  1. Clone the Lyra repository
  2. Go inside the packages/examples/with-react directory
  3. Run the development server with pnpm dev
  4. Go to http://localhost:3000 and open the development console

Expected behavior
Lyra should work out of the box on browsers

Screenshots
Screenshot 2022-07-19 at 10 28 48

Desktop (please complete the following information):

  • OS: OSX Monterey
  • Browser Chrome
  • Version latest

Consider adding JSDoc to the APIs

Is your feature request related to a problem? Please describe.
I've started working on the docusaurus docs as in #24.
I think these two issuses can be related to each other 🙂

Describe the solution you'd like
I'd like to add some JSDoc to current or future APIs.
While it is true they're quite intuitive, I think it would ease the end-user in the usage of Lyra's methods.

Additional context
Before:
Screenshot 2022-07-19 111205

After (merely an example):
Screenshot 2022-07-19 111608

Extend query parameters by adding query clauses

Is your feature request related to a problem? Please describe.
As for now, Lyra is capable of indexing documents with searchable and non-searchable fields.

For instance, given the following schema, we index the following fields:

import { lyra } from '@nearfom/lyra';

const movieDB = new Lyra({
  schema: {
    // searchable fields
    title: 'string',
    director: 'string',
    plot: 'string',

    // non searchable
    year: 'number',
    isFavorite: 'boolean'
  }
});

Even though numbers and booleans are non-searchable fields, we should start using them for performing queries using the where keyword. An example could be:

const result = await movieDB.search({
  term: 'love',
  limit: 10,
  offset: 5,
  where: {
    year: { '>=': 1990 },
    isFavorite: true
  }
});

as a first iteration, we could go using AND only (so, in the above example, WHERE year >= 1990 AND isFavorite = true). In the future, we might want to support AND, OR, CONTAINS, etc.

bug: Default search language differs from default insert language

It was fun to read this source code, so thank you 👍

Describe the bug

Possibly it is a bug, but it may be the intended behavior, in which case, please close.

Looking at the insert API of the Lyra class:

https://github.com/nearform/lyra/blob/0fa4078a79dc7036042d2a6f0e6611555cfd924c/packages/lyra/src/lyra.ts#L233

We see inserts into the Trie are using tokens of the specified language.

When performing search, we are always tokenizing the term to english:

lyra-tokenize

I expect this would give sub optimal results if the default language were non-english.

To Reproduce

Steps to reproduce the behavior:

  1. Create an index with a non-english language
  2. Insert non-English documents into the index
  3. Perform a search using a non-English term
  4. Expect a suboptimal match

Expected behavior

Searching the trie would be performed in the default language unless otherwise sepecified.

Possible API:

  async search(params: SearchParams, language: Language = this.defaultLanguage): SearchResult {
    const tokens = tokenize(params.term, language).values();

Screenshots
If applicable, add screenshots to help explain your problem.

Desktop (please complete the following information):

  • OS: [e.g. iOS]
  • Browser [e.g. chrome, safari]
  • Version [e.g. 22]

Smartphone (please complete the following information):

  • Device: [e.g. iPhone6]
  • OS: [e.g. iOS8.1]
  • Browser [e.g. stock browser, safari]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

Deleting one document also causes other documents to disappear from searches

Describe the bug
The bug occurs in lyra-0.0.1-beta-13.

Sorry for opening several bugs, I was trying to compare Lyra with MiniSearch and ran into these issues.

To Reproduce

const db = new Lyra({
  schema: {
    txt: "string",
  }
})

await db.insert({ txt: 'abc' })
//=> { id: 'J-LiSKu45O4d1phnVWvkc' }
await db.insert({ txt: 'abc' })
//=> { id: 'yx-s9o0-8I4uWcPsMyOWL' }
await db.insert({ txt: 'abcd' })
{ id: 'XoKJ6uWhRkftAt06UsdKo' }

await db.search({ term: 'abc', exact: true })
/* Returns what expected:
{
  elapsed: '814μs',
  hits: [
    { id: 'J-LiSKu45O4d1phnVWvkc', txt: 'abc' },
    { id: 'yx-s9o0-8I4uWcPsMyOWL', txt: 'abc' },
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined,
    undefined
  ],
  count: 2
}
*/

// Delete one of the two documents containing term `"abc"`
db.delete('yx-s9o0-8I4uWcPsMyOWL')

await db.search({ term: 'abc', exact: true })
/* Returns no results, even if one doc with "abc" should still be in the index:
{
  elapsed: '770μs',
  hits: [
    undefined, undefined,
    undefined, undefined,
    undefined, undefined,
    undefined, undefined,
    undefined, undefined
  ],
  count: 0
}
*/

Expected behavior

Deleting one document should have no effect on other documents.

Add exact match

Is your feature request related to a problem? Please describe.
Right now, Lyra only implements prefix search.

Describe the solution you'd like
Lyra needs to implement an exact search. I.E.:

await lyra.search({
  term: "now",
  limit: 10,
  offset: 0,
  exact: true
});

Lyra should match:

{
  txt: "now is time"
}

but not:

{
  txt: "nowadays"
}

Duplicate results when the term matches multiple fields

Describe the bug

On Lyra 0.0.1-beta-14 when a search term matches multiple fields, the results contain duplicates.

To Reproduce

Here's a reproduction test:

it("Should return unique results, even if the term matches multiple fields", async () => {
  const db = new Lyra({
    schema: {
      title: "string",
      text: "string",
    }
  });

  await db.insert({
    title: "something",
    text: "something",
  });

  const results = await db.search({ term: "something" });

  // The following assertion fails:
  // Expected: 1
  // Received: 2
  expect(results.count).toEqual(1);
});

Expected behavior

Results should always be unique, even if the search term matches multiple fields.

Suggestion: add other JS libraries to the benchmark, like Flexsearch

Is your feature request related to a problem? Please describe.
It would be better if we could compare how lyra behaves among other existing full-text search JS libraries.

Describe the solution you'd like
I propose to add the main existing libraries to the benchmark, to more easily compare the benefits of lyra.

What is the best way to do date range search

I have documents with dates (Those can be converted to milliseconds since epoch values.) - what's the best way to formulate query like where date is less than parameter and text contains Parrot.

Is this something that is in the roadmap, or doable already?

Typo tolerance

Is your feature request related to a problem?
As for now, Lyra does not implement typo tolerance during the search.

That means that if I misspell something like seaorse instead of seahorse, I don't get the expected results.

Describe the solution you'd like
Lyra should implement typo-tolerance using the Levenshtein algorithm or similar while traversing the trie.
Users should be able to disable the typo-tolerance feature as follows:

await lyra.search({
  term: "seaorse",
  limit: 5,
  offset: 2,
  exact: true,
  fixTypos: false
})

fixTypos should be enabled by default.

Suggestion: Add `matchingTerms` array to `search()` output

Is your feature request related to a problem? Please describe.
I suspect a common use of this library is to search through a list of messages / emails etc and show matching results, and then highlight the matching terms in the resulting filtered list. This is not super-straightforward with the current implementation.

Describe the solution you'd like
In the search() output it would be useful to have a matchingTerms key for each hit - its value could be an array of terms that matched with the hit.

For example, I have a list of emails with titles: "Your bill", "View your bill", "Make a payment", "Foo bar".

  • I search for "bill payment".
  • search() seems to treat a multiple-word term as, in this case, "Your OR bill"
  • My app displays a list of emails that match the search criteria and I want to highlight "bill" and "payment" in each email's title to indicate why that email matched the search (in this case, those emails would be "Your bill" and "View your bill" and "Make a payment")
  • If we had a matchingTerms array for each hit, this would be very easy to achieve.

Describe alternatives you've considered
It's perfectly possible to achieve the above result without a matchingTerms output, but it seems to me it would be a welcome feature for many developers.

As an aside, it might actually help to spot bugs too.

Process.hrtime is undefined with Next.js since 0.0.5

Describe the bug

On upgrade to 0.1.1 (from 0.0.4) the search function spits out a process error in Next.js

TypeError: process.hrtime is undefined

Guessing this might be due to the changes in b51456c which were added to 0.0.5

Here's a sandbox reproduction using the most popular TS/Next.js configuration

https://codesandbox.io/s/magical-carlos-3p11ow

Sorry I can't be more helpful and submit a PR, but I'm mostly a designer! Great little lib, ty for putting it together.

Chore: Distribution files should just be dist

Looking at the entry points of the distribution:

https://github.com/nearform/lyra/blob/6d4fd373b40e834fc0cfd7fe633460ac0a85b4e0/packages/lyra/package.json#L22-L24

The exports will be dist folder.

The actual package includes multiple CI / Testing fixtures / coverage report files that are not needed and likely bloat the package install.

Image shows all files included that are not necessary:

image

Suggested change

Utilise the files declaration to denote which files should actually be included in the distribution.

Required change:

  // packages/lyra/package.json
  "files": [
    "dist"
  ]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.