Coder Social home page Coder Social logo

lilacai / lilac Goto Github PK

View Code? Open in Web Editor NEW
840.0 14.0 73.0 37.92 MB

Curate better data for LLMs

Home Page: http://lilacml.com

License: Apache License 2.0

Shell 0.42% Python 50.05% JavaScript 0.10% CSS 0.23% TypeScript 11.21% HTML 0.03% Jupyter Notebook 18.12% Svelte 19.74% Dockerfile 0.12%
artificial-intelligence data-analysis dataset-analysis unstructured-data

lilac's People

Contributors

albertvillanova avatar brilee avatar contributorrandom avatar dechantoine avatar drikster80 avatar dsmilkov avatar halfdanj avatar hinthornw avatar hynky1999 avatar nsthorat avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lilac's Issues

STRING_SPAN should be implemented as a tuple, not a dictionary.

Currently STRING_SPANs are implemented as {'start': 0, 'end': 10}.

The data format will not encode 'start' and 'end', but there is runtime overhead of emitting this dictionary. Since we may emit a lot of spans for some datasets (e.g. sentence splits), we should optimize this by using a tuple.

This should have no API effect.

Unable to UDF concept_score

Trying to add a UDF column to selectRow(Schema) for concept score. Cohere embeddings have been computed already.

Requests looks like this

{
  "filters": [],
  "sort_by": [],
  "columns": [
    ["text"],
    ["label"],
    ["label_value"],
    ["__hfsplit__"],
    ["__rowid__"],
    ["__lilac__"],
    {
      "feature": ["__lilac__", "text", "cohere"],
      "transform": {
        "signal": {
          "signal_name": "concept_score",
          "namespace": "local",
          "concept_name": "toxicity",
          "embedding_name": "cohere"
        }
      }
    }
  ],
  "combine_columns": true
}

Getting following python error

Failed to merge schemas. Origin schema has fields but destination does not
Traceback (most recent call last):
  File "/Users/jongejan/dev/lilac/./src/router_utils.py", line 19, in custom_route_handler
    return await original_route_handler(request)
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 165, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/Users/jongejan/dev/lilac/./src/router_dataset.py", line 181, in select_rows_schema
    return db.select_rows_schema(
  File "/Users/jongejan/dev/lilac/./src/data/db_dataset_duckdb.py", line 724, in select_rows_schema
    return merge_schemas(col_schemas)
  File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 212, in merge_schemas
    _merge_field_into(cast(Field, s), cast(Field, merged_schema))
  File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 197, in _merge_field_into
    _merge_field_into(subfield, destination.fields[field_name])
  File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 197, in _merge_field_into
    _merge_field_into(subfield, destination.fields[field_name])
  File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 197, in _merge_field_into
    _merge_field_into(subfield, destination.fields[field_name])
  File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 192, in _merge_field_into
    raise ValueError('Failed to merge schemas. Origin schema has fields but destination does not')
ValueError: Failed to merge schemas. Origin schema has fields but destination does not

Routing considerations

Was looking a bit at how you've structured the routing.
I'd suggest maybe putting all the python endpoints behind a prefix like /api or such, and that way enabling that the frontend can expand into url routing without conflicting with rest calls. It also will make it easier to serve frontend and backend separately which for example can come very handy for scaling reasons.

I also think its worth considering already now how url routing in the frontend should be supported, and if for example it would be worth using next.js for it.

Allow sort_order per column

Right now you can only supply sort order globally, not per column supplied in sort_by.

One idea, instead of having two fields, perhaps this is a better interface:

{
	sorty_by: Array<string | { column: Path, order: 'asc' | 'desc' }>
}

SignalInfo json_schema is untyped

The field in SignalInfo.ts is typed as any

export type SignalInfo = {
    name: string;
    enrichment_type: EnrichmentType;
    json_schema: any;
};

But field is important to populate arguments to signals and such.

Sources should emit items and not be responsible for writing to parquet internally.

Currently sources are responsible for writing parquet, and this introduced a bug where CSV bypassed our write_items_to_parquet as an optimization. In this situation they need to know to write UUIDs, wrap in values, etc.

Let's simplify and make sources just emit an iterable of items, like signals, and do the work for them.

In the future we can speed this up with dask & sharding.

Explicit columns dont get nested

If i make a select rows query like

{
... 
columns: ['struct', 'field']
...
}

i would expect to get the value back nested, right now it comes back under a serialized flattened key "struct.field".

This prevents me from requesting all columns (from schema), and add UDF columns in addition to them.

Relax the signal API to return primitives.

We currently don't allow signals to return primitives, they have to return a named object. This makes the structure a little more cumbersome.

We ideally have the following APIs when splitting & enriching:

IMG_6830

IMG_5156

Use Vite for dev server

Vite is a super fast frontend server that is really rapidly taking over all frontend apps, it provides a much much faster dev experience compared to webpack.

The configuration is also much simpler than webpack, and it has build in proxy configuration that can forward the api calls to the python server port.

Moving over to it would provide instant hot module reloading, making frontend dev faster.

UDF columns in selectRows

Posting from our chat to keep track

This code adds a new column in selectRows that dynamically computes concept score:

  const signal: ConceptScoreSignal = {
    signal_name: 'concept_score',
    namespace: 'local',
    concept_name: 'toxicity',
    embedding_name: 'cohere'
  };

  const transform: SignalTransform = { signal };
  const conceptColumn: Column = {
    feature: [LILAC_COLUMN, 'comment_text', 'cohere', ENTITY_FEATURE_KEY],
    transform
  };

I tried then to do the same for PII, but that doesnt work:

const signal: Signal = {
    signal_name: 'pii'
  };

  const transform: SignalTransform = { signal };
  const conceptColumn: Column = {
    feature: [LILAC_COLUMN, 'label'],
    transform
  };

Can't instantiate abstract class Signal with abstract method fields (type=type_error)

Filter bug: String doesn't get escaped

With following query:

{
  "filters": [
    {
      "path": [
        "text"
      ],
      "op": "equals",
      "value": "She won't be there for long."
    }
  ],
  "sort_by": [],
  "sort_order": "ASC",
  "columns": [],
  "combine_columns": true
}

getting following error:

set_duckdb.py", line 690, in select_rows query = con.sql(f""" duckdb.ParserException: Parser Error: syntax error at or near "t" LINE 3: ... SELECT "text" AS "text", "label" AS "label", "label_value" AS "label_value", "__hfsplit__" AS "__hfsplit__", "__rowid__" AS "__rowid__" FROM t WHERE "text"['__value__'] = 'She won't be there for long.' ^

if i remove ' in the string error disappears.

I believe this should be be fix on python side and not on client side.

Fix csv source

Since csv source doesn't use write_items_to_parquet, it doesn't wrap primitives in __value__, so we can't work with the movies dataset.

Wrapping will either need to happen in pure SQL (fastest), or we should migrate to using a csv parser in python.

API error when adding concept to useGetItem

I tried to add the same logic from useGetIds to useGetItem to fetch the activeConcept as part of the item response. WHen I do, some (but not all) responses fail with following error:

Traceback (most recent call last):
  File "/Users/jongejan/dev/lilac/./src/router_utils.py", line 19, in custom_route_handler
    return await original_route_handler(request)
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 165, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/Users/jongejan/dev/lilac/./src/router_dataset.py", line 198, in select_rows
    db.select_rows(
  File "/Users/jongejan/dev/lilac/./src/data/db_dataset_duckdb.py", line 786, in select_rows
    concept_model = self._concept_model_db.get(signal.namespace, signal.concept_name,
  File "/Users/jongejan/dev/lilac/./src/concepts/db_concept.py", line 133, in get
    return pickle.load(f)
_pickle.UnpicklingError: pickle data was truncated

Can it be because of the many parallel requests?

This is the code i'm adding, copy pasted from getds

 if (activeConcept) {
    const signal: ConceptScoreSignal = {
      signal_name: 'concept_score',
      namespace: activeConcept.concept.namespace,
      concept_name: activeConcept.concept.name,
      embedding_name: activeConcept.embedding.name,
    };
    const alias = getConceptAlias(
      activeConcept.concept,
      activeConcept.column,
      activeConcept.embedding
    );
    const transform: SignalTransform = {signal};
    const conceptColumn: Column = {feature: activeConcept.column, transform, alias};
    columns = [...(columns || []), conceptColumn];
  }

selectRows throws error right after compute signal finishes

If i fetch rows right after a task has finished, select rows throws errors such as

duckdb.BinderException: Binder Error: Duplicate alias "cohere(comment_text)" in query!

If i wait with reloading data a bit, then the error doesn't happen.

Entities: Embeddings, TextSpanEntities should be dtypes.

When these become dtypes, we can store data as {value: Embedding} and simply check dtypes for these entities to know that they have values.

This is towards the refactor where everything is a Node, and some Nodes have values (and/or children).

Sort by UDF

select_rows throws following error when trying to sort by UDF column:

ValueError: Column ('__lilac__', 'text', 'cohere', 'local/toxicity') is not defined as an alias in the given columns and is not defined in the select. The sort by path must be defined in either the columns or as a column alias.Available sort by aliases: {'*'}.
Available columns: [('text',), ('label',), ('label_value',), ('__hfsplit__',), ('__rowid__',), ('__lilac__',), Column(feature=('__lilac__', 'text', 'cohere'), alias=None, transform=SignalTransform(signal=ConceptScoreSignal(signal_name='concept_score', namespace='local', concept_name='toxicity', embedding_name='cohere')))].

Request:

{
  "limit": 40,
  "filters": [],
  "sort_by": [
    [
      "__lilac__",
      "text",
      "cohere",
      "local/toxicity"
    ]
  ],
  "columns": [
    [
      "text"
    ],
    [
      "label"
    ],
    [
      "label_value"
    ],
    [
      "__hfsplit__"
    ],
    [
      "__rowid__"
    ],
    [
      "__lilac__"
    ],
    {
      "feature": [
        "__lilac__",
        "text",
        "cohere"
      ],
      "transform": {
        "signal": {
          "signal_name": "concept_score",
          "namespace": "local",
          "concept_name": "toxicity",
          "embedding_name": "cohere"
        }
      }
    }
  ],
  "combine_columns": true,
  "offset": 0
}

Investigate pnpm instead of npm.

https://pnpm.io/

According to Jonas

"A lot of projects with those needs are using pnpm, it supports more complex nested structures. And is starting to become a standard on a lot of projects i've seen lately"

Add signal_info to schema fields

In addition to signal_root, adding the signal info will allow frontend to show information about the field, and map it to UDF's

Add `semantic_search` back in

Seems like the signal needs to be updated to correctly receive embeddings as a field. Enrichment type is still set to text.

Don't wrap sources in VALUE_KEY before writing to parquet.

This can be done totally as an internal detail, when we read from duckdb on the other side we can fake lilac_item() on the result of source selection before merging with signals.

The pros of this are that we dont modify sources, meaning the parquet file can be our public API. As of right now, users have to go through our write_items_to_parquet in python, which is suboptimal if a user has data in bigquery they want to dump.

Use dotenv files for environment variables

Suggestion: Something I've really come to appreciate from vite land is the use of .env files. Theres a few ways to use them, but the setup I'd suggest is following

A file is added to repo called .env.example with content like this:

COHERE_API_KEY= # You api key
LILAC_DATA_PATH=./gcs_cache
... 

The user copies this file to .env and modifies it. This file is gitignored.
The content is loaded into python code (using python-dotenv), and setup so environment variables that are exported overwrite the content. The content is automatically also loaded by vite (though only keys that are prefixed with VITE_ are accessible by clientside code to avoid leaking data) (https://vitejs.dev/guide/env-and-mode.html)

The start_dev_server script can check for the existence of the file and throw an error if the user hasn't yet created the .env file.

The big benefit of this setup is that its obvious what env variables are available to be modified without reading a manual. And its standardized across both client and server side. They are also very handy when it gets to deployments. Its easy to switch between different environments by swapping file.

If you think its a good idea i'll happily implement it.

Caching bug with `db.manifest()`

db.manifest(), which holds embedding index info, is cached on a list of all the signal manifest files. But writing a new embedding index doesn't add any new signal manifests, just new embedding index manifests, so no cache eviction happens.

Three options:

  1. Not great: Add embedding manifest files in the cache key. This forces db.manifest() to know about every asset that gets written to disk and enumerate
  2. Simple: Instead of adding the embedding manifest files in the cache key, write a uuid version file, and have anybody that writes files change that uuid
  3. Bullet-proof but slow over networked drive like GCS or S3: recursively compute mtime (modification time) of the gcs_cache folder, OS doesn't give you folder-level mtime, so you do it by recursively taking the latest mtime of any files inside

Cast uuid() to blob/bytes in duckdb when support lands

From Daniel:

Currently parquet files generated by duckdb are different from parquet files generated manually in python.

The different is how the __rowid__ uuid values are treated:

  • manually generated parquet files have __rowid__ as bytes
  • duckdb generated parquet files have __rowid__ as logical UUID

This makes duckdb sometimes return a string and sometimes return bytes when reading the uuid column. To fix this we need to cast uuid() to bytes when we create the parquet files, however casting is not yet supported.

Filed a feature request in duckdb: duckdb/duckdb#5705

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.