lilacai / lilac Goto Github PK

Curate better data for LLMs

License: Apache License 2.0

Shell 0.42% Python 50.05% JavaScript 0.10% CSS 0.23% TypeScript 11.21% HTML 0.03% Jupyter Notebook 18.12% Svelte 19.74% Dockerfile 0.12%

artificial-intelligence data-analysis dataset-analysis unstructured-data

lilac's People

Contributors

Stargazers

Watchers

Forkers

mivanovitch touristshaun stjordanis github-ninad lilrachel1985 dorucioclea ai-mou lpai-org xiechengmude apollohuang1 contributorrandom litanlitudan chansongjo mabry1985 hinthornw williamtran29 gyliu513 liujuncn josedandrade dechantoine shalakasatheesh amj garrethlee abeusher tuhinmallick rch techthiyanes irshadbhat ototao azure-arc-0 kaynewest theycallmeloki polya20 kaiwren hynky1999 shikhill-gupta albertvillanova relojeffrey epinnock clabra-kyra ppl-ai mz0in edsealing drikster80 fossabot vishal-thenge cynepiaadmin johnjoo1 pradipkhomane swairshah allansene mbrukman chiragrank deniscoady songkq constion97 dneralla kranthigv suyambuganesh82 aydtsang lihaoran1997 qzl164 seunghyunseo keyman9848 wxdublin jtalmi madiator rchalamala fgblanch smyja pipoop leehanchung buildtonic

lilac's Issues

STRING_SPAN should be implemented as a tuple, not a dictionary.

Currently STRING_SPANs are implemented as {'start': 0, 'end': 10}.

The data format will not encode 'start' and 'end', but there is runtime overhead of emitting this dictionary. Since we may emit a lot of spans for some datasets (e.g. sentence splits), we should optimize this by using a tuple.

This should have no API effect.

Create filter comparison `EXISTS`

To filter for for example presence of PII in a row.

Separate "media" from "metadata" in the inspector UI.

This separation lets us show "media", e.g. large text blobs larger, and "metadata" as smaller auxiliary metadata.

Unable to UDF concept_score

Trying to add a UDF column to selectRow(Schema) for concept score. Cohere embeddings have been computed already.

Requests looks like this

{
  "filters": [],
  "sort_by": [],
  "columns": [
    ["text"],
    ["label"],
    ["label_value"],
    ["__hfsplit__"],
    ["__rowid__"],
    ["__lilac__"],
    {
      "feature": ["__lilac__", "text", "cohere"],
      "transform": {
        "signal": {
          "signal_name": "concept_score",
          "namespace": "local",
          "concept_name": "toxicity",
          "embedding_name": "cohere"
        }
      }
    }
  ],
  "combine_columns": true
}

Getting following python error

Failed to merge schemas. Origin schema has fields but destination does not
Traceback (most recent call last):
  File "/Users/jongejan/dev/lilac/./src/router_utils.py", line 19, in custom_route_handler
    return await original_route_handler(request)
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 165, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/Users/jongejan/dev/lilac/./src/router_dataset.py", line 181, in select_rows_schema
    return db.select_rows_schema(
  File "/Users/jongejan/dev/lilac/./src/data/db_dataset_duckdb.py", line 724, in select_rows_schema
    return merge_schemas(col_schemas)
  File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 212, in merge_schemas
    _merge_field_into(cast(Field, s), cast(Field, merged_schema))
  File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 197, in _merge_field_into
    _merge_field_into(subfield, destination.fields[field_name])
  File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 197, in _merge_field_into
    _merge_field_into(subfield, destination.fields[field_name])
  File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 197, in _merge_field_into
    _merge_field_into(subfield, destination.fields[field_name])
  File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 192, in _merge_field_into
    raise ValueError('Failed to merge schemas. Origin schema has fields but destination does not')
ValueError: Failed to merge schemas. Origin schema has fields but destination does not

When the JSON schema contains "signal_type" under "extra", use that to pre-populate a signal dropdown.

This comes from: #198

Routing considerations

Was looking a bit at how you've structured the routing.
I'd suggest maybe putting all the python endpoints behind a prefix like /api or such, and that way enabling that the frontend can expand into url routing without conflicting with rest calls. It also will make it easier to serve frontend and backend separately which for example can come very handy for scaling reasons.

I also think its worth considering already now how url routing in the frontend should be supported, and if for example it would be worth using next.js for it.

Remove router_embedding, it's no longer needed with signal_types.

As of #198 we can delete this.

Filing an issue since it's not high-priority and we'll break the old server.

Optimization: if a user selects 2 columns, one of which is the child of another, don't make 2 selects.

This can happen now that LILAC_COLUMN is removed. If one path is the child of another, we make two selects, and merge the results. Currently, we just blindly merge them.

This can be short-circuited by only making the query to the parent, knowing the child will be there.

Delete entity index, it's no longer necessary.

After #106 we can delete the entity index since we can derive all the indexes and what they enrich from the schema.

Tie the "Add filter" command to the redux state & REST requests.

Add filter currently does nothing, we need to 1) add dataset-level filters it to the redux state and 2) make the network request to the database with the filters applied.

Allow sort_order per column

Right now you can only supply sort order globally, not per column supplied in sort_by.

One idea, instead of having two fields, perhaps this is a better interface:

{
	sorty_by: Array<string | { column: Path, order: 'asc' | 'desc' }>
}

Text splitters and signals should be merged into a single backing implementation.

Fundamentally text splitters and signals do the same thing -- add a new column with data.

Splitters have special semantics about them generating spans, but signals should have that capability too. Let's generalize these together.

SignalInfo json_schema is untyped

The field in SignalInfo.ts is typed as any

export type SignalInfo = {
    name: string;
    enrichment_type: EnrichmentType;
    json_schema: any;
};

But field is important to populate arguments to signals and such.

UDFs without aliases should deeply nest the values.

This will allow the structure of computed signals and UDF's signal outputs to look the same, making client code that reads it much simpler.

Sources should emit items and not be responsible for writing to parquet internally.

Currently sources are responsible for writing parquet, and this introduced a bug where CSV bypassed our write_items_to_parquet as an optimization. In this situation they need to know to write UUIDs, wrap in values, etc.

Let's simplify and make sources just emit an iterable of items, like signals, and do the work for them.

In the future we can speed this up with dask & sharding.

Github actions installing Python never caches.

When running python github checks, the python installation cache is always a miss. We should investigate trying to get this to be a cache hit to speed up the python checks.

Explicit columns dont get nested

If i make a select rows query like

{
... 
columns: ['struct', 'field']
...
}

i would expect to get the value back nested, right now it comes back under a serialized flattened key "struct.field".

This prevents me from requesting all columns (from schema), and add UDF columns in addition to them.

Relax the signal API to return primitives.

We currently don't allow signals to return primitives, they have to return a named object. This makes the structure a little more cumbersome.

We ideally have the following APIs when splitting & enriching:

Selecting Media for spans does not work.

Currently, if we try to select a TEXT_SPAN (e.g. a sentence split), it show's N/A in the UI.

Add a title, num_items, and potentially a description to the dataset viewer page.

There currently is no title, num_items, and other metadata for the dataset itself visible:

Use Vite for dev server

Vite is a super fast frontend server that is really rapidly taking over all frontend apps, it provides a much much faster dev experience compared to webpack.

The configuration is also much simpler than webpack, and it has build in proxy configuration that can forward the api calls to the python server port.

Moving over to it would provide instant hot module reloading, making frontend dev faster.

Add elevated metadata for split levels: sentences, paragraphs, etc.

This will allow us to avoid wordy feature names, and replace with a more structured UI for splits (keeping the original feature, and the split feature somewhere together):

UDF columns in selectRows

Posting from our chat to keep track

This code adds a new column in selectRows that dynamically computes concept score:

  const signal: ConceptScoreSignal = {
    signal_name: 'concept_score',
    namespace: 'local',
    concept_name: 'toxicity',
    embedding_name: 'cohere'
  };

  const transform: SignalTransform = { signal };
  const conceptColumn: Column = {
    feature: [LILAC_COLUMN, 'comment_text', 'cohere', ENTITY_FEATURE_KEY],
    transform
  };

I tried then to do the same for PII, but that doesnt work:

const signal: Signal = {
    signal_name: 'pii'
  };

  const transform: SignalTransform = { signal };
  const conceptColumn: Column = {
    feature: [LILAC_COLUMN, 'label'],
    transform
  };

Can't instantiate abstract class Signal with abstract method fields (type=type_error)

Support column paths that are strings instead of array

Support sending requests to selectRow and selectRowSchema (and maybe other places?) that can look like

columns: [
	'comment_text',
	'comment_text.pii.email'
]

And add escape character for fields with . in them

Filter bug: String doesn't get escaped

With following query:

{
  "filters": [
    {
      "path": [
        "text"
      ],
      "op": "equals",
      "value": "She won't be there for long."
    }
  ],
  "sort_by": [],
  "sort_order": "ASC",
  "columns": [],
  "combine_columns": true
}

getting following error:

set_duckdb.py", line 690, in select_rows query = con.sql(f""" duckdb.ParserException: Parser Error: syntax error at or near "t" LINE 3: ... SELECT "text" AS "text", "label" AS "label", "label_value" AS "label_value", "__hfsplit__" AS "__hfsplit__", "__rowid__" AS "__rowid__" FROM t WHERE "text"['__value__'] = 'She won't be there for long.' ^

if i remove ' in the string error disappears.

I believe this should be be fix on python side and not on client side.

select_rows has extraneous rowids in select

This can affect parse time and slow down queries. Turn DEBUG=True and you'll see this when loading a dataset:

Add a bit to compute_signal for explicit approval for computing embeddings during a chain.

This hopefully will prevent accidental charges of many dollars, and data potentially sent to another server.

Add a display_name to signals.

This will allow us to separate the python class name, from the unique identifier, from what we display in the UI.

Chore: Looking tsconfig project reference

Potentially better way to structure top level tsconfig.
https://www.typescriptlang.org/docs/handbook/project-references.html

Fix csv source

Since csv source doesn't use write_items_to_parquet, it doesn't wrap primitives in __value__, so we can't work with the movies dataset.

Wrapping will either need to happen in pure SQL (fastest), or we should migrate to using a csv parser in python.

Add an endpoint for compute the cardinality of a leaf.

We should be able to compute the total count of a leaf, this is helpful for embedding computation and things like "how many sentences or words are there."

API error when adding concept to useGetItem

I tried to add the same logic from useGetIds to useGetItem to fetch the activeConcept as part of the item response. WHen I do, some (but not all) responses fail with following error:

Traceback (most recent call last):
  File "/Users/jongejan/dev/lilac/./src/router_utils.py", line 19, in custom_route_handler
    return await original_route_handler(request)
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 237, in app
    raw_response = await run_endpoint_function(
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 165, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
    return await anyio.to_thread.run_sync(func, *args)
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/Users/jongejan/dev/lilac/./src/router_dataset.py", line 198, in select_rows
    db.select_rows(
  File "/Users/jongejan/dev/lilac/./src/data/db_dataset_duckdb.py", line 786, in select_rows
    concept_model = self._concept_model_db.get(signal.namespace, signal.concept_name,
  File "/Users/jongejan/dev/lilac/./src/concepts/db_concept.py", line 133, in get
    return pickle.load(f)
_pickle.UnpicklingError: pickle data was truncated

Can it be because of the many parallel requests?

This is the code i'm adding, copy pasted from getds

 if (activeConcept) {
    const signal: ConceptScoreSignal = {
      signal_name: 'concept_score',
      namespace: activeConcept.concept.namespace,
      concept_name: activeConcept.concept.name,
      embedding_name: activeConcept.embedding.name,
    };
    const alias = getConceptAlias(
      activeConcept.concept,
      activeConcept.column,
      activeConcept.embedding
    );
    const transform: SignalTransform = {signal};
    const conceptColumn: Column = {feature: activeConcept.column, transform, alias};
    columns = [...(columns || []), conceptColumn];
  }

selectRows throws error right after compute signal finishes

If i fetch rows right after a task has finished, select rows throws errors such as

duckdb.BinderException: Binder Error: Duplicate alias "cohere(comment_text)" in query!

If i wait with reloading data a bit, then the error doesn't happen.

Entities: Embeddings, TextSpanEntities should be dtypes.

When these become dtypes, we can store data as {value: Embedding} and simply check dtypes for these entities to know that they have values.

This is towards the refactor where everything is a Node, and some Nodes have values (and/or children).

isSignalField optimization: Use a schema + path to walk the schema and prune the tree.

After #164 I will optimize this (just want to land it).

Sort by UDF

select_rows throws following error when trying to sort by UDF column:

ValueError: Column ('__lilac__', 'text', 'cohere', 'local/toxicity') is not defined as an alias in the given columns and is not defined in the select. The sort by path must be defined in either the columns or as a column alias.Available sort by aliases: {'*'}.
Available columns: [('text',), ('label',), ('label_value',), ('__hfsplit__',), ('__rowid__',), ('__lilac__',), Column(feature=('__lilac__', 'text', 'cohere'), alias=None, transform=SignalTransform(signal=ConceptScoreSignal(signal_name='concept_score', namespace='local', concept_name='toxicity', embedding_name='cohere')))].

Request:

{
  "limit": 40,
  "filters": [],
  "sort_by": [
    [
      "__lilac__",
      "text",
      "cohere",
      "local/toxicity"
    ]
  ],
  "columns": [
    [
      "text"
    ],
    [
      "label"
    ],
    [
      "label_value"
    ],
    [
      "__hfsplit__"
    ],
    [
      "__rowid__"
    ],
    [
      "__lilac__"
    ],
    {
      "feature": [
        "__lilac__",
        "text",
        "cohere"
      ],
      "transform": {
        "signal": {
          "signal_name": "concept_score",
          "namespace": "local",
          "concept_name": "toxicity",
          "embedding_name": "cohere"
        }
      }
    }
  ],
  "combine_columns": true,
  "offset": 0
}

Investigate pnpm instead of npm.

https://pnpm.io/

According to Jonas

"A lot of projects with those needs are using pnpm, it supports more complex nested structures. And is starting to become a standard on a lot of projects i've seen lately"

Add signal_info to schema fields

In addition to signal_root, adding the signal info will allow frontend to show information about the field, and map it to UDF's

Add `semantic_search` back in

Seems like the signal needs to be updated to correctly receive embeddings as a field. Enrichment type is still set to text.

Don't wrap sources in VALUE_KEY before writing to parquet.

This can be done totally as an internal detail, when we read from duckdb on the other side we can fake lilac_item() on the result of source selection before merging with signals.

The pros of this are that we dont modify sources, meaning the parquet file can be our public API. As of right now, users have to go through our write_items_to_parquet in python, which is suboptimal if a user has data in bigquery they want to dump.

Remove lilac, merge all the results under a single object.

This will basically just give us a single, large object that we return to users under select_rows.

Fix field() to allow signal values that can optionally be a signal root.

We should also migrate everyone to field.

Use dotenv files for environment variables

Suggestion: Something I've really come to appreciate from vite land is the use of .env files. Theres a few ways to use them, but the setup I'd suggest is following

A file is added to repo called .env.example with content like this:

COHERE_API_KEY= # You api key
LILAC_DATA_PATH=./gcs_cache
...

The user copies this file to .env and modifies it. This file is gitignored.
The content is loaded into python code (using python-dotenv), and setup so environment variables that are exported overwrite the content. The content is automatically also loaded by vite (though only keys that are prefixed with VITE_ are accessible by clientside code to avoid leaking data) (https://vitejs.dev/guide/env-and-mode.html)

The start_dev_server script can check for the existence of the file and throw an error if the user hasn't yet created the .env file.

The big benefit of this setup is that its obvious what env variables are available to be modified without reading a manual. And its standardized across both client and server side. They are also very handy when it gets to deployments. Its easy to switch between different environments by swapping file.

If you think its a good idea i'll happily implement it.

Caching bug with `db.manifest()`

db.manifest(), which holds embedding index info, is cached on a list of all the signal manifest files. But writing a new embedding index doesn't add any new signal manifests, just new embedding index manifests, so no cache eviction happens.

Three options:

Not great: Add embedding manifest files in the cache key. This forces db.manifest() to know about every asset that gets written to disk and enumerate
Simple: Instead of adding the embedding manifest files in the cache key, write a uuid version file, and have anybody that writes files change that uuid
Bullet-proof but slow over networked drive like GCS or S3: recursively compute mtime (modification time) of the gcs_cache folder, OS doesn't give you folder-level mtime, so you do it by recursively taking the latest mtime of any files inside

Add the source pydantic model to the manifest.

This will allow us to see how sources were created.

Rename "leafs" to "petals", which are nodes with values.

Currently, schema has "leafs" which returns all the nodes with values.

This is technically incorrect, so let's introduce "petal" terminology for nodes with values. Leafs are truly leafs.

Make embeddings entities, and swap vector_based=True for an enrichment type for embeddings.

Making embeddings entities makes them look like spans (e.g. a derivation of source data). This will let us remove "vector_based=True" and have signals talk about the the type of embedding via EnrichmentType.

This will also make things line up nicely for when emebddings can be eventually stored in duckdb instead of on disk.

Add tests for the REST endpoints

We often break the UI because of some small bugs that happen between browser <---pydantic --> handler.

See https://fastapi.tiangolo.com/tutorial/testing/

Cast uuid() to blob/bytes in duckdb when support lands

From Daniel:

Currently parquet files generated by duckdb are different from parquet files generated manually in python.

The different is how the __rowid__ uuid values are treated:

manually generated parquet files have __rowid__ as bytes
duckdb generated parquet files have __rowid__ as logical UUID

This makes duckdb sometimes return a string and sometimes return bytes when reading the uuid column. To fix this we need to cast uuid() to bytes when we create the parquet files, however casting is not yet supported.

Filed a feature request in duckdb: duckdb/duckdb#5705

Propagate errors from tasks to the UI.

Currently tasks that are backgrounded don't propagate errors back to the UI.