lilacai / lilac Goto Github PK
View Code? Open in Web Editor NEWCurate better data for LLMs
Home Page: http://lilacml.com
License: Apache License 2.0
Curate better data for LLMs
Home Page: http://lilacml.com
License: Apache License 2.0
Currently STRING_SPANs are implemented as {'start': 0, 'end': 10}.
The data format will not encode 'start' and 'end', but there is runtime overhead of emitting this dictionary. Since we may emit a lot of spans for some datasets (e.g. sentence splits), we should optimize this by using a tuple.
This should have no API effect.
To filter for for example presence of PII in a row.
This separation lets us show "media", e.g. large text blobs larger, and "metadata" as smaller auxiliary metadata.
Trying to add a UDF column to selectRow(Schema) for concept score. Cohere embeddings have been computed already.
Requests looks like this
{
"filters": [],
"sort_by": [],
"columns": [
["text"],
["label"],
["label_value"],
["__hfsplit__"],
["__rowid__"],
["__lilac__"],
{
"feature": ["__lilac__", "text", "cohere"],
"transform": {
"signal": {
"signal_name": "concept_score",
"namespace": "local",
"concept_name": "toxicity",
"embedding_name": "cohere"
}
}
}
],
"combine_columns": true
}
Getting following python error
Failed to merge schemas. Origin schema has fields but destination does not
Traceback (most recent call last):
File "/Users/jongejan/dev/lilac/./src/router_utils.py", line 19, in custom_route_handler
return await original_route_handler(request)
File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 237, in app
raw_response = await run_endpoint_function(
File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 165, in run_endpoint_function
return await run_in_threadpool(dependant.call, **values)
File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/Users/jongejan/dev/lilac/./src/router_dataset.py", line 181, in select_rows_schema
return db.select_rows_schema(
File "/Users/jongejan/dev/lilac/./src/data/db_dataset_duckdb.py", line 724, in select_rows_schema
return merge_schemas(col_schemas)
File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 212, in merge_schemas
_merge_field_into(cast(Field, s), cast(Field, merged_schema))
File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 197, in _merge_field_into
_merge_field_into(subfield, destination.fields[field_name])
File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 197, in _merge_field_into
_merge_field_into(subfield, destination.fields[field_name])
File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 197, in _merge_field_into
_merge_field_into(subfield, destination.fields[field_name])
File "/Users/jongejan/dev/lilac/./src/data/dataset_utils.py", line 192, in _merge_field_into
raise ValueError('Failed to merge schemas. Origin schema has fields but destination does not')
ValueError: Failed to merge schemas. Origin schema has fields but destination does not
This comes from: #198
Was looking a bit at how you've structured the routing.
I'd suggest maybe putting all the python endpoints behind a prefix like /api
or such, and that way enabling that the frontend can expand into url routing without conflicting with rest calls. It also will make it easier to serve frontend and backend separately which for example can come very handy for scaling reasons.
I also think its worth considering already now how url routing in the frontend should be supported, and if for example it would be worth using next.js for it.
As of #198 we can delete this.
Filing an issue since it's not high-priority and we'll break the old server.
This can happen now that LILAC_COLUMN is removed. If one path is the child of another, we make two selects, and merge the results. Currently, we just blindly merge them.
This can be short-circuited by only making the query to the parent, knowing the child will be there.
After #106 we can delete the entity index since we can derive all the indexes and what they enrich from the schema.
Add filter currently does nothing, we need to 1) add dataset-level filters it to the redux state and 2) make the network request to the database with the filters applied.
Right now you can only supply sort order globally, not per column supplied in sort_by
.
One idea, instead of having two fields, perhaps this is a better interface:
{
sorty_by: Array<string | { column: Path, order: 'asc' | 'desc' }>
}
Fundamentally text splitters and signals do the same thing -- add a new column with data.
Splitters have special semantics about them generating spans, but signals should have that capability too. Let's generalize these together.
The field in SignalInfo.ts is typed as any
export type SignalInfo = {
name: string;
enrichment_type: EnrichmentType;
json_schema: any;
};
But field is important to populate arguments to signals and such.
This will allow the structure of computed signals and UDF's signal outputs to look the same, making client code that reads it much simpler.
Currently sources are responsible for writing parquet, and this introduced a bug where CSV bypassed our write_items_to_parquet as an optimization. In this situation they need to know to write UUIDs, wrap in values, etc.
Let's simplify and make sources just emit an iterable of items, like signals, and do the work for them.
In the future we can speed this up with dask & sharding.
When running python github checks, the python installation cache is always a miss. We should investigate trying to get this to be a cache hit to speed up the python checks.
If i make a select rows query like
{
...
columns: ['struct', 'field']
...
}
i would expect to get the value back nested, right now it comes back under a serialized flattened key "struct.field"
.
This prevents me from requesting all columns (from schema), and add UDF columns in addition to them.
Vite is a super fast frontend server that is really rapidly taking over all frontend apps, it provides a much much faster dev experience compared to webpack.
The configuration is also much simpler than webpack, and it has build in proxy configuration that can forward the api calls to the python server port.
Moving over to it would provide instant hot module reloading, making frontend dev faster.
Posting from our chat to keep track
This code adds a new column in selectRows that dynamically computes concept score:
const signal: ConceptScoreSignal = {
signal_name: 'concept_score',
namespace: 'local',
concept_name: 'toxicity',
embedding_name: 'cohere'
};
const transform: SignalTransform = { signal };
const conceptColumn: Column = {
feature: [LILAC_COLUMN, 'comment_text', 'cohere', ENTITY_FEATURE_KEY],
transform
};
I tried then to do the same for PII, but that doesnt work:
const signal: Signal = {
signal_name: 'pii'
};
const transform: SignalTransform = { signal };
const conceptColumn: Column = {
feature: [LILAC_COLUMN, 'label'],
transform
};
Can't instantiate abstract class Signal with abstract method fields (type=type_error)
Support sending requests to selectRow and selectRowSchema (and maybe other places?) that can look like
columns: [
'comment_text',
'comment_text.pii.email'
]
And add escape character for fields with .
in them
With following query:
{
"filters": [
{
"path": [
"text"
],
"op": "equals",
"value": "She won't be there for long."
}
],
"sort_by": [],
"sort_order": "ASC",
"columns": [],
"combine_columns": true
}
getting following error:
set_duckdb.py", line 690, in select_rows query = con.sql(f""" duckdb.ParserException: Parser Error: syntax error at or near "t" LINE 3: ... SELECT "text" AS "text", "label" AS "label", "label_value" AS "label_value", "__hfsplit__" AS "__hfsplit__", "__rowid__" AS "__rowid__" FROM t WHERE "text"['__value__'] = 'She won't be there for long.' ^
if i remove '
in the string error disappears.
I believe this should be be fix on python side and not on client side.
This hopefully will prevent accidental charges of many dollars, and data potentially sent to another server.
This will allow us to separate the python class name, from the unique identifier, from what we display in the UI.
Potentially better way to structure top level tsconfig.
https://www.typescriptlang.org/docs/handbook/project-references.html
Since csv source doesn't use write_items_to_parquet
, it doesn't wrap primitives in __value__
, so we can't work with the movies
dataset.
Wrapping will either need to happen in pure SQL (fastest), or we should migrate to using a csv parser in python.
We should be able to compute the total count of a leaf, this is helpful for embedding computation and things like "how many sentences or words are there."
I tried to add the same logic from useGetIds to useGetItem to fetch the activeConcept as part of the item response. WHen I do, some (but not all) responses fail with following error:
Traceback (most recent call last):
File "/Users/jongejan/dev/lilac/./src/router_utils.py", line 19, in custom_route_handler
return await original_route_handler(request)
File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 237, in app
raw_response = await run_endpoint_function(
File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/fastapi/routing.py", line 165, in run_endpoint_function
return await run_in_threadpool(dependant.call, **values)
File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/starlette/concurrency.py", line 41, in run_in_threadpool
return await anyio.to_thread.run_sync(func, *args)
File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
return await get_asynclib().run_sync_in_worker_thread(
File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
return await future
File "/Users/jongejan/dev/lilac/.venv/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
result = context.run(func, *args)
File "/Users/jongejan/dev/lilac/./src/router_dataset.py", line 198, in select_rows
db.select_rows(
File "/Users/jongejan/dev/lilac/./src/data/db_dataset_duckdb.py", line 786, in select_rows
concept_model = self._concept_model_db.get(signal.namespace, signal.concept_name,
File "/Users/jongejan/dev/lilac/./src/concepts/db_concept.py", line 133, in get
return pickle.load(f)
_pickle.UnpicklingError: pickle data was truncated
Can it be because of the many parallel requests?
This is the code i'm adding, copy pasted from getds
if (activeConcept) {
const signal: ConceptScoreSignal = {
signal_name: 'concept_score',
namespace: activeConcept.concept.namespace,
concept_name: activeConcept.concept.name,
embedding_name: activeConcept.embedding.name,
};
const alias = getConceptAlias(
activeConcept.concept,
activeConcept.column,
activeConcept.embedding
);
const transform: SignalTransform = {signal};
const conceptColumn: Column = {feature: activeConcept.column, transform, alias};
columns = [...(columns || []), conceptColumn];
}
If i fetch rows right after a task has finished, select rows throws errors such as
duckdb.BinderException: Binder Error: Duplicate alias "cohere(comment_text)" in query!
If i wait with reloading data a bit, then the error doesn't happen.
When these become dtypes, we can store data as {value: Embedding} and simply check dtypes for these entities to know that they have values.
This is towards the refactor where everything is a Node, and some Nodes have values (and/or children).
After #164 I will optimize this (just want to land it).
select_rows
throws following error when trying to sort by UDF column:
ValueError: Column ('__lilac__', 'text', 'cohere', 'local/toxicity') is not defined as an alias in the given columns and is not defined in the select. The sort by path must be defined in either the columns or as a column alias.Available sort by aliases: {'*'}.
Available columns: [('text',), ('label',), ('label_value',), ('__hfsplit__',), ('__rowid__',), ('__lilac__',), Column(feature=('__lilac__', 'text', 'cohere'), alias=None, transform=SignalTransform(signal=ConceptScoreSignal(signal_name='concept_score', namespace='local', concept_name='toxicity', embedding_name='cohere')))].
Request:
{
"limit": 40,
"filters": [],
"sort_by": [
[
"__lilac__",
"text",
"cohere",
"local/toxicity"
]
],
"columns": [
[
"text"
],
[
"label"
],
[
"label_value"
],
[
"__hfsplit__"
],
[
"__rowid__"
],
[
"__lilac__"
],
{
"feature": [
"__lilac__",
"text",
"cohere"
],
"transform": {
"signal": {
"signal_name": "concept_score",
"namespace": "local",
"concept_name": "toxicity",
"embedding_name": "cohere"
}
}
}
],
"combine_columns": true,
"offset": 0
}
According to Jonas
"A lot of projects with those needs are using pnpm, it supports more complex nested structures. And is starting to become a standard on a lot of projects i've seen lately"
In addition to signal_root
, adding the signal info will allow frontend to show information about the field, and map it to UDF's
Seems like the signal needs to be updated to correctly receive embeddings as a field. Enrichment type is still set to text.
This can be done totally as an internal detail, when we read from duckdb on the other side we can fake lilac_item()
on the result of source selection before merging with signals.
The pros of this are that we dont modify sources, meaning the parquet file can be our public API. As of right now, users have to go through our write_items_to_parquet
in python, which is suboptimal if a user has data in bigquery they want to dump.
This will basically just give us a single, large object that we return to users under select_rows.
We should also migrate everyone to field.
Suggestion: Something I've really come to appreciate from vite land is the use of .env
files. Theres a few ways to use them, but the setup I'd suggest is following
A file is added to repo called .env.example
with content like this:
COHERE_API_KEY= # You api key
LILAC_DATA_PATH=./gcs_cache
...
The user copies this file to .env
and modifies it. This file is gitignored.
The content is loaded into python code (using python-dotenv), and setup so environment variables that are exported overwrite the content. The content is automatically also loaded by vite (though only keys that are prefixed with VITE_
are accessible by clientside code to avoid leaking data) (https://vitejs.dev/guide/env-and-mode.html)
The start_dev_server script can check for the existence of the file and throw an error if the user hasn't yet created the .env file.
The big benefit of this setup is that its obvious what env variables are available to be modified without reading a manual. And its standardized across both client and server side. They are also very handy when it gets to deployments. Its easy to switch between different environments by swapping file.
If you think its a good idea i'll happily implement it.
db.manifest()
, which holds embedding index info, is cached on a list of all the signal manifest files. But writing a new embedding index doesn't add any new signal manifests, just new embedding index manifests, so no cache eviction happens.
Three options:
This will allow us to see how sources were created.
Currently, schema has "leafs" which returns all the nodes with values.
This is technically incorrect, so let's introduce "petal" terminology for nodes with values. Leafs are truly leafs.
Making embeddings entities makes them look like spans (e.g. a derivation of source data). This will let us remove "vector_based=True" and have signals talk about the the type of embedding via EnrichmentType.
This will also make things line up nicely for when emebddings can be eventually stored in duckdb instead of on disk.
We often break the UI because of some small bugs that happen between browser <---pydantic --> handler.
From Daniel:
Currently parquet files generated by duckdb are different from parquet files generated manually in python.
The different is how the __rowid__
uuid values are treated:
__rowid__
as bytes__rowid__
as logical UUIDThis makes duckdb sometimes return a string and sometimes return bytes when reading the uuid column. To fix this we need to cast uuid() to bytes when we create the parquet files, however casting is not yet supported.
Filed a feature request in duckdb: duckdb/duckdb#5705
Currently tasks that are backgrounded don't propagate errors back to the UI.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.