vespa-engine / pyvespa Goto Github PK
View Code? Open in Web Editor NEWPython API for https://vespa.ai, the open big data serving engine
Home Page: https://pyvespa.readthedocs.io/
License: Apache License 2.0
Python API for https://vespa.ai, the open big data serving engine
Home Page: https://pyvespa.readthedocs.io/
License: Apache License 2.0
Is there a way to create a Struct data type within a schema? I have an array that I’m trying to use as an imported field, but I can’t find any reference for the struct field.
Also is there any built-in support for creating document summaries?
I'm trying to run a query using nearest neighbour search while limiting it to records with a certain value, but I get errors when I try to combine the fields in a single yql query.
Using the example from Image Search, I can run:
response = app.query(body={
"yql": 'select * from sources * where ([{"targetNumHits":100}]nearestNeighbor(embedding_image,embedding_text));',
"hits": 100,
"ranking.features.query(embedding_text)": [0.632, -0.987, ..., 0.534],
"ranking.profile": "embedding_similarity"
})
and return results.
My records also have numerical attributes "value" and "cost". I can search to filter to a specific value of "value" or "cost" individually, or both of them together, e.g.
response = app.query(body={
"yql": 'select * from sources * where (value=100 and cost=10);',
"hits": 100
})
but when I try to combine the embedding search while filtering to a value, I get an error
response = app.query(body={
"yql": 'select * from sources * where (value=100 and [{"targetNumHits":100}]nearestNeighbor(embedding_image,embedding_text));',
"hits": 100,
"ranking.features.query(embedding_text)": [0.632, -0.987, ..., 0.534],
"ranking.profile": "embedding_similarity"
})
The error is
mismatched input 'nearestNeighbor' expecting {<EOF>, 'select', ';'}
Expected query example:
select * from sources * where ({"grammar": "tokenize", "targetHits": 100, "defaultIndex": "default"}userInput(“this is a test”));
When deploying a specialized application package such as:
from vespa.gallery import TextSearch
app_package = TextSearch(id_field="id", text_fields=["title", "body"])
the instance returned by the deployment method could be specific to this specialized application, such as VespaTextSearch
, inheriting from the base class Vespa
.
Different app use cases have different needs and pyvespa currently lacks a pattern to encode those needs. Just to give another example, this would be useful to natively support TextImageSearch and similar use cases.
targetHits
is the preferred name. targetNumHits
is an old alias.
debugging pyvespa pipeline problems is easier in a verbose mode where responses from select operations are output to stdout (results from requests, etc).
Maybe an env var or something to enable
Current usage:
vespa_docker = VespaDocker(port=8080)
app = vespa_docker.deploy(
application_package=app_package,
container_memory="8G",
disk_folder="/Users/username/app_folder"
)
Suggested usage:
vespa_docker = VespaDocker(
port=8080,
container_memory="8G",
disk_folder="/Users/username/app_folder"
)
app = vespa_docker.deploy(
application_package=app_package,
)
Reason: Container config parameters belong to the initialization method as we should only specify them once. The deploy method will be called every time we need to redeploy our application package and it makes no sense to repeat container config args such as container_memory every time we redeploy to the same container.
Field
class distance_metric
parameter should be "euclidean"
instead of "enclidean"
. Example from code:
>>> Field(name="tensor_field",
... type="tensor<float>(x[128])",
... indexing=["attribute"],
... ann=HNSW(
... distance_metric="enclidean",
... max_links_per_node=16,
... neighbors_to_explore_at_insert=200,
... ),
... )
Hi,
I think there is a misleading situation when using the recall keyword with the number of results that are returned from the app.query
When running
query_results = app.query(query=query,
query_model=query_model,
recall=recall_docs,
)
query_results.get_hits()
The number of results is 10
(default length of hits
)
I think the default hits
should be the number of the docs in the recal_docs
query_results = app.query(query=query,
query_model=query_model,
recall=recall_docs,
hits=len(_recall_docs)
)
query_results.get_hits()
If the number of recall docs are less then 10 - I would expect to get less the 10 results, thus when the number of docs in the recall is more then 10, I would expect to get all the hits for the docs in the recall list.
Does this sound reasonable?
When sampling additional documents for each relevant document, it might happen that the relevant document is included again by chance. This happens often when the number of additional documents being sampled is large compared to the expected number of documents available for sampling.
Are there any plans to make it simpler for users to utilize any type of ml model (cnn, rnn, gnn, etc.) they want with Vespa during inference?
pyvespa
is a useful tool to integrate ml models with the Vespa engine, but it still feels limited. For example in sequence-classification-task-with-vespa-cloud
it is only able to load huggingface text models. Are there any plans in the future to create a wrapper for users to implement their own models with customizability for pre/postprocessing in pyvespa
? I say this because having ml developers rewrite pre/postprocessing in java is not a fun experience. Could this also be possible when the model is running inference within the content cluster? I am finding that loading embeddings directly into Vespa is painless but trying to load models into Vespa causes some pains.
Thanks.
References
Current documentation is built by readthedocs once a PR gets merged to master. However, we should add link check and check that the documentation can be built in our CI to catch documentation errors before they hit the master branch.
Ref question on the public Slack:
I am trying out PyVespa and I wanted to deploy my application to docker in cloud where I am having only end point access and not complete docker daemon access. But, in pyvespa I wasn't able to find a way to deploy without accessing whole docker. I found a way using Vespa CLI using curl command:
curl --header Content-Type:application/zip --data-binary @application.zip localhost:19071/application/v2/tenant/default/prepareandactivate
I want to know is there any way we can do end-point deployment using pyvespa?
On older versions of pyvespa (I was previously using 0.5.0) one had to manually create the HTTPAdapter for pyvespa. Now this is handled in the vespa library (I like the design choice) but the issue is that the maximum pool size for the HTTPAdapter's connections is no longer exposed. This destroys multithreading with Vespa, and should be exposed through the constructor:
Line 112 in e4b968e
I want to reproduce modeling of this schema line in pyvespa code:
How can I write python Schema
model that generates such a line?
pyvespa documentation page for Vespa Cloud deployment is currently not functional from an integration test point of view. Update the notebook to make it functional.
Take #343 into account as it simplified certificate and key management.
Currently, the code below adds two fields named title to the application package schema instead of update the existing title field.
from vespa.package import ApplicationPackage, Field
app_package = ApplicationPackage(name="news")
#
# Add title field
#
app_package.schema.add_fields(
Field(name="title", type="string", indexing=["index", "summary"])
)
#
# Update title field
#
app_package.schema.add_fields(
Field(name="title", type="string", indexing=["index", "summary"], index=["enable-bm25"])
)
The latest changes broke the collect training data doc. Fix doc and add code to an integration test.
Implement integration tests in python files instead. We would like to move away from the nbdev
library.
Hi
In the document v1 api guide the put
(which is used under the hood inside the batch_update
) has the separation between the namespace
and the schema
e.g.
http://hostname:8080/document/v1/namespace/music/docid/1
In pyvespa the code for get_data
and update_data
does not separate between the namespace
and the schema
e.g.
end_point = "{}/document/v1/{}/{}/docid/{}?create={}".format(
self.app.end_point, schema, schema, str(data_id), str(create).lower()
)
This causes a bug while updating the documents with pyvespa (or doesn't update the document or creates a duplicate if the create
parameter is set to True
We currently implemented put with the following method:
response = app.feed_data_point(
schema = "msmarco",
data_id = 1,
fields = {
"id": "1",
"title": "This is a text",
"body": "This is the body of the text"
}
)
But a better way would be to have the methods insert_data
, update_data
, remove_data
and get_data
.
pyvespa currently generates a data plane certificate and key and stores it in a file every time a deployment is made. This behavior conflicts with the workflow of using the vespa-cli to generate API key and dataplane certificate and key.
I suggest we remove this functionality from pyvespa to rely solely on vespa-cli to set-up certificates and keys for Vespa Cloud interaction.
The requirement of installing ML libraries through pip install pyvespa[ml]
should only happen when using modules that requires them such as vespa.ml
or vespa.experimental.ranking
.
This should avoid a new occurrence of #341.
When running documentation example from how-to https://pyvespa.readthedocs.io/en/latest/howto/deploy_app_package/deploy-docker.html#Deploy-application-package-created-with-pyvespa the script fails with error:
RuntimeError: ["Command failed. No directory or zip file found: '/app/application'", '']
from vespa.package import ApplicationPackage
app_package = ApplicationPackage(name="my_package")
from vespa.package import Field
app_package.schema.add_fields(
Field(name = "cord_uid", type = "string", indexing = ["attribute", "summary"]),
Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"),
Field(name = "abstract", type = "string", indexing = ["index", "summary"], index = "enable-bm25")
)
from vespa.package import FieldSet
app_package.schema.add_field_set(
FieldSet(name = "default", fields = ["title", "abstract"])
)
from vespa.package import RankProfile
app_package.schema.add_rank_profile(
RankProfile(name = "bm25", first_phase = "bm25(title) + bm25(abstract)")
)
import os
from vespa.deployment import VespaDocker
disk_folder = "sample_application" # specify your desired absolute path here
vespa_docker = VespaDocker(
port=8083,
disk_folder=disk_folder
)
app = vespa_docker.deploy(
application_package = app_package,
)
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
/tmp/ipykernel_285/3496665072.py in <module>
8 )
9
---> 10 app = vespa_docker.deploy(
11 application_package = app_package,
12 )
/.env/lib/python3.8/site-packages/vespa/deployment.py in deploy(self, application_package)
261 self.export_application_package(application_package=application_package)
262
--> 263 return self._execute_deployment(
264 application_name=application_package.name,
265 disk_folder=self.disk_folder,
/.env/lib/python3.8/site-packages/vespa/deployment.py in _execute_deployment(self, application_name, disk_folder, container_memory, application_folder, application_package)
232
233 if not any(re.match("Generation: [0-9]+", line) for line in deployment_message):
--> 234 raise RuntimeError(deployment_message)
235
236 app = Vespa(
RuntimeError: ["Command failed. No directory or zip file found: '/app/application'", '']
Hi I am having some issues deploying the app locally on an m1 mac. Was wondering if this is a known issue?
Versions:
This is what I see in docker desktop:
When I try to create a app deployment, the application will hang as follows.
from vespa.gallery import QuestionAnswering
from vespa.deployment import VespaDocker
app_package = QuestionAnswering()
vespa_docker = VespaDocker(port=8089)
app = vespa_docker.deploy(application_package=app_package)
Output
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
Waiting for configuration server.
...
Thanks for all the work in creating a python wrapper for vespa!
We have a class named RankProfile
in the vespa.package
module to create rank-profiles in the application package. However, we also have a class named RankProfile
in the vespa.query
module to define which rank-profile should be used in the query model.
This is unnecessarily confusing. I suggest we use vespa.query.Ranking
instead of vespa.query.RankProfile
to clarify the different use cases.
We currently create a vespa cloud instance each time the pyvespa integration tests are run. We have to reutilize the pyvespa integration test instance instead. The reason is that a new TLS certificate is created every time we create a new instance on Vespa Cloud.
Similar to what was done in #25 for data collection.
For auto testing, it is useful to be able to set a deploy timeout
11:50:14 docs/sphinx/source/deploy-docker.ipynb
11:50:14 /usr/local/lib/python3.8/site-packages/runnb/runnb.py:28: DeprecationWarning: The notebook is NOT trusted.
11:50:14 warnings.warn('The notebook is NOT trusted.', DeprecationWarning)
11:50:47 Waiting for configuration server.
11:50:53 Waiting for configuration server.
11:50:58 Waiting for configuration server.
11:51:03 Waiting for configuration server.
11:51:08 Waiting for configuration server.
11:51:13 Waiting for configuration server.
11:51:18 Waiting for configuration server.
In https://docs.vespa.ai/en/vespa-quick-start.html we do:
vespa status deploy --wait 300
I suggest we add an optional wait
parameter to the deploy-command
Example:
rank-profile collect_rank_features inherits default {
first-phase {
expression: random
}
ignore-default-rank-features
rank-features {
bm25(title)
bm25(body)
nativeRank(title)
nativeRank(body)
}
}
This is important when collecting training data for example.
It seems that previously defined volumes are not preserved. To reproduce try to deploy the same application with two different instances of VespaDocker. The first deployment will create the container. The second deployment will retrieve the already existing container but deployment will fail with
RuntimeError: ["Command failed. No directory or zip file found: '/app/application'", '']
Required to reproduce https://github.com/vespa-engine/sample-apps/blob/master/semantic-qa-retrieval/src/main/application/schemas/sentence.sd
document sentence inherits context {
field sentence_embedding type tensor<float>(x[512]) {
indexing: attribute|index
attribute {
distance-metric:euclidean
}
index {
hnsw {
max-links-per-node: 16
neighbors-to-explore-at-insert: 500
}
}
}
}
To reproduce clean all local vespa images and try to deploy. It will run for a long time and docker image ls -a
will show many vespa images.
It seems like the process for deploying to Prod is different from dev deployments. Prod deployment page asks for 2 zip files. I downloaded the application zip file from Vespa Cloud, but I see that it's possible to do this with to_files
call. However, I'm not sure how to generate the test package. Please help me understand the process.
I've deployed my Vespa app using pyvespa VespaDocker
, which I can connect to on localhost on the same machine, but trying to connect to it from another machine results in a timeout. Do we need to run the application on 0.0.0.0 (as with flask, for example), to enable connection from external machines? If so, looking at the source code "localhost" is hardcoded in several places, so I guess it's not currently possible?
pyvespa
will fail if we deploy an app with VespaDocker.deploy
and then try to redeploy once we externally stop the container. We need to add a check to see if the app container exist and is running. If it exist and is not running we need to start it again.
We create new application files every time we redeploy an application package.
Creating new files instead of modifying them in place breaks the bind-mount (see this issue), and changes in application files on the host are not propagated to the container, leading to new changes not being deployed.
Restarting the container re-establish the bind-mount and should solve the problem.
Hi,
In the Evaluation application documentation example, there is no mentioning of the option of using recall for specific documetns.
Since this option can be very useful in an evaluation process - I'm suggesting to add an example with the recall argument
top_ids = [...]
query_evaluation = app.evaluate_query(
eval_metrics = eval_metrics,
query_model = query_model,
query_id = query_data["query_id"],
query = query_data["query"],
id_field = "id",
relevant_docs = query_data["relevant_docs"],
default_score = 0,
recall = ("id", top_ids[1:3]))
)
Since pyvespa is still under active development:
Please consider to use python3 non-blocking coroutines, e.g. to use the async
/ await
keywords.
Many modern python frameworks ( fastapi, sanic, quart, starlette, ...) are based on async
.
Integrating blocking code into into an async application is burdensome (you have to delegate to threads or processes).
vespa_docker = VespaDocker(
port=8080,
disk_folder="/User/username/sample_app",
container_memory="8G"
)
vespa_docker.deploy(application_package=app_package)
After deploying in a Docker container like the above, we might want to continue to use the same running container in a future python session. For that, we need a way to instantiate VespaDocker
from the container name.
VespaDocker.from_container_name(app_package.name)
In issue #231 it would have been nice to specify the container image to use. This is not only for custom built or previews, but some users might also want to use the GitHub Docker Registry instead of Docker Hub (ghcr.io/vespa-engine/vespa
) for container registry.
Items to work on
Create a new method app.insert_data and add a deprecation warning on app.feed_data_point to follow the same pattern as the other data operations
As demonstrated in https://github.com/sha124/vespa/blob/main/VespaDocSimilarity.ipynb, if there is a parsing error in the query and the search result returns no hits + 4xx, the error is swallowed. The library should throw an error in this case.
Hi 👋
When I try to ingest data into Vespa Cloud, I get this error - OSError: [Errno 24] Too many open files
.
When I select only the first few documents in my dataset, the feed works. If I use the whole dataset, I get that error. I dont see a way to reset the connections/close files. So, pyvespa wont let me upload anymore data unless I quit the python session and do it all over again. Synchronous batch feed works, but it is too slow for my usecase.
Code:
# works
app.feed_batch(schema="myschema", batch=batch_data[:1000], batch_size=1000, total_timeout=200, asynchronous=True)
# fails
app.feed_batch(schema="myschema", batch=batch_data, batch_size=1000, total_timeout=200, asynchronous=True)
Hi,
I have a problem when trying to deploy an app with an onnx model to Docker on Windows 10.
When deploying the app on Docker, I get the following error:
vespa_docker = VespaDocker()
app = vespa_docker.deploy(application_package=app_package)
----------------------------------------------------------
RuntimeError: ["Uploading application '/app/application' using http://localhost:19071/application/v2/tenant/default/session", "Session 15 for tenant 'default' created.",
'Preparing session 15 using http://localhost:19071/application/v2/tenant/default/session/15/prepared',
'Request failed. HTTP status code: 400', 'Invalid application package: Error loading default.default: Could not parse schema file \'crossencoder.sd\':
Unknown symbol: Lexical error at line -1, column 356. Encountered: "\\\\" (92), after : ""', '']
The application package has worked on another OS, so the definition of it should not be the problem.
Example from a use case:
@retry(wait=wait_exponential(multiplier=1), stop=stop_after_attempt(10))
def send_feed_batch(self, feed_batch, total_timeout=10000):
feed_results = self.app.feed_batch(
batch=feed_batch, total_timeout=total_timeout
)
return feed_results
def index(self, corpus: Dict[str, Dict[str, str]], batch_size=1000):
batch_feed = [
{
"id": idx,
"fields": {
"id": idx,
"title": corpus[idx].get("title", None),
"body": corpus[idx].get("text", None),
},
}
for idx in list(corpus.keys())
]
mini_batches = [
batch_feed[i : i + batch_size]
for i in range(0, len(batch_feed), batch_size)
]
for idx, feed_batch in enumerate(mini_batches):
feed_results = self.send_feed_batch(feed_batch=feed_batch)
status_code_summary = Counter([x.status_code for x in feed_results])
print(
"Successful documents fed: {}/{}.\nBatch progress: {}/{}.".format(
status_code_summary[200], len(feed_batch), idx, len(mini_batches)
)
)
return 0
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.