Comments (7)
Thank you so much for the details. This is good enough. Will get back to you soon.
from weaviate.
@darshilshahquantive Thank you for bringing this up this issue..
Is it possible you are able to share your script to reproduce the issue?
from weaviate.
i don't have any particular script as we have different applications using weaviate and not a script , but i can provide you information like how we are importing the data and how we are making search queries etc
If you can provide a list of question that you want to know about our architecture that will be much helpful
If that helps !
from weaviate.
Hey @darshilshahquantive if you could provide information on the class schemas, how you are importing objects and query used for testing then that would be helpful!
from weaviate.
we have three products and i am providing functions used for each of the product for defining schema , importing object , retriving object
PRODUCT 1
Class Schemas
The schema for a document class in Weaviate is defined as follows:
json
{
"class": "Document",
"properties": [
{ "name": "title", "dataType": ["string"] },
{ "name": "content", "dataType": ["text"] },
{ "name": "account_id", "dataType": ["string"] }
],
"multiTenancyConfig": { "enabled": true },
"vectorizer": "text2vec-openai",
"moduleConfig": {
"text2vec-openai": {
"resourceName": "openai-resource",
"deploymentId": "text-embedding-ada-002"
}
}
}
Importing Objects
Objects are imported using the insert_chunked_data_into_vector_store function:
python
def insert_chunked_data_into_vector_store(chunks, uploaded_file_name, account_id, document_id):
vector_store = get_raw_vector_store()
WeaviateUtility.add_documents(
vector_store,
chunks,
StrategyConfig.Weaviate.VS_MULTI_TENANCY_CLASS_NAME,
account_id,
uploaded_file_name,
document_id,
)
Query Used for Testing
A function for fetching relevant documents based on text:
python
def get_relevant_documents_for_text(account_id, text, document_ids):
vs = get_raw_vector_store()
query = (
vs.query.get(StrategyConfig.Weaviate.VS_MULTI_TENANCY_CLASS_NAME, ["content", "source_file", "document_id"])
.with_tenant(account_id)
.with_hybrid(query=text, properties=["content"])
.with_additional('rerank(property: "content" query: "' + text + '") { score }')
.with_limit(5)
.with_where({"path": ["document_id"], "operator": "ContainsAny", "valueTextArray": document_ids})
)
return query.do()
Explanation
Class Schemas: Defines the structure and properties of the data stored in Weaviate, including multi-tenancy and vectorizer configurations.
Importing Objects: Uses the Weaviate client to insert chunked data into the vector store.
Query: Fetches documents relevant to a given text, using a hybrid query with additional re-ranking based on content.
PRODUCT 2
Class Schemas
Example schema for class:
json
{
"class": "TrackBC",
"multiTenancyConfig": {"enabled": True},
"vectorizer": "none",
"properties": [
{"name": "work_item_id", "dataType": ["text"]},
{"name": "dataset_ids", "dataType": ["text[]"]},
{"name": "external_id", "dataType": ["text"]},
{"name": "external_key", "dataType": ["text"]},
{"name": "external_system_url", "dataType": ["text"]},
{"name": "name", "dataType": ["text"]},
{"name": "description", "dataType": ["text"]},
{"name": "suggested_description", "dataType": ["text"]},
{"name": "status", "dataType": ["text"]},
{"name": "type", "dataType": ["text"]},
{"name": "assignee_emails", "dataType": ["text[]"]},
{
"name": "effort",
"dataType": ["object"],
"nestedProperties": [
{"dataType": ["number"], "name": "amount"},
{"dataType": ["text"], "name": "unit"},
{"dataType": ["text"], "name": "description"}
]
},
{"name": "hash", "dataType": ["text"]}
]
}
Importing Objects
Objects are imported using the add_single_object function:
python
def add_single_object(client: weaviate.Client, class_name: str, tenant: str, uuid: str, vector: list, object_data: Dict[str, Any]) -> None:
tenants = get_existing_tenants(client, class_name)
if tenant in tenants:
valid = client.data_object.validate(class_name=class_name, data_object=object_data, uuid=uuid, vector=vector)
if valid["valid"]:
try:
client.data_object.create(class_name=class_name, data_object=object_data, tenant=tenant, uuid=uuid, vector=vector)
except weaviate.ObjectAlreadyExistsException:
logger.info(f"Object {uuid} already exists in weaviate")
raise
except Exception as e:
logger.error(f"Error while adding weaviate object: {e}")
raise
else:
logger.error(f"Weaviate Object {object_data} is not valid")
raise Exception(f"Weaviate Object {object_data} is not valid")
else:
logger.error(f"Tenant {tenant} does not exist for class {class_name}")
raise Exception(f"Tenant {tenant} does not exist for class {class_name}")
Query Used for Testing
Retrieving data in batches:
python
def get_data_in_batch(client: weaviate.Client, class_name: str, tenant: str, where_filter: Dict, properties: List = [], batch_size: int = 200) -> List[Dict[str, Any]]:
count = get_count_of_objects(client, class_name, tenant, where_filter)
weaviate_data = []
for i in range(0, count, batch_size):
try:
response = (
client.query.get(class_name, properties)
.with_tenant(tenant)
.with_where(where_filter)
.with_additional(["vector id"])
.with_limit(batch_size)
.with_offset(i)
.do()
)
if response.get("errors"):
logger.error(f"Error while getting count of objects: {response['errors']}")
raise Exception(f"Error while getting count of objects: {response['errors']}")
else:
objects_list = response["data"]["Get"][class_name]
except Exception as e:
logger.error(f"Error while getting weaviate data in batch: {e}")
raise
weaviate_data.extend(objects_list)
return weaviate_data
Explanation
Class Schemas: Defines the structure and properties for data stored in Weaviate, including multi-tenancy and nested properties.
Importing Objects: Uses the Weaviate client to validate and insert data objects into the vector store, ensuring that the correct tenant exists.
Query: Retrieves data in batches from Weaviate based on a filter, useful for testing and verifying the number of objects.
PRODUCT 3
Class Schemas
Class schemas are created dynamically based on the properties provided when creating the collection:
python
def _use_collection(self, collection_name: str, properties: List[str], non_vectorized_properties: List[str] = []) -> None:
existing_collections = self.client.schema.get()["classes"]
collection_names = [collection["class"] for collection in existing_collections]
capitalized_collection_name = collection_name.capitalize()
if capitalized_collection_name not in collection_names:
class_obj = {
"class": capitalized_collection_name,
"description": "A collection of documents",
"multiTenancyConfig": {"enabled": True},
"vectorizer": "text2vec-openai",
"properties": [
{
"name": p.lower(),
"description": "string",
"dataType": ["text"],
"moduleConfig": {
"text2vec-openai": {
"skip": p in non_vectorized_properties
}
},
} for p in properties
],
"moduleConfig": {
"text2vec-openai": {
"model": "ada",
"modelVersion": "002",
"type": "text",
}
}
}
self.client.schema.create_class(class_obj)
Importing Objects
Checking document existence:
python
def _document_exist(self, collection_name: str, doc_id: str, namespace: str = None) -> bool:
try:
doc = (
self.client.data_object.get_by_id(
doc_id, class_name=collection_name.capitalize(), tenant=namespace
)
is not None
)
return doc
except:
return False
Adding a single object:
python
def _add_single_object_to_collection(self, document: Dict, collection_name: str, doc_id: str, overwrite: bool, namespace: str) -> bool:
capitalized_collection_name = collection_name.capitalize()
doc_exist = self._document_exist(capitalized_collection_name, doc_id, namespace)
if not doc_exist or (doc_exist and overwrite):
properties = {k.lower(): v for k, v in document.items()}
vec = None
self.client.data_object.create(
data_object=properties,
class_name=capitalized_collection_name,
uuid=doc_id,
tenant=namespace,
vector=vec,
)
return True
return False
Adding multiple documents:
python
def _add_documents(self, documents: List[Dict], collection_name: str, namespace: str, overwrite: bool = False) -> None:
for doc in documents:
doc_id = str(doc["doc_id"])
self._add_single_object_to_collection(doc, collection_name, doc_id, overwrite, namespace)
Queries Used for Retrieval
Constructing and executing queries:
python
def _get_documents(self, collection_name: str, query: str, num_results: int, filters: Dict = None, query_projections: List[str] = ["*"], query_properties: List[str] = ["*"], is_hybrid: bool = False, alpha: float = 0.5, namespace: str = None) -> List[Dict]:
capitalized_collection_name = collection_name.capitalize()
search = self.client.query.get(capitalized_collection_name, query_projections)
if filters is not None:
search = search.with_where(filters)
if is_hybrid:
raw_response = (
search.with_hybrid(query=query, alpha=alpha, properties=query_properties)
.with_additional("score")
.with_limit(num_results)
.with_tenant(namespace)
.do()
)
else:
raw_response = (
search.with_near_text({"concepts": [query]})
.with_additional("score")
.with_limit(num_results)
.with_tenant(namespace)
.do()
)
documents = [
{"score": item["_additional"]["score"], **item} for item in raw_response["data"]["Get"][capitalized_collection_name]
]
return documents
`We do use AND operators for filtering documents to retrive top most and to delete via filter
def _retrieve_top_distinct_values(
self,
query: str,
database_name: str,
schema_name: str,
table_name: str,
column_name: str,
namespace: str,
num_results=3,
) -> list[str]:
"""
Retrieve the top distinct values for the given column and query.
"""
filter = {
"operator": "And",
"operands": [
{
"path": ["database_name"],
"operator": "Equal",
"valueText": database_name,
},
{
"path": ["schema_name"],
"operator": "Equal",
"valueText": schema_name,
},
{
"path": ["table_name"],
"operator": "Equal",
"valueText": table_name,
},
{
"path": ["column_name"],
"operator": "Equal",
"valueText": column_name,
},
],
}
distinct_documents = self._get_documents(
query=query,
collection_name="distinct_column_values",
num_results=num_results,
namespace=namespace,
filters=filter,
)
distinct_values = [doc["value"] for doc in distinct_documents]
return distinct_values
`
Summary
Class Schemas: Defined dynamically when creating collections.
Importing Objects: Functions check for document existence and then add documents to the collection.
Retrieval Queries: Constructed to perform hybrid or near-text searches, with optional filters and projections.
I hope this helps , please feel free to ask if you dont get anything
from weaviate.
Also @rthiiyer82 , i have the pprof profile logs also if that can help in any way:
File: weaviate
Type: inuse_space
Time: Jun 10, 2024 at 7:31am (UTC)
Showing nodes accounting for 16638.89MB, 96.41% of 17258.55MB total
Dropped 406 nodes (cum <= 86.29MB)
flat flat% sum% cum cum%
13244.84MB 76.74% 76.74% 13244.84MB 76.74% github.com/weaviate/weaviate/adapters/repos/db/vector/hnsw/distancer.Normalize (inline)
1466.63MB 8.50% 85.24% 1466.63MB 8.50% github.com/weaviate/weaviate/adapters/repos/db/vector/hnsw.(*Deserializer).ReadLink
418.43MB 2.42% 87.67% 418.43MB 2.42% github.com/weaviate/weaviate/adapters/repos/db/vector/cache.(*shardedLockCache[go.shape.float32]).Grow
while the actual heap usage is 19.5 gb approx as per monitoring metrics, in profiler i am getting this insight of 17.3 gb approx
from weaviate.
@rthiiyer82 any updates on this ?
by the way i am also seeing continuous logs like this when the memory spike increases and as soon as weaviate gets a restart this logs seems gone (enabled the GCTRACE logs)
by any chance this can affect memoery spikes:
{"action":"lsm_memtable_flush","class":"Account_accountsg_categories","error":"switch active memtable: init commit logger: open /var/lib/weaviate/accountaccountsg_categories//lsm/objects/segment-1718480924736645345.wal: no such file or directory","index":"accountaccountsg_categories","level":"error","msg":"flush and switch failed","path":"/var/lib/weaviate/accountaccount_sg_categories//lsm/objects","shard":"","time":"2024-06-15T19:48:44Z"}
from weaviate.
Related Issues (20)
- Tenant delete fails for deactivated tenants
- Memory increases on each backup
- Classes can't be created while a node is down HOT 2
- delete_tombstone_test.go fails as there are still tombstones left in the end of the test
- POST batch/objects error: class '' not present in schema HOT 2
- Unable to load custom models with dimensions using OpenAI compatible API
- Generative-openai doesn't work with custom model names
- Weaviate Docker for GRPC port is not working with AWS ALB Health Check HOT 1
- [HNSW] Sparse visited list implementation
- [Dynamic] Limit concurrency when upgrading dynamic indices
- Support read-repaired deletes HOT 2
- [DX] Inconsistent application of case sensitivity
- Should any of these old modules be deprecated? HOT 4
- GSE tokenizer failed. HOT 3
- Add more infofmation to read-only errors
- The text2vec-ollama module uses an incorrect url for the embed POST request
- Raft node rejoining cluster can get stuck trying to start election until timeout is reached
- Node can receive replication read/write requests even though it is not reporting ready HOT 1
- Weaviate doesn't retry internal query & apply if leader is not found before the gRPC request is sent
- Discrepancy between documentation and implementation for batch operations in WeaviateAsyncClient
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from weaviate.