How to reproduce this bug? <div class="snippet-clipboard-content notranslate pos

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

i don't have any particular as we have different applications using weaviate an

Hey <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

Also <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Weaviate memory usage increasing gradually over the time about weaviate HOT 7 OPEN

darshilshahquantive commented on September 27, 2024 1

Weaviate memory usage increasing gradually over the time

from weaviate.

Comments (7)

rthiiyer82 commented on September 27, 2024 1

Thank you so much for the details. This is good enough. Will get back to you soon.

from weaviate.

rthiiyer82 commented on September 27, 2024

@darshilshahquantive Thank you for bringing this up this issue..

Is it possible you are able to share your script to reproduce the issue?

from weaviate.

darshilshahquantive commented on September 27, 2024

i don't have any particular script as we have different applications using weaviate and not a script , but i can provide you information like how we are importing the data and how we are making search queries etc

If you can provide a list of question that you want to know about our architecture that will be much helpful
If that helps !

from weaviate.

rthiiyer82 commented on September 27, 2024

Hey @darshilshahquantive if you could provide information on the class schemas, how you are importing objects and query used for testing then that would be helpful!

from weaviate.

darshilshahquantive commented on September 27, 2024

we have three products and i am providing functions used for each of the product for defining schema , importing object , retriving object

PRODUCT 1
Class Schemas

The schema for a document class in Weaviate is defined as follows:

json

{
  "class": "Document",
  "properties": [
    { "name": "title", "dataType": ["string"] },
    { "name": "content", "dataType": ["text"] },
    { "name": "account_id", "dataType": ["string"] }
  ],
  "multiTenancyConfig": { "enabled": true },
  "vectorizer": "text2vec-openai",
  "moduleConfig": {
    "text2vec-openai": {
      "resourceName": "openai-resource",
      "deploymentId": "text-embedding-ada-002"
    }
  }
}

Importing Objects

Objects are imported using the insert_chunked_data_into_vector_store function:

python

def insert_chunked_data_into_vector_store(chunks, uploaded_file_name, account_id, document_id):
    vector_store = get_raw_vector_store()
    WeaviateUtility.add_documents(
        vector_store,
        chunks,
        StrategyConfig.Weaviate.VS_MULTI_TENANCY_CLASS_NAME,
        account_id,
        uploaded_file_name,
        document_id,
    )

Query Used for Testing

A function for fetching relevant documents based on text:

python

def get_relevant_documents_for_text(account_id, text, document_ids):
    vs = get_raw_vector_store()
    query = (
        vs.query.get(StrategyConfig.Weaviate.VS_MULTI_TENANCY_CLASS_NAME, ["content", "source_file", "document_id"])
        .with_tenant(account_id)
        .with_hybrid(query=text, properties=["content"])
        .with_additional('rerank(property: "content" query: "' + text + '") { score }')
        .with_limit(5)
        .with_where({"path": ["document_id"], "operator": "ContainsAny", "valueTextArray": document_ids})
    )
    return query.do()

Explanation

Class Schemas: Defines the structure and properties of the data stored in Weaviate, including multi-tenancy and vectorizer configurations.
Importing Objects: Uses the Weaviate client to insert chunked data into the vector store.
Query: Fetches documents relevant to a given text, using a hybrid query with additional re-ranking based on content.

PRODUCT 2
Class Schemas

Example schema for class:

json

{
  "class": "TrackBC",
  "multiTenancyConfig": {"enabled": True},
  "vectorizer": "none",
  "properties": [
    {"name": "work_item_id", "dataType": ["text"]},
    {"name": "dataset_ids", "dataType": ["text[]"]},
    {"name": "external_id", "dataType": ["text"]},
    {"name": "external_key", "dataType": ["text"]},
    {"name": "external_system_url", "dataType": ["text"]},
    {"name": "name", "dataType": ["text"]},
    {"name": "description", "dataType": ["text"]},
    {"name": "suggested_description", "dataType": ["text"]},
    {"name": "status", "dataType": ["text"]},
    {"name": "type", "dataType": ["text"]},
    {"name": "assignee_emails", "dataType": ["text[]"]},
    {
      "name": "effort",
      "dataType": ["object"],
      "nestedProperties": [
        {"dataType": ["number"], "name": "amount"},
        {"dataType": ["text"], "name": "unit"},
        {"dataType": ["text"], "name": "description"}
      ]
    },
    {"name": "hash", "dataType": ["text"]}
  ]
}

Importing Objects

Objects are imported using the add_single_object function:

python

def add_single_object(client: weaviate.Client, class_name: str, tenant: str, uuid: str, vector: list, object_data: Dict[str, Any]) -> None:
    tenants = get_existing_tenants(client, class_name)
    if tenant in tenants:
        valid = client.data_object.validate(class_name=class_name, data_object=object_data, uuid=uuid, vector=vector)
        if valid["valid"]:
            try:
                client.data_object.create(class_name=class_name, data_object=object_data, tenant=tenant, uuid=uuid, vector=vector)
            except weaviate.ObjectAlreadyExistsException:
                logger.info(f"Object {uuid} already exists in weaviate")
                raise
            except Exception as e:
                logger.error(f"Error while adding weaviate object: {e}")
                raise
        else:
            logger.error(f"Weaviate Object {object_data} is not valid")
            raise Exception(f"Weaviate Object {object_data} is not valid")
    else:
        logger.error(f"Tenant {tenant} does not exist for class {class_name}")
        raise Exception(f"Tenant {tenant} does not exist for class {class_name}")

Query Used for Testing

Retrieving data in batches:

python

def get_data_in_batch(client: weaviate.Client, class_name: str, tenant: str, where_filter: Dict, properties: List = [], batch_size: int = 200) -> List[Dict[str, Any]]:
    count = get_count_of_objects(client, class_name, tenant, where_filter)
    weaviate_data = []
    for i in range(0, count, batch_size):
        try:
            response = (
                client.query.get(class_name, properties)
                .with_tenant(tenant)
                .with_where(where_filter)
                .with_additional(["vector id"])
                .with_limit(batch_size)
                .with_offset(i)
                .do()
            )

            if response.get("errors"):
                logger.error(f"Error while getting count of objects: {response['errors']}")
                raise Exception(f"Error while getting count of objects: {response['errors']}")
            else:
                objects_list = response["data"]["Get"][class_name]
        except Exception as e:
            logger.error(f"Error while getting weaviate data in batch: {e}")
            raise

        weaviate_data.extend(objects_list)
    return weaviate_data

Explanation

Class Schemas: Defines the structure and properties for data stored in Weaviate, including multi-tenancy and nested properties.
Importing Objects: Uses the Weaviate client to validate and insert data objects into the vector store, ensuring that the correct tenant exists.
Query: Retrieves data in batches from Weaviate based on a filter, useful for testing and verifying the number of objects.

PRODUCT 3
Class Schemas

Class schemas are created dynamically based on the properties provided when creating the collection:

python

def _use_collection(self, collection_name: str, properties: List[str], non_vectorized_properties: List[str] = []) -> None:
    existing_collections = self.client.schema.get()["classes"]
    collection_names = [collection["class"] for collection in existing_collections]
    capitalized_collection_name = collection_name.capitalize()
    
    if capitalized_collection_name not in collection_names:
        class_obj = {
            "class": capitalized_collection_name,
            "description": "A collection of documents",
            "multiTenancyConfig": {"enabled": True},
            "vectorizer": "text2vec-openai",
            "properties": [
                {
                    "name": p.lower(),
                    "description": "string",
                    "dataType": ["text"],
                    "moduleConfig": {
                        "text2vec-openai": {
                            "skip": p in non_vectorized_properties
                        }
                    },
                } for p in properties
            ],
            "moduleConfig": {
                "text2vec-openai": {
                    "model": "ada",
                    "modelVersion": "002",
                    "type": "text",
                }
            }
        }
        self.client.schema.create_class(class_obj)

Importing Objects

Checking document existence:

python

def _document_exist(self, collection_name: str, doc_id: str, namespace: str = None) -> bool:
    try:
        doc = (
            self.client.data_object.get_by_id(
                doc_id, class_name=collection_name.capitalize(), tenant=namespace
            )
            is not None
        )
        return doc
    except:
        return False

Adding a single object:

python

def _add_single_object_to_collection(self, document: Dict, collection_name: str, doc_id: str, overwrite: bool, namespace: str) -> bool:
    capitalized_collection_name = collection_name.capitalize()
    doc_exist = self._document_exist(capitalized_collection_name, doc_id, namespace)
    if not doc_exist or (doc_exist and overwrite):
        properties = {k.lower(): v for k, v in document.items()}
        vec = None
        self.client.data_object.create(
            data_object=properties,
            class_name=capitalized_collection_name,
            uuid=doc_id,
            tenant=namespace,
            vector=vec,
        )
        return True
    return False

Adding multiple documents:

python

def _add_documents(self, documents: List[Dict], collection_name: str, namespace: str, overwrite: bool = False) -> None:
    for doc in documents:
        doc_id = str(doc["doc_id"])
        self._add_single_object_to_collection(doc, collection_name, doc_id, overwrite, namespace)

Queries Used for Retrieval

Constructing and executing queries:

python

def _get_documents(self, collection_name: str, query: str, num_results: int, filters: Dict = None, query_projections: List[str] = ["*"], query_properties: List[str] = ["*"], is_hybrid: bool = False, alpha: float = 0.5, namespace: str = None) -> List[Dict]:
    capitalized_collection_name = collection_name.capitalize()
    search = self.client.query.get(capitalized_collection_name, query_projections)
    
    if filters is not None:
        search = search.with_where(filters)
    
    if is_hybrid:
        raw_response = (
            search.with_hybrid(query=query, alpha=alpha, properties=query_properties)
            .with_additional("score")
            .with_limit(num_results)
            .with_tenant(namespace)
            .do()
        )
    else:
        raw_response = (
            search.with_near_text({"concepts": [query]})
            .with_additional("score")
            .with_limit(num_results)
            .with_tenant(namespace)
            .do()
        )

    documents = [
        {"score": item["_additional"]["score"], **item} for item in raw_response["data"]["Get"][capitalized_collection_name]
    ]
    return documents

`We do use AND operators for filtering documents to retrive top most and to delete via filter

def _retrieve_top_distinct_values(
    self,
    query: str,
    database_name: str,
    schema_name: str,
    table_name: str,
    column_name: str,
    namespace: str,
    num_results=3,
) -> list[str]:
    """
    Retrieve the top distinct values for the given column and query.
    """
    filter = {
        "operator": "And",
        "operands": [
            {
                "path": ["database_name"],
                "operator": "Equal",
                "valueText": database_name,
            },
            {
                "path": ["schema_name"],
                "operator": "Equal",
                "valueText": schema_name,
            },
            {
                "path": ["table_name"],
                "operator": "Equal",
                "valueText": table_name,
            },
            {
                "path": ["column_name"],
                "operator": "Equal",
                "valueText": column_name,
            },
        ],
    }

    distinct_documents = self._get_documents(
        query=query,
        collection_name="distinct_column_values",
        num_results=num_results,
        namespace=namespace,
        filters=filter,
    )

    distinct_values = [doc["value"] for doc in distinct_documents]
    return distinct_values

Summary

Class Schemas: Defined dynamically when creating collections.
Importing Objects: Functions check for document existence and then add documents to the collection.
Retrieval Queries: Constructed to perform hybrid or near-text searches, with optional filters and projections.

I hope this helps , please feel free to ask if you dont get anything

from weaviate.

darshilshahquantive commented on September 27, 2024

Also @rthiiyer82 , i have the pprof profile logs also if that can help in any way:

File: weaviate
Type: inuse_space
Time: Jun 10, 2024 at 7:31am (UTC)
Showing nodes accounting for 16638.89MB, 96.41% of 17258.55MB total
Dropped 406 nodes (cum <= 86.29MB)
flat flat% sum% cum cum%
13244.84MB 76.74% 76.74% 13244.84MB 76.74% github.com/weaviate/weaviate/adapters/repos/db/vector/hnsw/distancer.Normalize (inline)
1466.63MB 8.50% 85.24% 1466.63MB 8.50% github.com/weaviate/weaviate/adapters/repos/db/vector/hnsw.(*Deserializer).ReadLink
418.43MB 2.42% 87.67% 418.43MB 2.42% github.com/weaviate/weaviate/adapters/repos/db/vector/cache.(*shardedLockCache[go.shape.float32]).Grow

while the actual heap usage is 19.5 gb approx as per monitoring metrics, in profiler i am getting this insight of 17.3 gb approx

from weaviate.

darshilshahquantive commented on September 27, 2024

@rthiiyer82 any updates on this ?

by the way i am also seeing continuous logs like this when the memory spike increases and as soon as weaviate gets a restart this logs seems gone (enabled the GCTRACE logs)
by any chance this can affect memoery spikes:

{"action":"lsm_memtable_flush","class":"Account_accountsg_categories","error":"switch active memtable: init commit logger: open /var/lib/weaviate/accountaccountsg_categories//lsm/objects/segment-1718480924736645345.wal: no such file or directory","index":"accountaccountsg_categories","level":"error","msg":"flush and switch failed","path":"/var/lib/weaviate/accountaccount_sg_categories//lsm/objects","shard":"","time":"2024-06-15T19:48:44Z"}

from weaviate.

Weaviate memory usage increasing gradually over the time about weaviate HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent