Coder Social home page Coder Social logo

Comments (5)

Athe-kunal avatar Athe-kunal commented on August 26, 2024

@okhat
Can you suggest something here?

from colbert.

detaos avatar detaos commented on August 26, 2024

It's not the easiest thing to use, but ColBERT does support pre-filtering:

Here's the chunk I use:

    if len(query.conditions) > 0:
        results = searcher.search(query.query, k=query.k, filter_fn=lambda pids: torch.tensor(
            [index for index in pids.numpy().tolist() if keepResult(query, index)], dtype=pids.dtype))
    else:
        results = searcher.search(query.query, k=query.k, full_length_search=True)

Note: The query object contains the filter conditions. The keepResult function returns a boolean about whether the metadata for the given passage id (index parameter) matches with the filter in the query.

from colbert.

Athe-kunal avatar Athe-kunal commented on August 26, 2024

Hi @detaos

Thank you for your response
During indexing, how should I index it with metadata? My indexing function is something like

with Run().context(RunConfig(nranks=1, experiment=EXPERIMENT_NAME)):  # nranks specifies the number of GPUs to use
        config = ColBERTConfig(doc_maxlen=doc_maxlen, nbits=nbits, kmeans_niters=4) # kmeans_niters specifies the number of iterations of k-means clustering; 4 is a good and fast default.
                                                                                    # Consider larger numbers for small datasets.

        indexer = Indexer(checkpoint=COLBERT_CHECKPOINT, config=config)
        for name, text_list in texts_dict.items():
            index_name = f'SEC.Earningcalls.{ticker}.{year}.{name}.{nbits}bits'
            indexer.index(name=index_name, collection=text_list, overwrite=True)

How can I pass the metadata information, currently I am just passing the list of texts. Thanks in advance

from colbert.

detaos avatar detaos commented on August 26, 2024

You don't need to index metadata that won't help the search. For example, lastmod dates from HTML pages are useful metadata, but no one is searching for a lastmod date. So, I keep my non-search-related metadata separate. I have a mapping object from passage ID to page ID, then have a metadata object that has the metadata for each page. My keepResult function uses the mapping for the candidate passage ID to the page ID to get the page's metadata to check against the filter. Essentially: metadata[page_ids[passage_id]]

It's a bit convoluted, but if you store the metadata for each passage, then you end up with a LOT of redundant metadata (presupposing you have many pages that are longer than 1 passage, which I do).

from colbert.

Athe-kunal avatar Athe-kunal commented on August 26, 2024

Ok, understood
Thanks @detaos, will implement this in my code

from colbert.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.