Coder Social home page Coder Social logo

deepset-ai / haystack-core-integrations Goto Github PK

View Code? Open in Web Editor NEW
89.0 12.0 82.0 4.1 MB

Additional packages (components, document stores and the likes) to extend the capabilities of Haystack version 2.0 and onwards

Home Page: https://haystack.deepset.ai

License: Apache License 2.0

Python 99.98% Shell 0.02%
ai haystack llm mlops nlp

haystack-core-integrations's People

Contributors

agnieszka-m avatar alistairlr112 avatar amnah199 avatar anakin87 avatar anushreebannadabhavi avatar awinml avatar bilgeyucel avatar davidsbatista avatar dependabot[bot] avatar dfokina avatar erichare avatar github-actions[bot] avatar haystackbot avatar jlonge4 avatar joanfm avatar julian-risch avatar lambda-science avatar lbux avatar masci avatar nickprock avatar paulmartrencharpro avatar sahusiddharth avatar shademe avatar silvanocerza avatar sjrl avatar tstadel avatar tuanacelik avatar vblagoje avatar wochinge avatar zansara avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

haystack-core-integrations's Issues

Add Amazon Bedrock Integration

Summary and motivation

As a user, I'd like to use models on Amazon Bedrock in my Haystack 2.0 pipelines. This was brought up in deepset-ai/haystack#6545

Detailed design

A new generator class for Amazon Bedrock with the same features as the 1.x version, which was implemented as an invocation layer. All currently available models on Bedrock should be supported and streaming should be supported for all models that support streaming. For example, we should be able to load Llama 2 Chat 13B model:

from amazon_bedrock_haystack.generators import AmazonBedrockGenerator
generator = AmazonBedrockGenerator(model_name_or_path="meta.llama2-13b-chat-v1")

Related PRs for the support in Haystack 1.x were deepset-ai/haystack#6226 and deepset-ai/haystack#6406

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Jina AI Integration

Summary and motivation

Add the Jina AI embedding functionality to Haystack

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Add MongoDBAtlasDocumentStore

Summary and motivation

Similar to 1.x we'd like to have a document store for MongoDB's Atlas.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Unstructured: better support for hosted APIs

  • The current logic to understand if the user is using the hosted API is not great (here)

  • The Unstructured SaaS API was recently announced. This involves several changes. (IDK if we support it as of now)

In conclusion, we can make our implementation more robust and flexible, so that users can use:

  • the free API via Docker
  • the free hosted API
  • the SAAS API

AstraDB document store

Summary and motivation

DataStax Astra DB is a serverless vector database that’s perfect for managing mission-critical AI workloads. It’s built on Apache Cassandra®, making Astra DB a highly scalable, reliable database technology. This makes it a powerful all-in-one data storage solution, ideal for Generative AI projects.

Detailed design

Implement the DocumentStore protocol and provide a specific documents retriever.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

  • The code was merged in the main branch #144
  • Docs are published at https://docs.haystack.deepset.ai/
  • There is a Github workflow running the tests for the integration nightly and at every PR
  • A label named like integration:<your integration name> has been added to this repo
  • The labeler.yml file has been updated
  • The package has been released on PyPI
  • #164
  • The integration has been listed in the Inventory section of this repo README
  • #163
  • The feature was announced through social media

Weaviate Document Store

Summary and motivation

We want to support Weaviate as a Document Store in Haystack 2.x, much like we did for Haystack 1.x

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

  • The code was merged in the main branch
  • Docs are published at https://docs.haystack.deepset.ai/
  • There is a Github workflow running the tests for the integration nightly and at every PR
  • A label named like integration:<your integration name> has been added to this repo
  • The labeler.yml file has been updated
  • The package has been released on PyPI
  • An integration tile has been added to https://github.com/deepset-ai/haystack-integrations
  • The integration has been listed in the Inventory section of this repo README
  • There is an example available to demonstrate the feature
  • The feature was announced through social media

Docs: Vertex AI integration docs

Tasks

When they're published, we should link them from the "Google Vertex AI integration" page

Add AzureOpenAIGenerator

Is your feature request related to a problem? Please describe.
We should support Azure Open AI services. It's a widely used service that is particularly relevant for professional production use.

See also: deepset-ai/haystack#6620

Describe the solution you'd like
Similar to 1.x we need a component that supports OpenAI models hosted on Azure.

Describe alternatives you've considered
None

Filter ElasticSearch results by min_score

Problem:
I want to retrieve all relevant (similar) documents from the ElasticsearchDocumentStore based on the _score using the EmbeddingRetriever (I am not using the Reader). Prior to the search, I don't know how many relevant Documents exist. To make sure, that I retrieve all relevant entries from the ElasticsearchDocumentStore I need to set top_k=10000 or higher and filter the results afterwards - only taking documents with a _score higher than x. Retrieving this many documents takes several seconds.

Solution
Filtering your query results by a minimum score value is already implemented in the Python Elasticsearch client. You could add another parameter (min_score) similar to tok_k and add it to the body that you use in client.search(). See my example:

body = { "size": top_k, "min_score": min_score, "query": self._get_vector_similarity_query(query_emb, top_k) }

I changed the body form the function def query_by_embedding(...) from the file haystack/document_stores/elasticsearch.py. Now the results contain only documents that have a _score higher than min_score.

Additional context
In case the user wants to filter the results by the cosine similarity metric the min_score parameter needs to be scaled appropriately before using it in the body.

Add Supabase integration

Hi , wondering if you will support supabse in near future

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Tasks

Package build broken if integration folder name contains hyphens

Describe the bug

We recently moved (#103) package versioning from hardcoded strings in the __about__.py file to git tags using setuptools_scm through hatch.

Problem is setuptools_scm splits the tag using - and having hyphens in the string prepending the version in the tag name, for example integrations/google_vertex-v0.0.1 confuses the plugin.

There isn't an easy fix, so I propose the following workaround:

  • Rename all the tags integrations/google-vertex-vXXX to integrations/google_vertex-vXXX
  • Rename all the tags integrations/instructor-embedders-vXXX to integrations/instructor_embedders-vXXX
  • Push the new tags, CI will attempt to rebuild the packages but will fail because the path google_vertex doesn't exist
  • Rename the folder integrations/google-vertex to integrations/google_vertex #114
  • Rename the folder integrations/instructor-embedders to integrations/instructor_embedders #114
  • Enforce the new naming convention with a job in the CI #119

The workaround won't affect the name of the package on PyPI, nor the import paths.

To Reproduce

Checkout the latest tag for google vertex and call hatch version

Describe your environment (please complete the following information):

  • OS: [e.g. iOS]
  • Haystack version:
  • Integration version:

[Elasticsearch] BM25 retrieval is too restrictive

To Reproduce

from elasticsearch_haystack.bm25_retriever import ElasticsearchBM25Retriever
from elasticsearch_haystack.document_store import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(hosts= "http://localhost:9200/")

documents = [Document(content="There are over 7,000 languages spoken around the world today."),
  Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
  Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]

retriever = ElasticsearchBM25Retriever(document_store=document_store)
print(retriever.run(query="How much self awareness do elephants have?"))
# {'documents': []}

This should return the 2nd Document but this does not happen because of this AND operator:

See for comparison the same query in Haystack 1.x:
https://github.com/deepset-ai/haystack/blob/c812250453ab7da35f526a5f2a53e18c058fe2ff/haystack/document_stores/search_engine.py#L1100

Build LlamaHub Integration

LlamaHub has added a function to convert outputs to haystack format, which luckily, seems compatible with 2.0 although it was built for 1.x.

The 'verified' data loaders have a standardized way to be loaded and used. Let's create an integration that consists of a custom component where the run function takes only a few things, such as: the name of the loader for download_loader('name of loader'), and the input expected for load_data()

image

Add Ollama Generator

Summary and motivation

See deepset-ai/haystack#6514

Detailed design

See deepset-ai/haystack#6514

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

  • The code was merged in the main branch
  • Docs are published at https://docs.haystack.deepset.ai/
  • There is a Github workflow running the tests for the integration nightly and at every PR
  • A label named like integration:<your integration name> has been added to this repo
  • The labeler.yml file has been updated
  • The package has been released on PyPI
  • An integration tile has been added to https://github.com/deepset-ai/haystack-integrations
  • The integration has been listed in the Inventory section of this repo README
  • There is an example available to demonstrate the feature
  • The feature was announced through social media

Rename the `QdrantRetriever` + `/retrievers` folder

Is your feature request related to a problem? Please describe.
As a user, I'd like to have predictable names for components and other retriever names contain the retrieval method

Describe the solution you'd like
Renaming the QdrantRetriever as QdrantEmbeddingRetriever to align with the current convention. Also, since we foresee QdrantSparseRetriever and QdrantHybridRetriever, it makes sense to create a new retrievers folder and move QdrantEmbeddingRetriever into it.

Describe alternatives you've considered
N/A

Additional context
N/A

Retrievers: should `document_store` be a private attribute?

  • Retrievers of InMemoryDocumentStore (InMemoryBM25Retriever...) have the attribute document_store

  • Retrievers of Elasticsearch and Opensearch Document Store have the attribute _document_store

We should probably agree on a consistent approach,
so that for example pipeline.get_component("retriever").document_store always works.

@silvanocerza

Update Chroma example colab

Describe the bug
Example colab of Chroma is not working. Link to the notebook: https://colab.research.google.com/drive/1YpDetI8BRbObPDEVdfqUcwhEX9UUXP-m?usp=sharing. The code in example.py should be updated accordingly

To Reproduce
Most up to date code block 👇 It seems like DocumentWriter is not writing any documents to the document store

import os
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter

from chroma_haystack import ChromaDocumentStore
from chroma_haystack.retriever import ChromaQueryRetriever

file_paths = ["data" / Path(name) for name in os.listdir("data")]

document_store = ChromaDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})

querying = Pipeline()
querying.add_component("retriever", ChromaQueryRetriever(document_store))
results = querying.run({"retriever": {"queries": ["Variable declarations"], "top_k": 3}})

for d in results["retriever"][0]:
    print(d.metadata, d.score)

Describe your environment (please complete the following information):

  • OS: Colab
  • Haystack version: haystack-ai==2.0.0b3
  • Integration version: 0.8.1

Elasticsearch Document Store - investigate scaling scores for embedding retrieval

Currently, Embedding Retrieval in the Elasticsearch Document Store does not allow scaling scores in the range [0, 1].

I have not implemented this feature for two reasons:

  • I have the impression that in the latest versions of Elasticsearch, it comes out of the box
  • It's not trivial to do it right

We should investigate these points further.

Add OptimumEmbedder

Is your feature request related to a problem? Please describe.
Huggin Face's Optimum library provides faster inference through ONNX and TensorRT. This can be used to create blazing fast embedding components. The concepts used in Optimum also play well with some of the concepts that we have in Haystack. For example:

Loading non-ONNX checkpoints requires a conversion step, this takes some time. We can do that step in our warmup function (https://huggingface.co/docs/optimum/onnxruntime/usage_guides/models#loading-a-vanilla-transformers-model).

Describe the solution you'd like

Describe alternatives you've considered

Additional context
https://colab.research.google.com/drive/10UAtpz26Gv2LtamT8j33LmI5UFQFwF4T?usp=sharing
https://github.com/huggingface/optimum-benchmark/tree/main/examples/fast-mteb

Tasks

  1. P1 integration:optimum type:documentation
    dfokina

Add one object storage based document store

Currently, Haystack supports storing data into ElasticSearch, InMemory, and RDBMS.
It would be nice to add support of Object storage like S3, which is very cheap and have less hassle to maintain.

In the first step, AWS s3 can be supported as they recently added s3 select option which can help retrieve only a subset of data from an object (currently support CSV file object in compressed or uncompressed format).

Ideally, we can add a Metadata service as well which may help to use Haystack along with Data Lakes.

Chroma DocumentStore throws error at creation if collection exists

Describe the bug
When intializing the ChromaDocumentStore, if you rerun the cell without changing the collection name, you will get an error saying telling you collection already exists.

Expected behaviour: Instead of the init calling 'create' by default, if collection already exists, it should simply connect to the existing collection. The current state means we cannot reuse a collection lateron.

To Reproduce
Any colab that has a cell with document_store = ChromaDocumentSrore()
Here is one for example: https://colab.research.google.com/drive/19NzliNb5ZUo1fUbplwnUrZpNBxs9HvoR?usp=sharing

Describe your environment (please complete the following information):

  • Integration version: chroma_haystack-0.9.0

Google Gemini (Rest API) Integration

Summary and motivation

Google Gemini isn't only provided via Vertex AI. Let's create an integration for the REST API offering, too. There you only need a simple API Key - no need for a GCP account.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

  • The code was merged in the main branch
  • Docs are published at https://docs.haystack.deepset.ai/
  • There is a Github workflow running the tests for the integration nightly and at every PR
  • A label named like integration:<your integration name> has been added to this repo
  • The labeler.yml file has been updated
  • The package has been released on PyPI
  • An integration tile has been added to https://github.com/deepset-ai/haystack-integrations
  • The integration has been listed in the Inventory section of this repo README
  • There is an example available to demonstrate the feature
  • The feature was announced through social media

Add SagemakerGenerator

Is your feature request related to a problem? Please describe.
Sagemaker is widely used for LLM inference in production use cases. We should support Sagemaker in 2.0.

Describe the solution you'd like
Generator similar to the Sagemaker support in 1.x

Tasks

Pinecone Document Store

Summary and motivation

Briefly explain the request: why do we need this integration? What use cases does it support?

Detailed design

Explain the design in enough detail for somebody familiar with Haystack to understand, and for somebody familiar with
the implementation to implement. Get into specifics and corner-cases, and include examples of how the feature is used.
Also, if there's any new terminology involved, define it here.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

  • The code was merged in the main branch
  • #161
  • There is a Github workflow running the tests for the integration nightly and at every PR
  • A label named like integration:<your integration name> has been added to this repo
  • The labeler.yml file has been updated
  • The package has been released on PyPI
  • deepset-ai/haystack-integrations#107
  • The integration has been listed in the Inventory section of this repo README
  • There is an example available to demonstrate the feature
  • deepset-ai/devrel-board#241

Preview needs removing

The instructor embedder (and probably others) are a causing errors right now because the imports still have the '.preview' in them. There might also be some other out of date imports.
(Reported on discord)

milvus

why? why would you do such a thing? 😭

Add Cohere LLM integration

Summary and motivation

Add support for CohereGenerator and CohereChatGenerator.

Detailed design

Cohere is one of the leading LLM providers and we should have Haystack Cohere LLM integration. See https://docs.cohere.com/reference/generate and https://docs.cohere.com/reference/chat for more details.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

  • The code was merged in the main branch
  • Docs are published at https://docs.haystack.deepset.ai/ (Chat generator missing)
  • There is a Github workflow running the tests for the integration nightly and at every PR
  • A label named like integration:<your integration name> has been added to this repo
  • The labeler.yml file has been updated
  • The package has been released on PyPI
  • deepset-ai/haystack-integrations#150
  • The integration has been listed in the Inventory section of this repo README
  • There is an example available to demonstrate the feature
  • #265

Add AnthropicGenerator

Is your feature request related to a problem? Please describe.
Anthropic models are widely used. We should support them in 2.0.

Describe the solution you'd like
Generator similar to the Anthropic support in 1.x

Tasks

Add AzureOpenAIEmbedders

Is your feature request related to a problem? Please describe.
OpenAI's embedding models hosted on Azure are widely used. We should add the relevant embedders to support them.

See also: deepset-ai/haystack#6620

Describe the solution you'd like
Similar to 1.x support the OpenAI embedding models hosted on Azure.

Describe alternatives you've considered
None

Add PGVector DocumentStore

Summary and motivation

PGVector is a popular request from our community. We should have it in 2.0.

Detailed design

TBD

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Support Pinecone's hybrid vectors

Is your feature request related to a problem? Please describe.
Recently, Pinecone announced support for Sparse-dense embeddings, allowing for hybrid vector search (both semantic and keyword search). However, haystack doesn't currently support these.

Describe the solution you'd like
Native support for sparse-dense vectors!

Describe alternatives you've considered
I've created a small package called haystack-hybrid-embedding which can be installed to support hybrid vectors for now, but it's a hack!

Qdrant integration

Summary and motivation

The integration has already been made and published.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

  • The code was merged in the main branch
  • Docs are published at https://docs.haystack.deepset.ai/
  • There is a Github workflow running the tests for the integration nightly and at every PR
  • A label named like integration:<your integration name> has been added to this repo
  • The labeler.yml file has been updated
  • The package has been released on PyPI
  • An integration tile has been added to https://github.com/deepset-ai/haystack-integrations
  • The integration has been listed in the Inventory section of this repo README
  • There is an example available to demonstrate the feature
  • The feature was announced through social media

Have a separate folder for each component type under integrations

Is your feature request related to a problem? Please describe.
As a user, I'm familiar with the convention at haystack repository and having the same convention here would make it easy for me to navigate.

Describe the solution you'd like
Create a /generators folder under cohere_haystack integration folder.
The same can apply to jina_haystack and unstructured. This way, we can be more open to new components coming from the same providers

Describe alternatives you've considered
We can leave it as it is.

Additional context
N/A

Tasks

  1. P1 integration:amazon-bedrock
    vblagoje
  2. P1 integration:astra
    masci
  3. P1 integration:cohere
    vblagoje
  4. P1 integration:elasticsearch
    masci
  5. P1 integration:google-ai
    masci
  6. P1 integration:google-vertex
    masci
  7. 1 of 1
    P1 integration:gradient
    masci
  8. P1 integration:instructor-embedders
    masci
  9. P1 integration:jina
    masci
  10. P1 integration:llama_cpp
    anakin87
  11. P1 integration:ollama
    anakin87
  12. 1 of 1
    P1 integration:opensearch
    masci
  13. P1 integration:pinecone
    masci
  14. P1 integration:qdrant
    masci
  15. P1 integration:unstructured-fileconverter
    anakin87
  16. P1 integration:weaviate
    silvanocerza
  17. integration:chroma
    masci

Add support for Cohere Embed v3 Models

Is your feature request related to a problem? Please describe.
Cohere has a new type of embedding models: Embed v3. Let's add support for it for 2.0 pipelines
https://txt.cohere.com/introducing-embed-v3/
https://docs.cohere.com/reference/embed

Describe the solution you'd like
It seems like the only change we need to make in CohereEmbedder is to add the input_type parameter. See the documentation: https://docs.cohere.com/reference/embed

Describe alternatives you've considered
There's no other way of using these v3 models. When we pass the v3 model names, the CohereEmbedder (or cohere API, not quite sure which one) may not raise any error but the model fallbacks to a v2 model

Additional context
N/A

Google Vertex AI integration

Summary and motivation

This already mostly implemented but I'm creating this issue to track the remaining work.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

  • The code was merged in the main branch
  • #155
  • There is a Github workflow running the tests for the integration nightly and at every PR
  • A label named like integration:<your integration name> has been added to this repo
  • The labeler.yml file has been updated
  • The package has been released on PyPI
  • #120
  • The integration has been listed in the Inventory section of this repo README
  • #117
  • #116

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.