deepset-ai / haystack-core-integrations Goto Github PK

Additional packages (components, document stores and the likes) to extend the capabilities of Haystack version 2.0 and onwards

Home Page: https://haystack.deepset.ai

License: Apache License 2.0

Python 99.98% Shell 0.02%

ai haystack llm mlops nlp

haystack-core-integrations's People

Contributors

Stargazers

Watchers

Forkers

zansara jfontestad filip-halt samthinkgit bilgeyucel awinml vrunm dorucioclea preemo-inc sunilkumardash9 joanfm anush008 sahusiddharth alistairlr112 anant shademe shashipal95 lambda-science nachollorca nickprock beatrixcohere sachinsachdeva jmdevita maxjakob hsm207 erichare zc277584121 penguine-ip dominastorm arthurbabin malek-jerbi crtr0 ptak82 mlcocdav paualarco anushreebannadabhavi rusteam marcschluperatintel malexw florul jdb78 jongirard chaitanya-nit ezhvsalate chrisk314 srini047 jcrpaquin edabati fmaoro techthiyanes mattf hemidactylus jdixosnd paulmartrencharpro virtualroot vish9812 ruben-vb etiennellipse mmurraysans antoniomuzzolini lohit8846 anakin87 zhanghexian vedantnaik19 johantandy ltmenezes suyambuganesh82 nh-lh mayankjobanputra isaac-chung basoko ajtran couchbase-ecosystem colegottdank rishh76 openvino-dev-samples guillaumecherel raspawar gadmarkovits matthewcoole christokur

haystack-core-integrations's Issues

Add Amazon Bedrock Integration

Summary and motivation

As a user, I'd like to use models on Amazon Bedrock in my Haystack 2.0 pipelines. This was brought up in deepset-ai/haystack#6545

Detailed design

A new generator class for Amazon Bedrock with the same features as the 1.x version, which was implemented as an invocation layer. All currently available models on Bedrock should be supported and streaming should be supported for all models that support streaming. For example, we should be able to load Llama 2 Chat 13B model:

from amazon_bedrock_haystack.generators import AmazonBedrockGenerator
generator = AmazonBedrockGenerator(model_name_or_path="meta.llama2-13b-chat-v1")

Related PRs for the support in Haystack 1.x were deepset-ai/haystack#6226 and deepset-ai/haystack#6406

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Jina AI Integration

Summary and motivation

Add the Jina AI embedding functionality to Haystack

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

The feature was announced through social media

Add MongoDBAtlasDocumentStore

Summary and motivation

Similar to 1.x we'd like to have a document store for MongoDB's Atlas.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Unstructured: better support for hosted APIs

The current logic to understand if the user is using the hosted API is not great (here)
The Unstructured SaaS API was recently announced. This involves several changes. (IDK if we support it as of now)

In conclusion, we can make our implementation more robust and flexible, so that users can use:

the free API via Docker
the free hosted API
the SAAS API

AstraDB document store

Summary and motivation

DataStax Astra DB is a serverless vector database that’s perfect for managing mission-critical AI workloads. It’s built on Apache Cassandra®, making Astra DB a highly scalable, reliable database technology. This makes it a powerful all-in-one data storage solution, ideal for Generative AI projects.

Detailed design

Implement the DocumentStore protocol and provide a specific documents retriever.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

refactor Instructor Embedders to be compatible with the new Document dataclass

replace Document.text with Document.content
replace Document.metadata with Document.meta

Weaviate Document Store

Summary and motivation

We want to support Weaviate as a Document Store in Haystack 2.x, much like we did for Haystack 1.x

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Docs: Vertex AI integration docs

Tasks

Beta Give feedback

VertexAITextGenerator
VertexAICodeGenerator
VertexAIImageGenerator
VertexAIImageQA
Options

When they're published, we should link them from the "Google Vertex AI integration" page

Add AzureOpenAIGenerator

Is your feature request related to a problem? Please describe.
We should support Azure Open AI services. It's a widely used service that is particularly relevant for professional production use.

Filter ElasticSearch results by min_score

Problem:
I want to retrieve all relevant (similar) documents from the ElasticsearchDocumentStore based on the _score using the EmbeddingRetriever (I am not using the Reader). Prior to the search, I don't know how many relevant Documents exist. To make sure, that I retrieve all relevant entries from the ElasticsearchDocumentStore I need to set top_k=10000 or higher and filter the results afterwards - only taking documents with a _score higher than x. Retrieving this many documents takes several seconds.

Solution
Filtering your query results by a minimum score value is already implemented in the Python Elasticsearch client. You could add another parameter (min_score) similar to tok_k and add it to the body that you use in client.search(). See my example:

body = { "size": top_k, "min_score": min_score, "query": self._get_vector_similarity_query(query_emb, top_k) }

I changed the body form the function def query_by_embedding(...) from the file haystack/document_stores/elasticsearch.py. Now the results contain only documents that have a _score higher than min_score.

Additional context
In case the user wants to filter the results by the cosine similarity metric the min_score parameter needs to be scaled appropriately before using it in the body.

Add Supabase integration

Hi , wondering if you will support supabse in near future

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Package build broken if integration folder name contains hyphens

Describe the bug

We recently moved (#103) package versioning from hardcoded strings in the __about__.py file to git tags using setuptools_scm through hatch.

Problem is setuptools_scm splits the tag using - and having hyphens in the string prepending the version in the tag name, for example integrations/google_vertex-v0.0.1 confuses the plugin.

There isn't an easy fix, so I propose the following workaround:

Rename all the tags integrations/google-vertex-vXXX to integrations/google_vertex-vXXX
Rename all the tags integrations/instructor-embedders-vXXX to integrations/instructor_embedders-vXXX
Push the new tags, CI will attempt to rebuild the packages but will fail because the path google_vertex doesn't exist
Rename the folder integrations/google-vertex to integrations/google_vertex #114
Rename the folder integrations/instructor-embedders to integrations/instructor_embedders #114
Enforce the new naming convention with a job in the CI #119

The workaround won't affect the name of the package on PyPI, nor the import paths.

To Reproduce

Checkout the latest tag for google vertex and call hatch version

Describe your environment (please complete the following information):

OS: [e.g. iOS]
Haystack version:
Integration version:

The default model for `VertexAIImageGenerator` should be `imagegeneration`

Describe the bug
According to Vertex AI documentation, the model for image generation is imagegeneration. Default model right now is imagetext which is a model for image captioning

[Elasticsearch] BM25 retrieval is too restrictive

To Reproduce

from elasticsearch_haystack.bm25_retriever import ElasticsearchBM25Retriever
from elasticsearch_haystack.document_store import ElasticsearchDocumentStore

document_store = ElasticsearchDocumentStore(hosts= "http://localhost:9200/")

documents = [Document(content="There are over 7,000 languages spoken around the world today."),
  Document(content="Elephants have been observed to behave in a way that indicates a high level of self-awareness, such as recognizing themselves in mirrors."),
  Document(content="In certain parts of the world, like the Maldives, Puerto Rico, and San Diego, you can witness the phenomenon of bioluminescent waves.")]

retriever = ElasticsearchBM25Retriever(document_store=document_store)
print(retriever.run(query="How much self awareness do elephants have?"))
# {'documents': []}

This should return the 2nd Document but this does not happen because of this AND operator:

haystack-core-integrations/integrations/elasticsearch/src/elasticsearch_haystack/document_store.py

Line 266 in 65beef5

"operator": "AND",

See for comparison the same query in Haystack 1.x:
https://github.com/deepset-ai/haystack/blob/c812250453ab7da35f526a5f2a53e18c058fe2ff/haystack/document_stores/search_engine.py#L1100

Weaviate: Support for hybrid search

Is your feature request related to a problem? Please describe.
Weaviate supports hybrid search (vector + sparse). However, Haystack doesn't support it currently.

Describe the solution you'd like
To be able to control alpha parameter and other necessary configurations for hybrid search through WeaviteDocumentStore

Additional context
Weaviate docs on hybrid
Related issue on Weaviate repo

Build LlamaHub Integration

LlamaHub has added a function to convert outputs to haystack format, which luckily, seems compatible with 2.0 although it was built for 1.x.

The 'verified' data loaders have a standardized way to be loaded and used. Let's create an integration that consists of a custom component where the run function takes only a few things, such as: the name of the loader for download_loader('name of loader'), and the input expected for load_data()

Add Ollama Generator

Summary and motivation

See deepset-ai/haystack#6514

Detailed design

See deepset-ai/haystack#6514

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

AstraDB - integration tile

Ollama: support streaming

Ollama Generator should support streaming, as done in other Generators with streaming_callback: see for example OpenAIGenerator.

There is an example available to demonstrate the feature

Tasks

Beta Give feedback

Colab
Article
Options

Create integration tile for Google Vertex AI

google-vertex-haystack package has been released.
We need to add it in the integrations list.

Rename the `QdrantRetriever` + `/retrievers` folder

Is your feature request related to a problem? Please describe.
As a user, I'd like to have predictable names for components and other retriever names contain the retrieval method

Describe the solution you'd like
Renaming the QdrantRetriever as QdrantEmbeddingRetriever to align with the current convention. Also, since we foresee QdrantSparseRetriever and QdrantHybridRetriever, it makes sense to create a new retrievers folder and move QdrantEmbeddingRetriever into it.

Describe alternatives you've considered
N/A

Additional context
N/A

Retrievers: should `document_store` be a private attribute?

Retrievers of InMemoryDocumentStore (InMemoryBM25Retriever...) have the attribute document_store
Retrievers of Elasticsearch and Opensearch Document Store have the attribute _document_store

We should probably agree on a consistent approach,
so that for example pipeline.get_component("retriever").document_store always works.

@silvanocerza

Update Chroma example colab

Describe the bug
Example colab of Chroma is not working. Link to the notebook: https://colab.research.google.com/drive/1YpDetI8BRbObPDEVdfqUcwhEX9UUXP-m?usp=sharing. The code in example.py should be updated accordingly

To Reproduce
Most up to date code block 👇 It seems like DocumentWriter is not writing any documents to the document store

import os
from pathlib import Path

from haystack import Pipeline
from haystack.components.converters import TextFileToDocument
from haystack.components.writers import DocumentWriter

from chroma_haystack import ChromaDocumentStore
from chroma_haystack.retriever import ChromaQueryRetriever

file_paths = ["data" / Path(name) for name in os.listdir("data")]

document_store = ChromaDocumentStore()

indexing = Pipeline()
indexing.add_component("converter", TextFileToDocument())
indexing.add_component("writer", DocumentWriter(document_store))
indexing.connect("converter", "writer")
indexing.run({"converter": {"sources": file_paths}})

querying = Pipeline()
querying.add_component("retriever", ChromaQueryRetriever(document_store))
results = querying.run({"retriever": {"queries": ["Variable declarations"], "top_k": 3}})

for d in results["retriever"][0]:
    print(d.metadata, d.score)

Describe your environment (please complete the following information):

OS: Colab
Haystack version: haystack-ai==2.0.0b3
Integration version: 0.8.1

Elasticsearch Document Store - investigate scaling scores for embedding retrieval

Currently, Embedding Retrieval in the Elasticsearch Document Store does not allow scaling scores in the range [0, 1].

I have not implemented this feature for two reasons:

I have the impression that in the latest versions of Elasticsearch, it comes out of the box
It's not trivial to do it right

We should investigate these points further.

Add OptimumEmbedder

Is your feature request related to a problem? Please describe.
Huggin Face's Optimum library provides faster inference through ONNX and TensorRT. This can be used to create blazing fast embedding components. The concepts used in Optimum also play well with some of the concepts that we have in Haystack. For example:

Loading non-ONNX checkpoints requires a conversion step, this takes some time. We can do that step in our warmup function (https://huggingface.co/docs/optimum/onnxruntime/usage_guides/models#loading-a-vanilla-transformers-model).

Describe the solution you'd like

fast embedders based on Optimum
apply other tricks such as sorting by sequence length and dynamic padding to bring down inference time (see https://github.com/UKPLab/sentence-transformers/blob/c006921e9e9977bc107b05676266b581091688a2/sentence_transformers/SentenceTransformer.py#L179)

Describe alternatives you've considered

Not having a fast embedder component
https://github.com/qdrant/fastembed > supported models are limited

Additional context
https://colab.research.google.com/drive/10UAtpz26Gv2LtamT8j33LmI5UFQFwF4T?usp=sharing
https://github.com/huggingface/optimum-benchmark/tree/main/examples/fast-mteb

Add one object storage based document store

Currently, Haystack supports storing data into ElasticSearch, InMemory, and RDBMS.
It would be nice to add support of Object storage like S3, which is very cheap and have less hassle to maintain.

In the first step, AWS s3 can be supported as they recently added s3 select option which can help retrieve only a subset of data from an object (currently support CSV file object in compressed or uncompressed format).

Ideally, we can add a Metadata service as well which may help to use Haystack along with Data Lakes.

Chroma DocumentStore throws error at creation if collection exists

Describe the bug
When intializing the ChromaDocumentStore, if you rerun the cell without changing the collection name, you will get an error saying telling you collection already exists.

Expected behaviour: Instead of the init calling 'create' by default, if collection already exists, it should simply connect to the existing collection. The current state means we cannot reuse a collection lateron.

To Reproduce
Any colab that has a cell with document_store = ChromaDocumentSrore()
Here is one for example: https://colab.research.google.com/drive/19NzliNb5ZUo1fUbplwnUrZpNBxs9HvoR?usp=sharing

Describe your environment (please complete the following information):

Integration version: chroma_haystack-0.9.0

Google Gemini (Rest API) Integration

Summary and motivation

Google Gemini isn't only provided via Vertex AI. Let's create an integration for the REST API offering, too. There you only need a simple API Key - no need for a GCP account.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Add SagemakerGenerator

Is your feature request related to a problem? Please describe.
Sagemaker is widely used for LLM inference in production use cases. We should support Sagemaker in 2.0.

Describe the solution you'd like
Generator similar to the Sagemaker support in 1.x

Ollama: investigate/implement ChatGenerator

It should be possible to pass ChatMessages to LLMs served via Ollama, as done in OpenAIChatGenerator.

We should investigate whether this is feasible and if so create a ChatGenerator.

Pinecone Document Store

Summary and motivation

Briefly explain the request: why do we need this integration? What use cases does it support?

Detailed design

Explain the design in enough detail for somebody familiar with Haystack to understand, and for somebody familiar with
the implementation to implement. Get into specifics and corner-cases, and include examples of how the feature is used.
Also, if there's any new terminology involved, define it here.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

create API reference for core-integrations

We need to create an API reference for haystack-core-integrations that would be loaded to our Haystack docs to Readme editor: https://docs.haystack.deepset.ai/v2.0/reference

Tasks

Beta Give feedback

Generate the api docs in markdown format for all the integrations #274
Sync the api docs with readme.com #266
Options

Preview needs removing

The instructor embedder (and probably others) are a causing errors right now because the imports still have the '.preview' in them. There might also be some other out of date imports.
(Reported on discord)

milvus

why? why would you do such a thing? 😭

AstraDB - example

Add Cohere LLM integration

Summary and motivation

Add support for CohereGenerator and CohereChatGenerator.

Detailed design

Cohere is one of the leading LLM providers and we should have Haystack Cohere LLM integration. See https://docs.cohere.com/reference/generate and https://docs.cohere.com/reference/chat for more details.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Add AnthropicGenerator

Is your feature request related to a problem? Please describe.
Anthropic models are widely used. We should support them in 2.0.

Describe the solution you'd like
Generator similar to the Anthropic support in 1.x

Add AzureOpenAIEmbedders

Is your feature request related to a problem? Please describe.
OpenAI's embedding models hosted on Azure are widely used. We should add the relevant embedders to support them.

Docs: PineconeDocumentStore & PineconeDenseRetriever

Drafts are ready

Setup a labeller job

Assign labels to PR with the name of the integration being changed.

Add PGVector DocumentStore

Summary and motivation

PGVector is a popular request from our community. We should have it in 2.0.

Detailed design

TBD

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Support Pinecone's hybrid vectors

Is your feature request related to a problem? Please describe.
Recently, Pinecone announced support for Sparse-dense embeddings, allowing for hybrid vector search (both semantic and keyword search). However, haystack doesn't currently support these.

Describe the solution you'd like
Native support for sparse-dense vectors!

Describe alternatives you've considered
I've created a small package called haystack-hybrid-embedding which can be installed to support hybrid vectors for now, but it's a hack!

[Elasticsearch Document Store] better error handling in `write_documents`

Currently, we assume that all document writing errors in the Document Store are related to duplicate documents.

We should handle this part better so as not to show misleading exceptions.

Qdrant integration

Summary and motivation

The integration has already been made and published.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

Docs: UnstructuredFileConverter

Is your feature request related to a problem? Please describe.
Documentation for UnstructuredFileConverter is missing. Here's the integration page for it: https://haystack.deepset.ai/integrations/unstructured-file-converter

Describe the solution you'd like
Adding a new documentation page for this converter. We' can then use this page in its integration page as well.

Have a separate folder for each component type under integrations

Is your feature request related to a problem? Please describe.
As a user, I'm familiar with the convention at haystack repository and having the same convention here would make it easy for me to navigate.

Describe the solution you'd like
Create a /generators folder under cohere_haystack integration folder.
The same can apply to jina_haystack and unstructured. This way, we can be more open to new components coming from the same providers

Describe alternatives you've considered
We can leave it as it is.

Additional context
N/A

Tasks

Beta Give feedback

amazon_bedrock: mount integration under haystack_integrations.* #195

P1 integration:amazon-bedrock
astra: mount integration under haystack_integrations.* #196

P1 integration:astra
cohere: mount integration under haystack_integrations.* #197

P1 integration:cohere
elasticsearch: mount integration under haystack_integrations.* #198

P1 integration:elasticsearch
google_ai: mount integration under haystack_integrations.* #199

P1 integration:google-ai
google_vertex: mount integration under haystack_integrations.* #200

P1 integration:google-vertex
gradient: mount integration under haystack_integrations.* #201

1 of 1

P1 integration:gradient
instructor_embedders: mount integration under haystack_integrations.* #202

P1 integration:instructor-embedders
jina: mount integration under haystack_integrations.* #203

P1 integration:jina
llama_cpp: mount integration under haystack_integrations.* #210

P1 integration:llama_cpp
ollama: mount integration under haystack_integrations.* #209

P1 integration:ollama
opensearch: mount integration under haystack_integrations.* #208

1 of 1

P1 integration:opensearch
pinecone: mount integration under haystack_integrations.* #207

P1 integration:pinecone
qdrant: mount integration under haystack_integrations.* #206

P1 integration:qdrant
unstructured: mount integration under haystack_integrations.* #205

P1 integration:unstructured-fileconverter
weaviate: mount integration under haystack_integrations.* #204

P1 integration:weaviate
chroma: mount integration under haystack_integrations.* #213

integration:chroma
Options

Add support for Cohere Embed v3 Models

Is your feature request related to a problem? Please describe.
Cohere has a new type of embedding models: Embed v3. Let's add support for it for 2.0 pipelines
https://txt.cohere.com/introducing-embed-v3/
https://docs.cohere.com/reference/embed

Describe the solution you'd like
It seems like the only change we need to make in CohereEmbedder is to add the input_type parameter. See the documentation: https://docs.cohere.com/reference/embed

Describe alternatives you've considered
There's no other way of using these v3 models. When we pass the v3 model names, the CohereEmbedder (or cohere API, not quite sure which one) may not raise any error but the model fallbacks to a v2 model

Additional context
N/A

Google Vertex AI integration

Summary and motivation

This already mostly implemented but I'm creating this issue to track the remaining work.

Checklist

If the request is accepted, ensure the following checklist is complete before closing this issue.

deepset-ai / haystack-core-integrations Goto Github PK

haystack-core-integrations's People

Contributors

Stargazers

Watchers

Forkers

haystack-core-integrations's Issues

Summary and motivation

Detailed design

Checklist

Summary and motivation

Checklist

Summary and motivation

Checklist

Summary and motivation

Detailed design

Checklist

Summary and motivation

Checklist

Tasks

Checklist

Tasks

Summary and motivation

Detailed design

Checklist

Tasks

Tasks

Summary and motivation

Checklist

Tasks

Summary and motivation

Detailed design

Checklist

Tasks

Summary and motivation

Detailed design

Checklist

Tasks

Summary and motivation

Detailed design

Checklist

Summary and motivation

Checklist

Tasks

Summary and motivation

Checklist

Recommend Projects

Recommend Topics

Recommend Org