Jun 29, 2023 1 min read

Haystack + Pinecone Hybrid Vectors

Recently, Pinecone announced support for Sparse-dense embeddings, allowing for hybrid vector search. This is pretty awesome as it allows you to support both keyword-style queries that require exact matches (with the sparse vectors) and semantic queries that understand the intention of the query (with dense vectors). The two components of this hybrid vector are sent separately to Pinecone, which stores them as a unified vector. Your ANN results then incorporate distance from both components, with a configurable α to weigh one vs the other. The standard is to do: \( \text{dense} * \alpha + \text{sparse} * (1-\alpha) \).

I've recently been on a haystack kick and it's fantastic; I find both the code and docs to be way higher quality than langchain (though I am still falling back to langchain for some complex agents). Sadly, haystack doesn't currently support hybrid vectors. Since I wanted to get cracking playing with this, I built a little library called haystack-hybrid-embedding which adds hybrid vector support to haystack.

Just pip install haystack-hybrid-embedding and you're off!

from haystack_hybrid_embedding import SpladeEmbeddingEncoder
from haystack_hybrid_embedding.pinecone import PineconeHybridDocumentStore, SparseDenseRetriever

document_store = PineconeHybridDocumentStore(...)

retriever = SparseDenseRetriever(
  sparse_encoder=SpladeEmbeddingEncoder(),
  alpha=0.8,
  ...
)

Hopefully we get native support in haystack soon, but for now feel free to use haystack-hybrid-embedding to bridge the gap.