Choosing a Vector Database

Deep Dive · Retrieval & RAG

Choosing a vector database: a constraint-first guide to a noisy market.

Vector-database shopping is one of the most overdone decisions in modern RAG engineering: dozens of vendors with confusingly similar pitches, benchmarks that mostly measure the wrong thing, and the worst possible default — picking by brand familiarity and switching twice later. This entry is opinionated. It explains what a vector DB actually is, what the four or five axes that genuinely differ between products are, just enough ANN internals to read a vendor's claims, and a constraint-first selection procedure that fits on one page. The conclusion most teams arrive at after doing this work is "Postgres with `pgvector` is fine" — and the value of this entry is letting you reach that conclusion deliberately, instead of after migrating off Pinecone twice.

STEP 1

What "a vector database" actually is.

Strip the marketing and the minimum definition is three pieces of machinery glued together:

Vector storage. A typed column or table that holds fixed-dimension float arrays, with the metadata you index alongside them (doc id, source, tags, timestamps, tenant id).
An approximate-nearest-neighbor (ANN) index. A data structure built over those vectors that returns the top-k nearest vectors to a query in sub-linear time. Exact nearest-neighbor at scale is too slow; approximate is the price you pay for being able to query millions of vectors in milliseconds.
A query engine. Takes a query vector plus filter conditions plus a k, walks the ANN index, applies filters, and returns ranked results — ideally also fusing with a sparse (BM25) signal when you ask it to.

By this definition, "vector database" includes the obvious dedicated products (Pinecone, Qdrant, Weaviate, Milvus, Chroma, Turbopuffer) and any general-purpose database that bolts on a vector column and an ANN index: Postgres with pgvector, OpenSearch with knn_vector, MongoDB Atlas Vector, SQLite with the vss extension. The first decision is not "which dedicated vector DB" but "do I need a dedicated one at all, or does a vector column on a database I already operate solve this?" For a large fraction of teams, the honest answer is the latter.

The diagnostic question, before any vendor evaluation: what am I going to do with the vectors that my existing database cannot do? If the answer is "I don't know yet," start with pgvector on the Postgres you already run. Migrating off it later is cheap; migrating off a managed dedicated vendor with proprietary APIs and a price floor is not.

STEP 2

The axes that actually differ between products.

Most vendor comparison tables score the wrong things ("supports cosine similarity: yes/yes/yes"). The axes that move real production decisions:

ANN algorithm and its tuning surface. HNSW, IVF, IVF-PQ, DiskANN, ScaNN. Each implies a different recall×latency×memory profile (see Step 3). Vendors that lock the algorithm choice or hide the tuning knobs are making the trade-off for you.
Filtering quality. "Retrieve top-k where tenant_id = 7 AND lang = 'en' AND created > '2026-01-01'" is the realistic query shape, and it is exactly where naïve ANN implementations break (see Step 4 of hybrid search and reranking for the related precision/recall question). Whether the vendor supports pre-filtered, post-filtered, or properly filter-aware ANN search dominates production recall.
Freshness and update model. HNSW does not love deletes; some indexes are append-only, some require periodic rebuilds, some claim "real-time" but stall under heavy update load. If your corpus updates frequently — tickets, docs, code — freshness is a first-class requirement, not a footnote.
Hybrid retrieval support. Two-stage BM25 + dense + rerank is the production default. Vendors with first-class BM25 in the same query path (OpenSearch, Weaviate, Qdrant, recent pgvector + Postgres FTS) save you a whole subsystem. Dense-only vendors that require you to operate a separate BM25 store are quietly more expensive.
Multi-tenancy story. Many small tenants vs one big tenant changes everything: namespaces, per-tenant indexes, isolation, noisy-neighbor exposure, per-tenant cost. Some vendors handle this elegantly (Pinecone namespaces, Qdrant collections); others force you to manage shard-per-tenant yourself.
Operational model. Managed cloud only, self-hostable, library-embedded, or "library plus standalone server." Each is a different on-call story. A managed cloud vendor goes down and you wait; a self-hosted cluster goes down and you fix it. Both are fine; they are different organizations.
Cost shape. Per-vector storage, per-query, per-pod-hour, or rolled into a database bill. Pricing is often the deciding factor, but "cheap per vector" with a steep query-cost gradient can be more expensive at production QPS than a flat per-node cost.

The trap is comparing vendors on the same axes that demos optimize for — latency at trivial scale, recall on synthetic benchmarks — while ignoring the axes that decide whether the system holds together in year two.

STEP 3

ANN internals: just enough to read a vendor pitch.

You do not need to implement HNSW to choose a vector DB. You do need to know what each of the major families buys you and charges you for, because every product pitches an algorithm and the algorithm constrains the trade-offs.

algorithm   |  recall  |  latency  |  memory  |  build  |  updates
----------------------------------------------------------------
HNSW        |  high    |  low      |  HIGH    |  slow   |  poor (deletes)
IVF (flat)  |  med     |  med      |  med     |  fast   |  good
IVF-PQ      |  med-low |  low      |  LOW     |  med    |  good
DiskANN     |  high    |  med      |  low(*)  |  slow   |  good (with care)
ScaNN       |  high    |  low      |  med     |  med    |  ok

(*) DiskANN keeps the bulk of the index on SSD, with a small RAM
    working set — the headline trick that makes billion-scale on
    one box plausible.

HNSW is the default in most dedicated vendors and in pgvector. Best-in-class recall and latency at moderate scale; pays for it in memory (the whole graph lives in RAM) and unhappy update behavior (deletes leave tombstones that degrade quality over time).
IVF (inverted file) clusters vectors into Voronoi cells and searches only the nearest few cells. Lower memory than HNSW, simpler, friendlier to updates — but tuning nlist and nprobe is more art than science.
IVF-PQ adds product quantization on top of IVF: vectors are compressed 10–100×, memory drops dramatically, recall takes a hit. The right choice when you have hundreds of millions of vectors and cannot afford to hold them all in RAM.
DiskANN is the modern "big vectors, one box" answer: most of the graph on SSD, a small RAM cache, careful prefetching. Billion-scale at low fixed cost. Adopted in some managed vendors (Turbopuffer, Vespa) and worth knowing about before you assume you need a sharded cluster.
ScaNN (Google) and FAISS (Meta) are toolkits as much as indexes — ScaNN ships inside several products, FAISS is the library underneath many of them. Knowing they exist is enough.

The recall×latency curve is not a vendor property; it is an algorithm-and-parameters property. A "fast and accurate" vendor running default HNSW with ef_search=16 will lose to the same algorithm in another product at ef_search=128. Insist on apples-to-apples benchmark numbers (same algorithm, same recall target) or run your own — never trust marketing latency charts at face value.

STEP 4

The landscape, by category not by brand.

Treat the market as four categories with different operational shapes. Pick the category first; the brand within a category is mostly a taste-and-pricing question.

Dedicated cloud (Pinecone, Turbopuffer, Zilliz Cloud). Pay someone else to operate the index. Zero ops, fastest path to production, opinionated APIs. Trade-offs: vendor lock-in (proprietary client libraries, migration is real work), price-per-vector that grows with the corpus, less control over the algorithm/tuning surface. Right for "I want to ship a feature this quarter and revisit the bill later."
Open-source dedicated (Qdrant, Weaviate, Milvus, Chroma, Vespa). You run the cluster. More tuning surface, no vendor lock-in, the option of self-hosted or managed. Trade-offs: real operational burden (HA, backups, upgrades), tuning required to hit the latencies the cloud vendors hand you. Right when scale or cost or compliance pushes you off the managed-cloud path.
General database + vectors (Postgres + pgvector, OpenSearch + knn_vector, MongoDB Atlas Vector, SQLite + vss). The vector index lives in a database you already operate. Joins, transactions, filters, BM25, and vectors in one query language. Cheaper than dedicated for small-to-medium corpora. Trade-offs: ANN quality and index size limits are bounded by the host DB's choices — pgvector with HNSW is excellent up to roughly 10–50M vectors per node and starts to hurt beyond that. Right for the majority of teams who do not actually have a billion-vector problem.
Library + your own server (FAISS, ScaNN, hnswlib). You build the storage layer. Maximum flexibility, the lowest possible cost at scale, the highest ongoing engineering investment. Right when none of the above fit (custom compression, exotic distance, embedded in another system) and you have engineers who want to own this surface area.

Two specific products worth naming because they shift the default. pgvector has gotten good enough since ~2024 that "just use Postgres" is the right answer for far more teams than the dedicated-vendor market wants to admit. Turbopuffer (and DiskANN-class products generally) made billion-scale single-node retrieval cheap enough that the old "you need a sharded cluster at 100M vectors" rule no longer holds. If your last vector-DB evaluation was before 2025, redo it.

STEP 5

A constraint-first selection procedure.

The decision should run constraint-first, not vendor-first. Write down your real constraints, then map them to a category. The questions, in roughly this order:

How many vectors, at what QPS, with what p99 latency budget? Under 10M vectors and a few hundred QPS → pgvector on your existing Postgres handles it. 10–100M, growing → dedicated open-source or managed cloud is worth it. Hundreds of millions and up → the choice narrows to vendors with serious large-scale credentials (Milvus, Vespa, Turbopuffer, DiskANN-based) and the question shifts to ops.
What does filtering look like? If queries are pure k-NN with no filters, almost any product is fine. If most queries carry tenant-id, language, date-range, or category filters, the vendor's filter implementation matters more than its peak QPS. Test the filtered query shape, not the bare one.
How fresh does retrieval need to be? Minutes-old is fine for most knowledge-base RAG; seconds-old is needed for ticket-search-during-a-call style use cases. Vendors with batch-rebuild models cannot do the latter without engineering effort.
Do you need hybrid (BM25 + dense)? Almost all production RAG does (see hybrid search and reranking). Picking a dense-only vendor and operating BM25 separately is a real ongoing cost; picking a vendor with first-class BM25 in the same query path (OpenSearch, Weaviate, Qdrant, Postgres FTS + pgvector) saves a subsystem.
What is your multi-tenancy story? Many small tenants → vendors with namespace/collection isolation. One giant tenant → doesn't matter; pick on other axes.
What is the operational appetite? A team that already runs Postgres at scale should not adopt a new database to add a vector column. A team without strong ops capacity should not self-host a distributed dedicated vector DB.
What is the cost ceiling? Estimate cost at 12–24 months of growth, not today. Dedicated managed vendors are cheap at small scale and expensive at production scale; the curve crosses around 10–30M vectors for typical pricing.

Apply those filters and the answer is usually narrow: most teams end up at pgvector or OpenSearch (because they already run one of them), Pinecone or Turbopuffer (because the team has no ops headcount), or Qdrant/Milvus/Weaviate (because they need the tuning surface and have engineers who will operate it). The framework values for evaluating any of them are recall@k and latency on your own data — never the vendor's published benchmark.

STEP 6

Common selection mistakes, in declining order of frequency.

Benchmark shopping. Picking the vendor with the best number on a public benchmark, run on a corpus and at a recall target that does not match yours. The benchmark numbers reflect one parameter setting on one dataset; your numbers will be different. Run a small evaluation (a few thousand queries) on your real corpus before signing anything.
Ignoring the filter story until production. Demo on bare k-NN, ship with filters, watch recall collapse. Filter behavior is the single most common nasty surprise. Test it first.
Optimizing for "billion-vector scale" you don't have. Most teams have 1–10M vectors and pick a vendor sized for 1B+. They pay for unused headroom and inherit complexity for nothing. Pick for next year's corpus, not the rocketship.
Treating it as a one-way door. A vector DB is not your durable system of record; it is an index over content that lives elsewhere. As long as you can rebuild from source, migration is annoying but bounded. Picking a vendor where you cannot rebuild (because the source documents weren't preserved, or because the ingestion pipeline only existed inside the vendor) is the real lock-in — not the API.
Switching too early. The first six months of any RAG system are dominated by retrieval-quality issues that no vendor change fixes. The right vendor at the wrong query, parsing (see document parsing), or chunking still loses to a worse vendor with the right ones. Fix the upstream before re-platforming the index.
Skipping the rebuild drill. Build the script that recreates the entire vector index from source documents before you have an incident that requires it. Time it. If it takes a week and the index is your customer's search bar, that is a P0 you don't know you have yet.

The unifying point: vector DB selection is mostly a constraint problem dressed as a vendor problem. Know your scale, your filter shape, your freshness target, your multi-tenancy story, your hybrid needs, your operational capacity, and your cost ceiling — in that order — and the right category usually picks itself. The brand within the category is the part that gets the marketing attention and matters least. Spend the evaluation budget on running your own corpus against two candidates, not on reading more comparison tables.