SUMM

The Problem with Benchmarking 00:03

The term "benchmarketing" refers to what is often seen in most benchmarks, where vendors present results favorably.
Many vector search users are unhappy with performance, and existing benchmarks are often perceived as very hard to interpret or misleading.
Benchmarks frequently show every vendor being both faster and slower than all competitors, indicating a lack of reliability.
The speaker advises not to trust "glossy charts" as better-looking benchmarks can sometimes hide worse underlying material.

Common Benchmarking Pitfalls 01:41

Picking the Right (or Wrong) Use Case: Companies often define scenarios that favor their product and disadvantage competitors, then generalize results to claim superiority (e.g., "40% faster").
Read-Only Workloads: Most benchmarks focus on read-only data sets in optimized formats because they are easier and more reproducible, despite most real-world workloads involving read/write operations.
Filtering in Vector Search: For Approximate Nearest Neighbor (ANN) search, especially HNSW, applying restrictive filters can counterintuitively make queries slower because more candidates must be examined before filtering.
Outdated Competitor Software: Benchmarkers frequently update their own software but use outdated versions (e.g., 18 months old) of competitors' products, leading to skewed comparisons.
Implicit Biases and Defaults: Benchmarkers might unknowingly pick scenarios, shard sizes, memory allocations, or instance configurations that work well for their system but are suboptimal for competitors.
Cheating and Quality Metrics: In ANN search, parameters can be tuned to sacrifice result quality (precision/recall) for speed, and these quality attributes are often intentionally or unintentionally skipped in performance benchmarks.
Creative Statistics: Benchmarkers might identify one specific edge case (e.g., a particular query or data type) where their system performs exceptionally (e.g., 10x faster) and then generalize that single win to claim overall superiority.
Lack of Reproducibility: Benchmarks often fail to publish all the necessary details or pieces for others to independently run and verify the results.

Improving Benchmarks 09:50

Automation and Reproducibility: Benchmarks should be automated and run regularly (e.g., nightly) to track performance changes over time and avoid gradual degradation (the "slow boiling frog" problem).
Do Your Own Benchmarks: Users should conduct their own benchmarks tailored to their specific data size, structure, read/write ratio, exact query types, acceptable latency, and hardware, as existing benchmarks are unlikely to be 100% relevant.
Learn from Flaws: Even if a benchmark is flawed or biased, it can still provide useful insights into where a system's potential strengths lie, what scenarios a vendor highlights, and what works well for them.

Vector Search Benchmark[eting] - Philipp Krenn, Elastic

The Problem with Benchmarking 00:03

Common Benchmarking Pitfalls 01:41

Improving Benchmarks 09:50