Vector Search Benchmark[eting] - Philipp Krenn, Elastic

The Problem with Benchmarking 00:03

  • The term "benchmarketing" refers to what is often seen in most benchmarks, where vendors present results favorably.
  • Many vector search users are unhappy with performance, and existing benchmarks are often perceived as very hard to interpret or misleading.
  • Benchmarks frequently show every vendor being both faster and slower than all competitors, indicating a lack of reliability.
  • The speaker advises not to trust "glossy charts" as better-looking benchmarks can sometimes hide worse underlying material.

Common Benchmarking Pitfalls 01:41

  • Picking the Right (or Wrong) Use Case: Companies often define scenarios that favor their product and disadvantage competitors, then generalize results to claim superiority (e.g., "40% faster").
  • Read-Only Workloads: Most benchmarks focus on read-only data sets in optimized formats because they are easier and more reproducible, despite most real-world workloads involving read/write operations.
  • Filtering in Vector Search: For Approximate Nearest Neighbor (ANN) search, especially HNSW, applying restrictive filters can counterintuitively make queries slower because more candidates must be examined before filtering.
  • Outdated Competitor Software: Benchmarkers frequently update their own software but use outdated versions (e.g., 18 months old) of competitors' products, leading to skewed comparisons.
  • Implicit Biases and Defaults: Benchmarkers might unknowingly pick scenarios, shard sizes, memory allocations, or instance configurations that work well for their system but are suboptimal for competitors.
  • Cheating and Quality Metrics: In ANN search, parameters can be tuned to sacrifice result quality (precision/recall) for speed, and these quality attributes are often intentionally or unintentionally skipped in performance benchmarks.
  • Creative Statistics: Benchmarkers might identify one specific edge case (e.g., a particular query or data type) where their system performs exceptionally (e.g., 10x faster) and then generalize that single win to claim overall superiority.
  • Lack of Reproducibility: Benchmarks often fail to publish all the necessary details or pieces for others to independently run and verify the results.

Improving Benchmarks 09:50

  • Automation and Reproducibility: Benchmarks should be automated and run regularly (e.g., nightly) to track performance changes over time and avoid gradual degradation (the "slow boiling frog" problem).
  • Do Your Own Benchmarks: Users should conduct their own benchmarks tailored to their specific data size, structure, read/write ratio, exact query types, acceptable latency, and hardware, as existing benchmarks are unlikely to be 100% relevant.
  • Learn from Flaws: Even if a benchmark is flawed or biased, it can still provide useful insights into where a system's potential strengths lie, what scenarios a vendor highlights, and what works well for them.