Picking the Right (or Wrong) Use Case: Companies often define scenarios that favor their product and disadvantage competitors, then generalize results to claim superiority (e.g., "40% faster").
Read-Only Workloads: Most benchmarks focus on read-only data sets in optimized formats because they are easier and more reproducible, despite most real-world workloads involving read/write operations.
Filtering in Vector Search: For Approximate Nearest Neighbor (ANN) search, especially HNSW, applying restrictive filters can counterintuitively make queries slower because more candidates must be examined before filtering.
Outdated Competitor Software: Benchmarkers frequently update their own software but use outdated versions (e.g., 18 months old) of competitors' products, leading to skewed comparisons.
Implicit Biases and Defaults: Benchmarkers might unknowingly pick scenarios, shard sizes, memory allocations, or instance configurations that work well for their system but are suboptimal for competitors.
Cheating and Quality Metrics: In ANN search, parameters can be tuned to sacrifice result quality (precision/recall) for speed, and these quality attributes are often intentionally or unintentionally skipped in performance benchmarks.
Creative Statistics: Benchmarkers might identify one specific edge case (e.g., a particular query or data type) where their system performs exceptionally (e.g., 10x faster) and then generalize that single win to claim overall superiority.
Lack of Reproducibility: Benchmarks often fail to publish all the necessary details or pieces for others to independently run and verify the results.
Automation and Reproducibility: Benchmarks should be automated and run regularly (e.g., nightly) to track performance changes over time and avoid gradual degradation (the "slow boiling frog" problem).
Do Your Own Benchmarks: Users should conduct their own benchmarks tailored to their specific data size, structure, read/write ratio, exact query types, acceptable latency, and hardware, as existing benchmarks are unlikely to be 100% relevant.
Learn from Flaws: Even if a benchmark is flawed or biased, it can still provide useful insights into where a system's potential strengths lie, what scenarios a vendor highlights, and what works well for them.