Information Retrieval from the Ground Up - Philipp Krenn, Elastic

Introduction and Setup 00:00

  • Philipp Krenn introduces the talk's hands-on approach, focusing on the "retrieval" component of retrieval augmented generation (RAG), not generation.
  • Audience is invited to follow along using a shared Elastic cloud instance (elastai.engineer), with guidance on avoiding data collisions.
  • ElasticSearch, based on Apache Lucene, is identified as the main search technology, featuring keyword, vector, and hybrid search capabilities.

Basics of Keyword (Lexical) Search 03:42

  • Classical keyword or lexical search involves breaking text into individual tokens (words) using standard tokenization (mostly based on whitespace and punctuation).
  • Token data includes the term, start and end offset (for highlighting matches), and position (for phrase matching).
  • Search engines preprocess and store tokens and positions at ingestion time for efficient querying.
  • Stop words (common, less useful words) and stemming (reducing words to root form) simplify data and improve matching.
  • Using keyword search on the phrase "these are not the droids you're looking for" results in only "droid", "you", and "look" as tokens after processing.
  • Mismatched language analyzers can yield poor results, so appropriate language-specific analysis is necessary.

Handling Stop Words and Analysis Trade-offs 13:35

  • Each language comes with a carefully curated stop word list (e.g., ~33 stop words in English by default), but users may modify or skip stop word removal based on context.
  • Removing stop words can lead to data loss for searches like "to be or not to be", where all terms could be removed.
  • Tokenization choices and analyzers (e.g., handling hyphens, email addresses, or different languages) significantly affect search results.

Creating and Querying Indexes in ElasticSearch 16:20

  • Practical demonstration of creating an index with a custom analysis pipeline (e.g., HTML removal, tokenization, lowering case, stop word removal, stemming).
  • Sample Star Wars phrases are stored and analyzed in the index.
  • Searches match singular/plural and casing differences due to stemming and case normalization.
  • The "inverted index" data structure stores tokens with document references and positions for efficient matching.

Synonyms, Phrases, and Query Nuances 23:32

  • Keyword search matches only exact tokens; synonyms (like "droid" and "robot") require explicit synonym lists, which can now be LLM-generated.
  • Ambiguous terms (homonyms) and lack of context limit purely lexical search—it's simple yet scalable.
  • Phrase queries leverage token positions for matching, but stop word removal and position shifts can cause mismatches.
  • "Slop" can be adjusted to permit small gaps in phrase matching.
  • Dictionary-based enhancements can help but involve licensing and availability challenges.
  • Compound noun languages (e.g., German) and partial word matches require tools like "n-grams," which increase storage and query cost.
  • Suggestions include storing multiple representations (with and without stop words or n-grams) and weighing different fields for hybrid queries.

Fuzziness and Misspellings 36:08

  • Fuzzy search allows automatic correction for small misspellings (using Levenshtein distance per token).
  • ElasticSearch can auto-tune fuzziness based on word length; excessive fuzziness can degrade precision.

Scoring and Relevance in Search 38:14

  • Documents are ranked by score, with the primary algorithm being BM25 (an improved form of TF-IDF).
    • Term Frequency (TF): More occurrences make a term more relevant, but importance flattens at higher frequencies.
    • Inverse Document Frequency (IDF): Rare terms are more relevant; common terms less so.
    • Field Length Normalization: Matches in shorter fields (like titles) score higher than those in longer fields.
  • Scores are only comparable within the same query, not between queries or as percentages.
  • Multi-term searches combine scores and coordinate factors to prioritize documents matching more of the query's terms.

Limitations and Hybridization 45:59

  • Synonyms, context, and meaning are limited in classic keyword search; managing these requires additional tools or ML models.
  • Embedding models (vector-based approaches) allow representing documents and queries in high-dimensional space, enabling semantic matching.
    • Demonstrated with simple and practical representations of Star Wars character traits.
  • Dense vector models (e.g., OpenAI embeddings) store arrays of floating-point values; similarity is computed by proximity in vector space.
  • Sparse vector models (such as SPLADE) use numerous tokens with varying weights; more interpretable but potentially slower for large queries.

Hands-on with Vector and Sparse Search 54:06

  • Examples show using dense and sparse vector models alongside keyword search in ElasticSearch.
  • Sparse models expand queries into large token sets; overlap in these tokens between queries and documents yields relevance scores.
  • Dense models always produce some degree of match, which can lead to less obvious relevance for short queries and make setting relevance thresholds difficult.
  • Dense embedding-based searches depend strongly on selecting the right model and context window.

Challenges, Chunking, and Scaling 77:33

  • Text chunking (splitting large texts into smaller units) is critical for vector-based search to ensure semantic relevance and enable highlighting.
  • Strategies for chunking vary (by page, paragraph, sentence), sometimes with overlapping context.
  • Trade-offs in chunk size and context window affect recall and precision.

Combining Multiple Search Methods (Hybrid Search) 80:05

  • Hybrid search combines keyword, dense, and sparse vector searches, potentially boosting certain fields (like brand names) or using custom ranking signals (e.g., product rating).
  • Reciprocal Rank Fusion (RRF) blends result positions from different search types, mitigating issues with differently scaled scores.
  • ElasticSearch supports combining query types in a single request, handling retrieval and reranking centrally.

Managing Indices, Performance, and Advanced Features 86:32

  • PostgreSQL vector search (PGVector) lacks some search-specific features (like full BM25). Hybrid search is recommended for most practical use cases.
  • For ingestion and deduplication, using content hashes as document IDs (with "create" operation) avoids redundant vector generation.
  • Handling extremely large datasets and updates requires an understanding of index structures and possible performance optimizations (e.g., HNSW merges in Elastic).
  • Query rehearsing (retrieve then rerank) allows quality improvements with minimal performance penalty; ElasticSearch supports rescoring and hybrid ranking.

Advanced Query Language and Usability Improvements 103:03

  • A new, simpler "pipe query" language is available to reduce verbose JSON, supporting both dense and sparse fields, although language-specific bindings are still improving.
  • The new query approach supports matching on embeddings and can handle joins, but not all search features are implemented yet.

Final Q&A and Closing Notes 104:44

  • ElasticSearch enables combining retrieval, reranking, and hybrid search in one endpoint, reducing application-side complexity.
  • The instance used in the demo is left running for further exploration; attendees are invited to try more queries and visit the Elastic booth for discussion and swag.