SUMM

Philipp Krenn introduces the talk's hands-on approach, focusing on the "retrieval" component of retrieval augmented generation (RAG), not generation.
Audience is invited to follow along using a shared Elastic cloud instance (elastai.engineer), with guidance on avoiding data collisions.
ElasticSearch, based on Apache Lucene, is identified as the main search technology, featuring keyword, vector, and hybrid search capabilities.

Classical keyword or lexical search involves breaking text into individual tokens (words) using standard tokenization (mostly based on whitespace and punctuation).
Token data includes the term, start and end offset (for highlighting matches), and position (for phrase matching).
Search engines preprocess and store tokens and positions at ingestion time for efficient querying.
Stop words (common, less useful words) and stemming (reducing words to root form) simplify data and improve matching.
Using keyword search on the phrase "these are not the droids you're looking for" results in only "droid", "you", and "look" as tokens after processing.
Mismatched language analyzers can yield poor results, so appropriate language-specific analysis is necessary.

Each language comes with a carefully curated stop word list (e.g., ~33 stop words in English by default), but users may modify or skip stop word removal based on context.
Removing stop words can lead to data loss for searches like "to be or not to be", where all terms could be removed.
Tokenization choices and analyzers (e.g., handling hyphens, email addresses, or different languages) significantly affect search results.

Practical demonstration of creating an index with a custom analysis pipeline (e.g., HTML removal, tokenization, lowering case, stop word removal, stemming).
Sample Star Wars phrases are stored and analyzed in the index.
Searches match singular/plural and casing differences due to stemming and case normalization.
The "inverted index" data structure stores tokens with document references and positions for efficient matching.

Keyword search matches only exact tokens; synonyms (like "droid" and "robot") require explicit synonym lists, which can now be LLM-generated.
Ambiguous terms (homonyms) and lack of context limit purely lexical search—it's simple yet scalable.
Phrase queries leverage token positions for matching, but stop word removal and position shifts can cause mismatches.
"Slop" can be adjusted to permit small gaps in phrase matching.
Dictionary-based enhancements can help but involve licensing and availability challenges.
Compound noun languages (e.g., German) and partial word matches require tools like "n-grams," which increase storage and query cost.
Suggestions include storing multiple representations (with and without stop words or n-grams) and weighing different fields for hybrid queries.

Fuzzy search allows automatic correction for small misspellings (using Levenshtein distance per token).
ElasticSearch can auto-tune fuzziness based on word length; excessive fuzziness can degrade precision.

Documents are ranked by score, with the primary algorithm being BM25 (an improved form of TF-IDF).
- Term Frequency (TF): More occurrences make a term more relevant, but importance flattens at higher frequencies.
- Inverse Document Frequency (IDF): Rare terms are more relevant; common terms less so.
- Field Length Normalization: Matches in shorter fields (like titles) score higher than those in longer fields.
Scores are only comparable within the same query, not between queries or as percentages.
Multi-term searches combine scores and coordinate factors to prioritize documents matching more of the query's terms.

Synonyms, context, and meaning are limited in classic keyword search; managing these requires additional tools or ML models.
Embedding models (vector-based approaches) allow representing documents and queries in high-dimensional space, enabling semantic matching.
- Demonstrated with simple and practical representations of Star Wars character traits.
Dense vector models (e.g., OpenAI embeddings) store arrays of floating-point values; similarity is computed by proximity in vector space.
Sparse vector models (such as SPLADE) use numerous tokens with varying weights; more interpretable but potentially slower for large queries.

Examples show using dense and sparse vector models alongside keyword search in ElasticSearch.
Sparse models expand queries into large token sets; overlap in these tokens between queries and documents yields relevance scores.
Dense models always produce some degree of match, which can lead to less obvious relevance for short queries and make setting relevance thresholds difficult.
Dense embedding-based searches depend strongly on selecting the right model and context window.

Text chunking (splitting large texts into smaller units) is critical for vector-based search to ensure semantic relevance and enable highlighting.
Strategies for chunking vary (by page, paragraph, sentence), sometimes with overlapping context.
Trade-offs in chunk size and context window affect recall and precision.

Hybrid search combines keyword, dense, and sparse vector searches, potentially boosting certain fields (like brand names) or using custom ranking signals (e.g., product rating).
Reciprocal Rank Fusion (RRF) blends result positions from different search types, mitigating issues with differently scaled scores.
ElasticSearch supports combining query types in a single request, handling retrieval and reranking centrally.

PostgreSQL vector search (PGVector) lacks some search-specific features (like full BM25). Hybrid search is recommended for most practical use cases.
For ingestion and deduplication, using content hashes as document IDs (with "create" operation) avoids redundant vector generation.
Handling extremely large datasets and updates requires an understanding of index structures and possible performance optimizations (e.g., HNSW merges in Elastic).
Query rehearsing (retrieve then rerank) allows quality improvements with minimal performance penalty; ElasticSearch supports rescoring and hybrid ranking.

A new, simpler "pipe query" language is available to reduce verbose JSON, supporting both dense and sparse fields, although language-specific bindings are still improving.
The new query approach supports matching on embeddings and can handle joins, but not all search features are implemented yet.

ElasticSearch enables combining retrieval, reranking, and hybrid search in one endpoint, reducing application-side complexity.
The instance used in the demo is left running for further exploration; attendees are invited to try more queries and visit the Elastic booth for discussion and swag.

Information Retrieval from the Ground Up - Philipp Krenn, Elastic