Classical keyword or lexical search involves breaking text into individual tokens (words) using standard tokenization (mostly based on whitespace and punctuation).
Token data includes the term, start and end offset (for highlighting matches), and position (for phrase matching).
Search engines preprocess and store tokens and positions at ingestion time for efficient querying.
Stop words (common, less useful words) and stemming (reducing words to root form) simplify data and improve matching.
Using keyword search on the phrase "these are not the droids you're looking for" results in only "droid", "you", and "look" as tokens after processing.
Mismatched language analyzers can yield poor results, so appropriate language-specific analysis is necessary.
Each language comes with a carefully curated stop word list (e.g., ~33 stop words in English by default), but users may modify or skip stop word removal based on context.
Removing stop words can lead to data loss for searches like "to be or not to be", where all terms could be removed.
Tokenization choices and analyzers (e.g., handling hyphens, email addresses, or different languages) significantly affect search results.
Creating and Querying Indexes in ElasticSearch 16:20
Practical demonstration of creating an index with a custom analysis pipeline (e.g., HTML removal, tokenization, lowering case, stop word removal, stemming).
Sample Star Wars phrases are stored and analyzed in the index.
Searches match singular/plural and casing differences due to stemming and case normalization.
The "inverted index" data structure stores tokens with document references and positions for efficient matching.
Examples show using dense and sparse vector models alongside keyword search in ElasticSearch.
Sparse models expand queries into large token sets; overlap in these tokens between queries and documents yields relevance scores.
Dense models always produce some degree of match, which can lead to less obvious relevance for short queries and make setting relevance thresholds difficult.
Dense embedding-based searches depend strongly on selecting the right model and context window.
Hybrid search combines keyword, dense, and sparse vector searches, potentially boosting certain fields (like brand names) or using custom ranking signals (e.g., product rating).
Reciprocal Rank Fusion (RRF) blends result positions from different search types, mitigating issues with differently scaled scores.
ElasticSearch supports combining query types in a single request, handling retrieval and reranking centrally.
Managing Indices, Performance, and Advanced Features 86:32
PostgreSQL vector search (PGVector) lacks some search-specific features (like full BM25). Hybrid search is recommended for most practical use cases.
For ingestion and deduplication, using content hashes as document IDs (with "create" operation) avoids redundant vector generation.
Handling extremely large datasets and updates requires an understanding of index structures and possible performance optimizations (e.g., HNSW merges in Elastic).
Query rehearsing (retrieve then rerank) allows quality improvements with minimal performance penalty; ElasticSearch supports rescoring and hybrid ranking.
Advanced Query Language and Usability Improvements 103:03
A new, simpler "pipe query" language is available to reduce verbose JSON, supporting both dense and sparse fields, although language-specific bindings are still improving.
The new query approach supports matching on embeddings and can handle joins, but not all search features are implemented yet.
Final Q&A and Closing Notes 104:44
ElasticSearch enables combining retrieval, reranking, and hybrid search in one endpoint, reducing application-side complexity.
The instance used in the demo is left running for further exploration; attendees are invited to try more queries and visit the Elastic booth for discussion and swag.