POC to PROD: Hard Lessons from 200+ Enterprise GenAI Deployments - Randall Hunt, Caylent

Introduction & Company Background 00:03

  • Caylent builds custom solutions for clients ranging from startups to Fortune 500 companies, focusing on a variety of tech challenges including app creation and database migration
  • The team consists of passionate autodidacts with a tendency toward product diversity and rapid prototyping
  • Generative AI is viewed as a powerful but not all-encompassing solution; misconceptions often arise about its capabilities

Enterprise GenAI Use Cases & Customer Examples 01:07

  • Developed an agent for Brainbox AI to optimize HVAC systems across tens of thousands of buildings, resulting in significant greenhouse gas reductions
  • Built AI-driven solutions for Simmons (water management), Pipes AI, Virtual Moving Technologies, and Z5 Inventory
  • Demonstrated a multimodal search system for Nature Footage, indexing and making searchable extensive stock video collections using semantic and vector embeddings

Technical Deep Dives on Multimodal Search & Video Understanding 02:49

  • For Nature Footage, built multimodal pooling embeddings using sampled video frames combined with text for advanced search
  • For a sports video customer, used audio amplitude spectrography to identify highlights by tracking crowd cheering, integrated both audio and video embeddings for event detection, and sent notifications based on detected plays
  • Simple video annotation, such as overlaying the three-point line, drastically improves model performance for event detection

Architecture & Model Selection 04:39

  • Utilizes various storage and search solutions such as Postgres PGVector and OpenSearch for efficient vector search
  • Prefers Postgres PGVector for vector storage but also leverages Redis and AWS MemoryDB for RAM-fast search where required, mindful of cost and scalability
  • Runs workloads on AWS services like Bedrock and SageMaker, and explores custom silicon (Tranium, Inferentia) for price/performance advantages (around 60% better than Nvidia GPUs in specific scenarios)
  • Model use includes proprietary (Claude, Nova), open-source (Llama, Deepseek), and embeddings tuned for business needs

Building Robust GenAI Systems & Business Moats 06:07

  • Institutions often require custom fine-tuning or layered applications on self-service tools
  • Key differentiator is leveraging context; richer user context leads to smarter, more relevant LLM-powered applications
  • Tracking and administering third-party tool usage is a recurring challenge, often requiring network-level monitoring

Lessons Learned from 200+ Deployments 12:02

  • Successful systems require more than embeddings and evals; understanding real user access patterns is crucial
  • Faceted search and filters built atop embeddings yield more usable results; speed is critical—slow models risk user abandonment
  • Good UX can compensate for some inference latency through strategic design (e.g., loading spinners)
  • Prompt engineering has become more effective than fine-tuning with modern models, reducing the need for ongoing prompt fixes as models evolve
  • Automations like prompt/context management, eval layers, and cost tracking are vital to long-term maintainability and scalability

Model Version, Prompt Engineering, and Cost Management 13:45

  • Prompt engineering generates significant gains as models improve (e.g., from Claude 3.5 to Claude 4 saw marked improvements without regressions)
  • The economics of inference must be considered; high-end models can be costly (“Is this inference going to bankrupt my company?”)
  • Effective caching and prompt design can optimize both cost and reliability

Evaluation, UX, and Personalization in Production 15:00

  • Evaluation suites start with “vibe checks” and evolve into binary or scored metrics for continuous improvement
  • UX orchestration, prompt versioning, and generative UI enable dynamic, user-personalized responses (e.g., just-in-time React components for dashboards)
  • Production features include adaptive UI per user, efficient document delivery for bandwidth-limited users, and channel selection (chat vs. voice) informed by actual user workflows

Final Recommendations and Contact 18:44

  • Delegate computation tasks appropriately; don’t use LLMs for math operations when native code suffices
  • Manage output tokens carefully to control inference costs
  • Take advantage of batch inference and prompt caching to further reduce costs
  • Continually refine context input for efficient, accurate LLM responses—strip irrelevant info and add user-specific context for optimal results
  • Speaker invites discussion of new use cases and collaboration opportunities