
Krakow, Poland, 17 - 19 June 2026
Sho Tanaka is a Lead Developer Advocate at Snowflake, focused on AI/ML and data engineering. He previously worked at Google (gTech) delivering ML/data solutions across Japan, APAC and global, and he is a Google Developer Expert (AI/ML) and co-founder of MLOps community in Japan. He enjoys turning messy real-world ML projects into reproducible, production-minded architectures.
Building AI Agents is accessible, but ensuring their reliability in production is a major engineering challenge. Unlike deterministic software, Agents are probabilistic. A binary "Pass/Fail" test is often insufficient to capture the nuances of an agent's reasoning process.
In this talk, we explore "Evaluation-Driven Development"—a paradigm shift for Python engineers building AI systems. We will focus on measuring the quality of agent trajectories using Python tools and visualizations.
The session covers:
- From Testing to Evaluation: Why we need to move beyond standard assertions to probabilistic scoring (0.0 to 1.0) for Generative AI.
- Metrics as Code: Implementing specific evaluation metrics in Python:
- Faithfulness: Scoring whether the answer is grounded in the retrieved context to detect hallucinations.
- Tool Selection Accuracy: Evaluating if the agent chose the correct tool (e.g., search vs. calculation) for the user's intent.
- Answer Relevancy: Using embedding similarity to measure if the response actually answers the prompt.
- Visualizing the Black Box: A live demo using Streamlit. We will showcase a custom dashboard that runs these evaluations, allowing developers to visualize the "reasoning trace" and identify exactly where the agent failed (Retrieval layer vs. Generation layer).
- The Feedback Loop: How to use these evaluation scores to iteratively improve prompts and context retrieval logic.
Searching for speaker images...
Ticket prices will go up in...
You missed out!
Venue address
ICE Krakow, ul. Marii Konopnickiej 17
Phone
+48 691 793 877
info@devoxx.pl
