Speaker - Devoxx Poland

Sho Tanaka

Snowflake

Sho Tanaka is a Lead Developer Advocate at Snowflake, focused on AI/ML and data engineering. He previously worked at Google (gTech) delivering ML/data solutions across Japan, APAC and global, and he is a Google Developer Expert (AI/ML) and co-founder of MLOps community in Japan. He enjoys turning messy real-world ML projects into reproducible, production-minded architectures.

View

How to evaluate AI Agent to be robust Intelligence

Conference - Short (INTERMEDIATE level)

Visualization Reliability Metrics Evaluation

Building AI Agents is accessible, but ensuring their reliability in production is a major engineering challenge. Unlike deterministic software, Agents are probabilistic. A binary "Pass/Fail" test is often insufficient to capture the nuances of an agent's reasoning process.

In this talk, we explore "Evaluation-Driven Development"—a paradigm shift for Python engineers building AI systems. We will focus on measuring the quality of agent trajectories using Python tools and visualizations.

The session covers:

From Testing to Evaluation: Why we need to move beyond standard assertions to probabilistic scoring (0.0 to 1.0) for Generative AI.
Metrics as Code: Implementing specific evaluation metrics in Python:
1. Faithfulness: Scoring whether the answer is grounded in the retrieved context to detect hallucinations.
2. Tool Selection Accuracy: Evaluating if the agent chose the correct tool (e.g., search vs. calculation) for the user's intent.
3. Answer Relevancy: Using embedding similarity to measure if the response actually answers the prompt.
Visualizing the Black Box: A live demo using Streamlit. We will showcase a custom dashboard that runs these evaluations, allowing developers to visualize the "reasoning trace" and identify exactly where the agent failed (Retrieval layer vs. Generation layer).
The Feedback Loop: How to use these evaluation scores to iteratively improve prompts and context retrieval logic.

Searching for speaker images...

Venue address

Phone

Email

Social Media