Wiki

Testing

How DataTalks.Club guests test data, ML, and AI systems through data checks, CI/CD, evaluation sets, monitoring, and production readiness practices.

Testing in data and ML work means proving that a change still behaves correctly enough for the downstream decision. That decision may depend on a dashboard or model. It can also depend on a product flow or agent action. In the DataTalks.Club archive, guests don’t treat testing as only unit tests around Python code.

They discuss data-quality checks and pipeline tests. They also discuss regression tests, prompt evaluation, and RAG evaluation. They cover agent integration tests, monitoring, and incident response too.

The testing boundary overlaps with DataOps and CI/CD. It also overlaps with data quality and observability, evaluation, and production. A test can catch a known failure before release. Monitoring and observability catch failures the team doesn’t know how to encode yet.

Start with these related wiki pages:

Core podcast discussions:

Common Definition

Across the archive, teams use tests for checks they can define before a known failure reaches a user. A test may assert that a dbt column isn’t null. Another test may check that a pipeline produces an expected snapshot or that an agent calls the right tool with the right parameters.

Victoria Perez Mola gives the analytics engineering version in Analytics Engineer Skills and Tools. At 6:59, she explains how dbt brings software development habits into data work. In dbt, SQL files live with YAML documentation, version control, and tests.

At 39:04, she describes dbt tests as queries that return failing rows. A non-null test passes when the query returns nothing. A failure can become a warning or an error before dependent models build.

Bartosz Mikulski gives the production AI version in Production AI Engineering. At 10:44, he says a team needs tests to prove a data pipeline works before it can defend a dashboard number. At 11:53-14:04, he prefers making the pipeline run, observing acceptable outputs, and turning those examples into tests. For data pipelines, that often means integration or snapshot tests rather than only small unit tests.

Guest Differences

Guests differ on what testing can guarantee. Perez Mola uses dbt tests to stop known bad source data from building downstream models. She also says at 40:19 that teams rarely reach a point where every future data-quality problem is covered. Barr Moses makes that limit explicit in Data Observability Explained. At 51:33-53:26, she argues that data teams need both tests and monitoring because tests cover expected failures while observability catches unknown unknowns.

Christopher Bergh starts from operating discipline rather than one testing tool. In Mastering DataOps, he ties testing to the definition of “done”: a pipeline isn’t done merely because a stakeholder saw a dashboard. Around 21:02, he says the system should tell the team when something is wrong while it runs and should let someone make a change quickly. In DataOps for Data Engineering, he connects that idea to CI/CD and regression tests at 30:55 and 42:39. He also includes realistic test data, version control, and automated checks.

LLM and agent guests use the word evaluation more often than testing, but the discipline is similar. Hugo Bowne-Anderson argues in Practical LLM Engineering and RAG that teams eventually need representative gold test sets for reliable software at 23:35-24:59. Ranjitha Kulkarni adds in Building Agentic AI Systems that public benchmarks measure model capability, not the deployed agent system. At 53:20-56:02, she frames agent checks as software tests. Her examples include mocked tools, integration tests, regression tests, and outcome assertions.

Data and Analytics Tests

Data tests usually encode expectations about required fields and allowed ranges. They also cover uniqueness, source quality, and transformation assumptions. In Perez Mola’s dbt discussion, tests protect analytical models from bad inputs. She describes checks for city names and numeric ranges around 36:57. At 39:04, she explains that source tests can block downstream models so teams don’t build reports on wrong data (Analytics Engineer Skills and Tools).

That makes analytics engineering testing different from a one-off analyst query. The test belongs near the model definition, runs with the transformation flow, and tells the team whether a shared business definition can still be trusted. Documentation and peer review matter too. Perez Mola ties good SQL, tests, guidelines, and review practices to the analytics engineer role at 42:27-44:12.

Mikulski adds another data-pipeline check: run representative input through the pipeline and compare the output with an expected snapshot. At 13:32-14:04 in Production AI Engineering, he says unit tests are less useful for whole pipelines than integration-style checks. Those tests can be named after the business rule they protect, which makes failures easier to understand during debugging.

CI/CD and Pipeline Regression

Testing becomes more valuable when it runs automatically, and Bergh’s DataOps episodes make automation the core operating point. In DataOps for Data Engineering, the 30:55 discussion ties safe change to regression tests and automated deployment. It also includes monitoring, realistic test data, and infrastructure as code. At 42:39, he warns that Git alone isn’t enough. Teams need end-to-end tests and automated checks before production.

In Mastering DataOps, he lists practical components at 33:47-48:25. Teams can use version control, automated tests, CI/CD, and test environments. They can also use dbt tests, Great Expectations, SQL checks, and other strategies that fit the pipeline. The exact tool matters less than the habit of proving a change with data before relying on it downstream.

This is why testing belongs beside CI/CD and reproducibility. A data or ML release should preserve the relationship between code, data, artifacts, and tests. It should also preserve metadata and deployment behavior. Otherwise a team may know a pipeline passed once but still not know what changed after a failure.

Evaluation for ML, Search, and LLM Systems

Some systems need evaluation sets rather than pass/fail data checks. For ML or search applications, teams ask whether the system performs well enough against representative cases and a meaningful baseline. RAG or LLM applications need the same question with language-specific checks.

Bowne-Anderson’s LLM engineering episode gives a practical LLM testing approach. At 23:35, he compares gold test sets to holdout and test sets in machine learning. Natural language and tool calls make the practice different.

At 24:39-24:59, he warns that teams don’t need an LLM judge for everything. Teams can use structured output checks and regular expressions. They can also use string matching, cheaper models, and human review (Practical LLM Engineering and RAG, LLM Evaluation Workflows).

Failure analysis turns evaluation into engineering work. At 26:43 in the same episode, Bowne-Anderson recommends categorizing errors and ranking the largest failure classes. If most failures come from retrieval, teams should fix retrieval before polishing formatting. This connects testing to RAG and search, RAG, and knowledge systems. It also connects testing to production search evaluation.

Agent and Tool Tests

Agent systems add tool behavior to answer quality. In Kulkarni’s discussion, model benchmarks and system benchmarks are separate. At 51:42 in Building Agentic AI Systems, she says teams need datasets that represent real users. Public benchmarks measure model capability, not the product system.

At 53:20-55:13, teams mock external tools and assert outputs as software-style tests. They also check whether the agent tries to call the right system with the right parameters. That matters because a calendar, SRE, or enterprise agent can fail through a bad tool call even when the generated text looks plausible.

Kulkarni also warns against overfitting the test to one reasoning path. At 56:02, she says an LLM can reach the same goal through different acceptable paths. For agent engineering, the test should often assert the outcome and required constraints, not every intermediate step.

Monitoring After Tests Pass

Tests don’t remove the need for production monitoring. In Data Observability Explained, Moses frames that gap by naming freshness and volume alongside distribution, schema, and lineage at 16:38.

At 21:57, she distinguishes a successful pipeline run from good data. A job can complete while publishing late, incomplete, shifted, or semantically wrong data.

Her test-driven data discussion at 51:33-53:26 is the useful boundary for this page. Tests specify what the team already knows might go wrong. Monitoring and observability help the team notice new failures and diagnose root cause. That is why data quality and observability sits next to testing instead of after it.

For ML and AI systems, monitoring also checks whether evaluation still holds after launch. A feature distribution can shift, labels can arrive late, a schema can change, or an upstream retrieval index can become stale. Use model monitoring when the alert concerns model behavior, and use DataOps when the root cause sits in the pipeline or release path.

Production Readiness

The archive treats production readiness as the point where tests and evaluation meet monitoring and ownership. Bergh’s “done versus good” discussion in Mastering DataOps gives the operating version. A team should be able to run the system, know when something is wrong, make changes safely, and onboard another person into the work.

Mikulski’s production AI discussion makes the same point through trust. At 9:51-11:31 in Production AI Engineering, he says the phrase “this number doesn’t look correct” is damaging because trust is hard to regain. Tests don’t prove perfection, but they give the team something concrete to rely on during debugging.

For LLM and agent systems, teams add representative test sets and failure analysis. They also add traces, mocked tool tests, outcome assertions, and feedback from real use. That connects testing to LLM production patterns, LLM evaluation workflows, and production. A demo becomes a system only when the team can change it, check it, observe it, and respond when it fails.

These pages cover the adjacent operating practices and evaluation topics: