You know how to write tests. You’ve written thousands of them — unit tests that check return values, integration tests that verify API contracts, end-to-end tests that simulate user workflows. Your CI pipeline runs them on every push. A red build means something is broken. A green build means your code does what you said it would.
Now try writing a test for a machine learning model:
import numpy as npimport pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitrng = np.random.default_rng(42)X = rng.normal(0, 1, (500, 5))y = (X[:, 0] +0.5* X[:, 1] + rng.normal(0, 0.3, 500) >0).astype(int)X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)model = RandomForestClassifier(n_estimators=100, random_state=42)model.fit(X_train, y_train)accuracy = model.score(X_test, y_test)print(f"Accuracy: {accuracy:.4f}")# This is a test... but what should the expected value be?assert accuracy >0.80, f"Model accuracy {accuracy:.4f} below threshold"
Accuracy: 0.9200
The assertion passes, but something about it feels wrong. In software testing, expected values come from specifications — the function should return 35.96 because the invoice maths says so. Here, the threshold of 0.80 came from… where? You picked it because it seemed reasonable. If the accuracy comes back at 0.79, is the model broken? Is the data different? Is the random seed behaving differently on this machine? You’re no longer testing correctness; you’re testing “good enough”, and the boundary between those two is fuzzy.
This is the fundamental tension in testing data science code. Your engineering instincts are exactly right — untested code is untrustworthy code — but the testing patterns you know were designed for deterministic systems. Data science code produces approximate, probabilistic, data-dependent outputs. You need a different testing vocabulary, built on the same disciplined foundations.
Engineering Bridge
In web development, you test against a specification: the API contract says GET /users/42 returns a JSON object with an id field. In data science, there is often no specification — the model’s behaviour is the specification, discovered empirically from data. This is closer to testing a search engine’s relevance ranking than testing a CRUD endpoint. You can’t assert exact outputs, but you can assert properties: results should be ordered by relevance, the top result should be relevant, and adding a typo shouldn’t completely destroy the rankings. The shift is from exact-output testing to behavioural testing — asserting what kind of answer you expect rather than which exact answer. (This is distinct from property-based testing in the Hypothesis/QuickCheck sense, where a framework generates random inputs to find counterexamples — a complementary technique worth exploring separately.)
20.2 Testing the deterministic parts first
Before wrestling with probabilistic outputs, recognise that most data science code is deterministic. Data cleaning functions, feature engineering transforms, and preprocessing pipelines take a DataFrame in and produce a DataFrame out. These are ordinary functions that deserve ordinary tests.
These tests use exact equality because the computations are exact. Date arithmetic, string manipulation, joins, filters, aggregations — all deterministic, all testable in the same way you’d test any utility function. The advice here is simple: extract your data transformations into pure functions and test them like any other code. The fact that they happen to operate on DataFrames rather than strings or integers doesn’t change the testing discipline.
Treat pytest as your primary testing framework. The standard project layout works for data science exactly as it works for application code — a tests/ directory, one test module per source module, fixtures for shared test data. In a real project, these tests would live in separate files:
# tests/test_features.py (not executed — illustrative)import pytestimport pandas as pd@pytest.fixturedef sample_orders():return pd.DataFrame({"last_order_date": pd.to_datetime( ["2025-01-10", "2024-12-15"] ),"signup_date": pd.to_datetime( ["2024-01-15", "2023-01-15"] ), })def test_recency_days_are_positive(sample_orders): result = compute_recency_features( sample_orders, pd.Timestamp("2025-01-15") )assert (result["days_since_last_order"] >=0).all()assert (result["days_since_signup"] >=0).all()def test_recency_ratio_bounded(sample_orders): result = compute_recency_features( sample_orders, pd.Timestamp("2025-01-15") )# Ratio should be between 0 and 1 for customers# who ordered after signupassert (result["order_recency_ratio"] >=0).all()assert (result["order_recency_ratio"] <=1).all()
Notice that the second test checks a property (the ratio is between 0 and 1) rather than an exact value. This is a preview of the pattern we’ll use increasingly for statistical code.
20.3 Approximate equality and tolerances
When code involves floating-point arithmetic — and nearly all data science code does — exact equality breaks down. Two mathematically equivalent computations can produce slightly different floating-point results. This is the same issue from Section 16.5, now applied to testing.
NumPy provides np.testing.assert_allclose for this purpose. It checks that values are equal within a relative tolerance (rtol) and an absolute tolerance (atol):
# These are mathematically equal but differ at machine precisiona =0.1+0.2b =0.3print(f"0.1 + 0.2 = {a!r}")print(f"0.3 = {b!r}")print(f"Equal? {a == b}")# assert_allclose handles this gracefullynp.testing.assert_allclose(a, b, rtol=1e-10)print("assert_allclose passed (rtol=1e-10)")
The choice of tolerance matters. Too tight and your tests become flaky — failing on different machines or library versions because of insignificant numerical differences. Too loose and your tests miss genuine bugs. Figure 20.1 shows the practical ranges for different kinds of computation.
import matplotlib.pyplot as pltimport numpy as npfig, ax = plt.subplots(figsize=(10, 3))categories = ["Floating-point\narithmetic","Iterative\nalgorithms","Cross-platform\nmodels",]starts = [1e-12, 1e-6, 1e-4]ends = [1e-8, 1e-4, 1e-2]colours = ["#2563eb", "#0d9488", "#e85d4a"]for i, (cat, s, e, c) inenumerate(zip(categories, starts, ends, colours)): ax.barh( i, np.log10(e) - np.log10(s), left=np.log10(s), height=0.5, color=c, alpha=0.8, )ax.set_yticks(range(len(categories)))ax.set_yticklabels(categories, fontsize=10)ax.set_xlabel("Tolerance (rtol)", fontsize=10)ax.set_xticks(range(-12, 0, 2))ax.set_xticklabels( [f"1e{x}"for x inrange(-12, 0, 2)], fontsize=9)ax.axvline( x=-6, color="#64748b", linestyle="--", linewidth=1.5, label="Practical default (1e-6)",)ax.legend(fontsize=9, frameon=False)ax.spines[["top", "right"]].set_visible(False)ax.invert_yaxis()plt.tight_layout()plt.show()
Figure 20.1: Tolerance guidelines for different types of numerical computation. Tighter tolerances catch more bugs but risk flaky tests; looser tolerances are more robust across platforms.
For computations where the expected value may be close to zero, rtol alone is insufficient — the relative comparison divides by a near-zero denominator, making even tiny absolute differences look large. In those cases, set both rtol and atol (e.g., rtol=1e-10, atol=1e-12).
from sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import make_classificationX_tol, y_tol = make_classification( n_samples=200, n_features=5, random_state=42)# Train the same model twice — numerically identical to near-machine# precision on the same platform, because the same deterministic solver# runs on the same data (not because of the convergence tolerance)model_a = LogisticRegression( random_state=42, max_iter=200).fit(X_tol, y_tol)model_b = LogisticRegression( random_state=42, max_iter=200).fit(X_tol, y_tol)np.testing.assert_allclose( model_a.predict_proba(X_tol), model_b.predict_proba(X_tol), rtol=1e-10,)print("Same-machine reproducibility: passed (rtol=1e-10)")# For cross-platform comparisons, you'd use a looser tolerance# and store expected values in a fixture file rather than recomputing
Same-machine reproducibility: passed (rtol=1e-10)
Author’s Note
I once spent an afternoon debugging a “flaky” test that failed intermittently in CI. The test compared model coefficients to six decimal places. The root cause: our CI runners used different CPU architectures (AMD vs Intel), which use different BLAS implementations, which produced legitimately different optimisation paths. The fix was trivial — loosen the tolerance from rtol=1e-8 to rtol=1e-5 — but it took me embarrassingly long to diagnose because I was looking for a code bug rather than a numerical one. Now I default to rtol=1e-6 for anything involving optimisation and only tighten it when I have a reason to.
20.4 Behavioural tests for models
You can’t assert that a model returns a specific prediction — the exact output depends on the training data, the random seed, and the library version. But you can assert how the model should behave. These behavioural tests check properties of the model’s outputs rather than their exact values.
There are four patterns that cover most cases. Each is worth understanding individually before we combine them in the worked example.
20.4.1 Smoke tests
Smoke tests check that the model runs without crashing and produces outputs of the expected shape and type. These catch packaging errors, missing dependencies, and serialisation bugs:
Directional tests (also called perturbation tests) check that changing an input in a known direction produces the expected change in output. If you increase a customer’s spending, a churn model should predict lower churn probability. If you increase an error rate, an incident model should predict higher incident probability:
Baseline (error=2%): 0.043
Perturbed (error=8%): 0.891
Directional test passed.
20.4.3 Invariance tests
Invariance tests check that features which should not affect the output genuinely do not. A credit scoring model shouldn’t change its prediction when you vary the applicant’s name. A fraud detection model shouldn’t depend on the transaction ID. This pattern catches both data leakage bugs (the model accidentally learned from an identifier) and fairness issues (the model responds to a protected attribute):
# Add a column that the model should NOT depend onbaseline_with_id = baseline.copy()baseline_with_id["server_id"] ="web-eu-042"altered_id = baseline_with_id.copy()altered_id["server_id"] ="web-us-999"# The model was trained on numeric features only, so server_id# should be irrelevant. Verify by predicting on the shared columns.feature_cols = ["error_pct", "request_rate", "p99_latency_ms"]p_original = incident_model.predict_proba( baseline_with_id[feature_cols])[0, 1]p_altered = incident_model.predict_proba( altered_id[feature_cols])[0, 1]assert p_original == p_altered, ("Changing server_id altered the prediction — ""possible data leakage through an identifier")print(f"Prediction with server_id='web-eu-042': {p_original:.3f}")print(f"Prediction with server_id='web-us-999': {p_altered:.3f}")print("Invariance test passed.")
Prediction with server_id='web-eu-042': 0.043
Prediction with server_id='web-us-999': 0.043
Invariance test passed.
In a real system, the invariance test becomes more meaningful when the irrelevant column is present in the training data — for example, checking that a model trained on features including user_id doesn’t actually depend on it. Here the test is straightforward because we excluded server_id from training. The pattern matters most when you’re not sure what the model learned.
20.4.4 Minimum performance tests
Minimum performance tests set a floor that the model must exceed. The threshold should come from a meaningful baseline — the performance of a simple heuristic, a previous model version, or a dummy classifier — not from an arbitrary number:
from sklearn.dummy import DummyClassifierfrom sklearn.model_selection import cross_val_score# Establish a meaningful baseline. "stratified" preserves# the class distribution, giving a non-trivial baseline even# with imbalanced classes (unlike "most_frequent", which# predicts everything as the majority class and produces# F1 = 0 for the minority class)dummy = DummyClassifier(strategy="stratified", random_state=42)dummy_scores = cross_val_score( dummy, incident_features, incident_target, cv=5, scoring="roc_auc",)# cross_val_score clones the estimator and refits from scratch# on each fold — the existing fitted state is ignoredmodel_scores = cross_val_score( incident_model, incident_features, incident_target, cv=5, scoring="roc_auc",)print(f"Dummy baseline AUC: "f"{dummy_scores.mean():.3f} +/- {dummy_scores.std():.3f}")print(f"Model AUC: "f"{model_scores.mean():.3f} +/- {model_scores.std():.3f}")# The model must meaningfully outperform the dummyassert model_scores.mean() > dummy_scores.mean() +0.05, ("Model does not meaningfully outperform ""the stratified dummy baseline")print("Minimum performance test passed.")
Dummy baseline AUC: 0.518 +/- 0.026
Model AUC: 0.623 +/- 0.053
Minimum performance test passed.
Engineering Bridge
These four patterns map to testing concepts you already use, though some analogies are closer than others. Smoke tests are health checks — the equivalent of hitting /healthz after a deployment. Directional tests are like testing that a rate limiter actually limits: when you double the request rate, the rejection count should increase. You don’t check the exact count, but you verify the direction. Invariance tests are like checking that an authorisation decision doesn’t depend on the X-Request-ID header — the request ID should be irrelevant to the outcome. Minimum performance tests are closest to SLO assertions: the service must maintain at least 99.9% availability, or the model must beat a baseline AUC of 0.50.
20.5 Testing for data leakage
Data leakage — when information from the test set or the future leaks into training — is the most dangerous class of data science bug. It’s dangerous because the model appears to work better than it should: accuracy goes up, metrics look impressive, and nobody suspects a problem until the model fails in production. The leakage tests in Section 16.1 focused on reproducibility; here we focus on catching leakage in the data itself.
A temporal leakage test verifies that your train/test split respects time ordering. If you’re predicting next month’s churn, the training data must not include features computed from next month’s activity:
rng = np.random.default_rng(42)# Simulate a dataset with temporal structuren =1000events = pd.DataFrame({"event_date": pd.date_range("2024-01-01", periods=n, freq="D" ),"revenue": rng.exponential(50, n),"user_count": rng.poisson(500, n),})# Define the prediction target dateprediction_date = pd.Timestamp("2024-10-01")# Split: train on data before prediction_date, test on data aftertrain = events[events["event_date"] < prediction_date]test = events[events["event_date"] >= prediction_date]# Assert no temporal overlapassert train["event_date"].max() < test["event_date"].min(), ("Temporal leakage: training data overlaps with test data")# Assert test set starts at the expected boundaryassert test["event_date"].min() == prediction_date, ("Test set doesn't start at prediction_date")# Assert all rows are accounted forassertlen(train) +len(test) ==len(events), ("Rows lost during splitting")print(f"Train: {train['event_date'].min().date()} to "f"{train['event_date'].max().date()} ({len(train)} rows)")print(f"Test: {test['event_date'].min().date()} to "f"{test['event_date'].max().date()} ({len(test)} rows)")print("Temporal leakage test passed.")
Train: 2024-01-01 to 2024-09-30 (274 rows)
Test: 2024-10-01 to 2026-09-26 (726 rows)
Temporal leakage test passed.
A target leakage test checks that none of your features are derived from the target variable. This often happens subtly — a feature that is a near-perfect proxy for the label because it was computed from the same underlying data:
from sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import cross_val_scorerng = np.random.default_rng(42)n =500# Create a dataset where one feature leaks the targetleak_features = pd.DataFrame({"legitimate_feature_1": rng.normal(0, 1, n),"legitimate_feature_2": rng.normal(0, 1, n),"total_refund_amount": rng.exponential(10, n),})leak_target = ( leak_features["total_refund_amount"] >15).astype(int) # leaky!# A suspiciously high cross-validation score is a red flagrf = RandomForestClassifier(n_estimators=50, random_state=42)scores = cross_val_score( rf, leak_features, leak_target, cv=5, scoring="accuracy")print(f"CV accuracy with all features: {scores.mean():.3f}")# Drop the suspect feature and comparescores_clean = cross_val_score( rf, leak_features[["legitimate_feature_1","legitimate_feature_2"]], leak_target, cv=5, scoring="accuracy",)print(f"CV accuracy without suspect: {scores_clean.mean():.3f}")accuracy_drop = scores.mean() - scores_clean.mean()assert accuracy_drop <0.15, (f"Dropping 'total_refund_amount' reduced accuracy by "f"{accuracy_drop:.1%} — possible target leakage. "f"Investigate before proceeding.")# This assertion WILL fail — demonstrating leakage detectionprint(f"Accuracy drop: {accuracy_drop:.1%}")
CV accuracy with all features: 1.000
CV accuracy without suspect: 0.730
---------------------------------------------------------------------------AssertionError Traceback (most recent call last)
CellIn[11], line 34 31print(f"CV accuracy without suspect: {scores_clean.mean():.3f}")
33 accuracy_drop = scores.mean() - scores_clean.mean()
---> 34assert accuracy_drop < 0.15, (
35f"Dropping 'total_refund_amount' reduced accuracy by " 36f"{accuracy_drop:.1%} — possible target leakage. " 37f"Investigate before proceeding." 38 )
39# This assertion WILL fail — demonstrating leakage detection 40print(f"Accuracy drop: {accuracy_drop:.1%}")
AssertionError: Dropping 'total_refund_amount' reduced accuracy by 27.0% — possible target leakage. Investigate before proceeding.
The pattern is straightforward: if removing a single feature causes a dramatic drop in performance, that feature is doing suspiciously heavy lifting. It may be legitimate — some features really are that predictive — but it warrants investigation. In this case the assertion fails because total_refund_amount is a direct proxy for the target, exactly the kind of leakage we want to catch.
20.6 Data validation tests
Models fail silently when their input data drifts from what they were trained on. A column that was never null during training starts arriving with nulls. A categorical feature gains a new level. Numeric values shift because an upstream system changed its units. The model keeps returning predictions — they’re just wrong.
Data validation tests catch these problems at the pipeline boundary, before the data reaches the model. Think of them as contract tests for your data:
def validate_model_input(df: pd.DataFrame) ->list[str]:"""Validate a DataFrame against the expected model input schema. Returns a list of validation errors (empty if valid). """ errors = []# --- Schema checks --- required_columns = {"error_pct", "request_rate", "p99_latency_ms", } missing = required_columns -set(df.columns)if missing: errors.append(f"Missing columns: {missing}")return errors # can't check further without columns# --- Null checks ---for col in required_columns: null_count = df[col].isna().sum()if null_count >0: errors.append(f"Column '{col}' has {null_count} null values" )# --- Range checks (based on training distribution) ---if (df["error_pct"] <0).any(): errors.append("error_pct contains negative values")if (df["request_rate"] <0).any(): errors.append("request_rate contains negative values")if (df["p99_latency_ms"] <0).any(): errors.append("p99_latency_ms contains negative values")# --- Distribution drift checks ---if df["error_pct"].mean() >20: errors.append(f"error_pct mean ({df['error_pct'].mean():.1f}) "f"is unusually high — possible upstream issue" )return errors# Test with valid datarng = np.random.default_rng(42)valid_data = pd.DataFrame({"error_pct": np.clip(rng.normal(2, 1.5, 100), 0, None),"request_rate": rng.exponential(100, 100),"p99_latency_ms": rng.lognormal(4, 0.8, 100),})assert validate_model_input(valid_data) == []# Test with invalid datainvalid_data = valid_data.copy()invalid_data.loc[0, "error_pct"] = np.naninvalid_data.loc[1, "request_rate"] =-5.0errors = validate_model_input(invalid_data)for error in errors:print(f" {error}")assertlen(errors) ==2, f"Expected 2 errors, got {len(errors)}"print(f"\nCaught {len(errors)} validation errors as expected.")
Column 'error_pct' has 1 null values
request_rate contains negative values
Caught 2 validation errors as expected.
This is the same principle as input validation on an API endpoint — check that the data conforms to the expected schema before processing it. The difference is that data validation also checks statistical properties (distributions, ranges, cardinalities) that an API schema wouldn’t capture. For more principled drift detection, formal two-sample tests (Kolmogorov–Smirnov, Population Stability Index) compare the incoming distribution against the training distribution — we met these tools in Section 5.1.
For production systems, libraries like Great Expectations and pandera provide declarative validation frameworks. They express these same checks as reusable expectations that can be versioned, shared across teams, and integrated into CI. They also add operational complexity — start with simple validation functions like the one above and adopt a framework when the number of checks or teams makes manual maintenance painful. The following illustrates the pandera approach:
When you refactor a function in application code, your tests ensure the behaviour doesn’t change. Models need the same protection. A “harmless” change to a feature engineering pipeline, a library upgrade, or a data refresh can silently change model behaviour. Model regression tests capture a known-good state and alert you when outputs drift.
import numpy as npfrom sklearn.ensemble import GradientBoostingClassifierfrom scipy.special import expit# Train the reference modelrng = np.random.default_rng(42)n =800X_ref = pd.DataFrame({"error_pct": np.clip(rng.normal(2, 1.5, n), 0, None),"request_rate": rng.exponential(100, n),"p99_latency_ms": rng.lognormal(4, 0.8, n),})log_odds = (-3.5+0.4* X_ref["error_pct"]+0.005* X_ref["request_rate"])y_ref = rng.binomial(1, expit(log_odds))ref_model = GradientBoostingClassifier( n_estimators=100, random_state=42).fit(X_ref, y_ref)# Create a fixed set of reference inputs and capture outputsreference_inputs = pd.DataFrame({"error_pct": [1.0, 3.0, 5.0, 8.0, 12.0],"request_rate": [50.0, 100.0, 200.0, 500.0, 1000.0],"p99_latency_ms": [30.0, 60.0, 100.0, 200.0, 500.0],})reference_outputs = ref_model.predict_proba( reference_inputs)[:, 1]# In practice, save to a fixture file:# np.save("tests/fixtures/reference_preds.npy",# reference_outputs)# Then load in your test:# expected = np.load("tests/fixtures/reference_preds.npy")# The test itself is just an assert_allclose call —# let the AssertionError propagate so pytest sees the failuredef test_model_regression(model, inputs, expected, rtol=1e-5):"""Assert model outputs haven't changed from reference.""" actual = model.predict_proba(inputs)[:, 1] np.testing.assert_allclose( actual, expected, rtol=rtol, err_msg="Model regression detected", )test_model_regression( ref_model, reference_inputs, reference_outputs)print("Regression test passed.")print(f"Reference predictions: "f"{np.round(reference_outputs, 4)}")
Regression test passed.
Reference predictions: [0.0481 0.1778 0.2788 0.9384 0.8722]
The fixture file (reference_predictions.npy) is the key. It records the model’s output on a fixed set of inputs at a point in time when you know the model is correct. Any future change that alters those outputs triggers a failure — not necessarily a bug, but a change that deserves investigation.
Author’s Note
I learned the value of model regression tests the hard way. We upgraded scikit-learn from 1.3 to 1.4 as part of routine dependency maintenance. All our unit tests passed. The model trained without errors. The accuracy on the test set was within normal range.
But two weeks later, a product manager noticed that a customer segment that used to receive a particular recommendation had stopped receiving it. The upgrade had changed the default behaviour of a preprocessing step, which slightly shifted the decision boundary, which affected predictions for customers near that boundary. A regression test on a fixed set of inputs would have caught it immediately.
20.8 What not to test
An engineer’s natural instinct is to aim for high code coverage. For data science code, this instinct needs calibrating. There is no value in testing that model.fit(X, y) works — you’d be testing scikit-learn, not your code. Similarly, auto-generated pipeline boilerplate and third-party library internals don’t benefit from coverage-driven tests.
Focus your testing effort where your logic lives: data transforms, feature engineering, validation rules, and the behavioural properties of the trained model. Coverage metrics are a misleading signal for ML projects — a pipeline can have 95% line coverage and still silently produce wrong predictions because the tests don’t check what matters.
20.9 Organising your test suite
A well-structured test suite for a data science project has three tiers, mirroring the test pyramid from software engineering. The pyramid shape — many fast tests at the base, fewer slow tests at the top — applies, though the relative importance differs: in traditional software, a unit test failure is usually more informative than an integration test failure, but in data science, a statistical test that detects model degradation may be the most important test you have.
import matplotlib.pyplot as pltimport matplotlib.patches as patchesfig, ax = plt.subplots(figsize=(10, 5))tiers = [ ("Unit tests", "Feature transforms, validation ""logic\nMilliseconds — every commit","#2563eb", 0, 0.9), ("Integration tests", "Pipeline end-to-end, small ""fixed data\nSeconds — every PR","#0d9488", 1, 0.65), ("Statistical tests", "Model quality, regression, ""directional\nMinutes — nightly / on model changes","#e85d4a", 2, 0.4),]for label, desc, colour, row, width in tiers: left = (1- width) /2 rect = patches.FancyBboxPatch( (left, row *1.1+0.05), width, 0.9, boxstyle="round,pad=0.03", facecolor=colour, edgecolor="white", linewidth=2, ) ax.add_patch(rect) ax.text(0.5, row *1.1+0.6, label, ha="center", va="center", fontsize=12, fontweight="bold", color="white", ) ax.text(0.5, row *1.1+0.3, desc, ha="center", va="center", fontsize=9, color="white", )ax.set_xlim(-0.05, 1.05)ax.set_ylim(-0.05, 3.4)ax.axis("off")plt.tight_layout()plt.show()
Figure 20.2: The data science test pyramid. Unit tests form the fast, broad base; statistical tests are fewer but catch the most dangerous failures.
In pytest, mark your statistical tests so they can be run separately. The following pytest.ini configuration lets you run pytest -m "not statistical" on every commit (fast) and pytest -m statistical nightly or when model code changes:
[pytest]markers = statistical: tests that involve model training (slow)
A typical project layout mirrors standard application code:
20.10 Worked example: a tested prediction pipeline
Let’s put these patterns together into a realistic, fully tested prediction pipeline. We’ll build a small incident prediction system with tests at every layer.
First, the reusable functions — feature engineering and data validation:
def engineer_features(df: pd.DataFrame) -> pd.DataFrame:"""Transform raw metrics into model features.""" result = df.copy() result["error_rate_log"] = np.log1p(df["error_pct"]) result["latency_over_threshold"] = ( df["p99_latency_ms"] >200 ).astype(int) result["load_x_errors"] = ( df["request_rate"] * df["error_pct"] )return resultdef validate_raw_data(df: pd.DataFrame) ->list[str]:"""Check raw data before feature engineering.""" errors = []for col in ["error_pct", "request_rate", "p99_latency_ms"]:if col notin df.columns: errors.append(f"Missing column: {col}")elif df[col].isna().any(): errors.append(f"Nulls in {col}")elif (df[col] <0).any(): errors.append(f"Negative values in {col}")return errors
Now generate training data and run the deterministic tests — data validation and feature engineering:
Test 3 passed: smoke test
Test 4 passed: performance (model AUC=0.613 vs dummy=0.482)
Test 5 passed: directional (low=0.059, high=0.779)
Five tests, five different aspects of correctness. None of them requires knowing the exact model output. Together, they provide strong evidence that the pipeline is working as intended — and they’ll catch most regressions caused by code changes, data shifts, or library upgrades.
20.11 Summary
Testing data science code is testing — the same discipline, the same rigour, the same CI integration you already know. The difference is in what you assert:
Test the deterministic parts exactly. Feature engineering, data transforms, and validation logic are ordinary functions. Test them with exact assertions, like any other code.
Use tolerances for numerical code.np.testing.assert_allclose with an appropriate tolerance handles floating-point imprecision without making your tests flaky. Set both rtol and atol when values may be near zero.
Test model behaviour, not exact outputs. Smoke tests, directional tests, invariance tests, and minimum performance tests together provide strong coverage without requiring you to predict exact model outputs.
Test for data leakage explicitly. Temporal leakage and target leakage are the most dangerous data science bugs because they make models look better than they should. Write tests that check for them directly.
Validate data at pipeline boundaries. Schema checks, null checks, range checks, and distribution checks catch upstream data problems before they silently corrupt your model’s predictions.
The applied chapters in Part 5 will exercise these patterns on real industry problems — from churn prediction to fraud detection — where the choice of test strategy directly affects business outcomes.
20.12 Exercises
Write a validate_prediction_output function that checks model predictions before they’re returned to a consumer. It should verify: (a) no null values, (b) all probabilities are between 0 and 1, (c) probabilities sum to 1 per row (within tolerance), and (d) the predicted class distribution is not suspiciously concentrated (e.g., >99% of predictions are the same class). Write pytest-style tests for each condition, including tests that deliberately trigger each failure.
A colleague argues that setting random_state=42 in every test makes the tests pass by coincidence — the model might fail with a different seed. Write a test that trains and evaluates a model across 10 different random seeds and asserts that the accuracy is above a threshold for all of them. What does this test tell you that a single-seed test does not? Where does this approach break down?
Build a directional test suite for a churn prediction model. Assume the model takes features tenure_months, monthly_spend, support_tickets_last_30d, and days_since_last_login. Write at least four directional tests, each checking that a plausible change in one feature moves the churn probability in the expected direction. For one of the features, explain a scenario where the expected direction might be wrong.
Conceptual: Your team has a model in production with no tests. You’ve been asked to add a test suite retroactively. You can’t write exact-output tests because you don’t know what the “correct” outputs should be — the model is the only source of truth. Describe a strategy for building a test suite from scratch around an existing model. Which types of tests can you add immediately, and which require baseline measurements first?
Conceptual: A data scientist on your team says “We don’t need unit tests — we have model evaluation metrics. If the AUC is good, the code is correct.” Write a rebuttal (3–5 sentences). Give at least two concrete examples of bugs that would not be caught by good evaluation metrics but would be caught by the testing patterns described in this chapter.
---title: "Testing data science code"---## The assert that doesn't work {#sec-testing-intro}You know how to write tests. You've written thousands of them — unit tests that check return values, integration tests that verify API contracts, end-to-end tests that simulate user workflows. Your CI pipeline runs them on every push. A red build means something is broken. A green build means your code does what you said it would.Now try writing a test for a machine learning model:```{python}#| label: naive-model-test#| echo: trueimport numpy as npimport pandas as pdfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import train_test_splitrng = np.random.default_rng(42)X = rng.normal(0, 1, (500, 5))y = (X[:, 0] +0.5* X[:, 1] + rng.normal(0, 0.3, 500) >0).astype(int)X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42)model = RandomForestClassifier(n_estimators=100, random_state=42)model.fit(X_train, y_train)accuracy = model.score(X_test, y_test)print(f"Accuracy: {accuracy:.4f}")# This is a test... but what should the expected value be?assert accuracy >0.80, f"Model accuracy {accuracy:.4f} below threshold"```The assertion passes, but something about it feels wrong. In software testing, expected values come from specifications — the function *should* return 35.96 because the invoice maths says so. Here, the threshold of 0.80 came from... where? You picked it because it seemed reasonable. If the accuracy comes back at 0.79, is the model broken? Is the data different? Is the random seed behaving differently on this machine? You're no longer testing correctness; you're testing "good enough", and the boundary between those two is fuzzy.This is the fundamental tension in testing data science code. Your engineering instincts are exactly right — untested code is untrustworthy code — but the testing patterns you know were designed for deterministic systems. Data science code produces approximate, probabilistic, data-dependent outputs. You need a different testing vocabulary, built on the same disciplined foundations.::: {.callout-note}## Engineering BridgeIn web development, you test against a *specification*: the API contract says `GET /users/42` returns a JSON object with an `id` field. In data science, there is often no specification — the model's behaviour *is* the specification, discovered empirically from data. This is closer to testing a search engine's relevance ranking than testing a CRUD endpoint. You can't assert exact outputs, but you can assert properties: results should be ordered by relevance, the top result should be relevant, and adding a typo shouldn't completely destroy the rankings. The shift is from **exact-output testing** to **behavioural testing** — asserting *what kind of* answer you expect rather than *which exact* answer. (This is distinct from *property-based testing* in the Hypothesis/QuickCheck sense, where a framework generates random inputs to find counterexamples — a complementary technique worth exploring separately.):::## Testing the deterministic parts first {#sec-testing-deterministic}Before wrestling with probabilistic outputs, recognise that most data science code *is* deterministic. Data cleaning functions, feature engineering transforms, and preprocessing pipelines take a DataFrame in and produce a DataFrame out. These are ordinary functions that deserve ordinary tests.```{python}#| label: deterministic-transform-tests#| echo: truedef compute_recency_features( df: pd.DataFrame, reference_date: pd.Timestamp,) -> pd.DataFrame:"""Compute days-since features relative to a reference date.""" result = df.copy() result["days_since_last_order"] = ( (reference_date - df["last_order_date"]).dt.days ) result["days_since_signup"] = ( (reference_date - df["signup_date"]).dt.days ) result["order_recency_ratio"] = ( result["days_since_last_order"] / result["days_since_signup"] )return result# --- Tests ---def test_recency_features(): ref_date = pd.Timestamp("2025-01-15") test_data = pd.DataFrame({"last_order_date": [pd.Timestamp("2025-01-10"), pd.Timestamp("2024-12-15")],"signup_date": [pd.Timestamp("2024-01-15"), pd.Timestamp("2023-01-15")], }) result = compute_recency_features(test_data, ref_date)# Exact assertions — these are deterministic computationsassert result["days_since_last_order"].tolist() == [5, 31]assert result["days_since_signup"].tolist() == [366, 731] np.testing.assert_allclose( result["order_recency_ratio"].values, [5/366, 31/731], )# Structural assertions — output shape and columnsassertlist(result.columns) == ["last_order_date", "signup_date","days_since_last_order", "days_since_signup","order_recency_ratio", ]test_recency_features()print("All deterministic tests passed.")```These tests use exact equality because the computations are exact. Date arithmetic, string manipulation, joins, filters, aggregations — all deterministic, all testable in the same way you'd test any utility function. The advice here is simple: **extract your data transformations into pure functions and test them like any other code.** The fact that they happen to operate on DataFrames rather than strings or integers doesn't change the testing discipline.Treat `pytest` as your primary testing framework. The standard project layout works for data science exactly as it works for application code — a `tests/` directory, one test module per source module, fixtures for shared test data. In a real project, these tests would live in separate files:```python# tests/test_features.py (not executed — illustrative)import pytestimport pandas as pd@pytest.fixturedef sample_orders():return pd.DataFrame({"last_order_date": pd.to_datetime( ["2025-01-10", "2024-12-15"] ),"signup_date": pd.to_datetime( ["2024-01-15", "2023-01-15"] ), })def test_recency_days_are_positive(sample_orders): result = compute_recency_features( sample_orders, pd.Timestamp("2025-01-15") )assert (result["days_since_last_order"] >=0).all()assert (result["days_since_signup"] >=0).all()def test_recency_ratio_bounded(sample_orders): result = compute_recency_features( sample_orders, pd.Timestamp("2025-01-15") )# Ratio should be between 0 and 1 for customers# who ordered after signupassert (result["order_recency_ratio"] >=0).all()assert (result["order_recency_ratio"] <=1).all()```Notice that the second test checks a *property* (the ratio is between 0 and 1) rather than an exact value. This is a preview of the pattern we'll use increasingly for statistical code.## Approximate equality and tolerances {#sec-approximate-equality}When code involves floating-point arithmetic — and nearly all data science code does — exact equality breaks down. Two mathematically equivalent computations can produce slightly different floating-point results. This is the same issue from @sec-taming-randomness, now applied to testing.NumPy provides `np.testing.assert_allclose` for this purpose. It checks that values are equal within a relative tolerance (`rtol`) and an absolute tolerance (`atol`):```{python}#| label: approximate-equality#| echo: true# These are mathematically equal but differ at machine precisiona =0.1+0.2b =0.3print(f"0.1 + 0.2 = {a!r}")print(f"0.3 = {b!r}")print(f"Equal? {a == b}")# assert_allclose handles this gracefullynp.testing.assert_allclose(a, b, rtol=1e-10)print("assert_allclose passed (rtol=1e-10)")```The choice of tolerance matters. Too tight and your tests become flaky — failing on different machines or library versions because of insignificant numerical differences. Too loose and your tests miss genuine bugs. @fig-tolerance-spectrum shows the practical ranges for different kinds of computation.```{python}#| label: fig-tolerance-spectrum#| echo: true#| fig-cap: "Tolerance guidelines for different types of numerical computation. Tighter tolerances catch more bugs but risk flaky tests; looser tolerances are more robust across platforms."#| fig-alt: "Horizontal bar chart with a logarithmic x-axis ranging from 1e-12 to 1e-1. Three horizontal bars represent tolerance ranges: 'Floating-point arithmetic' spans roughly 1e-12 to 1e-8 in blue, 'Iterative algorithms' spans 1e-6 to 1e-4 in teal, and 'Cross-platform models' spans 1e-4 to 1e-2 in coral. A vertical dashed grey line marks 1e-6 as a practical default."import matplotlib.pyplot as pltimport numpy as npfig, ax = plt.subplots(figsize=(10, 3))categories = ["Floating-point\narithmetic","Iterative\nalgorithms","Cross-platform\nmodels",]starts = [1e-12, 1e-6, 1e-4]ends = [1e-8, 1e-4, 1e-2]colours = ["#2563eb", "#0d9488", "#e85d4a"]for i, (cat, s, e, c) inenumerate(zip(categories, starts, ends, colours)): ax.barh( i, np.log10(e) - np.log10(s), left=np.log10(s), height=0.5, color=c, alpha=0.8, )ax.set_yticks(range(len(categories)))ax.set_yticklabels(categories, fontsize=10)ax.set_xlabel("Tolerance (rtol)", fontsize=10)ax.set_xticks(range(-12, 0, 2))ax.set_xticklabels( [f"1e{x}"for x inrange(-12, 0, 2)], fontsize=9)ax.axvline( x=-6, color="#64748b", linestyle="--", linewidth=1.5, label="Practical default (1e-6)",)ax.legend(fontsize=9, frameon=False)ax.spines[["top", "right"]].set_visible(False)ax.invert_yaxis()plt.tight_layout()plt.show()```For computations where the expected value may be close to zero, `rtol` alone is insufficient — the relative comparison divides by a near-zero denominator, making even tiny absolute differences look large. In those cases, set both `rtol` and `atol` (e.g., `rtol=1e-10, atol=1e-12`).```{python}#| label: tolerance-levels#| echo: truefrom sklearn.linear_model import LogisticRegressionfrom sklearn.datasets import make_classificationX_tol, y_tol = make_classification( n_samples=200, n_features=5, random_state=42)# Train the same model twice — numerically identical to near-machine# precision on the same platform, because the same deterministic solver# runs on the same data (not because of the convergence tolerance)model_a = LogisticRegression( random_state=42, max_iter=200).fit(X_tol, y_tol)model_b = LogisticRegression( random_state=42, max_iter=200).fit(X_tol, y_tol)np.testing.assert_allclose( model_a.predict_proba(X_tol), model_b.predict_proba(X_tol), rtol=1e-10,)print("Same-machine reproducibility: passed (rtol=1e-10)")# For cross-platform comparisons, you'd use a looser tolerance# and store expected values in a fixture file rather than recomputing```::: {.callout-tip}## Author's NoteI once spent an afternoon debugging a "flaky" test that failed intermittently in CI. The test compared model coefficients to six decimal places. The root cause: our CI runners used different CPU architectures (AMD vs Intel), which use different BLAS implementations, which produced legitimately different optimisation paths. The fix was trivial — loosen the tolerance from `rtol=1e-8` to `rtol=1e-5` — but it took me embarrassingly long to diagnose because I was looking for a code bug rather than a numerical one. Now I default to `rtol=1e-6` for anything involving optimisation and only tighten it when I have a reason to.:::## Behavioural tests for models {#sec-behavioural-tests}You can't assert that a model returns a specific prediction — the exact output depends on the training data, the random seed, and the library version. But you *can* assert how the model should *behave*. These **behavioural tests** check properties of the model's outputs rather than their exact values.There are four patterns that cover most cases. Each is worth understanding individually before we combine them in the worked example.### Smoke tests {#sec-smoke-tests}Smoke tests check that the model runs without crashing and produces outputs of the expected shape and type. These catch packaging errors, missing dependencies, and serialisation bugs:```{python}#| label: smoke-tests#| echo: truefrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.datasets import make_classificationX_smoke, y_smoke = make_classification( n_samples=300, n_features=5, random_state=42)smoke_model = GradientBoostingClassifier( n_estimators=50, random_state=42)smoke_model.fit(X_smoke, y_smoke)# Smoke testsX_new = np.random.default_rng(99).normal(0, 1, (10, 5))predictions = smoke_model.predict(X_new)probabilities = smoke_model.predict_proba(X_new)assert predictions.shape == (10,), "Wrong prediction shape"assert probabilities.shape == (10, 2), "Wrong probability shape"assertset(predictions).issubset({0, 1}), "Unexpected labels"assert np.all(probabilities >=0), "Negative probabilities"assert np.allclose(probabilities.sum(axis=1), 1.0), ("Probabilities don't sum to 1")print("All smoke tests passed.")```### Directional tests {#sec-directional-tests}Directional tests (also called **perturbation tests**) check that changing an input in a known direction produces the expected change in output. If you increase a customer's spending, a churn model should predict lower churn probability. If you increase an error rate, an incident model should predict higher incident probability:```{python}#| label: directional-tests#| echo: truefrom sklearn.ensemble import GradientBoostingClassifierfrom scipy.special import expit# Train an incident prediction modelrng = np.random.default_rng(42)n =800incident_features = pd.DataFrame({"error_pct": np.clip(rng.normal(2, 1.5, n), 0, None),"request_rate": rng.exponential(100, n),"p99_latency_ms": rng.lognormal(4, 0.8, n),})log_odds = (-3.5+0.4* incident_features["error_pct"]+0.005* incident_features["request_rate"])incident_target = rng.binomial(1, expit(log_odds))incident_model = GradientBoostingClassifier( n_estimators=100, random_state=42).fit(incident_features, incident_target)# Directional test: increasing error_pct should increase# incident probabilitybaseline = pd.DataFrame({"error_pct": [2.0],"request_rate": [100.0],"p99_latency_ms": [50.0],})perturbed = baseline.copy()perturbed["error_pct"] =8.0# much higher error ratep_baseline = incident_model.predict_proba(baseline)[0, 1]p_perturbed = incident_model.predict_proba(perturbed)[0, 1]assert p_perturbed > p_baseline, (f"Higher error rate should increase incident probability, "f"but {p_perturbed:.3f} <= {p_baseline:.3f}")print(f"Baseline (error=2%): {p_baseline:.3f}")print(f"Perturbed (error=8%): {p_perturbed:.3f}")print("Directional test passed.")```### Invariance tests {#sec-invariance-tests}Invariance tests check that features which *should not* affect the output genuinely do not. A credit scoring model shouldn't change its prediction when you vary the applicant's name. A fraud detection model shouldn't depend on the transaction ID. This pattern catches both data leakage bugs (the model accidentally learned from an identifier) and fairness issues (the model responds to a protected attribute):```{python}#| label: invariance-tests#| echo: true# Add a column that the model should NOT depend onbaseline_with_id = baseline.copy()baseline_with_id["server_id"] ="web-eu-042"altered_id = baseline_with_id.copy()altered_id["server_id"] ="web-us-999"# The model was trained on numeric features only, so server_id# should be irrelevant. Verify by predicting on the shared columns.feature_cols = ["error_pct", "request_rate", "p99_latency_ms"]p_original = incident_model.predict_proba( baseline_with_id[feature_cols])[0, 1]p_altered = incident_model.predict_proba( altered_id[feature_cols])[0, 1]assert p_original == p_altered, ("Changing server_id altered the prediction — ""possible data leakage through an identifier")print(f"Prediction with server_id='web-eu-042': {p_original:.3f}")print(f"Prediction with server_id='web-us-999': {p_altered:.3f}")print("Invariance test passed.")```In a real system, the invariance test becomes more meaningful when the irrelevant column *is* present in the training data — for example, checking that a model trained on features including `user_id` doesn't actually depend on it. Here the test is straightforward because we excluded `server_id` from training. The pattern matters most when you're not sure what the model learned.### Minimum performance tests {#sec-minimum-performance-tests}Minimum performance tests set a floor that the model must exceed. The threshold should come from a meaningful baseline — the performance of a simple heuristic, a previous model version, or a dummy classifier — not from an arbitrary number:```{python}#| label: minimum-performance#| echo: truefrom sklearn.dummy import DummyClassifierfrom sklearn.model_selection import cross_val_score# Establish a meaningful baseline. "stratified" preserves# the class distribution, giving a non-trivial baseline even# with imbalanced classes (unlike "most_frequent", which# predicts everything as the majority class and produces# F1 = 0 for the minority class)dummy = DummyClassifier(strategy="stratified", random_state=42)dummy_scores = cross_val_score( dummy, incident_features, incident_target, cv=5, scoring="roc_auc",)# cross_val_score clones the estimator and refits from scratch# on each fold — the existing fitted state is ignoredmodel_scores = cross_val_score( incident_model, incident_features, incident_target, cv=5, scoring="roc_auc",)print(f"Dummy baseline AUC: "f"{dummy_scores.mean():.3f} +/- {dummy_scores.std():.3f}")print(f"Model AUC: "f"{model_scores.mean():.3f} +/- {model_scores.std():.3f}")# The model must meaningfully outperform the dummyassert model_scores.mean() > dummy_scores.mean() +0.05, ("Model does not meaningfully outperform ""the stratified dummy baseline")print("Minimum performance test passed.")```::: {.callout-note}## Engineering BridgeThese four patterns map to testing concepts you already use, though some analogies are closer than others. Smoke tests are health checks — the equivalent of hitting `/healthz` after a deployment. Directional tests are like testing that a rate limiter actually limits: when you double the request rate, the rejection count should increase. You don't check the exact count, but you verify the direction. Invariance tests are like checking that an authorisation decision doesn't depend on the `X-Request-ID` header — the request ID should be irrelevant to the outcome. Minimum performance tests are closest to SLO assertions: the service must maintain at least 99.9% availability, or the model must beat a baseline AUC of 0.50.:::## Testing for data leakage {#sec-testing-leakage}Data leakage — when information from the test set or the future leaks into training — is the most dangerous class of data science bug. It's dangerous because the model appears to work *better* than it should: accuracy goes up, metrics look impressive, and nobody suspects a problem until the model fails in production. The leakage tests in @sec-repro-intro focused on reproducibility; here we focus on catching leakage in the data itself.A temporal leakage test verifies that your train/test split respects time ordering. If you're predicting next month's churn, the training data must not include features computed from next month's activity:```{python}#| label: temporal-leakage-test#| echo: truerng = np.random.default_rng(42)# Simulate a dataset with temporal structuren =1000events = pd.DataFrame({"event_date": pd.date_range("2024-01-01", periods=n, freq="D" ),"revenue": rng.exponential(50, n),"user_count": rng.poisson(500, n),})# Define the prediction target dateprediction_date = pd.Timestamp("2024-10-01")# Split: train on data before prediction_date, test on data aftertrain = events[events["event_date"] < prediction_date]test = events[events["event_date"] >= prediction_date]# Assert no temporal overlapassert train["event_date"].max() < test["event_date"].min(), ("Temporal leakage: training data overlaps with test data")# Assert test set starts at the expected boundaryassert test["event_date"].min() == prediction_date, ("Test set doesn't start at prediction_date")# Assert all rows are accounted forassertlen(train) +len(test) ==len(events), ("Rows lost during splitting")print(f"Train: {train['event_date'].min().date()} to "f"{train['event_date'].max().date()} ({len(train)} rows)")print(f"Test: {test['event_date'].min().date()} to "f"{test['event_date'].max().date()} ({len(test)} rows)")print("Temporal leakage test passed.")```A target leakage test checks that none of your features are derived from the target variable. This often happens subtly — a feature that is a near-perfect proxy for the label because it was computed from the same underlying data:```{python}#| label: target-leakage-test#| echo: true#| error: truefrom sklearn.ensemble import RandomForestClassifierfrom sklearn.model_selection import cross_val_scorerng = np.random.default_rng(42)n =500# Create a dataset where one feature leaks the targetleak_features = pd.DataFrame({"legitimate_feature_1": rng.normal(0, 1, n),"legitimate_feature_2": rng.normal(0, 1, n),"total_refund_amount": rng.exponential(10, n),})leak_target = ( leak_features["total_refund_amount"] >15).astype(int) # leaky!# A suspiciously high cross-validation score is a red flagrf = RandomForestClassifier(n_estimators=50, random_state=42)scores = cross_val_score( rf, leak_features, leak_target, cv=5, scoring="accuracy")print(f"CV accuracy with all features: {scores.mean():.3f}")# Drop the suspect feature and comparescores_clean = cross_val_score( rf, leak_features[["legitimate_feature_1","legitimate_feature_2"]], leak_target, cv=5, scoring="accuracy",)print(f"CV accuracy without suspect: {scores_clean.mean():.3f}")accuracy_drop = scores.mean() - scores_clean.mean()assert accuracy_drop <0.15, (f"Dropping 'total_refund_amount' reduced accuracy by "f"{accuracy_drop:.1%} — possible target leakage. "f"Investigate before proceeding.")# This assertion WILL fail — demonstrating leakage detectionprint(f"Accuracy drop: {accuracy_drop:.1%}")```The pattern is straightforward: if removing a single feature causes a dramatic drop in performance, that feature is doing suspiciously heavy lifting. It may be legitimate — some features really are that predictive — but it warrants investigation. In this case the assertion fails because `total_refund_amount` is a direct proxy for the target, exactly the kind of leakage we want to catch.## Data validation tests {#sec-data-validation-tests}Models fail silently when their input data drifts from what they were trained on. A column that was never null during training starts arriving with nulls. A categorical feature gains a new level. Numeric values shift because an upstream system changed its units. The model keeps returning predictions — they're just wrong.Data validation tests catch these problems at the pipeline boundary, before the data reaches the model. Think of them as contract tests for your data:```{python}#| label: data-validation#| echo: truedef validate_model_input(df: pd.DataFrame) ->list[str]:"""Validate a DataFrame against the expected model input schema. Returns a list of validation errors (empty if valid). """ errors = []# --- Schema checks --- required_columns = {"error_pct", "request_rate", "p99_latency_ms", } missing = required_columns -set(df.columns)if missing: errors.append(f"Missing columns: {missing}")return errors # can't check further without columns# --- Null checks ---for col in required_columns: null_count = df[col].isna().sum()if null_count >0: errors.append(f"Column '{col}' has {null_count} null values" )# --- Range checks (based on training distribution) ---if (df["error_pct"] <0).any(): errors.append("error_pct contains negative values")if (df["request_rate"] <0).any(): errors.append("request_rate contains negative values")if (df["p99_latency_ms"] <0).any(): errors.append("p99_latency_ms contains negative values")# --- Distribution drift checks ---if df["error_pct"].mean() >20: errors.append(f"error_pct mean ({df['error_pct'].mean():.1f}) "f"is unusually high — possible upstream issue" )return errors# Test with valid datarng = np.random.default_rng(42)valid_data = pd.DataFrame({"error_pct": np.clip(rng.normal(2, 1.5, 100), 0, None),"request_rate": rng.exponential(100, 100),"p99_latency_ms": rng.lognormal(4, 0.8, 100),})assert validate_model_input(valid_data) == []# Test with invalid datainvalid_data = valid_data.copy()invalid_data.loc[0, "error_pct"] = np.naninvalid_data.loc[1, "request_rate"] =-5.0errors = validate_model_input(invalid_data)for error in errors:print(f" {error}")assertlen(errors) ==2, f"Expected 2 errors, got {len(errors)}"print(f"\nCaught {len(errors)} validation errors as expected.")```This is the same principle as input validation on an API endpoint — check that the data conforms to the expected schema before processing it. The difference is that data validation also checks *statistical* properties (distributions, ranges, cardinalities) that an API schema wouldn't capture. For more principled drift detection, formal two-sample tests (Kolmogorov–Smirnov, Population Stability Index) compare the incoming distribution against the training distribution — we met these tools in @sec-hypothesis-testing.For production systems, libraries like Great Expectations and pandera provide declarative validation frameworks. They express these same checks as reusable expectations that can be versioned, shared across teams, and integrated into CI. They also add operational complexity — start with simple validation functions like the one above and adopt a framework when the number of checks or teams makes manual maintenance painful. The following illustrates the pandera approach:```python# pandera schema (not executed — illustrative)import pandera as pamodel_input_schema = pa.DataFrameSchema({"error_pct": pa.Column(float, pa.Check.ge(0)),"request_rate": pa.Column(float, pa.Check.ge(0)),"p99_latency_ms": pa.Column(float, pa.Check.ge(0)),})# Raises SchemaError if validation failsvalidated_df = model_input_schema.validate(incoming_data)```## Regression tests for model behaviour {#sec-regression-tests}When you refactor a function in application code, your tests ensure the behaviour doesn't change. Models need the same protection. A "harmless" change to a feature engineering pipeline, a library upgrade, or a data refresh can silently change model behaviour. **Model regression tests** capture a known-good state and alert you when outputs drift.```{python}#| label: regression-test#| echo: trueimport numpy as npfrom sklearn.ensemble import GradientBoostingClassifierfrom scipy.special import expit# Train the reference modelrng = np.random.default_rng(42)n =800X_ref = pd.DataFrame({"error_pct": np.clip(rng.normal(2, 1.5, n), 0, None),"request_rate": rng.exponential(100, n),"p99_latency_ms": rng.lognormal(4, 0.8, n),})log_odds = (-3.5+0.4* X_ref["error_pct"]+0.005* X_ref["request_rate"])y_ref = rng.binomial(1, expit(log_odds))ref_model = GradientBoostingClassifier( n_estimators=100, random_state=42).fit(X_ref, y_ref)# Create a fixed set of reference inputs and capture outputsreference_inputs = pd.DataFrame({"error_pct": [1.0, 3.0, 5.0, 8.0, 12.0],"request_rate": [50.0, 100.0, 200.0, 500.0, 1000.0],"p99_latency_ms": [30.0, 60.0, 100.0, 200.0, 500.0],})reference_outputs = ref_model.predict_proba( reference_inputs)[:, 1]# In practice, save to a fixture file:# np.save("tests/fixtures/reference_preds.npy",# reference_outputs)# Then load in your test:# expected = np.load("tests/fixtures/reference_preds.npy")# The test itself is just an assert_allclose call —# let the AssertionError propagate so pytest sees the failuredef test_model_regression(model, inputs, expected, rtol=1e-5):"""Assert model outputs haven't changed from reference.""" actual = model.predict_proba(inputs)[:, 1] np.testing.assert_allclose( actual, expected, rtol=rtol, err_msg="Model regression detected", )test_model_regression( ref_model, reference_inputs, reference_outputs)print("Regression test passed.")print(f"Reference predictions: "f"{np.round(reference_outputs, 4)}")```The fixture file (`reference_predictions.npy`) is the key. It records the model's output on a fixed set of inputs at a point in time when you know the model is correct. Any future change that alters those outputs triggers a failure — not necessarily a bug, but a change that deserves investigation.::: {.callout-tip}## Author's NoteI learned the value of model regression tests the hard way. We upgraded scikit-learn from 1.3 to 1.4 as part of routine dependency maintenance. All our unit tests passed. The model trained without errors. The accuracy on the test set was within normal range.But two weeks later, a product manager noticed that a customer segment that used to receive a particular recommendation had stopped receiving it. The upgrade had changed the default behaviour of a preprocessing step, which slightly shifted the decision boundary, which affected predictions for customers near that boundary. A regression test on a fixed set of inputs would have caught it immediately.:::## What not to test {#sec-what-not-to-test}An engineer's natural instinct is to aim for high code coverage. For data science code, this instinct needs calibrating. There is no value in testing that `model.fit(X, y)` works — you'd be testing scikit-learn, not your code. Similarly, auto-generated pipeline boilerplate and third-party library internals don't benefit from coverage-driven tests.Focus your testing effort where *your* logic lives: data transforms, feature engineering, validation rules, and the behavioural properties of the trained model. Coverage metrics are a misleading signal for ML projects — a pipeline can have 95% line coverage and still silently produce wrong predictions because the tests don't check what matters.## Organising your test suite {#sec-test-organisation}A well-structured test suite for a data science project has three tiers, mirroring the test pyramid from software engineering. The pyramid shape — many fast tests at the base, fewer slow tests at the top — applies, though the relative *importance* differs: in traditional software, a unit test failure is usually more informative than an integration test failure, but in data science, a statistical test that detects model degradation may be the most important test you have.```{python}#| label: fig-test-pyramid#| echo: true#| fig-cap: "The data science test pyramid. Unit tests form the fast, broad base; statistical tests are fewer but catch the most dangerous failures."#| fig-alt: "A pyramid diagram with three horizontal tiers. The bottom tier, labelled 'Unit tests' in blue, is the widest and annotated 'Feature transforms, validation logic — milliseconds, every commit'. The middle tier, labelled 'Integration tests' in teal, is annotated 'Pipeline end-to-end — seconds, every PR'. The top tier, labelled 'Statistical tests' in coral, is the narrowest and annotated 'Model quality, regression, directional — minutes, nightly or on model changes'."import matplotlib.pyplot as pltimport matplotlib.patches as patchesfig, ax = plt.subplots(figsize=(10, 5))tiers = [ ("Unit tests", "Feature transforms, validation ""logic\nMilliseconds — every commit","#2563eb", 0, 0.9), ("Integration tests", "Pipeline end-to-end, small ""fixed data\nSeconds — every PR","#0d9488", 1, 0.65), ("Statistical tests", "Model quality, regression, ""directional\nMinutes — nightly / on model changes","#e85d4a", 2, 0.4),]for label, desc, colour, row, width in tiers: left = (1- width) /2 rect = patches.FancyBboxPatch( (left, row *1.1+0.05), width, 0.9, boxstyle="round,pad=0.03", facecolor=colour, edgecolor="white", linewidth=2, ) ax.add_patch(rect) ax.text(0.5, row *1.1+0.6, label, ha="center", va="center", fontsize=12, fontweight="bold", color="white", ) ax.text(0.5, row *1.1+0.3, desc, ha="center", va="center", fontsize=9, color="white", )ax.set_xlim(-0.05, 1.05)ax.set_ylim(-0.05, 3.4)ax.axis("off")plt.tight_layout()plt.show()```In `pytest`, mark your statistical tests so they can be run separately. The following `pytest.ini` configuration lets you run `pytest -m "not statistical"` on every commit (fast) and `pytest -m statistical` nightly or when model code changes:```ini[pytest]markers = statistical: tests that involve model training (slow)```A typical project layout mirrors standard application code:```textproject/├── src/│ ├── features.py # Feature engineering│ ├── validation.py # Data validation logic│ ├── train.py # Training pipeline│ └── predict.py # Prediction service├── tests/│ ├── unit/│ │ ├── test_features.py # Fast, deterministic│ │ └── test_validation.py # Fast, deterministic│ ├── integration/│ │ └── test_pipeline.py # End-to-end, small data│ ├── statistical/│ │ ├── test_performance.py # Cross-validation checks│ │ └── test_regression.py # Reference output comparison│ └── fixtures/│ ├── sample_data.parquet # Fixed test datasets│ └── reference_preds.npy # Known-good outputs└── pytest.ini```## Worked example: a tested prediction pipeline {#sec-testing-worked-example}Let's put these patterns together into a realistic, fully tested prediction pipeline. We'll build a small incident prediction system with tests at every layer.First, the reusable functions — feature engineering and data validation:```{python}#| label: worked-example-functions#| echo: truedef engineer_features(df: pd.DataFrame) -> pd.DataFrame:"""Transform raw metrics into model features.""" result = df.copy() result["error_rate_log"] = np.log1p(df["error_pct"]) result["latency_over_threshold"] = ( df["p99_latency_ms"] >200 ).astype(int) result["load_x_errors"] = ( df["request_rate"] * df["error_pct"] )return resultdef validate_raw_data(df: pd.DataFrame) ->list[str]:"""Check raw data before feature engineering.""" errors = []for col in ["error_pct", "request_rate", "p99_latency_ms"]:if col notin df.columns: errors.append(f"Missing column: {col}")elif df[col].isna().any(): errors.append(f"Nulls in {col}")elif (df[col] <0).any(): errors.append(f"Negative values in {col}")return errors```Now generate training data and run the deterministic tests — data validation and feature engineering:```{python}#| label: worked-example-deterministic#| echo: truerng = np.random.default_rng(42)n =1000raw = pd.DataFrame({"error_pct": np.clip(rng.normal(2, 1.5, n), 0, None),"request_rate": rng.exponential(100, n),"p99_latency_ms": rng.lognormal(4, 0.8, n),})log_odds = (-3.5+0.4* raw["error_pct"]+0.005* raw["request_rate"]+0.002* raw["p99_latency_ms"])raw["incident"] = rng.binomial(1, expit(log_odds))# Test 1: Data validationvalidation_errors = validate_raw_data(raw)assert validation_errors == [], (f"Validation failed: {validation_errors}")print("Test 1 passed: data validation")# Test 2: Feature engineering (deterministic)features = engineer_features(raw)assert"error_rate_log"in features.columnsassert"latency_over_threshold"in features.columnsassert"load_x_errors"in features.columnsassert features["error_rate_log"].isna().sum() ==0assertset( features["latency_over_threshold"].unique()).issubset({0, 1})print("Test 2 passed: feature engineering")```Train the model and run the behavioural tests — smoke, minimum performance, and directional:```{python}#| label: worked-example-model-tests#| echo: truefrom sklearn.model_selection import train_test_split, cross_val_scorefrom sklearn.dummy import DummyClassifierfeature_cols = ["error_pct", "request_rate", "p99_latency_ms","error_rate_log", "latency_over_threshold","load_x_errors",]X_we = features[feature_cols]y_we = raw["incident"]X_train, X_test, y_train, y_test = train_test_split( X_we, y_we, test_size=0.2, random_state=42)we_model = GradientBoostingClassifier( n_estimators=100, max_depth=3, random_state=42)we_model.fit(X_train, y_train)# Test 3: Smoke testpreds = we_model.predict(X_test)probs = we_model.predict_proba(X_test)assert preds.shape == (len(X_test),)assert probs.shape == (len(X_test), 2)assert np.allclose(probs.sum(axis=1), 1.0)print("Test 3 passed: smoke test")# Test 4: Minimum performance (vs stratified dummy)dummy = DummyClassifier( strategy="stratified", random_state=42)model_auc = cross_val_score( we_model, X_we, y_we, cv=5, scoring="roc_auc").mean()dummy_auc = cross_val_score( dummy, X_we, y_we, cv=5, scoring="roc_auc").mean()assert model_auc > dummy_auc +0.05, (f"Model AUC ({model_auc:.3f}) doesn't beat "f"dummy ({dummy_auc:.3f}) by at least 0.05")print(f"Test 4 passed: performance "f"(model AUC={model_auc:.3f} vs dummy={dummy_auc:.3f})")# Test 5: Directional testlow_error = pd.DataFrame({"error_pct": [1.0], "request_rate": [100.0],"p99_latency_ms": [50.0],"error_rate_log": [np.log1p(1.0)],"latency_over_threshold": [0],"load_x_errors": [100.0],})high_error = pd.DataFrame({"error_pct": [10.0], "request_rate": [100.0],"p99_latency_ms": [50.0],"error_rate_log": [np.log1p(10.0)],"latency_over_threshold": [0],"load_x_errors": [1000.0],})p_low = we_model.predict_proba(low_error[feature_cols])[0, 1]p_high = we_model.predict_proba( high_error[feature_cols])[0, 1]assert p_high > p_low, ("Higher error rate should increase incident probability")print(f"Test 5 passed: directional "f"(low={p_low:.3f}, high={p_high:.3f})")```Five tests, five different aspects of correctness. None of them requires knowing the exact model output. Together, they provide strong evidence that the pipeline is working as intended — and they'll catch most regressions caused by code changes, data shifts, or library upgrades.## Summary {#sec-testing-summary}Testing data science code is testing — the same discipline, the same rigour, the same CI integration you already know. The difference is in *what* you assert:1. **Test the deterministic parts exactly.** Feature engineering, data transforms, and validation logic are ordinary functions. Test them with exact assertions, like any other code.2. **Use tolerances for numerical code.** `np.testing.assert_allclose` with an appropriate tolerance handles floating-point imprecision without making your tests flaky. Set both `rtol` and `atol` when values may be near zero.3. **Test model *behaviour*, not exact outputs.** Smoke tests, directional tests, invariance tests, and minimum performance tests together provide strong coverage without requiring you to predict exact model outputs.4. **Test for data leakage explicitly.** Temporal leakage and target leakage are the most dangerous data science bugs because they make models look *better* than they should. Write tests that check for them directly.5. **Validate data at pipeline boundaries.** Schema checks, null checks, range checks, and distribution checks catch upstream data problems before they silently corrupt your model's predictions.The applied chapters in Part 5 will exercise these patterns on real industry problems — from churn prediction to fraud detection — where the choice of test strategy directly affects business outcomes.## Exercises {#sec-testing-exercises}1. Write a `validate_prediction_output` function that checks model predictions before they're returned to a consumer. It should verify: (a) no null values, (b) all probabilities are between 0 and 1, (c) probabilities sum to 1 per row (within tolerance), and (d) the predicted class distribution is not suspiciously concentrated (e.g., >99% of predictions are the same class). Write pytest-style tests for each condition, including tests that deliberately trigger each failure.2. A colleague argues that setting `random_state=42` in every test makes the tests pass by coincidence — the model might fail with a different seed. Write a test that trains and evaluates a model across 10 different random seeds and asserts that the accuracy is above a threshold for *all* of them. What does this test tell you that a single-seed test does not? Where does this approach break down?3. Build a directional test suite for a churn prediction model. Assume the model takes features `tenure_months`, `monthly_spend`, `support_tickets_last_30d`, and `days_since_last_login`. Write at least four directional tests, each checking that a plausible change in one feature moves the churn probability in the expected direction. For one of the features, explain a scenario where the expected direction might be wrong.4. **Conceptual:** Your team has a model in production with no tests. You've been asked to add a test suite retroactively. You can't write exact-output tests because you don't know what the "correct" outputs should be — the model is the only source of truth. Describe a strategy for building a test suite from scratch around an existing model. Which types of tests can you add immediately, and which require baseline measurements first?5. **Conceptual:** A data scientist on your team says "We don't need unit tests — we have model evaluation metrics. If the AUC is good, the code is correct." Write a rebuttal (3–5 sentences). Give at least two concrete examples of bugs that would not be caught by good evaluation metrics but would be caught by the testing patterns described in this chapter.