Unit testing for data engineering: how to ensure production-ready data pipelines

In a landscape where decisions depend on reliable data, every successful use case starts with confidence in what’s delivered downstream.

OCT. 17, 2025

16 Min Read

Dilorom Abdullah

The business problem: ensuring trustworthy data in production environments

That confidence rests on fundamentals like accuracy, consistency, timeliness, and overall data quality. When a dataset is promoted to production, engineering teams must be fully certain that it is genuinely production-ready. Reaching that level requires more than good intentions: it means rigorous quality checks, validating ingestion paths, and verifying transformation and aggregation logic end-to-end.

One of the best ways to assert the correctness of logic is with unit tests. By evaluating small, isolated components, unit tests confirm each piece behaves as intended, which in turn stabilizes the broader pipeline. In practice, we pursue this assurance through comprehensive testing across our pipelines, with unit tests providing the core diligence that keeps systems robust.

This write-up tackles the challenge directly with a practical walkthrough for adding unit tests to Python/PySpark data projects. I’ll also show how to automate those tests in GitHub CI so changes can’t land in a development branch unless they pass. Adopting these habits raises the reliability of your pipelines and builds a stronger foundation for data-driven decision-making across your organization.

Introduction

Unit testing is a staple of sound software engineering, ensuring individual parts of a codebase behave as designed. In data engineering, especially with Python and PySpark, this matters even more because pipelines underpin critical analytics and applications. In this post, I’ll outline how to unit test ingestion modules, convert Databricks notebooks into importable Python modules, and wire everything into a clean CI/CD flow on GitHub. We’ll cover environment setup, test organization, using Pytest effectively, and integrating coverage reporting in CI. If your goals are higher reliability and a smoother testing workflow, the steps below will get you there.

What is unit testing?

Unit testing means writing focused, automated checks for small slices of functionality, typically a function or method. These tests verify expected behavior and identify defects early in the development process. Because they’re fast and deterministic, they give you the safety net to refactor confidently and keep code quality high over time. Unit tests fit naturally with TDD and continuous integration, and they work best when they stay narrow in scope and quick to execute.

Unit testing in Python and PySpark

Python offers the built-in unittest framework as well as pytest, a popular third-party option that favors concise, readable tests. Both can do the job; they just take different approaches. Here’s a quick comparison that informed my choice:

Ease of use & syntax unittest: Follows a JUnit-like, class-based style using unittest.TestCase, with methods prefixed by test_ and assertions like self.assertEqual() or self.assertTrue(). pytest: Emphasizes simplicity; tests can be plain functions and use Python’s assert directly, which keeps test code compact and readable.
Test discovery & execution unittest: You’ll invoke unittest.main() or use python -m unittest discover for discovery. pytest: Automatically finds tests in files named with test_ and runs them with a simple pytest command.
Fixtures & setup/teardown unittest: Uses setUp()/tearDown() methods in classes for preparation and cleanup. pytest: Provides a flexible fixture system with scoping (function/class/module/session) and easy reuse/parameterization.
Mocking support unittest: Ships with unittest.mock for mocks and patches. pytest: Works with unittest.mock and also supports pytest-mock for a smoother experience.
Plugins & extensibility unittest: Limited out of the box; add-ons required for extras. pytest: Rich plugin ecosystem—pytest-cov (coverage), pytest-mock (mocking), pytest-spark (PySpark helpers), and more.
Parameterization unittest: Typically uses subTest() for multiple input variants. pytest: Offers @pytest.mark.parametrize which is straightforward and expressive.

For this project, I went with pytest because of its fixtures, readability, and extensibility.

Converting Databricks notebooks into Python modules

In this repo, the pipelines/modules folder holds Databricks notebooks and Python modules containing ingestion and processing logic. These aren’t meant to run as standalone scripts; they’re building blocks invoked by jobs or workflows. That’s why functions should be factored into reusable Python modules that we can import wherever needed.

A Python module here is simply a .py file that groups related functions—often by data source or function type.

To write unit tests, the code must live in a module. If it currently sits in a Databricks notebook (or similar), move it into a Python file first.

Steps to convert a Databricks notebook:

Databricks notebooks saved as .py include # Databricks notebook source headers when viewed in an editor.
To turn a notebook into a module:
- Create a new .py file and paste the notebook code.
- Remove all # Databricks notebook source and # COMMAND ---------- markers—they indicate cell boundaries.

If the notebook relies on dbutils (available in Databricks notebooks by default), you’ll need to refactor because dbutils doesn’t exist in a plain Python environment. I’ll outline an approach to replace it shortly.

Also note that a Spark session is implicitly available in a Databricks notebook. In a standalone module it isn’t, so if your code references spark, you must create/import it explicitly:

from pyspark.sql import SparkSession

Making modules importable in the repository

Another prerequisite for testing is ensuring Python recognizes your directories as packages. Add an init.py file to each directory (and subdirectory) you plan to import from. The file can be empty; its presence is what matters.

In my layout, init.py appears in both pipelines/ and pipelines/modules/, which makes each a valid package. That way, tests can import modules directly and target functions unambiguously.

Setting up a virtual environment for Pytest

To run tests locally, set up an isolated environment:

Clone the repository and change to the project root.
Create a virtual environment: python -m venv venv
Activate it:
- macOS/Linux: source venv/bin/activate
- Windows: venv\Scripts\activate
Install dependencies: pip install -r tests/requirements.txt
Run tests: pytest

Organizing test files

Pytest discovers tests by filename patterns. The convention used here is:

Test directory: tests/
Test file name: test_module_<module_name>.py
Example: for google_sheet_processing.py, the test file lives at tests/google_sheet_processing.py.

First, add a requirements.txt for test dependencies. You might not need it for local, ad-hoc runs, but CI will. During CI, the runner installs the packages listed there.

While developing, install libraries into your virtual environment with pip install. As you add dependencies, record each one in requirements.txt right away—starting with pytest.

Pinning versions is generally safer to avoid breakage due to upstream changes. If you hit conflicts, you can temporarily omit versions so CI resolves compatible sets automatically.

Writing Pytest cases

Create a test file in tests/ using the naming pattern test_module_<module_name>.py. For instance, if your module is google_sheet_processing.py, the test file should be tests/google_sheet_processing.py.

At the top of the test file, import the module under test. For example:

from pipelines.modules.google_sheet_processing import *

In many cases, import * is acceptable here because you’ll exercise most of the module’s functions, and it brings in the same library symbols your module uses (unless you prefer to import dependencies explicitly in the tests).

The CI coverage script I added enforces at least one unit test per function. In practice, write as many focused tests as necessary to exercise all relevant behavior.

Pytest commands

Pytest includes convenient CLI options for driving your test run:

Run all tests (defaults to tests/): pytest
Run a single file: pytest test_file.py
Run one test within a file: pytest test_file.py::test_function
Verbose output: pytest -v
Stop at first failure: pytest -x
Filter by keyword: pytest -k "keyword"
Run tests with a marker: pytest -m marker_name
Run with coverage (requires pytest-cov). More on that below: pytest --cov=your_package

Pytest fixtures

Fixtures are one of pytest’s strongest features: they set up (and optionally tear down) shared state for tests, improving readability and reuse.

Why fixtures help:

Reusability: Define once, use across many tests.
Scoped lifetimes: Choose function/class/module/session based on how often you need the setup.
Automatic setup/cleanup: Keep tests focused on assertions, not plumbing.
Dependency injection: Request a fixture by naming it in a test’s parameters.

For example, test_module_google_sheet_processing.py defines fixtures to mock a gspread client, a Google Sheet, and a Spark DataFrame. Tests use them repeatedly, and you can pair them with patches when needed.

@pytest.fixture

def mock_google_client():

"""Fixture to provide a mock gspread client instance."""

return MagicMock(spec=gs.client.Client)

@pytest.fixture

def mock_google_sheet():

"""Fixture to provide a mock Google Sheet instance."""

return MagicMock()

@pytest.fixture

def mock_df():

"""Fixture to provide a mocked Spark DataFrame that supports all test cases."""

mock_spark_df = MagicMock(spec=DataFrame)

# Mock the Pandas conversion (for other test cases using toPandas)

mock_spark_df.toPandas.return_value = pd.DataFrame(

{"column1": ["value1", "value2"], "column2": ["value3", "value4"]}

)

# Define schema with different column types (for clean_df_to_write_gsheet tests)

mock_spark_df.dtypes = [

("decimal_col", "decimal"),

("int_col", "int"),

("float_col", "float"),

("double_col", "double"),

("timestamp_col", "timestamp"),

("string_col", "string"),

]

# Mock schema access

mock_spark_df.schema = {

"decimal_col": StructField("decimal_col", DoubleType(), True),

"int_col": StructField("int_col", IntegerType(), True),

"float_col": StructField("float_col", DoubleType(), True),

"double_col": StructField("double_col", DoubleType(), True),

"timestamp_col": StructField("timestamp_col", TimestampType(), True),

"string_col": StructField("string_col", StringType(), True),

}

Some fixtures are common across many test files. Instead of redefining them everywhere, place them in tests/conftest.py and they’ll be available project-wide. For example, I keep mock_spark, mock_secret_manager, and mock_sftp_client there as shared fixtures.

@pytest.fixture(scope="session")

def mock_spark():

"""Fixture to provide a mocked SparkSession."""

mock = MagicMock(spec=SparkSession)

return mock

@pytest.fixture(scope="session")

def mock_secret_manager():

"""Fixture to provide a mocked SecretsManager."""

mock = MagicMock()

mock.get_secret.return_value = "mocked_secret_value"

mock.secrets.list.return_value = ["key1", "key2", "key3"]

return mock

@pytest.fixture

def mock_sftp_client():

"""Fixture to provide a mocked SFTP client."""

mock = Mock(spec=paramiko.SFTPClient)

return mock

Fixture scope determines lifecycle:

function (default): new instance per test function.
class: one instance per test class.
module: one instance per module.
session: one instance for the entire test run.

Use broader scopes for expensive setups (e.g., DB connections, Spark). Prefer function scope when isolation is more important than speed.

About patching

patch from unittest.mock lets you replace external dependencies with mocks so you can isolate the code under test. Apply it as a decorator (@patch) or a context manager (with patch(...):). The target object is swapped only within the designated scope and then restored, keeping tests clean and deterministic.

Why it matters here:

No real API calls: Tests run quickly and don’t depend on live services (e.g., Google Sheets).
Controlled return values: Simulate edge cases like empty results without altering production code.
True unit isolation: Focus on logic specific to your function rather than network or I/O behavior.

In short, patching keeps your tests predictable and free from external side effects.

Handling dbutils in Python modules

dbutils is available in Databricks notebooks but not in standalone Python execution. To make code testable outside Databricks, refactor any dbutils-dependent logic. (See my separate notes on replacing dbutils idioms.)

If a module genuinely needs dbutils, encapsulate it in a class that accepts a dbutils instance at initialization. Implement your logic as methods and inject dbutils only at runtime from a Databricks notebook. That avoids importing dbutils in plain Python and keeps tests feasible.

For example, secrets_manager.py demonstrates replacing dbutils.secrets usage in a way that works in both environments.

In notebooks, SparkSession exists automatically; in modules, it does not, so import it explicitly:

from pyspark.sql import SparkSession

This ensures Spark operations work when running as a script.

Using Spark for testing via Databricks Connect

Certain Spark behaviors aren’t faithfully reproduced by mocks, so some tests are better run against a real Spark session. Databricks Connect provides that path.

Setup steps:

Install the client: pip install databricks-connect
Add databricks-connect to tests/requirements.txt so CI installs it.
Configure your .databrickscfg (in your user home) with host, token, and profiles (e.g., default, dev, prod).

These credentials are used to connect to a Serverless cluster. You can point at an all-purpose cluster, but consider the trade-offs:

If it’s always on, behavior mirrors serverless.
If it spins down, each test run waits ~4–5 minutes for startup, which is painful in CI.

For speed and consistency (locally and in CI), serverless is the better choice.

Setting up a Pytest fixture for Spark connection

Since multiple tests across files need a real Spark connection, define a fixture in tests/conftest.py.

Importing DatabricksSession: Bring in DatabricksSession from databricks.connect:

from databricks.connect import DatabricksSession

Note that it’s DatabricksSession, not SparkSession.

Also import SparkSession separately for the mocked variant:

from pyspark.sql import SparkSession

This supports both real (connect) and mocked Spark in the same suite.

@pytest.fixture(scope="session")

def databricks_spark():

"""Fixture to provide a real SparkSession using Databricks Connect in CI/CD with Serverless enabled."""

# Detect if running in GitHub Actions (GITHUB_ACTIONS env var is automatically set)

is_ci = os.getenv("GITHUB_ACTIONS") == "true"

if is_ci:

# Running in CI/CD → Use environment variables

required_env_vars = ["DATABRICKS_HOST", "DATABRICKS_TOKEN", "DATABRICKS_HTTP_PATH"]

for var in required_env_vars:

if not os.getenv(var):

raise ValueError(f" Missing required environment variable: {var}")

print("Using GitHub Actions environment variables for Databricks SparkSession.")

# Enable Serverless Mode

return DatabricksSession.builder.serverless().getOrCreate()

# Running locally → Use local machine environment variable: set "DATABRICKS_HOST", "DATABRICKS_TOKEN", "DATABRICKS_HTTP_PATH" env vars in your local machine environment

print(

"Running locally: Using local machine environmental variables for Databricks SparkSession."

)

return DatabricksSession.builder.serverless().getOrCreate()

Running tests with DatabricksSession To use the Databricks-backed session, you only need:

DatabricksSession.builder.serverless().getOrCreate()

The serverless() call provisions a Serverless cluster for the session.

Authentication in CI/CD

In CI, environment variables are provided via GitHub Secrets:

DATABRICKS_HOST
DATABRICKS_TOKEN
DATABRICKS_HTTP_PATH

These allow the test suite to authenticate to Databricks and use a Serverless cluster during runs.

Setting up environment variables locally You can also set these directly on your workstation instead of using .databrickscfg.

Steps:

Search for “Environment Variables” in your OS settings.
Under “User variables for <your_username>”, add or edit entries.
Create the following with appropriate values:
- DATABRICKS_HOST
- DATABRICKS_TOKEN
- DATABRICKS_HTTP_PATH

This approach keeps naming consistent between local and CI. Make sure that your user environment variable names match the environment variable names in Github. For example, mine do not match because I had already configured them for a different project, so I proceeded with using my existing names and updated the CI configuration accordingly.

Adding unit tests to CI/CD

Unit tests run automatically via a dedicated GitHub Actions workflow. I created a UnitTests job in .github/workflows/unit-tests.yml. It operates independently from other workflows (e.g., infrastructure in ci-cd.yml).

name: Unit Tests

on:

pull_request:

branches:

- dev

paths:

- pipelines/modules/**

- tests/**

- .github/workflows/unit-tests.yml

jobs:

UnitTests:

name: Unit Tests

runs-on: ubuntu-latest

timeout-minutes: 5

env:

DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST_DEV }}

DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN}}

DATABRICKS_HTTP_PATH: ${{ secrets.DATABRICKS_WAREHOUSE_HTTP_PATH }}

DATABRICKS_SERVERLESS_COMPUTE: "auto"

steps:

- name: Checkout

uses: actions/checkout@v4

with:

fetch-depth: 0 # Fetch full history so git diff works

- name: Set up Python 3.10

uses: actions/setup-python@v5

with:

python-version: "3.10"

- name: Install dependencies

run: |

pip install -r tests/requirements.txt

- name: Run unit tests

run: |

pytest -v

- name: Check for missing unit tests

run: |

chmod +x .github/scripts/check_missing_tests.sh

.github/scripts/check_missing_tests.sh

- name: Run tests with coverage

run: |

COVERAGE_THRESHOLD=0 # Set to 0 to disable failure enforcement

pytest --cov=pipelines/modules --cov-report=term-missing --cov-fail-under=$COVERAGE_THRESHOLD

A small shell script (.github/scripts/check_missing_tests.sh) enforces two rules:

Every module must have a corresponding test file.
Every function must have at least one test. If anything is missing, the CI job fails and logs the gaps.

Checking for missing unit tests in CI

The custom script scans modules and test files to ensure there’s at least one test per function. If it finds omissions, CI fails with a clear report so you can add what’s missing before merging.

Using pytest-cv for coverage analysis

pytest-cov measures which lines and branches of your code were executed during tests. It highlights untested paths and skipped branches (e.g., if/else), and provides an overall percentage. This keeps attention on meaningful coverage, not just the presence of tests.

Steps taken to integrate test coverage in CI

To check both the existence and the effectiveness of tests, I implemented:

Custom shell script in GitHub workflow The script inspects the codebase to confirm each function and module is covered by at least one test. If not, the job fails and asks for more tests. It also compares the feature branch’s coverage with the latest dev branch to catch regressions before merge.
pytest-cov for detailed line & branch coverage Coverage reports show which lines ran and which didn’t, including branch coverage for conditional logic. Developers can review the line-by-line output to spot gaps and add tests where they matter most.

Comparison: shell script vs. pytest-cov

Why keep both approaches?

The script guarantees a baseline: at least one test per function and per module.
Coverage ensures those tests actually execute important code paths.
Comparing feature branches against dev helps prevent coverage backslides over time.

Using both yields automated enforcement, better coverage, and consistently high-quality code.

Conclusion

Adding unit tests to your ingestion modules hardens your pipelines and makes the codebase easier to extend and maintain. Baking those tests into CI/CD adds automation and visibility through coverage reports, so quality stays high as the project evolves. With Pytest, Databricks Connect, and GitHub Actions, you can assemble a reliable testing workflow that surfaces issues early and improves developer confidence. As you adopt these practices, you’ll raise the bar on stability and elevate your data engineering process to a more professional and dependable standard.

AUTHOR

Dilorom Abdullah

Senior Databricks Architect

Our Approach

Unit testing for data engineering: how to ensure production-ready data pipelines

In a landscape where decisions depend on reliable data, every successful use case starts with confidence in what’s delivered downstream.

The business problem: ensuring trustworthy data in production environments

Introduction

What is unit testing?

Unit testing in Python and PySpark

Converting Databricks notebooks into Python modules

Making modules importable in the repository

Setting up a virtual environment for Pytest

Organizing test files

Writing Pytest cases

Pytest commands

Pytest fixtures

About patching

Handling dbutils in Python modules

Using Spark for testing via Databricks Connect

Setting up a Pytest fixture for Spark connection

Authentication in CI/CD

Adding unit tests to CI/CD

Checking for missing unit tests in CI

Using pytest-cv for coverage analysis

Steps taken to integrate test coverage in CI

Conclusion