Beyond Prompt Engineering: From Code Coverage to Code Confidence, Master Unit Tests with GitHub Copilot Agent

Generated AI image by Microsoft Bing Image Creator

Introduction

As we bring 2025 to a close, software engineers like myself find themselves immersed in all things AI from every direction—left, right, and dead center: Medium blogs, AI forums like Rundown AI, AI meetups, etc.—all in a constant state of flux. The message couldn’t be clearer: ignoring AI is no longer an option. Those who resist embracing these tools risk falling drastically behind in an industry moving at lightning speed. The greatest folly isn’t experimenting with AI and making mistakes — it’s refusing to adapt at all.

But here’s the critical insight most engineers miss: it’s not about using AI; it’s about using AI systematically.

Using one example, in my last post, writing comprehensive unit tests for data‑engineering pipelines isn’t just about achieving high coverage numbers—it’s about building confidence that your transformations work correctly under all conditions. GitHub Copilot Agent transforms this challenge from a tedious manual process into a systematic, AI‑assisted workflow that drives both coverage and quality.

The Challenge: Testing AWS Glue Pipelines

Environment Dependencies – Tests require awsglue modules only available in specific Docker containers.
Complex Data Transformations – Bronze, Silver, and Gold layer pipelines contain intricate business logic.
Integration Points – S3 interactions, DynamicFrames, Spark sessions, and AWS service mocking.
Scale – Multiple pipelines (Plain and Purchase Order) across three transformation layers.

Traditional approaches often result in inconsistent test quality, missing edge cases, poor coverage of critical paths, and unreliable refactoring.

The Solution: Structured Copilot Agent Methodology

1. Create a Reusable Prompt Template

Store this file as .github/prompts/unit-tester.prompt.md:

# Unit Test Generation Guidelines (Python)

You are a unit test generator assistant for Python code.  
Strictly follow these rules when generating tests:

## Test Structure

Use the AAA (Arrange‑Act‑Assert) pattern for structuring tests:

1. **Arrange** – Set up test data, fixtures, and preconditions.
2. **Act** – Execute the function/method under test.
3. **Assert** – Verify results and side effects.

## Naming Convention

- Name tests as `should_expected_behavior_when_state_under_test`.
- Use clear, descriptive names that document the test's purpose.
- For pytest, use snake_case function names.

## Best Practices

1. Test one behavior per test.
2. Use meaningful, representative test data.
3. Handle expected exceptions using `pytest.raises`.
4. Add comments and a docstring per test.
5. Include negative tests and edge cases (empty strings, None, large values).
6. Prefer pure functions and deterministic outcomes for unit tests.
7. Use fixtures (`@pytest.fixture`) for shared setup instead of global state.
8. Mock external systems (I/O, network, AWS) using `pytest-mock`.
9. Add docstrings to test functions.
10. Add negative tests
11. Add integration tests for critical paths
12. Add edge cases

## Additional Rules

1. Revalidate and think step‑by‑step.
2. Always follow the AAA pattern.
3. Keep assertions focused and specific.
4. Isolate unit tests from environment and external services.
5. When testing PySpark/Glue code, prefer isolating transformation logic into small functions that can be tested without cluster dependencies; use a local SparkSession only when necessary.

2. Establish Test Infrastructure

Docker Configuration (docker-compose.yaml)

volumes:
  - ./tests:/app/tests
  - ./scripts:/app/scripts
  - ./pytest.ini:/app/pytest.ini
  - ./pyproject.toml:/app/pyproject.toml
  - ./poetry.lock:/app/poetry.lock

Test Runner Script (scripts/run-tests-docker.bash)

#!/usr/bin/env bash
# Usage: ./scripts/run-tests-docker.bash <command>
# Commands: all, plain, po, bronze, silver, gold, fast, coverage, shell, clean

set -e

CMD=$1

case $CMD in
  all)      poetry run pytest -v --cov=. --cov-report=html ;;
  plain)    poetry run pytest -v tests/plain ;;
  po)       poetry run pytest -v tests/purchase_order ;;
  bronze)   poetry run pytest -v tests/*/test_bronze_job.py ;;
  silver)   poetry run pytest -v tests/*/test_silver_job.py ;;
  gold)     poetry run pytest -v tests/*/test_gold_job.py ;;
  fast)     poetry run pytest -v --maxfail=1 ;;
  coverage) poetry run pytest -v --cov=. --cov-report=html ;;
  shell)    /bin/bash ;;
  clean)    rm -rf .pytest_cache htmlcov .coverage ;;
  *) echo "Unknown command: $CMD" && exit 1 ;;
esac

Pytest Configuration (pytest.ini)

[pytest]
testpaths = tests
python_files = test_*.py
python_classes = Test*
python_functions = test_*
addopts = -v --cov=. --cov-report=html --cov-report=term

3. Use Copilot Agent with Context

When generating tests, provide Copilot Agent with:

Your Prompt Template (@.github/prompts/unit-tester.prompt.md)
The Code to Test (e.g., plain/bronze_job.py, purchase_order/bronze_job.py etc)
Shared Fixtures (tests/conftest.py)

Example Prompt

@unit-tester.prompt.md Generate comprehensive unit tests for plain/bronze_job.py
focusing on:
- Data ingestion from S3
- Schema validation
- Error handling for malformed data
- Logging functionality
- Integration with GlueContext

4. Organize Tests by Layer and Concern

tests/
├── conftest.py                    # Shared fixtures
├── plain/
│   ├── test_bronze_job.py         # 4 test classes, ~12 tests
│   ├── test_silver_job.py         # 4 test classes, ~15 tests
│   └── test_gold_job.py           # 4 test classes, ~14 tests
└── purchase_order/
    ├── test_bronze_job.py         # 4 test classes, ~10 tests
    ├── test_silver_job.py         # 4 test classes, ~17 tests
    └── test_gold_job.py           # Similar structure

5. Example AI‑Generated Test Class

class TestBronzeJobDataIngestion:
    """Test data ingestion functionality for bronze layer."""

    def test_should_read_data_from_s3_when_valid_path_provided(
        self, spark_session, mocker
    ):
        """Arrange‑Act‑Assert: Valid S3 path should successfully read data."""
        # Arrange
        mock_glue_context = mocker.Mock()
        s3_path = "s3://bucket/data/"
        expected_count = 100

        # Mock the DynamicFrame creation
        mock_dynamic_frame = mocker.Mock()
        mock_dynamic_frame.count.return_value = expected_count
        mock_glue_context.create_dynamic_frame.from_options.return_value = (
            mock_dynamic_frame
        )

        # Act
        result = read_bronze_data(mock_glue_context, s3_path)

        # Assert
        assert result.count() == expected_count
        mock_glue_context.create_dynamic_frame.from_options.assert_called_once()

    def test_should_raise_error_when_s3_path_is_empty(self, mocker):
        """Arrange‑Act‑Assert: Empty S3 path should raise ValueError."""
        # Arrange
        mock_glue_context = mocker.Mock()
        invalid_path = ""

        # Act & Assert
        with pytest.raises(ValueError, match="S3 path cannot be empty"):
            read_bronze_data(mock_glue_context, invalid_path)

    def test_should_handle_missing_files_when_path_not_found(
        self, spark_session, mocker
    ):
        """Arrange‑Act‑Assert: Missing S3 path should raise FileNotFoundError."""
        # Arrange
        mock_glue_context = mocker.Mock()
        mock_glue_context.create_dynamic_frame.from_options.side_effect = (
            FileNotFoundError("Path not found")
        )
        s3_path = "s3://bucket/nonexistent/"

        # Act & Assert
        with pytest.raises(FileNotFoundError):
            read_bronze_data(mock_glue_context, s3_path)

Real Results: Coverage and Quality Metrics

Coverage Achieved

Name	Stmts	Miss	Cover	Missing
plain/bronze_job.py	45	6	95%	105‑114, 135
plain/silver_job.py	36	1	97%	105
plain/gold_job.py	47	1	98%	165
purchase_order/bronze_job.py	42	6	86%	149-158, 175
purchase_order/silver_job.py	42	1	98%	166
purchase_order/gold_job.py	44	1	98%	152
TOTAL	252	16	94%

Test Suite Statistics

Plain Pipeline – 41 tests across 3 layers
Purchase Order Pipeline – 44 tests across 3 layers
Total – 85+ unit tests (plus integration suite)
Execution Time – under 30 seconds for the full suite

Quality Improvements

Consistent AAA structure across all tests
Clear, descriptive naming (should_*_when_*)
Comprehensive edge‑case coverage (null values, empty data, malformed input)
Full mocking of external dependencies (S3, AWS services)
Detailed docstrings for every test

Key Strategies That Worked

Prompt Engineering as Code – Storing the prompt template in version control guarantees uniform standards.
Iterative Refinement – Initial generations are generic; refined prompts yield high‑quality tests.
Context‑Aware Generation – Supplying fixtures, naming conventions, and business rules guides Copilot to produce relevant tests.
Docker‑First Testing – Guarantees a reproducible environment where awsglue modules are available.
Layered Test Strategy – Bronze focuses on ingestion, Silver on transformations, Gold on aggregations; integration tests validate end‑to‑end flow.

Practical Workflow

Define Your Standards – Create .github/prompts/unit-tester.prompt.md.
Set Up Infrastructure – Docker volumes, pytest config, shared fixtures (conftest.py).
Generate Tests with Copilot – Use the prompt template and explicit requirements.
Review and Refine – Run tests, examine coverage, ask Copilot to fill gaps.
Iterate on Uncovered Code – Target specific uncovered lines with new prompts.

Lessons Learned

Structured prompts dramatically improve test consistency.
Docker eliminates environment‑related failures.
Incremental, layer‑by‑layer testing is more manageable than tackling the whole project at once.
Coverage reports guide focused test generation.

Advanced Techniques

Parametrized Test Generation – Produce data‑driven tests with @pytest.mark.parametrize.
Property‑Based Testing – Use hypothesis to validate invariants across arbitrary inputs.
Integration Test Generation – Write end‑to‑end tests that run the full Bronze→Silver→Gold pipeline inside Docker.

Measuring Success

Metric	Before Copilot	After Copilot
Test Coverage	35 %	95 %
Time to Write Tests (per module)	2‑3 h	30‑45 min
Bugs Detected Pre‑Production	0 (undetected)	12 (caught)
Team Confidence	Low	High
ROI (time saved)	–	~60 % reduction

Conclusion

GitHub Copilot Agent isn’t merely a code‑completion tool—it’s a disciplined partner that turns unit‑test generation into a repeatable, high‑quality process. By combining a clear prompt template, Docker‑first infrastructure, and systematic coverage analysis, you achieve real code confidence, not just high coverage numbers. The workflow scales from a single module to an entire data‑engineering codebase, delivering faster development cycles, fewer production bugs, and a stronger testing culture.

For more information about my actual copilot-driven unit and integration tests implementation, you can find it from my commits here, or you can read more from my testing readme doc as well.

Till next time, Happy Coding, and wish you a happy and safe Xmas Holidays ahead🎄🎅🤶🧑‍🎄!

Power of Eloquence

Mastering the Art of Technical Craftsmanship