Keeping Your Jupyter Notebooks Clean

We’ve all been there: you’re about to commit a Jupyter notebook, but it’s full of massive dataframes, messy plots, or maybe it doesn’t even have a proper title. It’s just good hygiene to keep things tidy, especially if you’re sharing your work or putting it in version control.

Since notebooks are just JSON under the hood, we can easily whip up a zero-dependency Python script to do the “chores” for us.

"""Check Jupyter notebooks for quality criteria."""

import json
import sys
from pathlib import Path
from typing import TypedDict, cast


class Cell(TypedDict):
    """A cell in a Jupyter notebook."""

    cell_type: str
    outputs: list[dict]
    execution_count: int
    source: list[str]
    metadata: dict[str, str]


def first_cell_is_markdown(cells: list[Cell]) -> bool:
    """Check if the first cell in the notebook is a markdown cell."""
    first = cells[0]
    cell_type: str | None = first.get("cell_type")
    return cell_type == "markdown"


def outputs_are_empty(cells: list[Cell]) -> bool:
    """Check if all cells in the notebook have empty outputs."""
    for cell in cells:
        outputs = cell.get("outputs")
        if outputs:
            return False
    return True


def check_notebook(path: Path) -> bool:
    """Check if a notebook satisfies the quality criteria."""
    json_string = path.read_text(encoding="utf8")
    data = json.loads(json_string)

    cells = data.get("cells")
    if not cells:
        return True
    cells = cast("list[Cell]", cells)

    return all(
        (
            first_cell_is_markdown(cells),
            outputs_are_empty(cells),
            # Potentially more checks could be added here in the future.
        )
    )


def check_directory(path_str: str) -> int:
    """Check all notebooks in a directory and its subdirectories."""
    all_notebooks = Path(path_str).glob("**/*.ipynb")
    failed = [path for path in all_notebooks if not check_notebook(path)]

    if not failed:
        return 0

    failed_str = "\n\t".join(str(path) for path in failed)
    print(f"Failed check on notebooks:\n\t{failed_str}")
    return 1


if __name__ == "__main__":
    sys.exit(check_directory("."))

There are two main checks implemented:

first_cell_is_markdown: Ensures the first cell of the notebook is a Markdown cell. This is useful for making sure every notebook starts with a title or introduction.
outputs_are_empty: Iterates through all cells and checks if any have outputs. This helps prevent committing large dataframes, plots, or potentially sensitive information that might be stored in the notebook’s execution results.

The check_directory function uses Path.glob("**/*.ipynb") to find all Jupyter notebooks in the current directory and all subdirectories. It then runs the check_notebook function on each one. Using the pathlib library makes it easy to handle file paths in a cross-platform way.

Automating these little chores saves a lot of headache in the long run. It’s a simple base that you can easily tweak or add more rules to as you go.

Note

Download the whole code here.