SMODER Tutorial 01: Mouse Brain H3K27ac Quick Start

This tutorial demonstrates how to run the current validated SMODER workflow on a mouse brain RNA + peak dataset.

At the current stage, this notebook serves as a runnable quick-start example for the smoder package. It introduces the full workflow, including:

configuration
input checking
data loading
preprocessing
feature engineering
graph construction
model training
result saving
output inspection

This tutorial is intended as a development-stage example and may continue to evolve as the package structure is refined.

Before you begin

This tutorial assumes that:

the smoder package has already been installed
a compatible Python environment is available
the required input datasets have already been prepared
the current workflow is being run in an environment where the SMODER package can access the data files

At the current stage, this tutorial is based on the validated mouse brain H3K27ac example workflow.

[ ]:

import os
import sys
from pprint import pprint

import pandas as pd

import smoder
from smoder.config.defaults import (
    get_mousebrain_h3k27ac_base_config,
    get_mousebrain_h3k27ac_params,
)
from smoder.pipelines.mousebrain_h3k27ac import (
    load_and_show_data,
    init_model_and_preprocess,
    feature_engineering_and_graph_build,
    train_model,
    save_results,
)

print("Python executable:")
print(sys.executable)

print("\nSMODER package location:")
print(smoder.__file__)

Step 1. Define tutorial settings

We first define a tutorial-specific run name and choose whether to perform full training.

If RUN_TRAINING = False, this notebook can still be used for configuration checking and data loading without performing the full training step.

[ ]:

RUN_NAME = "tutorial_01_mousebrain_h3k27ac_v2"
RUN_TRAINING = True

base_config = get_mousebrain_h3k27ac_base_config(run_name=RUN_NAME)
params = get_mousebrain_h3k27ac_params()

# Keep the tutorial lightweight
params["epochs"] = 10
params["model_save"] = False

Step 2. Review configuration

The configuration contains:

input data paths
output directory
model save directory
preprocessing parameters
training parameters

You can optionally modify the paths below if needed for your own environment.

[ ]:

print("Base configuration:")
pprint(base_config)

print("\nParameter configuration:")
pprint(params)

Step 3. Check required input files

The current validated workflow expects three input files:

single-cell reference RNA
spatial RNA
spatial peak data

We verify that all required input files are available before proceeding.

[ ]:

required_files = {
    "sc_rna_path": base_config["sc_rna_path"],
    "st_rna_path": base_config["st_rna_path"],
    "st_adt_path": base_config["st_adt_path"],
}

missing_files = []

for key, path in required_files.items():
    exists = os.path.exists(path)
    print(f"{key}: {path}")
    print(f"  -> {'FOUND' if exists else 'MISSING'}")
    if not exists:
        missing_files.append(path)

if missing_files:
    raise FileNotFoundError("Some required input files are missing.")

Step 4. Load input data

This step loads the validated mouse brain example data and prints basic information such as:

dataset shapes
modality types
number of cell types
selected categories

[ ]:

ref_dict, smo_dict, params = load_and_show_data(base_config, params)

Step 5. Initialize the model and preprocess the data

This step initializes the SMODER model and performs built-in preprocessing, including:

reference RNA preprocessing
spatial RNA preprocessing
second-modality preprocessing
spot alignment
information gene selection

[ ]:

model = init_model_and_preprocess(ref_dict, smo_dict, params)

Step 6. Perform feature engineering and graph construction

In this step, the pipeline performs:

RNA feature engineering
second-modality feature engineering
spatial graph construction
feature graph construction

[ ]:

model, dim_rna, dim_modal2 = feature_engineering_and_graph_build(model, params)

print("RNA feature dimension:", dim_rna)
print("Second modality feature dimension:", dim_modal2)

Step 7. Optional model training

The next step trains the model and produces the main deconvolution outputs.

For this tutorial, we keep the number of epochs intentionally small so that the workflow remains suitable for quick validation.

If RUN_TRAINING = False, the training and saving steps will be skipped.

[ ]:

adata_result = None
cross_fusion = None

if RUN_TRAINING:
    adata_result, cross_fusion = train_model(
        model=model,
        params=params,
        dim_rna=dim_rna,
        dim_modal2=dim_modal2,
        base_config=base_config,
    )
else:
    print("Training step skipped because RUN_TRAINING = False")

Step 8. Save results

If training has been performed, the outputs are saved to the tutorial-specific output directory.

[ ]:

if RUN_TRAINING:
    save_results(
        adata_result=adata_result,
        model=model,
        base_config=base_config,
        params=params,
    )
else:
    print("Save step skipped because training was not run.")

Step 9. Inspect output directory

We now inspect the output directory and list the generated files.

[ ]:

output_dir = base_config["output_dir"]

print("Output directory:")
print(output_dir)
print()

if os.path.exists(output_dir):
    for name in sorted(os.listdir(output_dir)):
        full_path = os.path.join(output_dir, name)
        if os.path.isdir(full_path):
            print(f"[DIR]  {name}")
        else:
            size_mb = os.path.getsize(full_path) / (1024 * 1024)
            print(f"[FILE] {name} ({size_mb:.2f} MB)")
else:
    print("Output directory does not exist.")

Step 10. Preview the cell type proportions

If training has been run successfully, the output directory should contain a cell_type_proportions.csv file. We preview the first few rows below.

[ ]:

cell_type_csv = os.path.join(output_dir, "cell_type_proportions.csv")

if os.path.exists(cell_type_csv):
    df_props = pd.read_csv(cell_type_csv)
    print("cell_type_proportions.csv shape:", df_props.shape)
    display(df_props.head())
else:
    print("cell_type_proportions.csv not found.")

Step 11. Inspect result object metadata

If training has been executed, we can also inspect some basic information stored in the result object.

[ ]:

if adata_result is not None:
    print("Result AnnData shape:")
    print(adata_result.shape)

    print("\nAvailable obsm keys:")
    print(list(adata_result.obsm.keys()))

    print("\nAvailable uns keys:")
    print(list(adata_result.uns.keys())[:20])
else:
    print("No result object is available because training was skipped.")

Summary

In this tutorial, we completed a full SMODER quick-start workflow for the current validated mouse brain RNA + peak example.

This notebook covered:

configuration preparation
input file checking
data loading
model initialization and preprocessing
feature engineering and graph construction
optional model training
result saving
output inspection
result preview

This notebook is intended to serve as the first practical tutorial for the current SMODER package and can be further refined as the package, documentation, and public release workflow continue to improve.

Next steps

For downstream visualization and representative result figures, please see:

Tutorial 02: Mouse Brain H3K27ac Result Visualization
Tutorial 03: Simulated Human Melanoma Result Visualization
Tutorial 04: HBC Result Visualization

These tutorials describe how to generate representative SMODER result figures for each dataset.