SMODER Tutorial 01: Mouse Brain H3K27ac Quick Start

This tutorial demonstrates how to run the current validated SMODER workflow on a mouse brain RNA + peak dataset.

At the current stage, this notebook serves as a runnable quick-start example for the smoder package. It introduces the full workflow, including:

  1. configuration

  2. input checking

  3. data loading

  4. preprocessing

  5. feature engineering

  6. graph construction

  7. model training

  8. result saving

  9. output inspection

This tutorial is intended as a development-stage example and may continue to evolve as the package structure is refined.

Before you begin

This tutorial assumes that:

  • the smoder package has already been installed

  • a compatible Python environment is available

  • the required input datasets have already been prepared

  • the current workflow is being run in an environment where the SMODER package can access the data files

At the current stage, this tutorial is based on the validated mouse brain H3K27ac example workflow.

[ ]:
import os
import sys
from pprint import pprint

import pandas as pd

import smoder
from smoder.config.defaults import (
    get_mousebrain_h3k27ac_base_config,
    get_mousebrain_h3k27ac_params,
)
from smoder.pipelines.mousebrain_h3k27ac import (
    load_and_show_data,
    init_model_and_preprocess,
    feature_engineering_and_graph_build,
    train_model,
    save_results,
)

print("Python executable:")
print(sys.executable)

print("\nSMODER package location:")
print(smoder.__file__)

Step 1. Define tutorial settings

We first define a tutorial-specific run name and choose whether to perform full training.

If RUN_TRAINING = False, this notebook can still be used for configuration checking and data loading without performing the full training step.

[ ]:
RUN_NAME = "tutorial_01_mousebrain_h3k27ac_v2"
RUN_TRAINING = True

base_config = get_mousebrain_h3k27ac_base_config(run_name=RUN_NAME)
params = get_mousebrain_h3k27ac_params()

# Keep the tutorial lightweight
params["epochs"] = 10
params["model_save"] = False

Step 2. Review configuration

The configuration contains:

  • input data paths

  • output directory

  • model save directory

  • preprocessing parameters

  • training parameters

You can optionally modify the paths below if needed for your own environment.

[ ]:
print("Base configuration:")
pprint(base_config)

print("\nParameter configuration:")
pprint(params)

Step 3. Check required input files

The current validated workflow expects three input files:

  • single-cell reference RNA

  • spatial RNA

  • spatial peak data

We verify that all required input files are available before proceeding.

[ ]:
required_files = {
    "sc_rna_path": base_config["sc_rna_path"],
    "st_rna_path": base_config["st_rna_path"],
    "st_adt_path": base_config["st_adt_path"],
}

missing_files = []

for key, path in required_files.items():
    exists = os.path.exists(path)
    print(f"{key}: {path}")
    print(f"  -> {'FOUND' if exists else 'MISSING'}")
    if not exists:
        missing_files.append(path)

if missing_files:
    raise FileNotFoundError("Some required input files are missing.")

Step 4. Load input data

This step loads the validated mouse brain example data and prints basic information such as:

  • dataset shapes

  • modality types

  • number of cell types

  • selected categories

[ ]:
ref_dict, smo_dict, params = load_and_show_data(base_config, params)

Step 5. Initialize the model and preprocess the data

This step initializes the SMODER model and performs built-in preprocessing, including:

  • reference RNA preprocessing

  • spatial RNA preprocessing

  • second-modality preprocessing

  • spot alignment

  • information gene selection

[ ]:
model = init_model_and_preprocess(ref_dict, smo_dict, params)

Step 6. Perform feature engineering and graph construction

In this step, the pipeline performs:

  • RNA feature engineering

  • second-modality feature engineering

  • spatial graph construction

  • feature graph construction

[ ]:
model, dim_rna, dim_modal2 = feature_engineering_and_graph_build(model, params)

print("RNA feature dimension:", dim_rna)
print("Second modality feature dimension:", dim_modal2)

Step 7. Optional model training

The next step trains the model and produces the main deconvolution outputs.

For this tutorial, we keep the number of epochs intentionally small so that the workflow remains suitable for quick validation.

If RUN_TRAINING = False, the training and saving steps will be skipped.

[ ]:
adata_result = None
cross_fusion = None

if RUN_TRAINING:
    adata_result, cross_fusion = train_model(
        model=model,
        params=params,
        dim_rna=dim_rna,
        dim_modal2=dim_modal2,
        base_config=base_config,
    )
else:
    print("Training step skipped because RUN_TRAINING = False")

Step 8. Save results

If training has been performed, the outputs are saved to the tutorial-specific output directory.

[ ]:
if RUN_TRAINING:
    save_results(
        adata_result=adata_result,
        model=model,
        base_config=base_config,
        params=params,
    )
else:
    print("Save step skipped because training was not run.")

Step 9. Inspect output directory

We now inspect the output directory and list the generated files.

[ ]:
output_dir = base_config["output_dir"]

print("Output directory:")
print(output_dir)
print()

if os.path.exists(output_dir):
    for name in sorted(os.listdir(output_dir)):
        full_path = os.path.join(output_dir, name)
        if os.path.isdir(full_path):
            print(f"[DIR]  {name}")
        else:
            size_mb = os.path.getsize(full_path) / (1024 * 1024)
            print(f"[FILE] {name} ({size_mb:.2f} MB)")
else:
    print("Output directory does not exist.")

Step 10. Preview the cell type proportions

If training has been run successfully, the output directory should contain a cell_type_proportions.csv file. We preview the first few rows below.

[ ]:
cell_type_csv = os.path.join(output_dir, "cell_type_proportions.csv")

if os.path.exists(cell_type_csv):
    df_props = pd.read_csv(cell_type_csv)
    print("cell_type_proportions.csv shape:", df_props.shape)
    display(df_props.head())
else:
    print("cell_type_proportions.csv not found.")

Step 11. Inspect result object metadata

If training has been executed, we can also inspect some basic information stored in the result object.

[ ]:
if adata_result is not None:
    print("Result AnnData shape:")
    print(adata_result.shape)

    print("\nAvailable obsm keys:")
    print(list(adata_result.obsm.keys()))

    print("\nAvailable uns keys:")
    print(list(adata_result.uns.keys())[:20])
else:
    print("No result object is available because training was skipped.")

Summary

In this tutorial, we completed a full SMODER quick-start workflow for the current validated mouse brain RNA + peak example.

This notebook covered:

  • configuration preparation

  • input file checking

  • data loading

  • model initialization and preprocessing

  • feature engineering and graph construction

  • optional model training

  • result saving

  • output inspection

  • result preview

This notebook is intended to serve as the first practical tutorial for the current SMODER package and can be further refined as the package, documentation, and public release workflow continue to improve.

Next steps

For downstream visualization and representative result figures, please see:

  • Tutorial 02: Mouse Brain H3K27ac Result Visualization

  • Tutorial 03: Simulated Human Melanoma Result Visualization

  • Tutorial 04: HBC Result Visualization

These tutorials describe how to generate representative SMODER result figures for each dataset.