{
 "cells": [
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "# SMODER Tutorial 01: Mouse Brain H3K27ac Quick Start\n",
    "\n",
    "This tutorial demonstrates how to run the current validated SMODER workflow on a mouse brain RNA + peak dataset.\n",
    "\n",
    "At the current stage, this notebook serves as a runnable quick-start example for the `smoder` package. It introduces the full workflow, including:\n",
    "\n",
    "1. configuration\n",
    "2. input checking\n",
    "3. data loading\n",
    "4. preprocessing\n",
    "5. feature engineering\n",
    "6. graph construction\n",
    "7. model training\n",
    "8. result saving\n",
    "9. output inspection\n",
    "\n",
    "This tutorial is intended as a development-stage example and may continue to evolve as the package structure is refined."
   ],
   "id": "d87f546a6e1f0259"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Before you begin\n",
    "\n",
    "This tutorial assumes that:\n",
    "\n",
    "- the `smoder` package has already been installed\n",
    "- a compatible Python environment is available\n",
    "- the required input datasets have already been prepared\n",
    "- the current workflow is being run in an environment where the SMODER package can access the data files\n",
    "\n",
    "At the current stage, this tutorial is based on the validated mouse brain H3K27ac example workflow."
   ],
   "id": "74349887d27c0ea1"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "import os\n",
    "import sys\n",
    "from pprint import pprint\n",
    "\n",
    "import pandas as pd\n",
    "\n",
    "import smoder\n",
    "from smoder.config.defaults import (\n",
    "    get_mousebrain_h3k27ac_base_config,\n",
    "    get_mousebrain_h3k27ac_params,\n",
    ")\n",
    "from smoder.pipelines.mousebrain_h3k27ac import (\n",
    "    load_and_show_data,\n",
    "    init_model_and_preprocess,\n",
    "    feature_engineering_and_graph_build,\n",
    "    train_model,\n",
    "    save_results,\n",
    ")\n",
    "\n",
    "print(\"Python executable:\")\n",
    "print(sys.executable)\n",
    "\n",
    "print(\"\\nSMODER package location:\")\n",
    "print(smoder.__file__)"
   ],
   "id": "883f20933085fa64"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Step 1. Define tutorial settings\n",
    "\n",
    "We first define a tutorial-specific run name and choose whether to perform full training.\n",
    "\n",
    "If `RUN_TRAINING = False`, this notebook can still be used for configuration checking and data loading without performing the full training step."
   ],
   "id": "f5de031058c4b971"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "RUN_NAME = \"tutorial_01_mousebrain_h3k27ac_v2\"\n",
    "RUN_TRAINING = True\n",
    "\n",
    "base_config = get_mousebrain_h3k27ac_base_config(run_name=RUN_NAME)\n",
    "params = get_mousebrain_h3k27ac_params()\n",
    "\n",
    "# Keep the tutorial lightweight\n",
    "params[\"epochs\"] = 10\n",
    "params[\"model_save\"] = False"
   ],
   "id": "b0c8495764b1ba7d"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Step 2. Review configuration\n",
    "\n",
    "The configuration contains:\n",
    "\n",
    "- input data paths\n",
    "- output directory\n",
    "- model save directory\n",
    "- preprocessing parameters\n",
    "- training parameters\n",
    "\n",
    "You can optionally modify the paths below if needed for your own environment."
   ],
   "id": "4e71aa9466be0670"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "print(\"Base configuration:\")\n",
    "pprint(base_config)\n",
    "\n",
    "print(\"\\nParameter configuration:\")\n",
    "pprint(params)"
   ],
   "id": "8d0e7a8b6bb8d4ae"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Step 3. Check required input files\n",
    "\n",
    "The current validated workflow expects three input files:\n",
    "\n",
    "- single-cell reference RNA\n",
    "- spatial RNA\n",
    "- spatial peak data\n",
    "\n",
    "We verify that all required input files are available before proceeding."
   ],
   "id": "9389dba89bd124e"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "required_files = {\n",
    "    \"sc_rna_path\": base_config[\"sc_rna_path\"],\n",
    "    \"st_rna_path\": base_config[\"st_rna_path\"],\n",
    "    \"st_adt_path\": base_config[\"st_adt_path\"],\n",
    "}\n",
    "\n",
    "missing_files = []\n",
    "\n",
    "for key, path in required_files.items():\n",
    "    exists = os.path.exists(path)\n",
    "    print(f\"{key}: {path}\")\n",
    "    print(f\"  -> {'FOUND' if exists else 'MISSING'}\")\n",
    "    if not exists:\n",
    "        missing_files.append(path)\n",
    "\n",
    "if missing_files:\n",
    "    raise FileNotFoundError(\"Some required input files are missing.\")"
   ],
   "id": "b42ca688b026346a"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Step 4. Load input data\n",
    "\n",
    "This step loads the validated mouse brain example data and prints basic information such as:\n",
    "\n",
    "- dataset shapes\n",
    "- modality types\n",
    "- number of cell types\n",
    "- selected categories"
   ],
   "id": "667b622c389b795d"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": "ref_dict, smo_dict, params = load_and_show_data(base_config, params)",
   "id": "fcd1557f042604dc"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Step 5. Initialize the model and preprocess the data\n",
    "\n",
    "This step initializes the SMODER model and performs built-in preprocessing, including:\n",
    "\n",
    "- reference RNA preprocessing\n",
    "- spatial RNA preprocessing\n",
    "- second-modality preprocessing\n",
    "- spot alignment\n",
    "- information gene selection"
   ],
   "id": "70fd9d84d000d284"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": "model = init_model_and_preprocess(ref_dict, smo_dict, params)",
   "id": "9485337e992f5cc"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Step 6. Perform feature engineering and graph construction\n",
    "\n",
    "In this step, the pipeline performs:\n",
    "\n",
    "- RNA feature engineering\n",
    "- second-modality feature engineering\n",
    "- spatial graph construction\n",
    "- feature graph construction"
   ],
   "id": "e73680bfe9961e63"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "model, dim_rna, dim_modal2 = feature_engineering_and_graph_build(model, params)\n",
    "\n",
    "print(\"RNA feature dimension:\", dim_rna)\n",
    "print(\"Second modality feature dimension:\", dim_modal2)"
   ],
   "id": "4e2c9393a39e2a09"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Step 7. Optional model training\n",
    "\n",
    "The next step trains the model and produces the main deconvolution outputs.\n",
    "\n",
    "For this tutorial, we keep the number of epochs intentionally small so that the workflow remains suitable for quick validation.\n",
    "\n",
    "If `RUN_TRAINING = False`, the training and saving steps will be skipped."
   ],
   "id": "2389aad8e35630ef"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "adata_result = None\n",
    "cross_fusion = None\n",
    "\n",
    "if RUN_TRAINING:\n",
    "    adata_result, cross_fusion = train_model(\n",
    "        model=model,\n",
    "        params=params,\n",
    "        dim_rna=dim_rna,\n",
    "        dim_modal2=dim_modal2,\n",
    "        base_config=base_config,\n",
    "    )\n",
    "else:\n",
    "    print(\"Training step skipped because RUN_TRAINING = False\")"
   ],
   "id": "d5e5f727aa21ac81"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Step 8. Save results\n",
    "\n",
    "If training has been performed, the outputs are saved to the tutorial-specific output directory."
   ],
   "id": "d975335aa06bc3ff"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "if RUN_TRAINING:\n",
    "    save_results(\n",
    "        adata_result=adata_result,\n",
    "        model=model,\n",
    "        base_config=base_config,\n",
    "        params=params,\n",
    "    )\n",
    "else:\n",
    "    print(\"Save step skipped because training was not run.\")"
   ],
   "id": "561b69b4c67363c1"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Step 9. Inspect output directory\n",
    "\n",
    "We now inspect the output directory and list the generated files."
   ],
   "id": "92fc7ec6af3b7f60"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "output_dir = base_config[\"output_dir\"]\n",
    "\n",
    "print(\"Output directory:\")\n",
    "print(output_dir)\n",
    "print()\n",
    "\n",
    "if os.path.exists(output_dir):\n",
    "    for name in sorted(os.listdir(output_dir)):\n",
    "        full_path = os.path.join(output_dir, name)\n",
    "        if os.path.isdir(full_path):\n",
    "            print(f\"[DIR]  {name}\")\n",
    "        else:\n",
    "            size_mb = os.path.getsize(full_path) / (1024 * 1024)\n",
    "            print(f\"[FILE] {name} ({size_mb:.2f} MB)\")\n",
    "else:\n",
    "    print(\"Output directory does not exist.\")"
   ],
   "id": "8e66b5abc6c8b024"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Step 10. Preview the cell type proportions\n",
    "\n",
    "If training has been run successfully, the output directory should contain a `cell_type_proportions.csv` file. We preview the first few rows below."
   ],
   "id": "8847bdb9846a7dd"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "cell_type_csv = os.path.join(output_dir, \"cell_type_proportions.csv\")\n",
    "\n",
    "if os.path.exists(cell_type_csv):\n",
    "    df_props = pd.read_csv(cell_type_csv)\n",
    "    print(\"cell_type_proportions.csv shape:\", df_props.shape)\n",
    "    display(df_props.head())\n",
    "else:\n",
    "    print(\"cell_type_proportions.csv not found.\")"
   ],
   "id": "55ba5aeb329fc10f"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Step 11. Inspect result object metadata\n",
    "\n",
    "If training has been executed, we can also inspect some basic information stored in the result object."
   ],
   "id": "333b6afb1377b48d"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "if adata_result is not None:\n",
    "    print(\"Result AnnData shape:\")\n",
    "    print(adata_result.shape)\n",
    "\n",
    "    print(\"\\nAvailable obsm keys:\")\n",
    "    print(list(adata_result.obsm.keys()))\n",
    "\n",
    "    print(\"\\nAvailable uns keys:\")\n",
    "    print(list(adata_result.uns.keys())[:20])\n",
    "else:\n",
    "    print(\"No result object is available because training was skipped.\")"
   ],
   "id": "dccbf18b6090448e"
  },
  {
   "metadata": {},
   "cell_type": "markdown",
   "source": [
    "## Summary\n",
    "\n",
    "In this tutorial, we completed a full SMODER quick-start workflow for the current validated mouse brain RNA + peak example.\n",
    "\n",
    "This notebook covered:\n",
    "\n",
    "- configuration preparation\n",
    "- input file checking\n",
    "- data loading\n",
    "- model initialization and preprocessing\n",
    "- feature engineering and graph construction\n",
    "- optional model training\n",
    "- result saving\n",
    "- output inspection\n",
    "- result preview\n",
    "\n",
    "This notebook is intended to serve as the first practical tutorial for the current SMODER package and can be further refined as the package, documentation, and public release workflow continue to improve."
   ],
   "id": "f878331087a30a64"
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 5
}