{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Qwen3-0.6B Full-Parameter Fine-Tuning\n",
    "\n",
    "This notebook targets the **Ascend 910B CANN MindSpore image** and follows the upstream Qwen3 MindSpore fine-tuning flow from `MindSpeed-LLM/docs/zh/mindspore/quick_start.md`.\n",
    "\n",
    "To match the image-bundled source tree and this verification scenario, the notebook inlines commands equivalent to the official scripts:\n",
    "- `examples/mindspore/qwen3/ckpt_convert_qwen3_hf2mcore.sh`\n",
    "- `examples/mindspore/qwen3/data_convert_qwen3_instruction.sh`\n",
    "- `examples/mindspore/qwen3/tune_qwen3_0point6b_4K_full_ms.sh`\n",
    "\n",
    "**Workflow:**\n",
    "1. Environment checks\n",
    "2. Prepare the instruction dataset\n",
    "3. Verify the bundled MindSpeed-Core-MS/MindSpeed-LLM source tree\n",
    "4. Convert HF weights to MindSpeed/Mcore format\n",
    "5. Preprocess the fine-tuning data\n",
    "6. Launch full-parameter SFT fine-tuning\n",
    "7. Validate the output checkpoint\n",
    "\n",
    "> The default parameters are conservative and intended for image validation. For longer training closer to the upstream baseline, increase `SEQ_LENGTH`, `TRAIN_ITERS`, and `MBS` as needed.\n"
   ],
   "id": "bf537aee6af17427"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 0. Parameters\n"
   ],
   "id": "24a8a8925043c5aa"
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import os\n",
    "import warnings\n",
    "warnings.filterwarnings('ignore', category=DeprecationWarning)\n",
    "warnings.filterwarnings('ignore', category=ImportWarning)\n",
    "warnings.filterwarnings('ignore', category=UserWarning)\n",
    "warnings.filterwarnings('ignore', category=FutureWarning)\n",
    "\n",
    "from pathlib import Path\n",
    "\n",
    "# ===== Path configuration =====\n",
    "MINDSPEED_CORE_MS_DEFAULT_PATH = Path('/opt/app-root/share/MindSpeed-Core-MS')\n",
    "MINDSPEED_CORE_MS_PATH = Path(\n",
    "    os.environ.get('MINDSPEED_CORE_MS_PATH', str(MINDSPEED_CORE_MS_DEFAULT_PATH))\n",
    ")\n",
    "MINDSPEED_LLM_DIR = MINDSPEED_CORE_MS_PATH / 'MindSpeed-LLM'\n",
    "MINDSPEED_DIR = MINDSPEED_CORE_MS_PATH / 'MindSpeed'\n",
    "MSADAPTER_DIR = MINDSPEED_CORE_MS_PATH / 'MSAdapter'\n",
    "MEGATRON_DIR = MINDSPEED_CORE_MS_PATH / 'Megatron-LM'\n",
    "SET_PATH_SCRIPT = MINDSPEED_CORE_MS_PATH / 'tests' / 'scripts' / 'set_path.sh'\n",
    "\n",
    "OFFICIAL_CONVERT_SCRIPT = MINDSPEED_LLM_DIR / 'examples' / 'mindspore' / 'qwen3' / 'ckpt_convert_qwen3_hf2mcore.sh'\n",
    "OFFICIAL_PREPROCESS_SCRIPT = MINDSPEED_LLM_DIR / 'examples' / 'mindspore' / 'qwen3' / 'data_convert_qwen3_instruction.sh'\n",
    "OFFICIAL_TUNE_SCRIPT = MINDSPEED_LLM_DIR / 'examples' / 'mindspore' / 'qwen3' / 'tune_qwen3_0point6b_4K_full_ms.sh'\n",
    "CONVERT_CKPT_ENTRY = MINDSPEED_LLM_DIR / 'mindspeed_llm' / 'mindspore' / 'convert_ckpt.py'\n",
    "assert CONVERT_CKPT_ENTRY.exists(), f'Official MindSpore conversion entry not found: {CONVERT_CKPT_ENTRY}'\n",
    "\n",
    "HF_MODEL_DIR = Path('/opt/app-root/src/models/Qwen3-0.6B')\n",
    "WORK_DIR = Path('/opt/app-root/src/Qwen3-0.6B-work-dir')\n",
    "DATA_DIR = WORK_DIR / 'finetune_dataset'\n",
    "RAW_DATA_FILE = DATA_DIR / 'alpaca_sample.jsonl'\n",
    "PROCESSED_DATA_PREFIX = DATA_DIR / 'alpaca'\n",
    "OUTPUT_DIR = WORK_DIR / 'output' / 'qwen3_0.6b_finetuned'\n",
    "LOGS_DIR = WORK_DIR / 'logs'\n",
    "PRECHECK_LOG_DIR = LOGS_DIR / 'preflight'\n",
    "\n",
    "# ===== Optional: real dataset path =====\n",
    "ALPACA_PARQUET = Path('/opt/app-root/src/datasets/alpaca/train-00000-of-00001-a09b74b3ef9c3b56.parquet')\n",
    "\n",
    "# ===== Ascend environment scripts =====\n",
    "CANN_ENV = '/usr/local/Ascend/cann/set_env.sh'\n",
    "ATB_ENV = '/usr/local/Ascend/nnal/atb/set_env.sh'\n",
    "\n",
    "# ===== Repository and model configuration =====\n",
    "MODEL_SPEC = 'mindspeed_llm.tasks.models.spec.qwen3_spec layer_spec'\n",
    "MASTER_ADDR = 'localhost'\n",
    "MASTER_PORT = 6015\n",
    "NNODES = 1\n",
    "NODE_RANK = 0\n",
    "DISTRIBUTED_BACKEND = 'nccl'  # Keep nccl to match the upstream tune_qwen3_0point6b_4K_full_ms.sh behavior.\n",
    "\n",
    "# ===== Parallelism configuration (must match weight conversion) =====\n",
    "# Qwen3-0.6B is small enough for TP=1 and PP=1; use both cards for data parallelism.\n",
    "TP = 1\n",
    "PP = 1\n",
    "MBS = 2  # micro-batch=2 fits on each 32G card.\n",
    "\n",
    "# ===== Weight conversion output (include TP/PP in the path to avoid stale reuse) =====\n",
    "MCORE_WEIGHTS_DIR = WORK_DIR / 'model_weights' / f'qwen3_mcore_tp{TP}_pp{PP}'\n",
    "\n",
    "# ===== Training hyperparameters (for 2x 910B 32G) =====\n",
    "SEQ_LENGTH = 2048  # The upstream quick start example uses 4096; reduce to 2048 for validation.\n",
    "TRAIN_ITERS = 100  # The upstream quick start example uses 2000; reduce to 100 for validation.\n",
    "LR = 1.25e-6\n",
    "MIN_LR = 1.25e-7\n",
    "\n",
    "# ===== Data preprocessing =====\n",
    "ENABLE_THINKING = 'true'\n",
    "HANDLER_NAME = 'AlpacaStyleInstructionHandler'\n",
    "TOKENIZER_TYPE = 'PretrainedFromHF'\n",
    "PROMPT_TYPE = 'qwen3'\n",
    "DATA_PATH = str(PROCESSED_DATA_PREFIX)\n",
    "\n",
    "print('Configuration loaded')\n",
    "print(f'  MindSpeed-Core-MS: {MINDSPEED_CORE_MS_PATH}')\n",
    "print(f'  MindSpeed-LLM:     {MINDSPEED_LLM_DIR}')\n",
    "print(f'  Official convert script:    {OFFICIAL_CONVERT_SCRIPT}')\n",
    "print(f'  Official preprocess script: {OFFICIAL_PREPROCESS_SCRIPT}')\n",
    "print(f'  Official fine-tune script:  {OFFICIAL_TUNE_SCRIPT}')\n",
    "print(f'  Official conversion entry:  {CONVERT_CKPT_ENTRY}')\n",
    "print(f'  Model:                      {HF_MODEL_DIR}')\n",
    "print(f'  Work dir:                   {WORK_DIR}')\n",
    "print(f'  Dataset: {ALPACA_PARQUET}' if ALPACA_PARQUET.exists() else '  Dataset: built-in sample dataset')\n",
    "print(f'  TP={TP}, PP={PP}, MBS={MBS}, SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}')\n",
    "print(f'  Distributed backend: {DISTRIBUTED_BACKEND}')\n"
   ],
   "id": "436e89b6617b17e6"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Helpers\n"
   ],
   "id": "27f668d69a63d358"
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import json\n",
    "import os\n",
    "import shlex\n",
    "import subprocess\n",
    "\n",
    "_SUPPRESS_WARNINGS = 'ignore::DeprecationWarning,ignore::ImportWarning,ignore::UserWarning,ignore::FutureWarning'\n",
    "\n",
    "def q(value):\n",
    "    return shlex.quote(str(value))\n",
    "\n",
    "def run_cmd(cmd, cwd=None, check=True, step_name=None, log_file=None):\n",
    "    'Run a bash command inside the Ascend environment, keep pipefail semantics, and optionally tee output to a log file.'\n",
    "    env_parts = ['set -o pipefail', f'source {q(CANN_ENV)}', f'source {q(ATB_ENV)}']\n",
    "    if SET_PATH_SCRIPT.exists():\n",
    "        env_parts.append(f'source {q(SET_PATH_SCRIPT)}')\n",
    "    env_prefix = ' && '.join(env_parts)\n",
    "    effective_cwd = Path(cwd or WORK_DIR)\n",
    "    resolved_log_file = Path(log_file) if log_file else None\n",
    "    wrapped_cmd = cmd\n",
    "    if resolved_log_file is not None:\n",
    "        resolved_log_file.parent.mkdir(parents=True, exist_ok=True)\n",
    "        wrapped_cmd = f'{{\n",
    "{cmd}\n",
    "}} 2>&1 | tee {q(resolved_log_file)}'\n",
    "    full_cmd = f'{env_prefix} && {wrapped_cmd}'\n",
    "    print(f'$ {cmd}\n",
    "')\n",
    "    if resolved_log_file is not None:\n",
    "        print(f'Log file: {resolved_log_file}\n",
    "')\n",
    "    run_env = os.environ.copy()\n",
    "    run_env['PYTHONWARNINGS'] = _SUPPRESS_WARNINGS\n",
    "    result = subprocess.run(\n",
    "        ['bash', '-lc', full_cmd],\n",
    "        cwd=str(effective_cwd),\n",
    "        text=True,\n",
    "        env=run_env,\n",
    "    )\n",
    "    if check and result.returncode != 0:\n",
    "        step_label = f'Step[{step_name}]' if step_name else 'Command'\n",
    "        log_hint = f', log file: {resolved_log_file}' if resolved_log_file is not None else ''\n",
    "        raise RuntimeError(f'{step_label} failed with exit code {result.returncode}{log_hint}')\n",
    "    return result\n",
    "\n",
    "print('Helper ready: run_cmd()')\n",
    "print(f'Preflight log dir: {PRECHECK_LOG_DIR}')\n"
   ],
   "id": "24c1197bd5a2b547"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1. Environment Checks\n"
   ],
   "id": "54d5bc600212c226"
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import os\n",
    "import warnings\n",
    "\n",
    "print('=' * 60)\n",
    "print('Environment checks')\n",
    "print('=' * 60)\n",
    "\n",
    "assert Path(CANN_ENV).exists(), f'Ascend CANN environment script not found: {CANN_ENV}'\n",
    "assert Path(ATB_ENV).exists(), f'Ascend ATB environment script not found: {ATB_ENV}'\n",
    "print(f'Expected MindSpeed-Core-MS path: {MINDSPEED_CORE_MS_PATH}')\n",
    "assert MINDSPEED_CORE_MS_PATH.exists(), f'MindSpeed-Core-MS source tree not found: {MINDSPEED_CORE_MS_PATH}'\n",
    "assert SET_PATH_SCRIPT.exists(), f'set_path.sh not found: {SET_PATH_SCRIPT}'\n",
    "for repo_dir in (MINDSPEED_LLM_DIR, MINDSPEED_DIR, MSADAPTER_DIR, MEGATRON_DIR):\n",
    "    assert repo_dir.exists(), f'Bundled source tree missing: {repo_dir}'\n",
    "for script_path in (OFFICIAL_CONVERT_SCRIPT, OFFICIAL_PREPROCESS_SCRIPT, OFFICIAL_TUNE_SCRIPT):\n",
    "    assert script_path.exists(), f'Official example script missing: {script_path}'\n",
    "\n",
    "with warnings.catch_warnings():\n",
    "    warnings.simplefilter('ignore', DeprecationWarning)\n",
    "    warnings.simplefilter('ignore', ImportWarning)\n",
    "    warnings.simplefilter('ignore', UserWarning)\n",
    "    warnings.simplefilter('ignore', FutureWarning)\n",
    "    import mindspore as ms\n",
    "    import msadapter\n",
    "    import mindspeed\n",
    "    import mindspeed_llm\n",
    "\n",
    "print(f'MindSpore:          {ms.__version__}')\n",
    "print(f'CANN env script:    {CANN_ENV}')\n",
    "print(f'ATB env script:     {ATB_ENV}')\n",
    "print(f'MindSpeed-Core-MS:  {MINDSPEED_CORE_MS_PATH}')\n",
    "print(f'set_path.sh:        {SET_PATH_SCRIPT}')\n",
    "\n",
    "device_target = 'unknown'\n",
    "if hasattr(ms, 'get_context'):\n",
    "    try:\n",
    "        device_target = ms.get_context('device_target')\n",
    "    except Exception:\n",
    "        pass\n",
    "\n",
    "nproc = None\n",
    "hal = getattr(ms, 'hal', None)\n",
    "if hal is not None and hasattr(hal, 'device_count'):\n",
    "    try:\n",
    "        nproc = int(hal.device_count())\n",
    "    except Exception:\n",
    "        nproc = None\n",
    "\n",
    "if not nproc:\n",
    "    visible_devices = os.environ.get('ASCEND_RT_VISIBLE_DEVICES') or os.environ.get('ASCEND_VISIBLE_DEVICES')\n",
    "    if visible_devices:\n",
    "        nproc = len([d for d in visible_devices.split(',') if d.strip()])\n",
    "    else:\n",
    "        nproc = int(os.environ.get('RANK_SIZE', '1'))\n",
    "\n",
    "print(f'Device target:      {device_target}')\n",
    "print(f'NPU count:          {nproc}')\n",
    "\n",
    "print(f'\n",
    "Model directory: {HF_MODEL_DIR}')\n",
    "assert HF_MODEL_DIR.exists(), f'Model directory not found: {HF_MODEL_DIR}'\n",
    "required_model_files = [\n",
    "    HF_MODEL_DIR / 'config.json',\n",
    "    HF_MODEL_DIR / 'tokenizer.json',\n",
    "    HF_MODEL_DIR / 'tokenizer_config.json',\n",
    "]\n",
    "for required_file in required_model_files:\n",
    "    assert required_file.exists(), f'Required model file not found: {required_file}'\n",
    "safetensor_files = sorted(HF_MODEL_DIR.glob('*.safetensors'))\n",
    "assert safetensor_files, f'No safetensors weight files found: {HF_MODEL_DIR}'\n",
    "print('Required model files:')\n",
    "for required_file in required_model_files:\n",
    "    print(f'  {required_file.name}: OK')\n",
    "print(f'  safetensors: {len(safetensor_files)} file(s)')\n",
    "\n",
    "model_files = sorted(HF_MODEL_DIR.glob('*'))\n",
    "for f in model_files[:5]:\n",
    "    if f.is_file():\n",
    "        print(f'  {f.name} ({f.stat().st_size / 1e9:.2f} GB)')\n",
    "if len(model_files) > 5:\n",
    "    print(f'  ... total {len(model_files)} files')\n",
    "\n",
    "py_path_entries = [Path(p) for p in os.environ.get('PYTHONPATH', '').split(':') if p]\n",
    "expected_entries = [\n",
    "    MINDSPEED_CORE_MS_PATH / 'msadapter',\n",
    "    MSADAPTER_DIR,\n",
    "    MINDSPEED_CORE_MS_PATH / 'msadapter' / 'msa_thirdparty',\n",
    "    MSADAPTER_DIR / 'msa_thirdparty',\n",
    "    MINDSPEED_LLM_DIR,\n",
    "    MEGATRON_DIR,\n",
    "    MINDSPEED_DIR,\n",
    "]\n",
    "print('\n",
    "PYTHONPATH key entries:')\n",
    "for entry in expected_entries:\n",
    "    present = entry in py_path_entries\n",
    "    print(f'  {entry}: {\"OK\" if present else \"missing\"}')\n",
    "missing_entries = [str(entry) for entry in expected_entries if entry not in py_path_entries]\n",
    "assert not missing_entries, f'PYTHONPATH is missing required entries: {missing_entries}'\n",
    "\n",
    "assert nproc >= TP * PP, f'NPU count ({nproc}) < TP*PP ({TP * PP}); reduce PP and retry.'\n",
    "DP = max(1, nproc // (TP * PP))\n",
    "GBS = DP * MBS\n",
    "print(f'\n",
    "Parallel configuration: TP={TP}, PP={PP}, DP={DP}, GBS={GBS}')\n",
    "\n",
    "subprocess_env_check = '\n",
    "'.join([\n",
    "    \"python - <<'PY'\",\n",
    "    'import importlib',\n",
    "    'import os',\n",
    "    'import sys',\n",
    "    'from pathlib import Path',\n",
    "    '',\n",
    "    \"core_root = Path(os.environ['MINDSPEED_CORE_MS_PATH'])\",\n",
    "    \"print(f'sys.executable: {sys.executable}')\",\n",
    "    \"print(f'MINDSPEED_CORE_MS_PATH: {core_root}')\",\n",
    "    \"for name in ('mindspore', 'msadapter', 'mindspeed', 'mindspeed_llm', 'transformers'):\",\n",
    "    '    module = importlib.import_module(name)',\n",
    "    \"    print(f'{name}: {getattr(module, \"__file__\", \"<namespace>\")}')\",\n",
    "    '',\n",
    "    \"cfg = core_root / 'MindSpeed-LLM' / 'configs' / 'checkpoint' / 'model_cfg.json'\",\n",
    "    \"print(f'checkpoint_cfg: {cfg} exists={cfg.exists()}')\",\n",
    "    'PY',\n",
    "    '',\n",
    "])\n",
    "run_cmd(\n",
    "    subprocess_env_check,\n",
    "    step_name='subprocess-import-check',\n",
    "    log_file=PRECHECK_LOG_DIR / 'subprocess_import_check.log',\n",
    ")\n",
    "print(f'Subprocess environment check log: {PRECHECK_LOG_DIR / \"subprocess_import_check.log\"}')\n",
    "print('\n",
    "Environment checks passed!')\n"
   ],
   "id": "458f673071da592d"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Prepare the Dataset\n",
    "\n",
    "Create sample Alpaca-formatted instruction data for fine-tuning flow validation.\n",
    "\n",
    "To use a real dataset, place a parquet file at `ALPACA_PARQUET` or write a JSONL file to `RAW_DATA_FILE`, one JSON object per line:\n",
    "\n",
    "```json\n",
    "{\"instruction\": \"...\", \"input\": \"...\", \"output\": \"...\"}\n",
    "```\n"
   ],
   "id": "f29d99ee638e7039"
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "import warnings\n",
    "\n",
    "DATA_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "if ALPACA_PARQUET.exists():\n",
    "    print(f'Loading Alpaca dataset: {ALPACA_PARQUET.name}')\n",
    "    with warnings.catch_warnings():\n",
    "        warnings.simplefilter('ignore', DeprecationWarning)\n",
    "        import pandas as pd\n",
    "        df = pd.read_parquet(ALPACA_PARQUET)\n",
    "\n",
    "    required_columns = {'instruction', 'input', 'output'}\n",
    "    missing_columns = required_columns.difference(df.columns)\n",
    "    assert not missing_columns, f'Parquet dataset is missing required columns: {sorted(missing_columns)}'\n",
    "\n",
    "    with open(RAW_DATA_FILE, 'w', encoding='utf-8') as f:\n",
    "        for item in df[['instruction', 'input', 'output']].to_dict('records'):\n",
    "            item['input'] = item.get('input') or ''\n",
    "            f.write(json.dumps(item, ensure_ascii=False) + '\n",
    "')\n",
    "\n",
    "    print(f'Converted parquet to JSONL: {RAW_DATA_FILE}')\n",
    "    preview_records = df[['instruction', 'input', 'output']].head(3).to_dict('records')\n",
    "else:\n",
    "    print('Alpaca dataset not found, using built-in sample data\n",
    "')\n",
    "    sample_data = [\n",
    "        {'instruction': 'Translate the following sentence into French', 'input': 'The weather is nice today.', 'output': \"Il fait beau aujourd'hui.\"},\n",
    "        {'instruction': 'Translate the following sentence into Spanish', 'input': 'I like programming.', 'output': 'Me gusta programar.'},\n",
    "        {'instruction': 'Summarize the following in one sentence', 'input': 'Machine learning is fascinating and widely used in many fields.', 'output': 'Machine learning is widely used across many fields.'},\n",
    "        {'instruction': 'Rewrite in a more formal tone', 'input': 'Hello, how are you?', 'output': 'Hello, how have you been today?'},\n",
    "        {'instruction': 'Introduce MindSpore in one sentence', 'input': '', 'output': 'MindSpore is an all-scenario AI computing framework for device, edge, and cloud.'},\n",
    "        {'instruction': 'List three common sorting algorithms', 'input': '', 'output': 'Three common sorting algorithms are bubble sort, quicksort, and merge sort.'},\n",
    "        {'instruction': 'Explain what full-parameter fine-tuning means', 'input': '', 'output': 'Full-parameter fine-tuning updates all model weights instead of training only a lightweight adapter.'},\n",
    "        {'instruction': 'Write a Python function that adds two numbers', 'input': '', 'output': 'def add(a, b):\n",
    "    return a + b'},\n",
    "        {'instruction': 'Rewrite in a more concise way', 'input': 'Artificial intelligence is changing the world.', 'output': 'AI is changing the world.'},\n",
    "        {'instruction': 'What is Ascend 910B?', 'input': '', 'output': 'Ascend 910B is an AI accelerator chip designed by Huawei for deep learning training and inference.'},\n",
    "    ]\n",
    "    with open(RAW_DATA_FILE, 'w', encoding='utf-8') as f:\n",
    "        for item in sample_data:\n",
    "            f.write(json.dumps(item, ensure_ascii=False) + '\n",
    "')\n",
    "    preview_records = sample_data[:3]\n",
    "    print(f'Sample dataset created: {RAW_DATA_FILE}')\n",
    "    print(f'{len(sample_data)} samples total')\n",
    "\n",
    "print('\n",
    "Data preview:')\n",
    "for item in preview_records:\n",
    "    inp = f' {item[\"input\"]}' if item.get('input') else ''\n",
    "    print(f'  Q: {item[\"instruction\"][:80]}{inp[:40]}')\n",
    "    print(f'  A: {str(item[\"output\"])[:80]}')\n"
   ],
   "id": "d16a7a6cad0a1069"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Verify the Bundled Source Tree\n",
    "\n",
    "The image keeps the `MindSpeed-Core-MS`, `MindSpeed-LLM`, `MindSpeed`, `MSAdapter`, and `Megatron-LM` source trees in the same layout used by the official `MindSpeed-Core-MS/tests/scripts/set_path.sh`, so the notebook does not need to clone extra repositories.\n",
    "\n",
    "The bundled source tree lives under `/opt/app-root/share/MindSpeed-Core-MS`, while `/opt/app-root/src` remains available for the workbench PVC, models, datasets, and training outputs.\n",
    "\n",
    "This cell only performs directory checks and smoke tests for the official entry points. It does not modify the bundled source tree at notebook runtime.\n"
   ],
   "id": "851e67430d5ed824"
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "WORK_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "repos = [\n",
    "    ('MindSpeed-Core-MS', MINDSPEED_CORE_MS_PATH),\n",
    "    ('MindSpeed-LLM', MINDSPEED_LLM_DIR),\n",
    "    ('MindSpeed', MINDSPEED_DIR),\n",
    "    ('MSAdapter', MSADAPTER_DIR),\n",
    "    ('Megatron-LM', MEGATRON_DIR),\n",
    "]\n",
    "\n",
    "print('Bundled source tree:')\n",
    "for name, repo_dir in repos:\n",
    "    exists = repo_dir.exists()\n",
    "    print(f'  [{name}] {repo_dir}: {\"OK\" if exists else \"missing\"}')\n",
    "    assert exists, f'Bundled source tree missing: {repo_dir}'\n",
    "\n",
    "script_checks = [\n",
    "    ('Official weight conversion script', OFFICIAL_CONVERT_SCRIPT),\n",
    "    ('Official MindSpore conversion entry', CONVERT_CKPT_ENTRY),\n",
    "    ('Official instruction data script', OFFICIAL_PREPROCESS_SCRIPT),\n",
    "    ('Data preprocessing entry', MINDSPEED_LLM_DIR / 'preprocess_data.py'),\n",
    "    ('Official fine-tune script', OFFICIAL_TUNE_SCRIPT),\n",
    "    ('Fine-tuning training entry', MINDSPEED_LLM_DIR / 'posttrain_gpt.py'),\n",
    "]\n",
    "\n",
    "print('\n",
    "Official scripts and entry points:')\n",
    "for name, script_path in script_checks:\n",
    "    exists = script_path.exists()\n",
    "    print(f'  [{name}] {script_path}: {\"OK\" if exists else \"missing\"}')\n",
    "    assert exists, f'Required script missing: {script_path}'\n",
    "\n",
    "repo_smoke_cmd = ' && '.join([\n",
    "    f'cd {q(MINDSPEED_LLM_DIR)}',\n",
    "    f'python {q(CONVERT_CKPT_ENTRY)} --load-model-type hf --help >/dev/null',\n",
    "    'python ./preprocess_data.py --help >/dev/null',\n",
    "    'python ./posttrain_gpt.py --help >/dev/null',\n",
    "])\n",
    "run_cmd(\n",
    "    repo_smoke_cmd,\n",
    "    cwd=MINDSPEED_LLM_DIR,\n",
    "    step_name='repo-smoke-check',\n",
    "    log_file=PRECHECK_LOG_DIR / 'repo_smoke_check.log',\n",
    ")\n",
    "print(f'Repository entry-point check log: {PRECHECK_LOG_DIR / \"repo_smoke_check.log\"}')\n",
    "print('\n",
    "Source tree checks passed!')\n"
   ],
   "id": "24f4339feef7faac"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Convert HF Weights to MindSpeed/Mcore Format\n",
    "\n",
    "Convert HuggingFace Qwen3-0.6B weights into the Mcore checkpoint format required by the MindSpore training flow. The first conversion usually takes a few minutes.\n",
    "\n",
    "> If conversion hits device-side OOM, refer to the upstream quick start and run the conversion on the CPU side instead.\n"
   ],
   "id": "1e08be6c6932204f"
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "MCORE_WEIGHTS_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "conversion_marker = MCORE_WEIGHTS_DIR / 'latest_checkpointed_iteration.txt'\n",
    "iter_dirs = sorted(MCORE_WEIGHTS_DIR.glob('iter_*'))\n",
    "converted = conversion_marker.exists() and bool(iter_dirs)\n",
    "\n",
    "if converted:\n",
    "    print(f'Weights already exist, skipping conversion: {MCORE_WEIGHTS_DIR}')\n",
    "    print(f'Latest checkpoint marker: {conversion_marker.read_text().strip()}')\n",
    "else:\n",
    "    convert_args = [\n",
    "        '--use-mcore-models',\n",
    "        '--model-type', 'GPT',\n",
    "        '--load-model-type', 'hf',\n",
    "        '--save-model-type', 'mg',\n",
    "        '--target-tensor-parallel-size', str(TP),\n",
    "        '--target-pipeline-parallel-size', str(PP),\n",
    "        '--load-dir', str(HF_MODEL_DIR),\n",
    "        '--save-dir', str(MCORE_WEIGHTS_DIR),\n",
    "        '--tokenizer-model', str(HF_MODEL_DIR / 'tokenizer.json'),\n",
    "        '--params-dtype', 'bf16',\n",
    "        '--model-type-hf', 'qwen3',\n",
    "        '--ai-framework', 'mindspore',\n",
    "    ]\n",
    "    if MODEL_SPEC:\n",
    "        convert_args.extend(['--spec', *MODEL_SPEC.split()])\n",
    "\n",
    "    convert_cmd = ' && '.join([\n",
    "        f'cd {q(MINDSPEED_LLM_DIR)}',\n",
    "        'export CUDA_DEVICE_MAX_CONNECTIONS=1',\n",
    "        ' '.join(['python', q(CONVERT_CKPT_ENTRY), *[q(arg) for arg in convert_args]]),\n",
    "    ])\n",
    "    print('Converting weights through the official entry point...')\n",
    "    run_cmd(\n",
    "        convert_cmd,\n",
    "        cwd=MINDSPEED_LLM_DIR,\n",
    "        step_name='weight-convert',\n",
    "        log_file=LOGS_DIR / 'convert_qwen3_0.6b_ms.log',\n",
    "    )\n",
    "    print('Weight conversion completed!')\n",
    "\n",
    "print('\n",
    "Converted checkpoint files:')\n",
    "for p in sorted(MCORE_WEIGHTS_DIR.glob('*'))[:20]:\n",
    "    print(f'  {p.name}')\n",
    "\n",
    "assert conversion_marker.exists(), f'Checkpoint marker file not found: {conversion_marker}'\n",
    "assert any(MCORE_WEIGHTS_DIR.glob('iter_*')), f'Converted iter_* checkpoint not found: {MCORE_WEIGHTS_DIR}'\n"
   ],
   "id": "c11053fb769c124f"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Data Preprocessing\n",
    "\n",
    "Convert Alpaca-formatted instruction data into the packed binary format required by the MindSpore Qwen3 fine-tuning flow.\n"
   ],
   "id": "8b4eb7214878fee9"
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "preprocess_cmd = ' && '.join([\n",
    "    f'cd {q(MINDSPEED_LLM_DIR)}',\n",
    "    f'mkdir -p {q(DATA_DIR)}',\n",
    "    'python ./preprocess_data.py'\n",
    "    f' --input {q(RAW_DATA_FILE)}'\n",
    "    f' --tokenizer-name-or-path {q(HF_MODEL_DIR)}'\n",
    "    f' --output-prefix {q(PROCESSED_DATA_PREFIX)}'\n",
    "    f' --handler-name {HANDLER_NAME}'\n",
    "    f' --tokenizer-type {TOKENIZER_TYPE}'\n",
    "    ' --workers 4'\n",
    "    ' --log-interval 1'\n",
    "    f' --enable-thinking {ENABLE_THINKING}'\n",
    "    f' --prompt-type {PROMPT_TYPE}',\n",
    "])\n",
    "\n",
    "print('Preprocessing data...')\n",
    "run_cmd(\n",
    "    preprocess_cmd,\n",
    "    cwd=MINDSPEED_LLM_DIR,\n",
    "    step_name='data-preprocess',\n",
    "    log_file=LOGS_DIR / 'preprocess_qwen3_0.6b_ms.log',\n",
    ")\n",
    "\n",
    "expected_outputs = [\n",
    "    DATA_DIR / 'alpaca_packed_attention_mask_document.bin',\n",
    "    DATA_DIR / 'alpaca_packed_attention_mask_document.idx',\n",
    "    DATA_DIR / 'alpaca_packed_input_ids_document.bin',\n",
    "    DATA_DIR / 'alpaca_packed_input_ids_document.idx',\n",
    "    DATA_DIR / 'alpaca_packed_labels_document.bin',\n",
    "    DATA_DIR / 'alpaca_packed_labels_document.idx',\n",
    "]\n",
    "\n",
    "print('\n",
    "Preprocessing output files:')\n",
    "for f in expected_outputs:\n",
    "    size_kb = f.stat().st_size / 1024 if f.exists() else 0\n",
    "    print(f'  {f.name}: {\"OK\" if f.exists() else \"missing\"} ({size_kb:.1f} KB)')\n",
    "\n",
    "missing = [str(f) for f in expected_outputs if not f.exists()]\n",
    "assert not missing, f'Preprocessing output files were not generated: {missing}'\n",
    "print('\n",
    "Data preprocessing completed!')\n"
   ],
   "id": "1bbde65b273d888a"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Launch Fine-Tuning\n",
    "\n",
    "Run full-parameter SFT for Qwen3-0.6B with the MindSpore backend. Training logs stream to the notebook.\n",
    "\n",
    "> The current configuration `SEQ_LENGTH=2048, TRAIN_ITERS=100, MBS=2` targets 2x 910B 32G NPUs. For larger-scale training, increase `TRAIN_ITERS`, `SEQ_LENGTH`, and the dataset size.\n"
   ],
   "id": "66219c9355e9d630"
  },
  {
   "metadata": {},
   "cell_type": "code",
   "outputs": [],
   "execution_count": null,
   "source": [
    "LOGS_DIR.mkdir(parents=True, exist_ok=True)\n",
    "OUTPUT_DIR.mkdir(parents=True, exist_ok=True)\n",
    "\n",
    "WORLD_SIZE = nproc * NNODES\n",
    "DP = max(1, nproc // (TP * PP))\n",
    "GBS = DP * MBS\n",
    "\n",
    "train_log_file = LOGS_DIR / 'tune_qwen3_0.6b_ms.log'\n",
    "\n",
    "env = ' && '.join([\n",
    "    f'cd {q(MINDSPEED_LLM_DIR)}',\n",
    "    'export CUDA_DEVICE_MAX_CONNECTIONS=1',\n",
    "    'export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True',\n",
    "    f'mkdir -p {q(LOGS_DIR)}',\n",
    "    f'mkdir -p {q(OUTPUT_DIR)}',\n",
    "])\n",
    "\n",
    "distributed = ' '.join([\n",
    "    'msrun',\n",
    "    f'--local_worker_num {nproc}',\n",
    "    f'--worker_num {WORLD_SIZE}',\n",
    "    f'--node_rank {NODE_RANK}',\n",
    "    f'--master_addr {MASTER_ADDR}',\n",
    "    f'--master_port {MASTER_PORT}',\n",
    "    f'--log_dir={q(LOGS_DIR / \"msrun\")}',\n",
    "    '--join=True',\n",
    "    '--cluster_time_out=300',\n",
    "])\n",
    "\n",
    "model_args = ' '.join([\n",
    "    '--use-mcore-models',\n",
    "    f'--tensor-model-parallel-size {TP}',\n",
    "    f'--pipeline-model-parallel-size {PP}',\n",
    "    '--sequence-parallel',\n",
    "    f'--spec {MODEL_SPEC}',\n",
    "    '--kv-channels 128',\n",
    "    '--qk-layernorm',\n",
    "    '--use-flash-attn',\n",
    "    '--num-layers 28',\n",
    "    '--hidden-size 1024',\n",
    "    '--use-rotary-position-embeddings',\n",
    "    '--num-attention-heads 16',\n",
    "    '--ffn-hidden-size 3072',\n",
    "    '--max-position-embeddings 32768',\n",
    "    f'--seq-length {SEQ_LENGTH}',\n",
    "    '--make-vocab-size-divisible-by 1',\n",
    "    '--padded-vocab-size 151936',\n",
    "    '--rotary-base 1000000',\n",
    "    '--disable-bias-linear',\n",
    "    '--swiglu',\n",
    "    '--tokenizer-type PretrainedFromHF',\n",
    "    f'--tokenizer-name-or-path {q(HF_MODEL_DIR)}',\n",
    "    '--normalization RMSNorm',\n",
    "    '--position-embedding-type rope',\n",
    "    '--norm-epsilon 1e-6',\n",
    "    '--hidden-dropout 0',\n",
    "    '--attention-dropout 0',\n",
    "    '--no-gradient-accumulation-fusion',\n",
    "    '--attention-softmax-in-fp32',\n",
    "    '--exit-on-missing-checkpoint',\n",
    "    '--no-masked-softmax-fusion',\n",
    "    '--group-query-attention',\n",
    "    '--num-query-groups 8',\n",
    "    '--seed 42',\n",
    "    '--bf16',\n",
    "    '--transformer-impl local',\n",
    "    '--ckpt-format msadapter',\n",
    "])\n",
    "\n",
    "train_args = ' '.join([\n",
    "    f'--train-iters {TRAIN_ITERS}',\n",
    "    f'--micro-batch-size {MBS}',\n",
    "    f'--global-batch-size {GBS}',\n",
    "    f'--lr {LR}',\n",
    "    f'--min-lr {MIN_LR}',\n",
    "    '--weight-decay 1e-1',\n",
    "    '--lr-warmup-fraction 0.01',\n",
    "    '--clip-grad 1.0',\n",
    "    '--adam-beta1 0.9',\n",
    "    '--adam-beta2 0.95',\n",
    "    '--no-load-optim',\n",
    "    '--no-load-rng',\n",
    "])\n",
    "\n",
    "data_args = ' '.join([\n",
    "    f'--data-path {q(DATA_PATH)}',\n",
    "    '--split 100,0,0',\n",
    "])\n",
    "\n",
    "output_args = ' '.join([\n",
    "    '--log-interval 1',\n",
    "    f'--save-interval {TRAIN_ITERS}',\n",
    "    f'--eval-interval {TRAIN_ITERS}',\n",
    "    '--eval-iters 0',\n",
    "    '--log-throughput',\n",
    "])\n",
    "\n",
    "tune_args = ' '.join([\n",
    "    '--finetune',\n",
    "    '--stage sft',\n",
    "    '--is-instruction-dataset',\n",
    "    '--prompt-type qwen3',\n",
    "    '--no-pad-to-seq-lengths',\n",
    "])\n",
    "\n",
    "cmd = (\n",
    "    f'{env} && {distributed} posttrain_gpt.py '\n",
    "    f'{model_args} {train_args} {data_args} {output_args} {tune_args} '\n",
    "    f'--distributed-backend {DISTRIBUTED_BACKEND} '\n",
    "    f'--load {q(MCORE_WEIGHTS_DIR)} '\n",
    "    f'--save {q(OUTPUT_DIR)} '\n",
    "    f'--ai-framework mindspore '\n",
    ")\n",
    "\n",
    "print(f'Training config: {nproc} NPU, TP={TP}, PP={PP}, DP={DP}, GBS={GBS}')\n",
    "print(f'SEQ={SEQ_LENGTH}, ITERS={TRAIN_ITERS}, MASTER={MASTER_ADDR}:{MASTER_PORT}')\n",
    "print(f'Log file: {train_log_file}')\n",
    "print('\n",
    "Starting fine-tuning...\n",
    "')\n",
    "run_cmd(\n",
    "    cmd,\n",
    "    cwd=MINDSPEED_LLM_DIR,\n",
    "    step_name='train',\n",
    "    log_file=train_log_file,\n",
    ")\n",
    "print(f'\n",
    "Fine-tuning completed! Weights saved to: {OUTPUT_DIR}')\n"
   ],
   "id": "3dd57cef5ae58e69"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 7. Validate Outputs\n"
   ],
   "id": "b765e7120b74eab3"
  },
  {
   "cell_type": "code",
   "metadata": {},
   "execution_count": null,
   "outputs": [],
   "source": [
    "log_file = LOGS_DIR / 'tune_qwen3_0.6b_ms.log'\n",
    "latest_marker = OUTPUT_DIR / 'latest_checkpointed_iteration.txt'\n",
    "iter_dirs = sorted(OUTPUT_DIR.glob('iter_*'))\n",
    "\n",
    "print(f'Output directory: {OUTPUT_DIR}')\n",
    "print(f'Log file: {log_file}')\n",
    "\n",
    "assert OUTPUT_DIR.exists() and any(OUTPUT_DIR.iterdir()), f'No artifacts found in output directory: {OUTPUT_DIR}'\n",
    "\n",
    "if latest_marker.exists():\n",
    "    print(f'Latest checkpoint marker: {latest_marker.read_text().strip()}')\n",
    "else:\n",
    "    print('Latest checkpoint marker not found, listing all checkpoint artifacts instead.')\n",
    "\n",
    "if iter_dirs:\n",
    "    print('\n",
    "Checkpoint iteration directories:')\n",
    "    for d in iter_dirs:\n",
    "        print(f'  {d.name}')\n",
    "\n",
    "print('\n",
    "Output artifacts (first 20 entries):')\n",
    "for p in sorted(OUTPUT_DIR.rglob('*'))[:20]:\n",
    "    if p.is_file():\n",
    "        print(f'  {p.relative_to(OUTPUT_DIR)} ({p.stat().st_size / 1024:.1f} KB)')\n",
    "    else:\n",
    "        print(f'  {p.relative_to(OUTPUT_DIR)}/')\n",
    "\n",
    "assert log_file.exists(), f'Training log was not generated: {log_file}'\n",
    "print('\n",
    "Validation passed!')\n"
   ],
   "id": "ae0cdda33955c5d9"
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Using a Real Dataset\n",
    "\n",
    "Once validation passes, you can switch to a real dataset:\n",
    "\n",
    "1. **Prepare the data**\n",
    "   - Place a parquet dataset at `ALPACA_PARQUET`, or write Alpaca-formatted JSONL to `RAW_DATA_FILE`.\n",
    "   - Each record should contain `instruction`, `input`, and `output` fields.\n",
    "\n",
    "2. **Adjust the training scale**\n",
    "   - Increase `SEQ_LENGTH` gradually based on available accelerator memory and target context length.\n",
    "   - Increase `TRAIN_ITERS` based on dataset size.\n",
    "   - Adjust `MBS` to match available memory.\n",
    "\n",
    "3. **Re-convert weights when TP/PP changes**\n",
    "   - The `MCORE_WEIGHTS_DIR` path includes `TP` and `PP` to prevent accidental reuse of stale converted weights.\n",
    "\n",
    "4. **Adjust instruction preprocessing behavior**\n",
    "   - The upstream example uses `ENABLE_THINKING=true` by default.\n",
    "   - Change `ENABLE_THINKING` if your dataset or prompt style needs different behavior.\n",
    "\n",
    "5. **Multi-node training**\n",
    "   - Update `MASTER_ADDR`, `MASTER_PORT`, `NNODES`, and `NODE_RANK`, then rerun the training cell.\n",
    "\n",
    "This notebook only validates checkpoint generation. Add inference steps later after the target image and upstream scripts provide a stable MindSpore inference path.\n"
   ],
   "id": "8c3896d42381d177"
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "name": "python",
   "version": "3.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
