Lab reset playbook: what we automate overnight
We treat every shared cluster like a rehearsal stage: if the props drift, the story stops making sense. Nightly jobs rebuild namespaces from pinned manifests so learners always start from the same baseline described in the syllabus.
The automation intentionally avoids clever mutations. Instead, it reapplies a known-good bundle, then runs a smoke suite that mirrors the first exercise in each active cohort. When a check fails, the job opens an internal ticket with logs—never silent repair.
We document the reset contract in the learner handbook so advanced students know which directories persist across sessions and which are ephemeral. That transparency reduces support pings about “mysteriously missing” PVCs.
Finally, we snapshot anonymized timing data to see which modules stress the cluster most. That informs when we add throttling or split cohorts, keeping wall-clock lab time honest without promising infinite capacity.