Skip to content

feat(BA-6242): give custom runtime variant a default health check#11863

Merged
HyeockJinKim merged 8 commits into
mainfrom
feat/BA-6242
Jun 5, 2026
Merged

feat(BA-6242): give custom runtime variant a default health check#11863
HyeockJinKim merged 8 commits into
mainfrom
feat/BA-6242

Conversation

@seedspirit
Copy link
Copy Markdown
Contributor

@seedspirit seedspirit commented May 29, 2026

Summary

  • Seed the custom runtime variant's default_model_definition with a baseline health_check (path /health, interval 10s, max_retries 10, initial_delay 1800s) via alembic migration ed42bc179b91.
  • Previously custom held {"models": null}, so deployments that omitted health_check fell back to the 60s schema default and could be marked unhealthy while large models load (up to ~30 min). 1800s matches the prebuilt variants.
  • Merge stays field-wise: a user's model-definition.yaml / request still overrides each field, so this is purely a fallback layer. Operator-customised rows are left untouched (WHERE-guarded UPDATE, idempotent).
  • Kept both fixtures (fixtures/manager/ and src/ai/backend/install/fixtures/) in sync with the migration.

Test plan

  • pants fmt/lint/check on the migration
  • Real alembic run on a throwaway DB: 0113c63f3261 → ed42bc179b91 seeds the def; downgrade -1 reverts to {"models": null}; re-upgrade head is idempotent
  • Both fixtures parse as valid JSON
  • CI

Resolves BA-6242


📚 Documentation preview 📚: https://sorna--11863.org.readthedocs.build/en/11863/


📚 Documentation preview 📚: https://sorna-ko--11863.org.readthedocs.build/ko/11863/

@github-actions github-actions Bot added size:L 100~500 LoC comp:manager Related to Manager component require:db-migration Automatically set when alembic migrations are added or updated labels May 29, 2026
seedspirit added a commit that referenced this pull request May 29, 2026
@seedspirit seedspirit requested a review from HyeockJinKim May 29, 2026 08:23
@seedspirit seedspirit marked this pull request as ready for review May 29, 2026 08:57
@seedspirit seedspirit requested a review from a team as a code owner May 29, 2026 08:57
Copilot AI review requested due to automatic review settings May 29, 2026 08:57
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Seeds the custom runtime variant’s default_model_definition with a baseline health_check so deployments that omit health_check don’t inherit the strict 60s schema default and get marked unhealthy while large models are still loading.

Changes:

  • Add an Alembic data migration to update the custom runtime variant’s default_model_definition from {"models": null} to a draft containing service.health_check defaults (guarded to avoid overwriting operator customizations).
  • Update both runtime-variant fixture JSON files to match the new custom default.
  • Add a towncrier news fragment describing the behavior change.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/ai/backend/manager/models/alembic/versions/ed42bc179b91_set_custom_runtime_variant_default_definition.py Adds an idempotent, guarded UPDATE to seed custom’s default model-definition draft with a long initial_delay health check.
src/ai/backend/install/fixtures/example-runtime-variants.json Updates the installer fixture so fresh installs include the new custom baseline health check.
fixtures/manager/example-runtime-variants.json Keeps the Manager-side fixture in sync with the installer fixture for the custom default.
changes/11863.feature.md Documents the new default health check behavior in release notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread changes/11863.feature.md Outdated
@@ -0,0 +1 @@
Apply a default health check (/health, 1800s initial delay) to the custom runtime variant so model deployments no longer fail health checks while large models load
Copy link
Copy Markdown
Member

@fregataa fregataa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

health check field is nullable and when it is null then no health check runs

seedspirit and others added 3 commits June 4, 2026 10:18
Seed the custom variant's default_model_definition with a baseline
health_check (path /health, initial_delay 1800s) so deployments that
omit it no longer fall back to the 60s schema default and fail health
checks while large models load. User model-definition.yaml / request
values still override field-by-field. Migration + both fixtures kept
in sync.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…flag

Add an `enable` flag (default False) to ModelHealthCheck across the config
model, v2 DTOs, GraphQL type/input, adapter, and runtime-variant fixtures.
health_check_config() now gates on `enable` instead of presence, and a
model-definition file that declares a health_check is treated as enabled
via ModelDefinitionDraft.from_file_payload. Re-parent the custom runtime
variant seed migration onto eb9d9c018e85 and seed a disabled default
health check so custom deployments can opt in later.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@github-actions github-actions Bot added the comp:common Related to Common component label Jun 4, 2026
Co-authored-by: octodog <mu001@lablup.com>
@github-actions github-actions Bot added the area:docs Documentations label Jun 4, 2026
@seedspirit seedspirit requested review from a team and fregataa June 4, 2026 08:04
Comment thread src/ai/backend/common/dto/manager/v2/deployment/request.py Outdated
Comment thread src/ai/backend/manager/api/gql/deployment/types/revision.py Outdated
if not changed:
return draft
return draft.model_copy(update={"models": new_models})

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this draft resolved?

seedspirit and others added 3 commits June 4, 2026 19:00
Per review, the creation input boolean does not need to be nullable.
ModelHealthCheckInput.enable and ModelHealthCheckInputGQL.enable now
default to False instead of None. The draft type stays nullable so merge
semantics are unaffected (to_draft uses exclude_unset).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…licit opt-out

An empty `health_check:` (null or {}) in a model-definition.yaml now resolves to
a disabled override instead of leaking the runtime-variant/preset baseline. A
non-empty block still opts in (enable defaults to True). Normalization runs on
the canonical-name dump so it is correct for both snake_case and kebab-case keys.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Comment thread fixtures/manager/example-runtime-variants.json Outdated
…riants

Built-in variants ship a known health_check endpoint, so they should be
health-checked by default (enable=true); only `custom` stays opt-in (false).
Corrects the example fixtures and folds the existing-DB backfill into the
ed42bc179b91 health-check opt-in migration.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@HyeockJinKim HyeockJinKim merged commit 11508ba into main Jun 5, 2026
36 checks passed
@HyeockJinKim HyeockJinKim deleted the feat/BA-6242 branch June 5, 2026 07:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:docs Documentations comp:common Related to Common component comp:manager Related to Manager component require:db-migration Automatically set when alembic migrations are added or updated size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants