diff --git a/README.fr.md b/README.fr.md index 6db14ae..1e82afe 100644 --- a/README.fr.md +++ b/README.fr.md @@ -100,7 +100,7 @@ Gaspillage minimum estimé : ~$25 944/mois - Détecte le gaspillage IA/ML coûteux : SageMaker, AML, Vertex AI — ressources GPU signalées comme candidats à risque plus élevé (500–23 000 $/mois) - Fonctionne sur AWS, Azure et GCP en un seul outil - S'exécute entièrement dans votre environnement — aucun agent, pas de SaaS, aucun credential stocké -- 46 règles de détection sélectives et haut signal, conçues pour éviter les faux positifs en environnements IaC +- 47 règles de détection sélectives et haut signal, conçues pour éviter les faux positifs en environnements IaC - Prêt pour CI/CD — codes de sortie d'application + sorties JSON/CSV/markdown ### Ce que CleanCloud ne fait PAS @@ -151,6 +151,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l | Endpoint SageMaker (GPU) | 500 – 23 000 $ / mois | | Instance Notebook SageMaker (GPU) | 500 – 23 000+ $ / mois | | Studio Apps SageMaker (KernelGateway/JupyterLab/CodeEditor) | 42 – 1 600+ $ / mois | +| Domaine SageMaker (stockage EFS inactif) | Charges EFS continues | | Training Job SageMaker (job GPU runaway/bloqué) | 670 – 2 360+ $ / jour | | Cluster AML Compute Azure (GPU) | 600 – 15 000 $ / mois | | Instance de calcul Azure ML (GPU) | 600 – 15 000+ $ / mois | @@ -165,7 +166,7 @@ L'infrastructure IA/ML inactive est la source de gaspillage cloud invisible à l CleanCloud détecte les endpoints à zéro invocation / zéro prédiction, l'activité de contrôle inactive sur les notebooks et apps managés, ainsi que les training jobs managés anormalement longs sur les 3 clouds. Les outils natifs montrent la facture — ils ne nomment pas la ressource concrète à examiner. ```bash -cleancloud scan --provider aws --category ai # PTUs Bedrock + endpoints + notebooks + Studio apps SageMaker + training jobs SageMaker + EC2 GPU +cleancloud scan --provider aws --category ai # PTUs Bedrock + endpoints + notebooks + domaines + Studio apps SageMaker + training jobs SageMaker + EC2 GPU cleancloud scan --provider azure --category ai # clusters AML + instances ML + endpoints en ligne + AI Search + PTUs OpenAI cleancloud scan --provider gcp --category ai # endpoints Vertex AI + Workbench + training jobs + Cloud TPU + Feature Stores cleancloud scan --provider aws --category all # hygiène + IA/ML ensemble @@ -432,7 +433,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud ## Ce que CleanCloud détecte -46 règles pour AWS, Azure et GCP — conservatrices, haut signal, conçues pour éviter les faux positifs en environnements IaC. +47 règles pour AWS, Azure et GCP — conservatrices, haut signal, conçues pour éviter les faux positifs en environnements IaC. **AWS :** - Compute : instances arrêtées 30+ jours (charges EBS continuent) @@ -441,7 +442,7 @@ Oui. CleanCloud n'a besoin d'accès réseau qu'aux endpoints API de votre cloud - Plateforme : instances RDS inactives (HIGH) - Observabilité : logs CloudWatch à rétention infinie - Gouvernance : ressources sans tags, security groups inutilisés -- IA/ML *(opt-in : `--category ai`)* : Bedrock Provisioned Throughput (Model Units) inactifs avec zéro invocation depuis 7+ jours ; endpoints SageMaker sans trafic `InvokeEndpoint` observé depuis 14+ jours ; instances Notebook SageMaker avec timestamps de contrôle inactifs depuis 14+ jours ; Studio Apps SageMaker (`KernelGateway`/`JupyterLab`/`CodeEditor`) sans signal d'activité récent exploitable depuis 7+ jours ; training jobs SageMaker toujours `InProgress` au-delà du seuil de 24h +- IA/ML *(opt-in : `--category ai`)* : Bedrock Provisioned Throughput (Model Units) inactifs avec zéro invocation depuis 7+ jours ; endpoints SageMaker sans trafic `InvokeEndpoint` observé depuis 14+ jours ; instances Notebook SageMaker avec timestamps de contrôle inactifs depuis 14+ jours ; Domaines SageMaker sans apps en cours d'exécution sur tous les profils et espaces depuis 30+ jours (coût de stockage EFS continu) ; Studio Apps SageMaker (`KernelGateway`/`JupyterLab`/`CodeEditor`) sans signal d'activité récent exploitable depuis 7+ jours ; training jobs SageMaker toujours `InProgress` au-delà du seuil de 24h **Azure :** - Compute : VMs arrêtées (non désallouées) (HIGH) diff --git a/README.md b/README.md index dc77bdc..53a45a2 100644 --- a/README.md +++ b/README.md @@ -151,6 +151,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend | SageMaker endpoint (GPU) | $500 – $23,000 / month | | SageMaker Notebook Instance (GPU) | $500 – $23,000+ / month | | SageMaker Studio Apps (KernelGateway/JupyterLab/CodeEditor) | $42 – $1,600+ / month | +| SageMaker Domain (idle EFS storage) | Continuous EFS charges | | SageMaker Training Job (runaway/hung GPU job) | $670 – $2,360+ / day | | Azure AML compute cluster (GPU) | $600 – $15,000 / month | | Azure ML Compute Instance (GPU) | $600 – $15,000+ / month | @@ -165,7 +166,7 @@ Idle AI/ML infrastructure is the fastest-growing source of invisible cloud spend CleanCloud detects zero-invocation / zero-prediction endpoints, stale managed notebook and app activity, and long-running managed training jobs across all three clouds. Native cost tools show the bill — they do not name the specific resource to review. ```bash -cleancloud scan --provider aws --category ai # Bedrock PTUs + SageMaker endpoints + notebooks + Studio apps + training jobs + idle GPU EC2 +cleancloud scan --provider aws --category ai # Bedrock PTUs + SageMaker endpoints + notebooks + domains + Studio apps + training jobs + idle GPU EC2 cleancloud scan --provider azure --category ai # AML compute + ML instances + online endpoints + AI Search + OpenAI PTUs cleancloud scan --provider gcp --category ai # Vertex AI endpoints + Workbench + training jobs + Cloud TPU + Feature Stores cleancloud scan --provider aws --category all # hygiene + AI/ML together @@ -432,7 +433,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints ## What CleanCloud Detects -46 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments. +47 rules across AWS, Azure, and GCP — conservative, high-signal, designed to avoid false positives in IaC environments. **AWS:** - Compute: stopped instances 30+ days (EBS charges continue) @@ -441,7 +442,7 @@ Yes. CleanCloud only needs network access to your cloud provider's API endpoints - Platform: idle RDS instances (HIGH) - Observability: infinite retention CloudWatch Logs - Governance: untagged resources, unused security groups -- AI/ML *(opt-in: `--category ai`)*: idle Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days; idle SageMaker endpoints with no observed `InvokeEndpoint` traffic 14+ days; SageMaker Notebook Instances with stale control-plane timestamps 14+ days; SageMaker Studio apps (`KernelGateway`/`JupyterLab`/`CodeEditor`) with no usable recent activity signal 7+ days; SageMaker training jobs still `InProgress` beyond the 24h threshold +- AI/ML *(opt-in: `--category ai`)*: idle Bedrock Provisioned Throughput (Model Units) with zero invocations 7+ days; idle SageMaker endpoints with no observed `InvokeEndpoint` traffic 14+ days; SageMaker Notebook Instances with stale control-plane timestamps 14+ days; SageMaker Domains with no running apps across all user profiles and spaces 30+ days (continuous EFS storage cost); SageMaker Studio apps (`KernelGateway`/`JupyterLab`/`CodeEditor`) with no usable recent activity signal 7+ days; SageMaker training jobs still `InProgress` beyond the 24h threshold **Azure:** - Compute: stopped (not deallocated) VMs (HIGH) diff --git a/cleancloud/doctor/aws.py b/cleancloud/doctor/aws.py index 69e9ba8..dc12fea 100644 --- a/cleancloud/doctor/aws.py +++ b/cleancloud/doctor/aws.py @@ -751,7 +751,30 @@ def run_aws_ai_doctor(profile: Optional[str], region: Optional[str] = None) -> N permissions_failed.append(("sagemaker:DescribeNotebookInstance", str(e))) warn(f"sagemaker:DescribeNotebookInstance - {e}") - # --- sagemaker:ListApps (aws.sagemaker.studio_app.idle) --- + # --- sagemaker:ListDomains + sagemaker:DescribeDomain (aws.sagemaker.domain.idle) --- + try: + sagemaker.list_domains(MaxResults=1) + permissions_tested.append("sagemaker:ListDomains") + success("sagemaker:ListDomains") + except Exception as e: + permissions_failed.append(("sagemaker:ListDomains", str(e))) + warn(f"sagemaker:ListDomains - {e}") + + try: + # DescribeDomain — attempt only if a domain exists to avoid a spurious miss + _domains = sagemaker.list_domains(MaxResults=1) + _domain_list = _domains.get("Domains", []) + if _domain_list: + sagemaker.describe_domain(DomainId=_domain_list[0]["DomainId"]) + permissions_tested.append("sagemaker:DescribeDomain") + success("sagemaker:DescribeDomain") + else: + info("sagemaker:DescribeDomain - not tested (no SageMaker domain found to probe)") + except Exception as e: + permissions_failed.append(("sagemaker:DescribeDomain", str(e))) + warn(f"sagemaker:DescribeDomain - {e}") + + # --- sagemaker:ListApps (aws.sagemaker.studio_app.idle + aws.sagemaker.domain.idle) --- try: sagemaker.list_apps(MaxResults=1) permissions_tested.append("sagemaker:ListApps") diff --git a/cleancloud/providers/aws/rules/ai/sagemaker_domain_idle.py b/cleancloud/providers/aws/rules/ai/sagemaker_domain_idle.py new file mode 100644 index 0000000..1cee618 --- /dev/null +++ b/cleancloud/providers/aws/rules/ai/sagemaker_domain_idle.py @@ -0,0 +1,379 @@ +""" +Rule: aws.sagemaker.domain.idle + + (spec — docs/specs/aws/ai/sagemaker_domain_idle.md) + +Intent: + Detect Amazon SageMaker Domains that are InService, old enough to evaluate, + and have no currently running apps across all user profiles and spaces, so + they can be reviewed as potential FinOps cleanup candidates. + + A SageMaker Domain creates a managed EFS file system on first user onboarding. + That file system persists and incurs continuous storage charges regardless of + whether any Studio apps are running. A domain with no active apps represents + wasted EFS cost with no current compute value. + + This is a read-only review-candidate rule — not a delete-safe rule. + +Exclusions: + - DomainArn absent (malformed identity) + - DomainId absent (cannot filter ListApps) + - Status absent or not "InService" + - CreationTime absent, naive, or future + - age_days < idle_days_threshold (too young) + - any app in InService or Pending status + - any app entry with unclassifiable Status (absent or undocumented value) + - DescribeDomain non-permission failure (item-scoped skip) + +Detection: + - InService domain older than idle_days_threshold + - ListApps fully paginated, zero apps in InService or Pending status + +Key rules: + - Signal: control-plane ListApps state (sole trusted activity source) + - LastUserActivityTimestamp explicitly excluded (contaminated by health checks) + - estimated_monthly_cost_usd = None + - Confidence: HIGH always (direct control-plane state) + - Risk: HIGH if HomeEfsFileSystemId present; MEDIUM otherwise + - ListDomains failure → FAIL RULE + - ListApps failure → FAIL RULE + - Permission-denied on any required API → FAIL RULE + - DescribeDomain non-permission failure → SKIP ITEM + - Unclassifiable app Status → SKIP ITEM (domain not emitted) + +APIs: + - sagemaker:ListDomains + - sagemaker:DescribeDomain + - sagemaker:ListApps +""" + +from collections import Counter +from datetime import datetime, timezone +from typing import List, Optional + +import boto3 +from botocore.exceptions import BotoCoreError, ClientError + +from cleancloud.core.confidence import ConfidenceLevel +from cleancloud.core.evidence import Evidence +from cleancloud.core.finding import Finding +from cleancloud.core.risk import RiskLevel + +# --- Module-level constants --- + +_DEFAULT_IDLE_DAYS_THRESHOLD = 30 +_ELIGIBLE_STATUS = "InService" + +# Documented App.Status values — anything else is unclassifiable +_KNOWN_APP_STATUSES = {"Deleted", "Deleting", "Failed", "InService", "Pending"} + +# App statuses that indicate active compute presence +_BILLABLE_APP_STATUSES = {"InService", "Pending"} + +_PERMISSION_ERROR_CODES = ("AccessDenied", "UnauthorizedOperation", "AccessDeniedException") + +_FINDING_TITLE = "Idle SageMaker domain review candidate" + +_SIGNALS_NOT_CHECKED = ( + "LastUserActivityTimestamp from DescribeApp was not used as evidence because " + "AWS documents it as updated on health checks, making it unreliable as a " + "user-activity signal", + "A user may start a new app shortly after evaluation; this is a point-in-time check", + "The domain may be intentionally kept active for periodic or scheduled use", + "Deleting apps may transition back to InService if the deletion fails", + "EFS storage cost depends on per-user home directory content; this rule does not " + "inspect directory sizes or file counts", + "Native idle shutdown configuration (AppLifecycleManagement.IdleSettings) is surfaced " + "as context but does not affect eligibility", +) + +RULE_METADATA = { + "id": "aws.sagemaker.domain.idle", + "category": "ai", + "service": "sagemaker", + "cost_impact": "high", +} + + +def _str(value: object) -> Optional[str]: + """Return value as str only when it is a non-empty string; else None.""" + return value if isinstance(value, str) and value else None + + +def _check_idle_shutdown(settings: dict) -> bool: + """Check if idle shutdown is enabled in JupyterLab or CodeEditor app settings.""" + for app_key in ("JupyterLabAppSettings", "CodeEditorAppSettings"): + app_settings = settings.get(app_key, {}) + if not isinstance(app_settings, dict): + continue + lifecycle = app_settings.get("AppLifecycleManagement", {}) + if not isinstance(lifecycle, dict): + continue + idle_settings = lifecycle.get("IdleSettings", {}) + if not isinstance(idle_settings, dict): + continue + if idle_settings.get("LifecycleManagement") == "Enabled": + return True + return False + + +def _normalize_domain(item: object, now_utc: datetime) -> Optional[dict]: + """Normalize a raw ListDomains item to the canonical field shape. + + Returns None when required identity/status/timestamp fields are absent or + invalid — the caller must skip the item. + """ + if not isinstance(item, dict): + return None + + # --- Identity (required; absent → skip) --- + domain_arn = _str(item.get("DomainArn")) + if domain_arn is None: + return None + + domain_id = _str(item.get("DomainId")) + if domain_id is None: + return None + + # --- Status (required; absent → skip) --- + normalized_status = _str(item.get("Status")) + if normalized_status is None: + return None + + # --- CreationTime (required; absent, naive, future → skip) --- + raw_ct = item.get("CreationTime") + if not isinstance(raw_ct, datetime): + return None + if raw_ct.tzinfo is None: + return None + creation_time_utc = raw_ct.astimezone(timezone.utc) + if creation_time_utc > now_utc: + return None + + # --- Derived fields --- + age_days = int((now_utc - creation_time_utc).total_seconds() // 86400) + + # --- Optional context fields --- + domain_name = _str(item.get("DomainName")) + + raw_lmt = item.get("LastModifiedTime") + last_modified_time_utc = None + if isinstance(raw_lmt, datetime) and raw_lmt.tzinfo is not None: + lmt = raw_lmt.astimezone(timezone.utc) + if lmt <= now_utc: + last_modified_time_utc = lmt + + return { + "resource_id": domain_arn, + "domain_arn": domain_arn, + "domain_id": domain_id, + "domain_name": domain_name, + "normalized_status": normalized_status, + "creation_time_utc": creation_time_utc, + "last_modified_time_utc": last_modified_time_utc, + "age_days": age_days, + } + + +def _enrich_domain(describe_response: dict) -> dict: + """Extract enrichment fields from DescribeDomain response.""" + home_efs_id = _str(describe_response.get("HomeEfsFileSystemId")) + home_efs_creation = _str(describe_response.get("HomeEfsFileSystemCreation")) + app_network_access_type = _str(describe_response.get("AppNetworkAccessType")) + auth_mode = _str(describe_response.get("AuthMode")) + + # Check idle shutdown across both DefaultUserSettings and DefaultSpaceSettings + idle_shutdown = False + for settings_key in ("DefaultUserSettings", "DefaultSpaceSettings"): + settings = describe_response.get(settings_key, {}) + if isinstance(settings, dict) and _check_idle_shutdown(settings): + idle_shutdown = True + break + + return { + "home_efs_file_system_id": home_efs_id, + "home_efs_file_system_creation": home_efs_creation, + "app_network_access_type": app_network_access_type, + "auth_mode": auth_mode, + "idle_shutdown_configured": idle_shutdown, + } + + +def find_idle_sagemaker_domains( + session: boto3.Session, + region: str, + idle_days_threshold: int = _DEFAULT_IDLE_DAYS_THRESHOLD, +) -> List[Finding]: + sagemaker = session.client("sagemaker", region_name=region) + + # --- Step 1: Validate permission, then fully paginate ListDomains --- + # Pre-flight direct call catches AccessDeniedException reliably even if + # the paginator were to silently return an empty page on permission errors. + try: + sagemaker.list_domains(MaxResults=1) + except ClientError as exc: + if exc.response["Error"]["Code"] in _PERMISSION_ERROR_CODES: + raise PermissionError("Missing required IAM permission: sagemaker:ListDomains") from exc + raise + except BotoCoreError: + raise + + try: + paginator = sagemaker.get_paginator("list_domains") + pages = list(paginator.paginate()) + except ClientError as exc: + if exc.response["Error"]["Code"] in _PERMISSION_ERROR_CODES: + raise PermissionError("Missing required IAM permission: sagemaker:ListDomains") from exc + raise + except BotoCoreError: + raise + + now = datetime.now(timezone.utc) + findings: List[Finding] = [] + + for page in pages: + for raw_item in page.get("Domains", []): + # --- Step 2: Normalize domain summary --- + n = _normalize_domain(raw_item, now) + if n is None: + continue + + # --- Step 3: Exclusion rules --- + if n["normalized_status"] != _ELIGIBLE_STATUS: + continue + + if n["age_days"] < idle_days_threshold: + continue + + # --- Step 4: DescribeDomain enrichment --- + try: + describe = sagemaker.describe_domain(DomainId=n["domain_id"]) + except ClientError as exc: + if exc.response["Error"]["Code"] in _PERMISSION_ERROR_CODES: + raise PermissionError( + "Missing required IAM permission: sagemaker:DescribeDomain" + ) from exc + continue # non-permission failure → SKIP ITEM + except BotoCoreError: + continue # transport error → SKIP ITEM + + enrichment = _enrich_domain(describe) + n.update(enrichment) + + # --- Step 6: ListApps for this domain --- + try: + apps_paginator = sagemaker.get_paginator("list_apps") + apps_pages = list(apps_paginator.paginate(DomainIdEquals=n["domain_id"])) + except ClientError as exc: + if exc.response["Error"]["Code"] in _PERMISSION_ERROR_CODES: + raise PermissionError( + "Missing required IAM permission: sagemaker:ListApps" + ) from exc + raise # other ListApps failure → FAIL RULE + except BotoCoreError: + raise # transport failure → FAIL RULE + + # --- Step 7-9: Evaluate app statuses --- + status_counts: Counter = Counter() + skip_domain = False + + for apps_page in apps_pages: + for app_entry in apps_page.get("Apps", []): + # Non-dict entries have no extractable status → + # unclassifiable, handled by the check below. + raw_status = ( + _str(app_entry.get("Status")) if isinstance(app_entry, dict) else None + ) + + # Unclassifiable status → SKIP ITEM + if raw_status is None or raw_status not in _KNOWN_APP_STATUSES: + skip_domain = True + break + + status_counts[raw_status] += 1 + + # Billable app → SKIP ITEM + if raw_status in _BILLABLE_APP_STATUSES: + skip_domain = True + break + + if skip_domain: + break + + if skip_domain: + continue + + # --- Step 10: EMIT --- + total_apps = sum(status_counts.values()) + apps_by_status = dict(status_counts) + + has_efs = n["home_efs_file_system_id"] is not None + risk = RiskLevel.HIGH if has_efs else RiskLevel.MEDIUM + + efs_signal = ( + f"EFS file system {n['home_efs_file_system_id']} incurs continuous " + f"storage charges" + if has_efs + else "No HomeEfsFileSystemId was returned by DescribeDomain" + ) + + signals_used = [ + f"Domain status is '{_ELIGIBLE_STATUS}'", + f"Domain age is {n['age_days']} days, meeting the " + f"{idle_days_threshold}-day threshold (applied to domain age, " + f"not measured inactivity duration)", + "ListApps was fully paginated and found zero apps in InService " "or Pending state", + efs_signal, + ] + + domain_display = n["domain_name"] or n["domain_id"] + + findings.append( + Finding( + provider="aws", + rule_id="aws.sagemaker.domain.idle", + resource_type="aws.sagemaker.domain", + resource_id=n["domain_arn"], + region=region, + estimated_monthly_cost_usd=None, + title=_FINDING_TITLE, + summary=( + f"SageMaker domain {domain_display} is currently InService, " + f"{n['age_days']} days old, and has no running apps" + ), + reason=( + f"InService SageMaker domain is {n['age_days']} days old " + f"and currently has no InService or Pending apps across " + f"all user profiles and spaces" + ), + risk=risk, + confidence=ConfidenceLevel.HIGH, + detected_at=now, + evidence=Evidence( + signals_used=signals_used, + signals_not_checked=list(_SIGNALS_NOT_CHECKED), + time_window=f"{idle_days_threshold} days", + ), + details={ + "evaluation_path": "idle-sagemaker-domain-review-candidate", + "domain_arn": n["domain_arn"], + "domain_id": n["domain_id"], + "domain_name": n["domain_name"], + "normalized_status": n["normalized_status"], + "creation_time": n["creation_time_utc"].isoformat(), + "age_days": n["age_days"], + "idle_days_threshold": idle_days_threshold, + "home_efs_file_system_id": n["home_efs_file_system_id"], + "home_efs_file_system_creation": n["home_efs_file_system_creation"], + "app_network_access_type": n["app_network_access_type"], + "auth_mode": n["auth_mode"], + "idle_shutdown_configured": n["idle_shutdown_configured"], + "total_apps_evaluated": total_apps, + "apps_by_status": apps_by_status, + "inservice_app_count": 0, + "pending_app_count": 0, + }, + ) + ) + + return findings diff --git a/cleancloud/providers/aws/scan.py b/cleancloud/providers/aws/scan.py index f54b925..03eead4 100644 --- a/cleancloud/providers/aws/scan.py +++ b/cleancloud/providers/aws/scan.py @@ -12,6 +12,9 @@ find_idle_bedrock_provisioned_throughputs, ) from cleancloud.providers.aws.rules.ai.ec2_gpu_idle import find_idle_gpu_instances +from cleancloud.providers.aws.rules.ai.sagemaker_domain_idle import ( + find_idle_sagemaker_domains, +) from cleancloud.providers.aws.rules.ai.sagemaker_endpoint_idle import ( find_idle_sagemaker_endpoints, ) @@ -68,6 +71,7 @@ AWS_RULE_MAP_AI: Dict[str, Callable] = { "aws.sagemaker.endpoint.idle": find_idle_sagemaker_endpoints, "aws.sagemaker.notebook.idle": find_idle_sagemaker_notebooks, + "aws.sagemaker.domain.idle": find_idle_sagemaker_domains, "aws.ec2.gpu.idle": find_idle_gpu_instances, "aws.bedrock.provisioned_throughput.idle": find_idle_bedrock_provisioned_throughputs, "aws.sagemaker.studio_app.idle": find_idle_sagemaker_studio_apps, diff --git a/deploy/cloudformation/cleancloud-role.yaml b/deploy/cloudformation/cleancloud-role.yaml index 1bc9236..c874957 100644 --- a/deploy/cloudformation/cleancloud-role.yaml +++ b/deploy/cloudformation/cleancloud-role.yaml @@ -137,6 +137,8 @@ Resources: - sagemaker:DescribeEndpointConfig - sagemaker:ListNotebookInstances - sagemaker:DescribeNotebookInstance + - sagemaker:ListDomains + - sagemaker:DescribeDomain - sagemaker:ListApps - sagemaker:DescribeApp - sagemaker:ListTrainingJobs diff --git a/deploy/terraform/aws/main.tf b/deploy/terraform/aws/main.tf index 6b769f6..d89083d 100644 --- a/deploy/terraform/aws/main.tf +++ b/deploy/terraform/aws/main.tf @@ -59,6 +59,8 @@ resource "aws_iam_role_policy" "cleancloud_ai" { "sagemaker:DescribeEndpointConfig", "sagemaker:ListNotebookInstances", "sagemaker:DescribeNotebookInstance", + "sagemaker:ListDomains", + "sagemaker:DescribeDomain", "sagemaker:ListApps", "sagemaker:DescribeApp", "sagemaker:ListTrainingJobs", diff --git a/docs/aws.md b/docs/aws.md index 41deb8b..6feccef 100644 --- a/docs/aws.md +++ b/docs/aws.md @@ -309,7 +309,7 @@ For the complete production workflow with enforcement flags, scheduling, and art > |------|----------|---------------| > | `base-readonly.json` | `sts:GetCallerIdentity`, `cloudwatch:GetMetricStatistics` | **Always — every scan, every category** | > | `hygiene-readonly.json` | EC2, RDS, ELB, S3, logs | `--category hygiene` (default) | -> | `ai-readonly.json` | Bedrock Provisioned Throughput, SageMaker endpoints/notebooks/Studio apps/training jobs, EC2 GPU instances, CloudWatch metrics | `--category ai` | +> | `ai-readonly.json` | Bedrock Provisioned Throughput, SageMaker endpoints/notebooks/domains/Studio apps/training jobs, EC2 GPU instances, CloudWatch metrics | `--category ai` | > > `base-readonly.json` must be attached alongside any category file. It provides `sts:GetCallerIdentity` (used at startup and by `doctor` to verify credentials) and shared CloudWatch metric access. Attach `hygiene-readonly.json` for the default scan path, and `ai-readonly.json` for `--category ai`. @@ -409,7 +409,7 @@ Attach this policy to your IAM role or user for the default hygiene scan path (c - Safe for production accounts - Compatible with security-reviewed pipelines -For AI/ML scans, also attach [`security/aws/ai-readonly.json`](../security/aws/ai-readonly.json). It adds permissions for Bedrock Provisioned Throughput, SageMaker endpoints, notebook instances, SageMaker Studio apps (`sagemaker:ListApps`, `sagemaker:DescribeApp`), SageMaker training jobs (`sagemaker:ListTrainingJobs`, `sagemaker:DescribeTrainingJob`), EC2 GPU instances, and `cloudwatch:ListMetrics` for GPU metric discovery. +For AI/ML scans, also attach [`security/aws/ai-readonly.json`](../security/aws/ai-readonly.json). It adds permissions for Bedrock Provisioned Throughput, SageMaker endpoints, notebook instances, SageMaker Domains (`sagemaker:ListDomains`, `sagemaker:DescribeDomain`), SageMaker Studio apps (`sagemaker:ListApps`, `sagemaker:DescribeApp`), SageMaker training jobs (`sagemaker:ListTrainingJobs`, `sagemaker:DescribeTrainingJob`), EC2 GPU instances, and `cloudwatch:ListMetrics` for GPU metric discovery. --- @@ -923,7 +923,7 @@ Permissions Tested: 17/17 passed ====================================================================== ``` -**What the AI doctor adds:** Bedrock Provisioned Throughput, SageMaker endpoints, notebook instances, SageMaker Studio apps, SageMaker training jobs, EC2 GPU inventory, and the CloudWatch permissions those AI rules need. Run it before `cleancloud scan --provider aws --category ai`. +**What the AI doctor adds:** Bedrock Provisioned Throughput, SageMaker endpoints, notebook instances, SageMaker Domains, SageMaker Studio apps, SageMaker training jobs, EC2 GPU inventory, and the CloudWatch permissions those AI rules need. Run it before `cleancloud scan --provider aws --category ai`. --- diff --git a/docs/rules.md b/docs/rules.md index 2e20540..cad1ad9 100644 --- a/docs/rules.md +++ b/docs/rules.md @@ -1,10 +1,10 @@ # CleanCloud Rules -46 rules across three providers (30 hygiene + 16 AI/ML). +47 rules across three providers (30 hygiene + 17 AI/ML). | Provider | Hygiene | AI/ML | Total | Catalog | |---|---|---|---|---| -| AWS | 13 | 6 | 19 | [rules/aws.md](rules/aws.md) | +| AWS | 13 | 7 | 20 | [rules/aws.md](rules/aws.md) | | Azure | 12 | 5 | 17 | [rules/azure.md](rules/azure.md) | | GCP | 5 | 5 | 10 | [rules/gcp.md](rules/gcp.md) | diff --git a/docs/rules/aws.md b/docs/rules/aws.md index 402a756..814efe1 100644 --- a/docs/rules/aws.md +++ b/docs/rules/aws.md @@ -1,6 +1,6 @@ # AWS Rules -19 rules (13 hygiene + 6 AI/ML). AI/ML rules require `--category ai`. +20 rules (13 hygiene + 7 AI/ML). AI/ML rules require `--category ai`. ← [Back to index](../rules.md) @@ -23,6 +23,7 @@ | `aws.sagemaker.notebook.idle` | AI/ML | SageMaker Notebook Instances with stale activity 14+ days | | `aws.ec2.gpu.idle` | AI/ML | EC2 GPU/accelerator instances with <5% GPU or <10% CPU over 7 days | | `aws.bedrock.provisioned_throughput.idle` | AI/ML | Bedrock Provisioned Throughput with zero invocations 7+ days | +| `aws.sagemaker.domain.idle` | AI/ML | SageMaker Domains with no running apps 30+ days (continuous EFS cost) | | `aws.sagemaker.studio_app.idle` | AI/ML | SageMaker Studio apps with no usable activity 7+ days | | `aws.sagemaker.training_job.long_running` | AI/ML | SageMaker training jobs still running beyond threshold | @@ -275,6 +276,19 @@ **Spec:** [specs/aws/ai/bedrock_provisioned_idle.md](../specs/aws/ai/bedrock_provisioned_idle.md) +#### `aws.sagemaker.domain.idle` +**Detects:** SageMaker Domains `InService` with no apps in `InService` or `Pending` state across all user profiles and spaces for `idle_days_threshold` of domain age (continuous EFS storage cost) + +**Confidence / Risk:** HIGH (fully paginated ListApps control-plane state) / HIGH (`HomeEfsFileSystemId` present); MEDIUM (no EFS) + +**Permissions:** `sagemaker:ListDomains`, `sagemaker:DescribeDomain`, `sagemaker:ListApps` + +**Params:** `idle_days_threshold` (default: 30) + +**Exclusions:** non-`InService` domains; domains younger than threshold; any app in `InService` or `Pending` state; unclassifiable app status entries + +**Spec:** [specs/aws/ai/sagemaker_domain_idle.md](../specs/aws/ai/sagemaker_domain_idle.md) + #### `aws.sagemaker.studio_app.idle` **Detects:** SageMaker Studio `KernelGateway`/`JupyterLab`/`CodeEditor` apps `InService` with no usable recent activity for `idle_days_threshold` diff --git a/docs/specs/aws/ai/sagemaker_domain_idle.md b/docs/specs/aws/ai/sagemaker_domain_idle.md new file mode 100644 index 0000000..d7ccaa7 --- /dev/null +++ b/docs/specs/aws/ai/sagemaker_domain_idle.md @@ -0,0 +1,469 @@ +# aws.sagemaker.domain.idle — Canonical Rule Specification + +## 1. Intent + +Detect Amazon SageMaker Domains in the currently evaluated account/Region that are +`InService` and have **no currently running apps** across all user profiles and spaces, +so they can be reviewed as potential FinOps cleanup candidates. + +A SageMaker Domain creates a managed EFS file system on first user onboarding. That file +system persists and incurs continuous storage charges regardless of whether any Studio apps +are running. A domain with no active apps represents wasted EFS cost with no current +compute value. + +This is a **read-only review-candidate rule**. It is not proof that the domain is safe to +delete, not proof that no user intends to start an app shortly, and not proof of the exact +storage cost. + +--- + +## 2. AWS API Grounding + +Based on official Amazon SageMaker Studio, SageMaker API Reference, SageMaker Studio +pricing, and IAM permissions documentation. + +### Key AWS facts + +1. `ListDomains` is the canonical SageMaker Domain inventory API and supports pagination. +2. `DomainDetails` (from `ListDomains`) documents `DomainId`, `DomainArn`, `DomainName`, + `Status`, `CreationTime`, `LastModifiedTime`, and `Url`. +3. `Domain.Status` valid values are `Deleting`, `Failed`, `InService`, `Pending`, + `Updating`, `Update_Failed`, and `Delete_Failed`. +4. `DescribeDomain` returns additional fields including `HomeEfsFileSystemId`, + `HomeEfsFileSystemCreation`, `AppNetworkAccessType`, `DefaultUserSettings`, + `DefaultSpaceSettings`, `AuthMode`, `VpcId`, and `SubnetIds`. +5. `HomeEfsFileSystemId` documents the ID of the EFS file system managed by the domain. +6. AWS documentation states that when the first user is onboarded to a domain, SageMaker + creates an EFS volume and that "a storage charge is incurred for this directory." +7. `ListApps` supports `DomainIdEquals` filtering and returns `AppDetails` items. +8. `AppDetails` (from `ListApps`) documents `AppName`, `AppType`, `CreationTime`, + `DomainId`, `UserProfileName`, `SpaceName`, `Status`, and `ResourceSpec`. +9. `App.Status` valid values are `Deleted`, `Deleting`, `Failed`, `InService`, and `Pending`. +10. `App.AppType` valid values include `JupyterServer`, `KernelGateway`, `JupyterLab`, + `CodeEditor`, `RStudioServerPro`, `RSessionGateway`, `DetailedProfiler`, `TensorBoard`, + and `Canvas`. +11. `DescribeApp` returns `LastUserActivityTimestamp` and `LastHealthCheckTimestamp`. +12. AWS documentation explicitly states: "`LastUserActivityTimestamp` is also updated when + SageMaker AI performs health checks without user activity. As a result, this value is + set to the same value as `LastHealthCheckTimestamp`." This makes it unreliable as a + canonical user-activity signal. +13. There is no documented CloudWatch namespace or metric for SageMaker Studio Domain or + App-level activity. No `KernelGateway` or domain-level CloudWatch metrics are documented. +14. Apps in `InService` status incur hourly compute charges. AWS documentation states that + "launching a JupyterLab application, even if no resources or jobs are launched in the + application, incurs costs." +15. There is no charge for the SageMaker Studio UI or the domain itself beyond EFS storage + and app compute. +16. `DefaultUserSettings` may contain `AppLifecycleManagement.IdleSettings` for JupyterLab + and CodeEditor apps, reflecting whether native idle shutdown is configured. + +### Implications + +- Only `InService` Domains are eligible. +- Age thresholding is supportable because `CreationTime` is documented in `ListDomains`. +- The canonical idle signal is control-plane state: presence or absence of `InService` apps + under the domain, not a CloudWatch metric (none is documented). +- `LastUserActivityTimestamp` from `DescribeApp` is explicitly documented as contaminated + by health checks and must not be used as a primary idle signal. +- `HomeEfsFileSystemId` surfaces cost context but cannot be mapped to a canonical per-domain + monthly cost estimate. +- `estimated_monthly_cost_usd = null`. + +--- + +## 3. Scope and Terminology + +- **"Domain"** — an item returned by `ListDomains`. +- **"App"** — an item returned by `ListApps(DomainIdEquals=domain_id)`. +- **"idle"** — the domain has no apps in `InService` or `Pending` status across all user + profiles and spaces at the time of evaluation. +- **"billable app state"** — `InService` or `Pending`. `Pending` means app launch is in + progress; the domain is not considered idle while any app is starting. +- **`idle_days_threshold`** — operator-configurable threshold applied to domain age, + default `30`. This threshold applies to domain age, not measured inactivity duration. +- **`reference_time_utc = CreationTime`** (domain level; `LastModifiedTime` reflects + config changes, not user activity, and must not be used as the age reference). +- **`age_days = floor((now_utc − reference_time_utc) / 86400 seconds)`** + +### Included + +- Domains in the currently evaluated Region/account +- `Status == "InService"` +- `age_days >= idle_days_threshold` +- no apps in `InService` or `Pending` status across the domain at evaluation time + +### Excluded + +- `Deleting`, `Pending`, `Updating`, `Update_Failed`, `Delete_Failed`, `Failed` +- missing or invalid stable identity +- missing or invalid `CreationTime` +- too new to evaluate (`age_days < idle_days_threshold`) +- any app currently `InService` or `Pending` in the domain + +--- + +## 4. Canonical Rule Statement + +A Domain is eligible only when **all** of the following are true: + +- stable domain identity (`DomainId`, `DomainArn`) exists +- `Status == "InService"` +- `CreationTime` is valid and not in the future +- `age_days >= idle_days_threshold` +- `ListApps(DomainIdEquals=domain_id)` returns no apps with `Status` in + `{"InService", "Pending"}` + +No additional predicate may be required for baseline eligibility, including: + +- auth mode (`SSO` vs `IAM`) +- network access type (`PublicInternetOnly` vs `VpcOnly`) +- whether native idle shutdown is configured (`AppLifecycleManagement.IdleSettings`) +- VPC configuration +- KMS key presence +- number of user profiles or spaces +- `HomeEfsFileSystemCreation` setting +- tags + +--- + +## 5. Normalization Contract + +All rule logic must operate on normalized fields only. + +### 5.1 Domain-Level Fields + +| Canonical field | Source field | Absent / invalid | +|---|---|---| +| `resource_id` | `DomainArn` | skip item | +| `domain_arn` | `DomainArn` | skip item | +| `domain_id` | `DomainId` | skip item | +| `domain_name` | `DomainName` | null | +| `normalized_status` | `Status` | skip item | +| `creation_time_utc` | `CreationTime` (tz-aware UTC) | skip item | +| `last_modified_time_utc` | `LastModifiedTime` (tz-aware UTC) | null | +| `age_days` | floor((now − creation_time_utc) / 86400) | skip item | +| `home_efs_file_system_id` | `DescribeDomain.HomeEfsFileSystemId` | null | +| `home_efs_file_system_creation` | `DescribeDomain.HomeEfsFileSystemCreation` | null | +| `app_network_access_type` | `DescribeDomain.AppNetworkAccessType` | null | +| `auth_mode` | `DescribeDomain.AuthMode` | null | +| `idle_shutdown_configured` | `true` if `AppLifecycleManagement.IdleSettings.LifecycleManagement == "Enabled"` in any of: `DefaultUserSettings.JupyterLabAppSettings`, `DefaultUserSettings.CodeEditorAppSettings`, `DefaultSpaceSettings.JupyterLabAppSettings`, `DefaultSpaceSettings.CodeEditorAppSettings` | `false` | + +### 5.2 Normalization Requirements + +- String-valued identifiers must normalize only from non-empty strings. +- Timestamp fields must be timezone-aware UTC before use; naive → skip item for required + timestamps, null for contextual timestamps. +- Future `CreationTime` → skip item. +- `resource_id` must be `DomainArn`, not `DomainId` or `DomainName`. +- `DescribeDomain` is called per domain to obtain EFS and settings context. + - Permission-denied (`AccessDeniedException`) → **FAIL RULE** (the IAM policy is + missing a required permission; continuing would produce systematically incomplete + results). + - All other `DescribeDomain` failures (throttling after retries, resource-not-found + race, transient network error) → **SKIP ITEM** for that domain's enrichment; the + rule continues evaluating remaining domains. + +### 5.3 App-Level Fields + +| Canonical field | Source field | Absent / invalid | +|---|---|---| +| `app_name` | `AppName` | null (tolerated) | +| `app_type` | `AppType` | null | +| `app_status` | `Status` | skip domain (unclassifiable) | +| `app_creation_time_utc` | `CreationTime` (tz-aware UTC) | null | +| `user_profile_name` | `UserProfileName` | null | +| `space_name` | `SpaceName` | null | + +App normalization requirements: + +- An app entry is **unclassifiable** if `app_status` is absent or not one of the + documented `App.Status` values. +- If any app entry in the domain is unclassifiable → **SKIP ITEM** (the domain must not + be emitted, because the unclassifiable app could be in a billable state). +- `AppName` absent is tolerated for status-checking purposes: an app with a valid + `app_status` but missing `AppName` still counts toward the billable-app check. + +--- + +## 6. Idle-Activity Determination + +The control-plane `ListApps` response is the **sole trusted activity source** for this +rule. There is no documented CloudWatch metric for SageMaker Studio Domain or App activity. + +### Required API contract + +| Field | Value | +|---|---| +| API | `ListApps` | +| Filter | `DomainIdEquals = domain_id` | +| Pagination | full pagination required via `NextToken` | + +### Billable app states + +Apps in the following states must be treated as active compute presence: + +- `InService` — app is running and billing +- `Pending` — app launch is in progress; domain is not considered idle + +### Interpretation rules + +- If any app entry has an unclassifiable `Status` (absent or not a documented value) → + **SKIP ITEM** (cannot confirm the domain is idle) +- If any app in the domain has `Status` in `{"InService", "Pending"}` → **not idle** → + **SKIP ITEM** +- Domain is idle only when all returned apps have `Status` in `{"Deleted", "Deleting", + "Failed"}`, or there are no apps at all + +### Pagination requirement + +`ListApps` must be fully paginated. Partial results must not be interpreted as confirming +zero active apps. + +### Failure semantics + +- `ListApps` request or pagination failure → **FAIL RULE** +- Any app entry with unclassifiable `app_status` → **SKIP ITEM** for the entire domain + (the unclassifiable entry could be billable) + +### `LastUserActivityTimestamp` — explicitly excluded as primary signal + +AWS documentation states `LastUserActivityTimestamp` is also set on health checks and +equals `LastHealthCheckTimestamp`. This field must not be used as canonical user-activity +evidence. It may be surfaced as optional context only, with caveats. + +--- + +## 7. Pricing / Cost Boundary + +- `estimated_monthly_cost_usd = null` + +### What is documentable + +- The domain has an EFS file system (`HomeEfsFileSystemId`) that incurs continuous EFS + storage charges +- Apps in `InService` incur hourly compute charges; the idle domain has none currently +- The finding may state that EFS storage costs continue until the domain is deleted + +### Mandatory rules + +- MUST NOT emit a fixed monthly EFS storage estimate per domain +- MUST NOT infer immediate savings from idle state alone +- MAY surface `home_efs_file_system_id` and `home_efs_file_system_creation` as cost context + +--- + +## 8. Deterministic Evaluation Order + +1. Retrieve and fully paginate `ListDomains` +2. Normalize each domain summary item +3. For each normalized item: + - `domain_arn` or `domain_id` absent → **SKIP ITEM** + - `normalized_status` absent → **SKIP ITEM** + - `normalized_status != "InService"` → **SKIP ITEM** + - `creation_time_utc` absent/invalid/future → **SKIP ITEM** + - `age_days < idle_days_threshold` → **SKIP ITEM** +4. Call `DescribeDomain(DomainId=domain_id)` to obtain `HomeEfsFileSystemId`, + `HomeEfsFileSystemCreation`, and settings context + - `DescribeDomain` permission-denied (`AccessDeniedException`) → **FAIL RULE** + - `DescribeDomain` other failure → **SKIP ITEM** (item-scoped, rule continues) +5. Normalize domain enrichment fields +6. Call and fully paginate `ListApps(DomainIdEquals=domain_id)` + - `ListApps` failure or pagination failure → **FAIL RULE** +7. Normalize app entries +8. If any app entry has unclassifiable `app_status` (absent or not a documented value) + → **SKIP ITEM** +9. If any app has `app_status` in `{"InService", "Pending"}` → **SKIP ITEM** +10. Otherwise → **EMIT** + +--- + +## 9. Exclusion Rules + +1. `domain_arn` absent → malformed identity +2. `domain_id` absent → cannot filter `ListApps` +3. `normalized_status` absent → missing current-state signal +4. `normalized_status != "InService"` → domain not currently active +5. `creation_time_utc` absent/naive/future → invalid age source +6. `age_days < idle_days_threshold` → too new +7. any app with `Status == "InService"` → compute currently running +8. any app with `Status == "Pending"` → app launch in progress, domain not idle + +No exclusion for: auth mode, network access type, VPC config, KMS key, Studio version, +native idle shutdown config, tag state, or EFS creation mode. + +--- + +## 10. Failure Model + +### Rule-level failures (FAIL RULE) + +- `ListDomains` request or pagination failure +- `ListApps` request or pagination failure for any domain that reached the app-check step +- Permission-denied (`AccessDeniedException`) on any required API (`ListDomains`, + `DescribeDomain`, `ListApps`) + +### Item-level skips (SKIP ITEM) + +- malformed identity or missing `CreationTime` +- non-`InService` domain status +- domain too new +- `DescribeDomain` non-permission failure (throttling after retries, resource-not-found + race, transient network error) +- any app in `InService` or `Pending` state +- any app entry with unclassifiable `Status` (see §6 malformed-app handling) + +--- + +## 11. Confidence Model + +| Condition | Confidence | +|---|---| +| `ListApps` fully paginated, zero `InService` or `Pending` apps | `HIGH` | + +**Mandatory rule:** use `HIGH` confidence. The finding is based on direct control-plane +state from fully paginated `ListApps`. The absence of running apps at evaluation time is +a documentable fact, not an inference from a metric proxy. + +--- + +## 12. Risk Model + +| Condition | Risk | +|---|---| +| `HomeEfsFileSystemId` is present (non-null, non-empty) | `HIGH` | +| `HomeEfsFileSystemId` is absent or null | `MEDIUM` | + +**Note:** risk is based on `HomeEfsFileSystemId` presence from `DescribeDomain`, not on +verified EFS file system existence. No EFS API call is made. A present ID is a strong +signal of continuous storage cost because SageMaker-managed EFS volumes are not deleted +until the domain itself is deleted. + +--- + +## 13. Evidence / Details Contract + +### Required details fields + +```text +evaluation_path = "idle-sagemaker-domain-review-candidate" +domain_arn +domain_id +domain_name +normalized_status = "InService" +creation_time +age_days +idle_days_threshold +home_efs_file_system_id +home_efs_file_system_creation +app_network_access_type +auth_mode +idle_shutdown_configured +total_apps_evaluated +apps_by_status (dict: status → count across all evaluated app entries) +inservice_app_count = 0 +pending_app_count = 0 +``` + +### Optional context fields + +```text +last_modified_time +user_profile_count (if available from ListUserProfiles — enrichment only) +space_count (if available from ListSpaces — enrichment only) +``` + +### Required evidence wording + +**Signals used** must state: + +- domain is currently `InService` +- domain age met the configured threshold using `CreationTime` +- `ListApps` was fully paginated and found zero apps in `InService` or `Pending` state +- EFS file system ID is surfaced as continuous cost context + +**Signals not checked** must state major blind spots: + +- `LastUserActivityTimestamp` from `DescribeApp` was not used as evidence because AWS + documents it as updated on health checks, making it unreliable as a user-activity signal +- a user may start a new app shortly after evaluation; this is a point-in-time check +- the domain may be intentionally kept active for periodic or scheduled use +- `Deleting` apps may transition back to `InService` if the deletion fails +- EFS storage cost depends on per-user home directory content; this rule does not inspect + directory sizes or file counts +- native idle shutdown configuration (`AppLifecycleManagement.IdleSettings`) is surfaced + as context but does not affect eligibility + +--- + +## 14. Non-goals / Blind Spots + +This rule does **not** prove any of the following: + +- that the domain is safe to delete +- that users are not actively using the domain outside the observation window +- that no app will be started imminently +- that all data in the EFS home directories is safe to remove +- the exact current EFS storage cost +- that the domain has no operational dependencies (CI/CD pipelines, scheduled notebooks) + +--- + +## 15. API and IAM Contract + +### Required APIs + +- `sagemaker:ListDomains` +- `sagemaker:DescribeDomain` +- `sagemaker:ListApps` + +### Optional enrichment APIs (not required for eligibility) + +- `sagemaker:ListUserProfiles` +- `sagemaker:ListSpaces` + +### Mandatory API usage rules + +- `ListDomains` must be fully paginated +- `ListApps` must be called with `DomainIdEquals` and fully paginated +- `DescribeApp` must not be called as part of the canonical eligibility path (it is not + required for idle determination; `LastUserActivityTimestamp` is excluded as a signal) +- undocumented fallback activity signals must not be substituted + +--- + +## 16. Acceptance Scenarios + +### Must emit + +1. `InService` Domain older than threshold, `ListApps` returns zero apps → emit with + `risk = HIGH` if `HomeEfsFileSystemId` present +2. `InService` Domain older than threshold, all apps have `Status == "Deleted"` → + emit with `risk = HIGH` if `HomeEfsFileSystemId` present +3. `InService` Domain older than threshold, all apps have `Status == "Failed"` → + emit with `risk = MEDIUM` or `HIGH` depending on EFS presence + +### Must skip + +4. `Pending` domain +5. `Updating` domain +6. `Deleting` domain +7. `Failed` domain +8. `InService` domain younger than threshold +9. malformed item without `DomainArn` +10. malformed item without `DomainId` +11. malformed item with missing/invalid/future `CreationTime` +12. domain with any app in `Status == "InService"` +13. domain with any app in `Status == "Pending"` +14. domain with any app entry where `Status` is absent or not a documented value +15. `DescribeDomain` non-permission failure for a specific domain (item skipped, rule + continues) + +### Must fail + +16. `ListDomains` request or pagination failure +17. `ListApps` request or pagination failure for any domain that reached the app-check step +18. `DescribeDomain` permission-denied (`AccessDeniedException`) + +--- + +Rule: aws.sagemaker.domain.idle diff --git a/pyproject.toml b/pyproject.toml index 217504b..729ee62 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -54,7 +54,7 @@ azure = [ "azure-mgmt-monitor>=6.0.0", "azure-mgmt-containerregistry>=10.0.0", "azure-mgmt-cognitiveservices>=13.5.0", - "azure-mgmt-machinelearningservices>=1.0.0", + "azure-mgmt-machinelearningservices>=1.0.0,<1.0.1", "azure-ai-ml>=1.0.0", "azure-mgmt-search>=9.0.0", "azure-core>=1.38.0", @@ -81,7 +81,7 @@ all = [ "azure-mgmt-monitor>=6.0.0", "azure-mgmt-containerregistry>=10.0.0", "azure-mgmt-cognitiveservices>=13.5.0", - "azure-mgmt-machinelearningservices>=1.0.0", + "azure-mgmt-machinelearningservices>=1.0.0,<1.0.1", "azure-ai-ml>=1.0.0", "azure-mgmt-search>=9.0.0", "azure-core>=1.38.0", diff --git a/security/aws/ai-readonly.json b/security/aws/ai-readonly.json index 15243e1..1e2ecae 100644 --- a/security/aws/ai-readonly.json +++ b/security/aws/ai-readonly.json @@ -17,7 +17,9 @@ "sagemaker:DescribeEndpoint", "sagemaker:DescribeEndpointConfig", "sagemaker:ListNotebookInstances", - "sagemaker:DescribeNotebookInstance" + "sagemaker:DescribeNotebookInstance", + "sagemaker:ListDomains", + "sagemaker:DescribeDomain" ], "Resource": "*" }, diff --git a/tests/cleancloud/config/test_accounts_config.py b/tests/cleancloud/config/test_accounts_config.py index d7b1f1b..aafc332 100644 --- a/tests/cleancloud/config/test_accounts_config.py +++ b/tests/cleancloud/config/test_accounts_config.py @@ -10,14 +10,18 @@ def test_load_basic_accounts(tmp_path): config_file = tmp_path / "accounts.yaml" - config_file.write_text(textwrap.dedent("""\ + config_file.write_text( + textwrap.dedent( + """\ role_name: CleanCloudReadOnlyRole accounts: - id: "111111111111" name: prod - id: "222222222222" name: dev - """)) + """ + ) + ) config = load_accounts_config(str(config_file)) @@ -31,13 +35,17 @@ def test_load_basic_accounts(tmp_path): def test_load_external_id(tmp_path): config_file = tmp_path / "accounts.yaml" - config_file.write_text(textwrap.dedent("""\ + config_file.write_text( + textwrap.dedent( + """\ role_name: CleanCloudReadOnlyRole external_id: cleancloud-secret accounts: - id: "111111111111" name: prod - """)) + """ + ) + ) config = load_accounts_config(str(config_file)) @@ -46,12 +54,16 @@ def test_load_external_id(tmp_path): def test_load_scan_timeout(tmp_path): config_file = tmp_path / "accounts.yaml" - config_file.write_text(textwrap.dedent("""\ + config_file.write_text( + textwrap.dedent( + """\ scan_timeout: 7200 accounts: - id: "111111111111" name: prod - """)) + """ + ) + ) config = load_accounts_config(str(config_file)) @@ -60,11 +72,15 @@ def test_load_scan_timeout(tmp_path): def test_default_role_name_when_omitted(tmp_path): config_file = tmp_path / "accounts.yaml" - config_file.write_text(textwrap.dedent("""\ + config_file.write_text( + textwrap.dedent( + """\ accounts: - id: "111111111111" name: prod - """)) + """ + ) + ) config = load_accounts_config(str(config_file)) @@ -75,10 +91,14 @@ def test_default_role_name_when_omitted(tmp_path): def test_account_name_defaults_to_id_when_omitted(tmp_path): config_file = tmp_path / "accounts.yaml" - config_file.write_text(textwrap.dedent("""\ + config_file.write_text( + textwrap.dedent( + """\ accounts: - id: "111111111111" - """)) + """ + ) + ) config = load_accounts_config(str(config_file)) @@ -87,11 +107,15 @@ def test_account_name_defaults_to_id_when_omitted(tmp_path): def test_account_id_coerced_to_string(tmp_path): config_file = tmp_path / "accounts.yaml" - config_file.write_text(textwrap.dedent("""\ + config_file.write_text( + textwrap.dedent( + """\ accounts: - id: 111111111111 name: prod - """)) + """ + ) + ) config = load_accounts_config(str(config_file)) @@ -101,10 +125,14 @@ def test_account_id_coerced_to_string(tmp_path): def test_empty_accounts_raises(tmp_path): config_file = tmp_path / "accounts.yaml" - config_file.write_text(textwrap.dedent("""\ + config_file.write_text( + textwrap.dedent( + """\ role_name: CleanCloudReadOnlyRole accounts: [] - """)) + """ + ) + ) with pytest.raises(ValueError, match="No accounts found"): load_accounts_config(str(config_file)) diff --git a/tests/cleancloud/providers/aws/ai/test_aws_sagemaker_domain_idle.py b/tests/cleancloud/providers/aws/ai/test_aws_sagemaker_domain_idle.py new file mode 100644 index 0000000..c5e125a --- /dev/null +++ b/tests/cleancloud/providers/aws/ai/test_aws_sagemaker_domain_idle.py @@ -0,0 +1,1138 @@ +from datetime import datetime, timedelta, timezone +from unittest.mock import MagicMock + +import pytest +from botocore.exceptions import BotoCoreError, ClientError + +from cleancloud.providers.aws.rules.ai.sagemaker_domain_idle import ( + RULE_METADATA, + _check_idle_shutdown, + _enrich_domain, + _normalize_domain, + find_idle_sagemaker_domains, +) + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + +_DEFAULT_THRESHOLD = 30 +_ARN_PREFIX = "arn:aws:sagemaker:us-east-1:123456789012:domain" + + +def _make_session(sagemaker_mock): + session = MagicMock() + session.client.return_value = sagemaker_mock + return session + + +def _make_domain( + domain_id="d-abc123", + name="ml-research", + age_days=60, + status="InService", +): + """Build a ListDomains response entry.""" + now = datetime.now(timezone.utc) + return { + "DomainId": domain_id, + "DomainArn": f"{_ARN_PREFIX}/{domain_id}", + "DomainName": name, + "Status": status, + "CreationTime": now - timedelta(days=age_days), + "LastModifiedTime": now - timedelta(days=age_days - 1), + } + + +def _describe_response( + domain_id="d-abc123", + efs_id="fs-abc123", + efs_creation="Automatic", + network_access="PublicInternetOnly", + auth_mode="IAM", + idle_shutdown=False, +): + """Build a DescribeDomain response.""" + resp = { + "DomainId": domain_id, + "HomeEfsFileSystemId": efs_id, + "HomeEfsFileSystemCreation": efs_creation, + "AppNetworkAccessType": network_access, + "AuthMode": auth_mode, + "DefaultUserSettings": {}, + "DefaultSpaceSettings": {}, + } + if idle_shutdown: + resp["DefaultUserSettings"] = { + "JupyterLabAppSettings": { + "AppLifecycleManagement": { + "IdleSettings": { + "LifecycleManagement": "Enabled", + } + } + } + } + return resp + + +def _make_apps(*statuses): + """Build a ListApps response with the given app statuses.""" + apps = [] + for i, status in enumerate(statuses): + apps.append( + { + "AppName": f"app-{i}", + "AppType": "JupyterLab", + "Status": status, + "DomainId": "d-abc123", + "CreationTime": datetime.now(timezone.utc) - timedelta(days=5), + } + ) + return apps + + +def _setup_sagemaker( + domains=None, + describe_response=None, + apps=None, + describe_side_effect=None, + list_apps_side_effect=None, +): + """Wire up a fully mocked SageMaker client.""" + sm = MagicMock() + + # ListDomains paginator + domain_paginator = MagicMock() + domain_paginator.paginate.return_value = [{"Domains": domains or []}] + + # ListApps paginator + apps_paginator = MagicMock() + if list_apps_side_effect: + apps_paginator.paginate.side_effect = list_apps_side_effect + else: + apps_paginator.paginate.return_value = [{"Apps": apps if apps is not None else []}] + + def get_paginator(name): + if name == "list_domains": + return domain_paginator + if name == "list_apps": + return apps_paginator + raise ValueError(f"Unexpected paginator: {name}") + + sm.get_paginator.side_effect = get_paginator + + # DescribeDomain + if describe_side_effect: + sm.describe_domain.side_effect = describe_side_effect + else: + sm.describe_domain.return_value = describe_response or _describe_response() + + return sm + + +def _run( + domains=None, + describe_response=None, + apps=None, + threshold=_DEFAULT_THRESHOLD, + region="us-east-1", + describe_side_effect=None, + list_apps_side_effect=None, +): + sm = _setup_sagemaker( + domains=domains, + describe_response=describe_response, + apps=apps, + describe_side_effect=describe_side_effect, + list_apps_side_effect=list_apps_side_effect, + ) + return find_idle_sagemaker_domains(_make_session(sm), region, threshold) + + +def _arn(domain_id): + return f"{_ARN_PREFIX}/{domain_id}" + + +# --------------------------------------------------------------------------- +# TestMustEmit +# --------------------------------------------------------------------------- + + +class TestMustEmit: + """Spec §16: scenarios 1-3.""" + + def test_idle_domain_no_apps_emits(self): + """Scenario 1: InService domain older than threshold, zero apps.""" + findings = _run(domains=[_make_domain(age_days=60)], apps=[]) + assert len(findings) == 1 + + def test_idle_domain_all_deleted_apps_emits(self): + """Scenario 2: all apps Deleted.""" + findings = _run( + domains=[_make_domain(age_days=60)], + apps=_make_apps("Deleted", "Deleted"), + ) + assert len(findings) == 1 + + def test_idle_domain_all_failed_apps_emits(self): + """Scenario 3: all apps Failed.""" + findings = _run( + domains=[_make_domain(age_days=60)], + apps=_make_apps("Failed"), + ) + assert len(findings) == 1 + + def test_idle_domain_mixed_non_billable_emits(self): + """Mix of Deleted, Deleting, Failed → still idle.""" + findings = _run( + domains=[_make_domain(age_days=60)], + apps=_make_apps("Deleted", "Deleting", "Failed"), + ) + assert len(findings) == 1 + + def test_resource_id_is_domain_arn(self): + findings = _run(domains=[_make_domain(domain_id="d-xyz789", age_days=60)]) + assert findings[0].resource_id == _arn("d-xyz789") + + def test_resource_type(self): + findings = _run(domains=[_make_domain(age_days=60)]) + assert findings[0].resource_type == "aws.sagemaker.domain" + + def test_provider(self): + findings = _run(domains=[_make_domain(age_days=60)]) + assert findings[0].provider == "aws" + + def test_rule_id(self): + findings = _run(domains=[_make_domain(age_days=60)]) + assert findings[0].rule_id == "aws.sagemaker.domain.idle" + + def test_region_preserved(self): + findings = _run( + domains=[_make_domain(age_days=60)], + region="ap-southeast-1", + ) + assert findings[0].region == "ap-southeast-1" + + def test_no_domains_returns_empty(self): + assert _run(domains=[]) == [] + + def test_summary_contains_domain_name(self): + findings = _run( + domains=[_make_domain(name="fraud-model-studio", age_days=60)], + ) + assert "fraud-model-studio" in findings[0].summary + + def test_exactly_at_threshold_emits(self): + findings = _run(domains=[_make_domain(age_days=30)]) + assert len(findings) == 1 + + +# --------------------------------------------------------------------------- +# TestMustSkip +# --------------------------------------------------------------------------- + + +class TestMustSkip: + """Spec §16: scenarios 4-15.""" + + def test_pending_domain_skipped(self): + """Scenario 4.""" + assert _run(domains=[_make_domain(age_days=60, status="Pending")]) == [] + + def test_updating_domain_skipped(self): + """Scenario 5.""" + assert _run(domains=[_make_domain(age_days=60, status="Updating")]) == [] + + def test_deleting_domain_skipped(self): + """Scenario 6.""" + assert _run(domains=[_make_domain(age_days=60, status="Deleting")]) == [] + + def test_failed_domain_skipped(self): + """Scenario 7.""" + assert _run(domains=[_make_domain(age_days=60, status="Failed")]) == [] + + def test_domain_too_young_skipped(self): + """Scenario 8.""" + assert _run(domains=[_make_domain(age_days=29)]) == [] + + def test_missing_domain_arn_skipped(self): + """Scenario 9.""" + d = _make_domain(age_days=60) + del d["DomainArn"] + assert _run(domains=[d]) == [] + + def test_empty_domain_arn_skipped(self): + d = _make_domain(age_days=60) + d["DomainArn"] = "" + assert _run(domains=[d]) == [] + + def test_missing_domain_id_skipped(self): + """Scenario 10.""" + d = _make_domain(age_days=60) + del d["DomainId"] + assert _run(domains=[d]) == [] + + def test_empty_domain_id_skipped(self): + d = _make_domain(age_days=60) + d["DomainId"] = "" + assert _run(domains=[d]) == [] + + def test_missing_creation_time_skipped(self): + """Scenario 11.""" + d = _make_domain(age_days=60) + del d["CreationTime"] + assert _run(domains=[d]) == [] + + def test_naive_creation_time_skipped(self): + d = _make_domain(age_days=60) + d["CreationTime"] = datetime.now() - timedelta(days=60) + assert d["CreationTime"].tzinfo is None + assert _run(domains=[d]) == [] + + def test_future_creation_time_skipped(self): + d = _make_domain(age_days=60) + d["CreationTime"] = datetime.now(timezone.utc) + timedelta(days=1) + assert _run(domains=[d]) == [] + + def test_missing_status_skipped(self): + d = _make_domain(age_days=60) + del d["Status"] + assert _run(domains=[d]) == [] + + def test_empty_status_skipped(self): + d = _make_domain(age_days=60) + d["Status"] = "" + assert _run(domains=[d]) == [] + + def test_domain_with_inservice_app_skipped(self): + """Scenario 12.""" + assert ( + _run( + domains=[_make_domain(age_days=60)], + apps=_make_apps("InService"), + ) + == [] + ) + + def test_domain_with_pending_app_skipped(self): + """Scenario 13.""" + assert ( + _run( + domains=[_make_domain(age_days=60)], + apps=_make_apps("Pending"), + ) + == [] + ) + + def test_domain_with_unclassifiable_app_status_skipped(self): + """Scenario 14: app with missing Status → skip domain.""" + bad_app = {"AppName": "app-0", "AppType": "JupyterLab", "DomainId": "d-abc123"} + # No Status key + assert ( + _run( + domains=[_make_domain(age_days=60)], + apps=[bad_app], + ) + == [] + ) + + def test_domain_with_unknown_app_status_skipped(self): + """Scenario 14: app with undocumented Status value → skip domain.""" + bad_app = { + "AppName": "app-0", + "AppType": "JupyterLab", + "Status": "SomeNewStatus", + "DomainId": "d-abc123", + } + assert ( + _run( + domains=[_make_domain(age_days=60)], + apps=[bad_app], + ) + == [] + ) + + def test_domain_with_empty_app_status_skipped(self): + bad_app = { + "AppName": "app-0", + "Status": "", + "DomainId": "d-abc123", + } + assert ( + _run( + domains=[_make_domain(age_days=60)], + apps=[bad_app], + ) + == [] + ) + + def test_describe_domain_non_permission_failure_skips(self): + """Scenario 15: DescribeDomain throttle/not-found → skip item, rule continues.""" + findings = _run( + domains=[_make_domain(age_days=60)], + describe_side_effect=ClientError( + {"Error": {"Code": "ThrottlingException", "Message": "slow down"}}, + "DescribeDomain", + ), + ) + assert findings == [] + + def test_describe_domain_botocore_error_skips(self): + findings = _run( + domains=[_make_domain(age_days=60)], + describe_side_effect=BotoCoreError(), + ) + assert findings == [] + + def test_mixed_billable_and_non_billable_apps_skipped(self): + """One InService app among Deleted apps → skip.""" + assert ( + _run( + domains=[_make_domain(age_days=60)], + apps=_make_apps("Deleted", "InService", "Deleted"), + ) + == [] + ) + + def test_non_dict_app_entry_skips_domain(self): + """Non-dict app entry is unclassifiable → skip domain.""" + sm = _setup_sagemaker( + domains=[_make_domain(age_days=60)], + ) + # Override apps paginator to return non-dict entry + apps_paginator = MagicMock() + apps_paginator.paginate.return_value = [{"Apps": [None, "bad"]}] + + original_get_paginator = sm.get_paginator.side_effect + + def patched_get_paginator(name): + if name == "list_apps": + return apps_paginator + return original_get_paginator(name) + + sm.get_paginator.side_effect = patched_get_paginator + findings = find_idle_sagemaker_domains(_make_session(sm), "us-east-1") + assert findings == [] + + def test_non_dict_item_in_domains_skipped(self): + sm = MagicMock() + paginator = MagicMock() + paginator.paginate.return_value = [{"Domains": [None, "bad", 42]}] + sm.get_paginator.return_value = paginator + findings = find_idle_sagemaker_domains(_make_session(sm), "us-east-1") + assert findings == [] + + def test_age_zero_skipped(self): + assert _run(domains=[_make_domain(age_days=0)]) == [] + + +# --------------------------------------------------------------------------- +# TestMustFailRule +# --------------------------------------------------------------------------- + + +class TestMustFailRule: + """Spec §16: scenarios 16-18.""" + + def test_list_domains_preflight_permission_denied_fails(self): + """Pre-flight direct call catches permission error before paginator.""" + sm = MagicMock() + sm.list_domains.side_effect = ClientError( + {"Error": {"Code": "AccessDeniedException", "Message": "denied"}}, + "ListDomains", + ) + with pytest.raises(PermissionError) as exc_info: + find_idle_sagemaker_domains(_make_session(sm), "us-east-1") + assert "sagemaker:ListDomains" in str(exc_info.value) + # Paginator should never be called if pre-flight fails + sm.get_paginator.assert_not_called() + + def test_list_domains_paginator_permission_denied_fails(self): + """Scenario 16 (permission variant) — paginator also catches.""" + sm = MagicMock() + paginator = MagicMock() + paginator.paginate.side_effect = ClientError( + {"Error": {"Code": "AccessDeniedException", "Message": "denied"}}, + "ListDomains", + ) + sm.get_paginator.return_value = paginator + with pytest.raises(PermissionError) as exc_info: + find_idle_sagemaker_domains(_make_session(sm), "us-east-1") + assert "sagemaker:ListDomains" in str(exc_info.value) + + def test_list_domains_other_client_error_propagates(self): + """Scenario 16 (non-permission variant).""" + sm = MagicMock() + paginator = MagicMock() + paginator.paginate.side_effect = ClientError( + {"Error": {"Code": "InternalFailure", "Message": "oops"}}, + "ListDomains", + ) + sm.get_paginator.return_value = paginator + with pytest.raises(ClientError): + find_idle_sagemaker_domains(_make_session(sm), "us-east-1") + + def test_list_domains_botocore_error_propagates(self): + sm = MagicMock() + paginator = MagicMock() + paginator.paginate.side_effect = BotoCoreError() + sm.get_paginator.return_value = paginator + with pytest.raises(BotoCoreError): + find_idle_sagemaker_domains(_make_session(sm), "us-east-1") + + def test_list_apps_permission_denied_fails(self): + """Scenario 17.""" + with pytest.raises(PermissionError) as exc_info: + _run( + domains=[_make_domain(age_days=60)], + list_apps_side_effect=ClientError( + {"Error": {"Code": "AccessDeniedException", "Message": "denied"}}, + "ListApps", + ), + ) + assert "sagemaker:ListApps" in str(exc_info.value) + + def test_list_apps_other_client_error_propagates(self): + with pytest.raises(ClientError): + _run( + domains=[_make_domain(age_days=60)], + list_apps_side_effect=ClientError( + {"Error": {"Code": "InternalFailure", "Message": "oops"}}, + "ListApps", + ), + ) + + def test_list_apps_botocore_error_propagates(self): + with pytest.raises(BotoCoreError): + _run( + domains=[_make_domain(age_days=60)], + list_apps_side_effect=BotoCoreError(), + ) + + def test_describe_domain_permission_denied_fails(self): + """Scenario 18.""" + with pytest.raises(PermissionError) as exc_info: + _run( + domains=[_make_domain(age_days=60)], + describe_side_effect=ClientError( + {"Error": {"Code": "AccessDeniedException", "Message": "denied"}}, + "DescribeDomain", + ), + ) + assert "sagemaker:DescribeDomain" in str(exc_info.value) + + def test_describe_domain_access_denied_fails(self): + with pytest.raises(PermissionError): + _run( + domains=[_make_domain(age_days=60)], + describe_side_effect=ClientError( + {"Error": {"Code": "AccessDenied", "Message": "denied"}}, + "DescribeDomain", + ), + ) + + def test_describe_domain_unauthorized_operation_fails(self): + with pytest.raises(PermissionError): + _run( + domains=[_make_domain(age_days=60)], + describe_side_effect=ClientError( + {"Error": {"Code": "UnauthorizedOperation", "Message": "denied"}}, + "DescribeDomain", + ), + ) + + +# --------------------------------------------------------------------------- +# TestConfidenceModel +# --------------------------------------------------------------------------- + + +class TestConfidenceModel: + def test_confidence_always_high(self): + findings = _run(domains=[_make_domain(age_days=60)]) + assert findings[0].confidence.value == "high" + + def test_confidence_high_with_efs(self): + findings = _run( + domains=[_make_domain(age_days=60)], + describe_response=_describe_response(efs_id="fs-123"), + ) + assert findings[0].confidence.value == "high" + + def test_confidence_high_without_efs(self): + findings = _run( + domains=[_make_domain(age_days=60)], + describe_response=_describe_response(efs_id=None), + ) + assert findings[0].confidence.value == "high" + + +# --------------------------------------------------------------------------- +# TestRiskModel +# --------------------------------------------------------------------------- + + +class TestRiskModel: + def test_efs_present_is_high_risk(self): + findings = _run( + domains=[_make_domain(age_days=60)], + describe_response=_describe_response(efs_id="fs-abc123"), + ) + assert findings[0].risk.value == "high" + + def test_efs_absent_is_medium_risk(self): + findings = _run( + domains=[_make_domain(age_days=60)], + describe_response=_describe_response(efs_id=None), + ) + assert findings[0].risk.value == "medium" + + def test_efs_empty_string_is_medium_risk(self): + findings = _run( + domains=[_make_domain(age_days=60)], + describe_response=_describe_response(efs_id=""), + ) + assert findings[0].risk.value == "medium" + + def test_no_critical_risk_emitted(self): + findings = _run(domains=[_make_domain(age_days=60)]) + for f in findings: + assert f.risk.value != "critical" + + +# --------------------------------------------------------------------------- +# TestCostModel +# --------------------------------------------------------------------------- + + +class TestCostModel: + def test_estimated_cost_is_none(self): + findings = _run(domains=[_make_domain(age_days=60)]) + assert findings[0].estimated_monthly_cost_usd is None + + +# --------------------------------------------------------------------------- +# TestNormalization +# --------------------------------------------------------------------------- + + +class TestNormalization: + def _now(self): + return datetime.now(timezone.utc) + + def test_returns_none_for_non_dict(self): + assert _normalize_domain(None, self._now()) is None + assert _normalize_domain("bad", self._now()) is None + assert _normalize_domain(42, self._now()) is None + + def test_returns_none_when_arn_missing(self): + now = self._now() + item = { + "DomainId": "d-abc", + "DomainName": "test", + "Status": "InService", + "CreationTime": now - timedelta(days=30), + } + assert _normalize_domain(item, now) is None + + def test_returns_none_when_domain_id_missing(self): + now = self._now() + item = { + "DomainArn": _arn("d-abc"), + "DomainName": "test", + "Status": "InService", + "CreationTime": now - timedelta(days=30), + } + assert _normalize_domain(item, now) is None + + def test_returns_none_when_status_missing(self): + now = self._now() + item = { + "DomainArn": _arn("d-abc"), + "DomainId": "d-abc", + "CreationTime": now - timedelta(days=30), + } + assert _normalize_domain(item, now) is None + + def test_returns_none_for_naive_creation_time(self): + now = self._now() + item = { + "DomainArn": _arn("d-abc"), + "DomainId": "d-abc", + "Status": "InService", + "CreationTime": datetime.now() - timedelta(days=30), + } + assert _normalize_domain(item, now) is None + + def test_returns_none_for_future_creation_time(self): + now = self._now() + item = { + "DomainArn": _arn("d-abc"), + "DomainId": "d-abc", + "Status": "InService", + "CreationTime": now + timedelta(days=1), + } + assert _normalize_domain(item, now) is None + + def test_age_days_computed_correctly(self): + now = self._now() + item = { + "DomainArn": _arn("d-abc"), + "DomainId": "d-abc", + "Status": "InService", + "CreationTime": now - timedelta(days=45), + } + n = _normalize_domain(item, now) + assert n is not None + assert n["age_days"] == 45 + + def test_domain_name_optional(self): + now = self._now() + item = { + "DomainArn": _arn("d-abc"), + "DomainId": "d-abc", + "Status": "InService", + "CreationTime": now - timedelta(days=30), + } + n = _normalize_domain(item, now) + assert n is not None + assert n["domain_name"] is None + + def test_last_modified_time_optional(self): + now = self._now() + item = { + "DomainArn": _arn("d-abc"), + "DomainId": "d-abc", + "Status": "InService", + "CreationTime": now - timedelta(days=30), + } + n = _normalize_domain(item, now) + assert n is not None + assert n["last_modified_time_utc"] is None + + def test_naive_last_modified_time_normalized_to_none(self): + now = self._now() + item = { + "DomainArn": _arn("d-abc"), + "DomainId": "d-abc", + "Status": "InService", + "CreationTime": now - timedelta(days=30), + "LastModifiedTime": datetime.now() - timedelta(days=10), # naive + } + n = _normalize_domain(item, now) + assert n is not None + assert n["last_modified_time_utc"] is None + + def test_future_last_modified_time_normalized_to_none(self): + now = self._now() + item = { + "DomainArn": _arn("d-abc"), + "DomainId": "d-abc", + "Status": "InService", + "CreationTime": now - timedelta(days=30), + "LastModifiedTime": now + timedelta(days=1), + } + n = _normalize_domain(item, now) + assert n is not None + assert n["last_modified_time_utc"] is None + + +# --------------------------------------------------------------------------- +# TestEnrichment +# --------------------------------------------------------------------------- + + +class TestEnrichment: + def test_enrich_extracts_efs_id(self): + resp = _describe_response(efs_id="fs-abc123") + e = _enrich_domain(resp) + assert e["home_efs_file_system_id"] == "fs-abc123" + + def test_enrich_efs_id_none_when_absent(self): + resp = _describe_response() + del resp["HomeEfsFileSystemId"] + e = _enrich_domain(resp) + assert e["home_efs_file_system_id"] is None + + def test_enrich_auth_mode(self): + resp = _describe_response(auth_mode="SSO") + e = _enrich_domain(resp) + assert e["auth_mode"] == "SSO" + + def test_idle_shutdown_default_false(self): + resp = _describe_response(idle_shutdown=False) + e = _enrich_domain(resp) + assert e["idle_shutdown_configured"] is False + + def test_idle_shutdown_true_when_configured(self): + resp = _describe_response(idle_shutdown=True) + e = _enrich_domain(resp) + assert e["idle_shutdown_configured"] is True + + def test_idle_shutdown_from_default_space_settings(self): + resp = _describe_response() + resp["DefaultSpaceSettings"] = { + "CodeEditorAppSettings": { + "AppLifecycleManagement": { + "IdleSettings": { + "LifecycleManagement": "Enabled", + } + } + } + } + e = _enrich_domain(resp) + assert e["idle_shutdown_configured"] is True + + +# --------------------------------------------------------------------------- +# TestIdleShutdownCheck +# --------------------------------------------------------------------------- + + +class TestIdleShutdownCheck: + def test_enabled_in_jupyterlab(self): + settings = { + "JupyterLabAppSettings": { + "AppLifecycleManagement": {"IdleSettings": {"LifecycleManagement": "Enabled"}} + } + } + assert _check_idle_shutdown(settings) is True + + def test_enabled_in_code_editor(self): + settings = { + "CodeEditorAppSettings": { + "AppLifecycleManagement": {"IdleSettings": {"LifecycleManagement": "Enabled"}} + } + } + assert _check_idle_shutdown(settings) is True + + def test_disabled(self): + settings = { + "JupyterLabAppSettings": { + "AppLifecycleManagement": {"IdleSettings": {"LifecycleManagement": "Disabled"}} + } + } + assert _check_idle_shutdown(settings) is False + + def test_empty_settings(self): + assert _check_idle_shutdown({}) is False + + def test_malformed_settings_handled(self): + assert _check_idle_shutdown({"JupyterLabAppSettings": "not-a-dict"}) is False + assert ( + _check_idle_shutdown({"JupyterLabAppSettings": {"AppLifecycleManagement": None}}) + is False + ) + + +# --------------------------------------------------------------------------- +# TestDetailsContract +# --------------------------------------------------------------------------- + + +class TestDetailsContract: + def _finding(self): + return _run( + domains=[_make_domain(domain_id="d-test", name="my-domain", age_days=60)], + describe_response=_describe_response( + domain_id="d-test", + efs_id="fs-test", + efs_creation="Automatic", + network_access="VpcOnly", + auth_mode="SSO", + idle_shutdown=True, + ), + apps=_make_apps("Deleted", "Failed"), + )[0] + + def test_evaluation_path(self): + assert ( + self._finding().details["evaluation_path"] == "idle-sagemaker-domain-review-candidate" + ) + + def test_domain_arn(self): + assert self._finding().details["domain_arn"] == _arn("d-test") + + def test_domain_id(self): + assert self._finding().details["domain_id"] == "d-test" + + def test_domain_name(self): + assert self._finding().details["domain_name"] == "my-domain" + + def test_normalized_status(self): + assert self._finding().details["normalized_status"] == "InService" + + def test_creation_time_present(self): + assert "creation_time" in self._finding().details + + def test_age_days(self): + assert self._finding().details["age_days"] == 60 + + def test_idle_days_threshold(self): + assert self._finding().details["idle_days_threshold"] == 30 + + def test_home_efs_file_system_id(self): + assert self._finding().details["home_efs_file_system_id"] == "fs-test" + + def test_home_efs_file_system_creation(self): + assert self._finding().details["home_efs_file_system_creation"] == "Automatic" + + def test_app_network_access_type(self): + assert self._finding().details["app_network_access_type"] == "VpcOnly" + + def test_auth_mode(self): + assert self._finding().details["auth_mode"] == "SSO" + + def test_idle_shutdown_configured(self): + assert self._finding().details["idle_shutdown_configured"] is True + + def test_total_apps_evaluated(self): + assert self._finding().details["total_apps_evaluated"] == 2 + + def test_apps_by_status(self): + d = self._finding().details["apps_by_status"] + assert d == {"Deleted": 1, "Failed": 1} + + def test_inservice_app_count_zero(self): + assert self._finding().details["inservice_app_count"] == 0 + + def test_pending_app_count_zero(self): + assert self._finding().details["pending_app_count"] == 0 + + +# --------------------------------------------------------------------------- +# TestEvidenceContract +# --------------------------------------------------------------------------- + + +class TestEvidenceContract: + def _evidence(self): + return _run(domains=[_make_domain(age_days=60)])[0].evidence + + def test_signals_used_non_empty(self): + assert len(self._evidence().signals_used) > 0 + + def test_signals_used_mentions_inservice(self): + sigs = " ".join(self._evidence().signals_used) + assert "InService" in sigs + + def test_signals_used_mentions_list_apps(self): + sigs = " ".join(self._evidence().signals_used) + assert "ListApps" in sigs + + def test_signals_used_mentions_domain_age(self): + sigs = " ".join(self._evidence().signals_used) + assert "domain age" in sigs + + def test_signals_not_checked_mentions_last_user_activity(self): + not_checked = " ".join(self._evidence().signals_not_checked) + assert "LastUserActivityTimestamp" in not_checked + + def test_signals_not_checked_mentions_health_checks(self): + not_checked = " ".join(self._evidence().signals_not_checked) + assert "health checks" in not_checked + + def test_signals_not_checked_mentions_point_in_time(self): + not_checked = " ".join(self._evidence().signals_not_checked) + assert "point-in-time" in not_checked + + def test_signals_not_checked_mentions_efs_storage(self): + not_checked = " ".join(self._evidence().signals_not_checked) + assert "EFS storage cost" in not_checked + + def test_time_window(self): + assert self._evidence().time_window == "30 days" + + +# --------------------------------------------------------------------------- +# TestPagination +# --------------------------------------------------------------------------- + + +class TestPagination: + def test_multiple_domain_pages_aggregated(self): + sm = MagicMock() + + domain_paginator = MagicMock() + domain_paginator.paginate.return_value = [ + {"Domains": [_make_domain("d-1", "dom-1", age_days=60)]}, + {"Domains": [_make_domain("d-2", "dom-2", age_days=60)]}, + ] + + apps_paginator = MagicMock() + apps_paginator.paginate.return_value = [{"Apps": []}] + + def get_paginator(name): + if name == "list_domains": + return domain_paginator + if name == "list_apps": + return apps_paginator + raise ValueError(f"Unexpected: {name}") + + sm.get_paginator.side_effect = get_paginator + sm.describe_domain.return_value = _describe_response() + + findings = find_idle_sagemaker_domains(_make_session(sm), "us-east-1") + assert len(findings) == 2 + + def test_multiple_apps_pages_aggregated(self): + """Apps across multiple pages are all checked.""" + sm = _setup_sagemaker(domains=[_make_domain(age_days=60)]) + + apps_paginator = MagicMock() + apps_paginator.paginate.return_value = [ + {"Apps": _make_apps("Deleted")}, + {"Apps": _make_apps("InService")}, # second page has active app + ] + + original = sm.get_paginator.side_effect + + def patched(name): + if name == "list_apps": + return apps_paginator + return original(name) + + sm.get_paginator.side_effect = patched + findings = find_idle_sagemaker_domains(_make_session(sm), "us-east-1") + assert findings == [] + + +# --------------------------------------------------------------------------- +# TestMultipleDomains +# --------------------------------------------------------------------------- + + +class TestMultipleDomains: + def test_only_idle_domains_emitted(self): + idle = _make_domain("d-idle", "idle-domain", age_days=60) + young = _make_domain("d-young", "young-domain", age_days=10) + + sm = MagicMock() + + domain_paginator = MagicMock() + domain_paginator.paginate.return_value = [{"Domains": [idle, young]}] + + apps_paginator = MagicMock() + apps_paginator.paginate.return_value = [{"Apps": []}] + + def get_paginator(name): + if name == "list_domains": + return domain_paginator + if name == "list_apps": + return apps_paginator + raise ValueError(f"Unexpected: {name}") + + sm.get_paginator.side_effect = get_paginator + sm.describe_domain.return_value = _describe_response() + + findings = find_idle_sagemaker_domains(_make_session(sm), "us-east-1") + assert len(findings) == 1 + assert findings[0].details["domain_id"] == "d-idle" + + def test_describe_failure_skips_one_continues_other(self): + """DescribeDomain fails for one domain but succeeds for another.""" + d1 = _make_domain("d-fail", "fail-domain", age_days=60) + d2 = _make_domain("d-ok", "ok-domain", age_days=60) + + sm = MagicMock() + + domain_paginator = MagicMock() + domain_paginator.paginate.return_value = [{"Domains": [d1, d2]}] + + apps_paginator = MagicMock() + apps_paginator.paginate.return_value = [{"Apps": []}] + + def get_paginator(name): + if name == "list_domains": + return domain_paginator + if name == "list_apps": + return apps_paginator + raise ValueError(f"Unexpected: {name}") + + sm.get_paginator.side_effect = get_paginator + + call_count = [0] + + def describe_side_effect(**kwargs): + call_count[0] += 1 + if call_count[0] == 1: + raise ClientError( + {"Error": {"Code": "ResourceNotFoundException", "Message": "gone"}}, + "DescribeDomain", + ) + return _describe_response(domain_id=kwargs.get("DomainId", "d-ok")) + + sm.describe_domain.side_effect = describe_side_effect + + findings = find_idle_sagemaker_domains(_make_session(sm), "us-east-1") + assert len(findings) == 1 + assert findings[0].details["domain_id"] == "d-ok" + + +# --------------------------------------------------------------------------- +# TestCustomThreshold +# --------------------------------------------------------------------------- + + +class TestCustomThreshold: + def test_custom_threshold_7_days(self): + findings = _run( + domains=[_make_domain(age_days=7)], + threshold=7, + ) + assert len(findings) == 1 + + def test_age_just_below_custom_threshold_skipped(self): + findings = _run( + domains=[_make_domain(age_days=6)], + threshold=7, + ) + assert findings == [] + + def test_custom_threshold_stored_in_details(self): + findings = _run( + domains=[_make_domain(age_days=60)], + threshold=7, + ) + assert findings[0].details["idle_days_threshold"] == 7 + + +# --------------------------------------------------------------------------- +# TestTitleAndReason +# --------------------------------------------------------------------------- + + +class TestTitleAndReason: + def test_title_is_spec_mandated(self): + findings = _run(domains=[_make_domain(age_days=60)]) + assert findings[0].title == "Idle SageMaker domain review candidate" + + def test_reason_contains_key_wording(self): + findings = _run(domains=[_make_domain(age_days=60)]) + assert "InService SageMaker domain" in findings[0].reason + assert "60 days old" in findings[0].reason + assert "no InService or Pending apps" in findings[0].reason + + def test_reason_does_not_imply_inactivity_duration(self): + """Spec: threshold applies to domain age, not measured inactivity.""" + findings = _run(domains=[_make_domain(age_days=60)]) + assert "for at least" not in findings[0].reason + + +# --------------------------------------------------------------------------- +# TestRuleMetadata +# --------------------------------------------------------------------------- + + +class TestRuleMetadata: + def test_rule_id(self): + assert RULE_METADATA["id"] == "aws.sagemaker.domain.idle" + + def test_category(self): + assert RULE_METADATA["category"] == "ai" + + def test_service(self): + assert RULE_METADATA["service"] == "sagemaker" + + def test_cost_impact(self): + assert RULE_METADATA["cost_impact"] == "high" diff --git a/tests/cleancloud/safety/aws/test_aws_iam_policy_parity.py b/tests/cleancloud/safety/aws/test_aws_iam_policy_parity.py index faf66cd..be84955 100644 --- a/tests/cleancloud/safety/aws/test_aws_iam_policy_parity.py +++ b/tests/cleancloud/safety/aws/test_aws_iam_policy_parity.py @@ -63,7 +63,10 @@ # aws.sagemaker.notebook.idle "sagemaker:ListNotebookInstances", "sagemaker:DescribeNotebookInstance", - # aws.sagemaker.studio_app.idle + # aws.sagemaker.domain.idle + "sagemaker:ListDomains", + "sagemaker:DescribeDomain", + # aws.sagemaker.studio_app.idle (ListApps shared with domain.idle) "sagemaker:ListApps", "sagemaker:DescribeApp", # aws.sagemaker.training_job.long_running diff --git a/tests/e2e/aws/test_aws_ai_rules_smoke.py b/tests/e2e/aws/test_aws_ai_rules_smoke.py index 62ff116..1db603c 100644 --- a/tests/e2e/aws/test_aws_ai_rules_smoke.py +++ b/tests/e2e/aws/test_aws_ai_rules_smoke.py @@ -8,6 +8,9 @@ find_idle_bedrock_provisioned_throughputs, ) from cleancloud.providers.aws.rules.ai.ec2_gpu_idle import find_idle_gpu_instances +from cleancloud.providers.aws.rules.ai.sagemaker_domain_idle import ( + find_idle_sagemaker_domains, +) from cleancloud.providers.aws.rules.ai.sagemaker_endpoint_idle import ( find_idle_sagemaker_endpoints, ) @@ -24,6 +27,7 @@ _AWS_AI_RULE_IDS = { "aws.sagemaker.endpoint.idle", "aws.sagemaker.notebook.idle", + "aws.sagemaker.domain.idle", "aws.ec2.gpu.idle", "aws.bedrock.provisioned_throughput.idle", "aws.sagemaker.studio_app.idle", @@ -40,6 +44,7 @@ def test_aws_ai_rules_run_without_error(): rules = [ find_idle_sagemaker_endpoints, find_idle_sagemaker_notebooks, + find_idle_sagemaker_domains, find_idle_gpu_instances, find_idle_bedrock_provisioned_throughputs, find_idle_sagemaker_studio_apps,