From 53153641b9c9cc91dfc3f014a58eba5473a432f6 Mon Sep 17 00:00:00 2001 From: Stanislav Uschakow Date: Mon, 8 Jun 2026 10:46:13 +0200 Subject: [PATCH] docs: generating documentation for the architecture The docs explain the overall architecture, AWS related entrypoints and visualize the dataflow using diagrams. The documentation should help developers to understand the code and accelerate the onboarding process for this project. Signed-off-by: Stanislav Uschakow --- docs/01-overview.md | 212 ++++++++++++++++++ docs/02-invocation-control-flow.md | 199 +++++++++++++++++ docs/03-pipeline-orchestration.md | 208 +++++++++++++++++ docs/04-execution-fargate-ec2.md | 196 ++++++++++++++++ docs/05-storage-dataflow.md | 178 +++++++++++++++ docs/06-aws-provisioning.md | 206 +++++++++++++++++ docs/07-kernelci-kcidb-integration.md | 244 ++++++++++++++++++++ docs/08-analysis-regression.md | 308 ++++++++++++++++++++++++++ docs/09-cost-leak-prevention.md | 224 +++++++++++++++++++ 9 files changed, 1975 insertions(+) create mode 100644 docs/01-overview.md create mode 100644 docs/02-invocation-control-flow.md create mode 100644 docs/03-pipeline-orchestration.md create mode 100644 docs/04-execution-fargate-ec2.md create mode 100644 docs/05-storage-dataflow.md create mode 100644 docs/06-aws-provisioning.md create mode 100644 docs/07-kernelci-kcidb-integration.md create mode 100644 docs/08-analysis-regression.md create mode 100644 docs/09-cost-leak-prevention.md diff --git a/docs/01-overview.md b/docs/01-overview.md new file mode 100644 index 0000000..bbf25ba --- /dev/null +++ b/docs/01-overview.md @@ -0,0 +1,212 @@ +# System Overview & Component Map + +`pullab_cloud` (Python package `kernel_ci_cloud_labs`) is a KernelCI "pull lab": it bridges the KernelCI API to ephemeral AWS infrastructure, running kernel boot/benchmark tests on freshly spawned EC2 VMs and reporting results back to KernelCI/KCIDB. Two end-to-end flows drive it: the **direct pipeline run** (CLI / EventBridge) and the **pull-lab poller** flow (kernelci-api -> pipeline -> KCIDB). + +## 1. Layered architecture + +A small **registry** decouples the orchestration core from concrete cloud backends. Three pluggable abstractions - *provider*, *storage*, *auth* - are registered by decorator and instantiated by name from configuration. + +```mermaid +graph TD + CLI["cli.py (kernel-ci-cloud-runner)"] + MAIN["main.py"] + EB["eventbridge_handler.py"] + POLL["pull_labs_poller.py"] + REG["core/registry.py"] + PIPE["core/pipeline.run_pipeline"] + PROV["providers/aws_provider.AWSProvider"] + STOR["storage/s3_storage.S3Storage"] + AUTH["auth/aws_auth.AWSAuth"] + + CLI --> REG + MAIN --> REG + EB --> REG + POLL --> REG + REG --> PROV + REG --> STOR + REG --> AUTH + CLI --> PIPE + MAIN --> PIPE + EB --> PIPE + POLL --> PIPE + PIPE --> PROV + PIPE --> STOR + PROV --> AUTH + STOR --> AUTH +``` + +### The registry (`core/registry.py`) + +`registry.py` declares three module-level dicts - `PROVIDER_REGISTRY`, `STORAGE_REGISTRY`, `AUTH_REGISTRY` - plus the decorators `register_provider` / `register_storage` / `register_auth` that populate them, and read-side helpers `get_provider` / `get_storage` / `get_auth`. Concrete classes self-register at import time: `AWSProvider` via `@register_provider("aws")`, `S3Storage` via `@register_storage("s3")`, `AWSAuth` via `@register_auth("aws")`. + +Since registration only fires when the defining module is imported, every entry point first calls `main.import_all_packages(...)` to walk and import every submodule of `kernel_ci_cloud_labs.providers`, `.storage`, and `.auth` so the decorators run before the registries are read. + +## 2. Entry points + +There are **four entry-point objects**. Three - the CLI, `main.main()`, and the EventBridge handler - call `run_pipeline` directly; `PullLabsPoller` calls it indirectly through a swappable job executor. All converge on the same registry-based `auth -> storage -> provider -> run_pipeline` sequence. [Chapter 02](02-invocation-control-flow.md) enumerates these as **five invocation modes**, since the poller is reachable three ways (`run_forever`, `--once`, `lambda_handler`). + +| Entry point | File | Trigger | +|---|---|---| +| `kernel-ci-cloud-runner` CLI | `cli.py` | Human / shell | +| `main.main()` | `main.py` | Library / direct call | +| `handle_eventbridge` (a.k.a. `lambda_handler`) | `eventbridge_handler.py` | EventBridge / Lambda | +| `PullLabsPoller` | `pull_labs_poller.py` | kernelci-api polling loop | + +### CLI (`cli.py`) + +`cli.py` builds an `argparse` tree rooted at `kernel-ci-cloud-runner aws ...` with subcommands `run`, `analyze`, and `setup` (`configure`, `upload-rpms`, `upload-tests`, `cleanup`, `validate`). `run` dispatches to `cmd_run`, which: + +1. Optionally downloads a config from S3 when `--config-s3` is given (S3 config takes precedence over `--config`, for EventBridge-style triggers). +2. Imports all provider/storage/auth packages so the registries populate. +3. Loads `config.json` and merges in `credentials.json` via `main.load_credentials`. +4. Instantiates the three backends from the registries and calls `run_pipeline`: + +```python +auth = AUTH_REGISTRY[config["auth_credentials"]["auth_provider"]](config, credentials) +storage = STORAGE_REGISTRY[config["storage"]["type"]](storage_config, auth) +provider = PROVIDER_REGISTRY[config["provider"]](auth, config, storage) +run_pipeline(provider, storage, run_dir=run_dir) +``` + +`main.main()` performs the same lookup/instantiation; `eventbridge_handler.handle_eventbridge` repeats it after fetching the config from S3 (Flow B). + +## 3. The pipeline (`core/pipeline.run_pipeline`) + +`run_pipeline(provider, storage, run_dir=None)` is the provider-agnostic orchestration heart - it only calls methods on the `provider` and `storage` objects handed to it. Key behaviours grounded in the source: + +- **Run prefix.** Derives a per-run prefix `run_{test_id}_{run_timestamp}`, where `run_timestamp` is `datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S")`. This scopes every S3 key and the per-run CloudWatch VM log group, and is written into `provider.config["run_prefix"]`. +- **Boot-log public-read probe.** Before spawning, `_warn_if_logs_not_public` does a read-only bucket-policy check and only *warns* if kernel boot logs would not be publicly reachable by KCIDB dashboard users; it never aborts. +- **Crash-aware wait.** `provider.wait_for_task_completion()` (section 4) blocks until the ECS task stops; a non-zero container exit shortens the subsequent VM-log wait and publishes the container's own log as the failure URL. +- **Artifacts.** After VM logs are pulled, `collect_run_artifacts` downloads each instance's `console-output.log` and writes `artifacts.json` (section 6). +- **Cleanup is unconditional.** A `finally` block stops the ECS task and terminates every EC2 instance tagged `run_prefix=` still `pending`/`running`. +- **Summary.** `create_summary` parses `vms/.log` files into per-instance PASS/FAIL rows and writes `summary.json`; the `vms.instances` list is the per-VM ground truth later joined against `artifacts.json` by the poller. + +## 4. AWS provider (`providers/aws_provider.AWSProvider`) + +`AWSProvider` runs a single launcher container on **AWS Fargate** and waits for it to finish, with kernel-crash detection layered on plain status polling. + +### `spawn_container` + +Builds ECS `containerOverrides` environment variables and uploads the test config to S3. The container environment: + +| Env var | Source | +|---|---| +| `RUN_PREFIX` | `config["run_prefix"]` | +| `S3_BUCKET` | `storage.bucket` | +| `AWS_REGION` | `config["region"]` | +| `EC2_LOG_GROUP` | first `cloudwatch.log_groups` key containing `/ec2/` | +| `KCI_DEBUG` | host env `KCI_DEBUG` (only if set) | +| `TEST_CONFIG_FILENAME` | always `"test_config.json"` | + +The full `config["test_config"]` is serialised to JSON and uploaded to `{run_prefix}/test_config.json`; only the bare filename is passed in `TEST_CONFIG_FILENAME`, and the container rebuilds the full path from `RUN_PREFIX` + filename. The task launches with `ecs.run_task(..., launchType="FARGATE", enableExecuteCommand=True, ...)`, with up to 5 retries for transient (IAM-propagation) errors. + +### `wait_for_task_completion` + +Polls task status until `STOPPED` while tailing the per-run VM console log group `{EC2_LOG_GROUP}/{run_prefix}` for kernel crash/stall patterns. It aborts early - stopping the task and raising `RuntimeError` - on: + +- a kernel-side crash/stall pattern (panic, Oops:, BUG:, soft lockup, RCU stall, hung task, GP fault, paging fault); +- no new console output for `PULLAB_TASK_HANG_THRESHOLD_SEC` seconds (default **600**); +- overall `PULLAB_TASK_WAIT_TIMEOUT_SEC` seconds elapsed (default **3600**). + +Poll interval is `PULLAB_TASK_POLL_INTERVAL_SEC` (default **30**); a progress line logs every `PULLAB_TASK_PROGRESS_LOG_SEC` (default 120). Crash detection is disabled (falling back to pure status polling) when no `/ec2/` log group or no `run_prefix` is configured. + +## 5. In-container launcher and VM client + +### Container image (`dockerfiles/aws/test.dockerfile`) + +Image is `python:3.12-slim`, installs **only** `boto3`, and copies `launch_vm.py`, `debug_aws_setup.py`, the package `__init__.py` stubs, and `core/log_scrub.py` (so `launch_vm.py`'s `from kernel_ci_cloud_labs.core.log_scrub import scrub_text` resolves). `CMD` runs `debug_aws_setup.py` (ignoring its exit code) then `launch_vm.py`. + +### `launch_vm.py` + +`launch_vms_from_config()` is the container entrypoint. It reads `RUN_PREFIX`, `S3_BUCKET`, `AWS_REGION`, `TEST_CONFIG_FILENAME` from the environment, loads the test config from S3 at `{run_prefix}/{config_filename}`, expands the `vms` array (one VM per test, `min_count` instances each), and launches each in its own thread. Per VM, `launch_and_test_vm`: + +1. Spawns an EC2 instance (`VMLauncher.spawn_vm`) tagged `run_prefix=`. +2. Runs the test via SSM (`execute_test_via_ssm`). +3. **Always** calls `check_test_result` to read `result.txt` from S3 as the source of truth, even if SSM reported `Failed` - short-lived VMs often shut down before SSM reports final status. +4. In `finally`, runs `cleanup`, which captures the EC2 serial console (scrubbed for secrets) and uploads it to `{run_prefix}/test_{test}/output/{instance_id}/console-output.log`. + +`execute_test_via_ssm` sets the SSM command timeout to `min(max_runtime + 3600, 43200)` (max 12h) and directs SSM Run Command output to CloudWatch log group `{ec2_log_group}/{run_prefix}`. + +### `test-vm-client.sh` + +The client script SSM downloads and runs inside each VM: + +- Finds and runs `run*.sh` stages **sorted with `sort -V`**, executing exactly one stage per `RUN_ID` (persisted per-instance in S3 and incremented each invocation). +- Uses **exit code 194** to signal the SSM agent to reboot the VM and re-run for the next stage; the final stage exits with the script's own code and triggers a delayed shutdown. +- On the final stage, uploads `result.txt`, `stats.json`, and every `benchmark-*.csv` (and `results_*`) to `{run_prefix}/test_{test}/output/{instance_id}/`. On a mid-chain failure it writes a `FAILED` `result.txt`/`stats.json` and stops the chain. + +## 6. Results & artifacts + +`core/artifacts.collect_run_artifacts` discovers each `(test, instance_id)` pair under the run prefix, downloads `console-output.log` into `run_dir/vms/-console.log`, and writes an `artifacts.json` manifest with sha256/size/content-type plus the S3 URI and a public HTTPS `log_url`. The URL is built by `s3_public_url`, forming `https://.s3..amazonaws.com/`; it only resolves when the bucket carries the public-read boot-log policy. + +`storage/s3_storage.S3Storage` provides the S3 backend. `upload_tests` and `upload_test_payload` MD5-compare the local artifact's hash against the existing S3 object's ETag to skip redundant uploads. Separately, `copy_external_requirements` copies each *enabled* folder listed in a test's `external_requirements.json` from the `external_storage` bucket into `{run_prefix}/shared/{folder}/` via server-side S3 `copy_object`, skipping a folder that already exists there - an S3 folder-existence check (`_check_s3_folder_exists`), **not** an MD5 comparison. + +## 7. Authentication & resource provisioning (`auth/aws_auth.AWSAuth`) + +`AWSAuth.authenticate()` is more than a credential check: when the corresponding config sections are present, the *same* call provisions the full AWS footprint the pipeline needs: + +- IAM roles via `AWSRoleManager.ensure_exists` (honouring `force_recreate_roles`); +- the ECR repository, and if a `docker` section is present, builds and pushes the Docker image; +- the ECS cluster; +- CloudWatch log groups; +- the ECS task definition. + +It tracks whether any resource was newly created (`_resources_created`) so the pipeline can wait for AWS propagation before spawning. + +## 8. Flow A - CLI / EventBridge direct run + +```mermaid +sequenceDiagram + participant U as "CLI / EventBridge" + participant P as "run_pipeline" + participant A as "AWSAuth" + participant PR as "AWSProvider" + participant ECS as "ECS Fargate task" + participant VM as "EC2 VMs" + participant S3 as "S3" + U->>P: instantiate auth/storage/provider, call run_pipeline + P->>A: authenticate (provision IAM/ECR/ECS/CW) + P->>PR: spawn_container (env + test_config.json to S3) + PR->>ECS: run_task FARGATE + ECS->>VM: launch_vm spawns EC2, runs test via SSM + VM->>S3: upload result.txt, console-output.log, benchmarks + PR->>P: wait_for_task_completion (crash/hang/timeout aware) + P->>S3: collect_run_artifacts -> artifacts.json + P->>U: summary.json +``` + +`eventbridge_handler.handle_eventbridge` downloads the config from `config_s3_uri`, calls `_prepare_kernel_rpms` (a placeholder that currently just expects pre-uploaded RPMs), makes the config run-local by appending a unique suffix to `test_config.test_id`, then runs the standard `run_pipeline` via the registry instantiation. + +## 9. Flow B - Pull-lab poller (`pull_labs_poller.PullLabsPoller`) + +- **Polling/matching.** `poll_once` issues `GET /events?state=available&kind=job` and matches runtime + platform. +- **Claiming.** `_claim_node` records `data.job_id` on the node and **leaves the state `available`**. kernelci-api's state machine forbids `available -> running`, and `available -> closing` would be auto-finished (~60s, no result) by kernelci-pipeline's timeout handler - so `data.job_id` is the only viable claim marker. +- **Translation.** `translate_job` (in `pull_labs_translate.py`) raises `ValueError` if `artifacts.kernel` or `artifacts.modules` is missing, produces exactly one `vms[*]` entry, derives `test_id = pulllab--`, and passes `KERNEL_URL` / `MODULES_URL` / `ARCH` (plus optional `ROOTFS_URL`, `KERNELCI_NODE_ID`) as `test_params`. +- **Execution.** `_default_job_executor` uses the same registry-based instantiation as `main.py`, calls `run_pipeline`, and post-processes via `_extract_test_results`. +- **Build ID.** `resolve_build_id` walks `node.parent` up to **8** hops to a `kbuild` ancestor and returns `origin:`. +- **KCIDB reporting.** Direct KCIDB submission from the poller is **currently disabled** (the `submit_tests` call is commented out). Instead the boot-log URL is written onto the maestro node's `artifacts.test_log` so kernelci-pipeline's `send_kcidb` emits the single dashboard-visible row. The row-building code (and `_node_result_from_rows`) is kept so outcome derivation still works and dual submission can be re-enabled cheaply. + +`kcidb_submit.build_test_row` validates `origin` against `[a-z0-9_]+` and `path` against the KCIDB v5.3 dot-segment grammar, raising `ValueError` on invalid values; `KCIDB_SCHEMA_VERSION` is `{"major": 5, "minor": 3}`. + +## 10. Configuration (`examples/aws/config.json`) + +The example config's `kernelci` section sets `runtime_name = "pull-labs-aws-ec2"`, `platforms = ["aws-ec2-x86_64"]`, and `kcidb_origin = "pullab_cloud_aws"`, with both `api_token` and `kcidb_jwt` left `null` in the file - these are injected at runtime via the `KERNELCI_API_TOKEN` / `KCIDB_JWT` env vars (or `UNIFIED_TOKEN` as a shared fallback). The `platforms` filter narrows the shared `pull-labs-aws-ec2` runtime to x86_64 jobs. + +## Component reference + +| Component | File | Role | +|---|---|---| +| CLI | `cli.py` | Argparse entry point; `cmd_run` instantiates backends and runs the pipeline | +| Library main | `main.py` | `import_all_packages` + registry instantiation + `run_pipeline` | +| Registry | `core/registry.py` | `PROVIDER`/`STORAGE`/`AUTH_REGISTRY` + `register_*` decorators | +| Pipeline | `core/pipeline.py` | `run_pipeline` orchestration, summary, cleanup | +| AWS provider | `providers/aws_provider.py` | Fargate `run_task`, crash-aware completion wait | +| AWS auth | `auth/aws_auth.py` | Credentials + IAM/ECR/ECS/CloudWatch provisioning | +| S3 storage | `storage/s3_storage.py` | Test/payload upload, shared external requirements | +| Launcher | `launch_vm.py` | In-container EC2 spawn + SSM test execution | +| VM client | `vm-tests/test-vm-client.sh` | Staged `run*.sh` execution with reboot via exit 194 | +| Artifacts | `core/artifacts.py` | `collect_run_artifacts`, `artifacts.json`, public log URLs | +| EventBridge | `eventbridge_handler.py` | Lambda-compatible scheduled trigger | +| Poller | `pull_labs_poller.py` | kernelci-api <-> pipeline <-> KCIDB bridge | +| Translate | `pull_labs_translate.py` | PULL_LABS job_definition -> run config | +| KCIDB submit | `kcidb_submit.py` | KCIDB v5.3 row/revision build + REST submit | diff --git a/docs/02-invocation-control-flow.md b/docs/02-invocation-control-flow.md new file mode 100644 index 0000000..0bc542d --- /dev/null +++ b/docs/02-invocation-control-flow.md @@ -0,0 +1,199 @@ +# Invocation & Control Flow (5 entry modes) + +`pullab_cloud` (Python package `kernel_ci_cloud_labs`) has five entry points. All converge on the same registry-driven pattern (build `auth` -> `storage` -> `provider`, then `run_pipeline`), but differ in who triggers them, where config comes from, and what wraps the pipeline call. + +| # | Entry mode | Source symbol | Trigger | +|---|------------|---------------|---------| +| 1 | CLI subcommand | `kernel_ci_cloud_labs.cli:main` | Operator running `kernel-ci-cloud-runner aws run ...` | +| 2 | Library `main()` | `kernel_ci_cloud_labs.main:main` | `python -m`/import; default `config_path="examples/aws/config.json"` | +| 3 | EventBridge / Lambda (one-shot pipeline) | `kernel_ci_cloud_labs.eventbridge_handler.lambda_handler` | EventBridge scheduled rule / custom event | +| 4 | Pull-lab poller (CLI / long-running loop) | `kernel_ci_cloud_labs.pull_labs_poller:main` | Container loop or cron with `--once` | +| 5 | Pull-lab poller (Lambda) | `kernel_ci_cloud_labs.pull_labs_poller.lambda_handler` | Lambda, single poll cycle per invocation | + +Modes 1-3 directly run one pipeline. Modes 4-5 poll `kernelci-api` for pull-lab jobs and invoke the pipeline indirectly through a pluggable *job executor*. + +--- + +## The shared core: registry instantiation + +Four of the five modes build the same trio from the registries in `kernel_ci_cloud_labs.core.registry`, keyed off the config dict: + +- `AUTH_REGISTRY[config["auth_credentials"]["auth_provider"]]` +- `STORAGE_REGISTRY[config["storage"]["type"]]` +- `PROVIDER_REGISTRY[config["provider"]]` + +To populate the registries, every submodule of `providers`, `storage`, and `auth` must be imported first so the `register_*` decorators run. `import_all_packages` (`main.py`) does this via `importlib.import_module` plus `pkgutil.iter_modules` walking `package.__path__`. + +The storage object is always built from a *merged* config - `config["storage"]` plus root-level `region` and `external_storage` - and this exact merge is duplicated in all four call sites (`cli.py`, `main.py`, `eventbridge_handler.py`, `pull_labs_poller.py`): + +```python +storage_config = { + **config["storage"], + "region": config.get("region"), + "external_storage": config.get("external_storage", {}), +} +``` + +`run_pipeline` lives in `core/pipeline.py` with signature `run_pipeline(provider, storage, run_dir=None)` and returns the summary dict. The generated run prefix has the format `run_{test_id}_{datetime}`, and per-instance VM console logs land at the S3 key `{run_prefix}/test_{test}/output/{instance_id}/console-output.log` (`launch_vm.py`). + +--- + +## Mode 1 - CLI subcommand (`cli.py`) + +The console-script `kernel-ci-cloud-runner` is bound to `kernel_ci_cloud_labs.cli:main` (`setup.py`). `main()` builds a nested argparse tree (`cloud` -> `command` -> `setup_command`) and dispatches via `args.func(args)`. When no `func` is bound (incomplete subcommand), it prints help for the *deepest subparser reached* and exits code `1`: `setup` help when `cloud==aws and command==setup`, otherwise the `aws` parser help, otherwise the top-level parser help. + +The pipeline-running subcommand `aws run` is handled by `cmd_run`. Control flow: + +1. Set up logging into a fresh run directory. +2. Resolve config path: `config_path = args.config` by default, but **if `args.config_s3` is set, config is downloaded from S3 to a `NamedTemporaryFile(delete=False)` and `config_path` is reassigned** - so `--config-s3` takes precedence over `--config` (used for EventBridge-style triggers passing config in S3). +3. `import_all_packages` runs for `providers`, `storage`, `auth` *before* config load and registry lookups. +4. Load config JSON and call `load_credentials(config_path)`. +5. Build `auth`, the merged `storage_config` + `storage`, then `provider`. +6. Call `run_pipeline(provider, storage, run_dir=run_dir)` directly. + +```mermaid +flowchart TD + Main["main()"] --> Parse["argparse parse_args"] + Parse --> HasFunc{"hasattr args func?"} + HasFunc -->|no| Help["print deepest subparser help
sys.exit(1)"] + HasFunc -->|yes| Dispatch["args.func(args)"] + Dispatch --> CmdRun["cmd_run"] + CmdRun --> S3{"args.config_s3 set?"} + S3 -->|yes| Dl["download to NamedTemporaryFile
reassign config_path"] + S3 -->|no| UseCfg["use args.config"] + Dl --> Imports["import_all_packages x3"] + UseCfg --> Imports + Imports --> Build["auth / storage / provider"] + Build --> Run["run_pipeline(provider, storage, run_dir)"] +``` + +Note: CLI failure paths use `sys.exit(1)` (`cli.py`); subcommands like `validate` and `analyze` propagate the underlying function's return value through `sys.exit(...)`. + +--- + +## Mode 2 - Library `main()` (`main.py`) + +`main.py:main(config_path="examples/aws/config.json")` is the canonical library invocation and the template the other modes mirror. Module import has side effects: at import time it reads `LOG_LEVEL`, creates a run directory, and configures logging. + +`main()`: + +1. `import_all_packages` for `providers`, `storage`, `auth` to populate registries. +2. Load config JSON. +3. `load_credentials(config_path)` - reads a `credentials.json` sibling of the config file; returns `None` (after a warning) if absent. +4. Look up the three registry classes, build `auth`, merged `storage`, and `provider`. +5. `run_pipeline(provider, storage, run_dir=run_dir)`. + +--- + +## Mode 3 - EventBridge / Lambda one-shot (`eventbridge_handler.py`) + +The Lambda entry point is an alias: `lambda_handler = handle_eventbridge`. The Lambda handler name is `kernel_ci_cloud_labs.eventbridge_handler.lambda_handler`. + +`handle_eventbridge(event, context=None)` control flow: + +1. Set up per-invocation logging and generate an `invocation_id`. +2. Read `config_s3_uri` from the event; if missing, **return `{"status": "error", ...}` immediately**. Resolve `region` from `event["region"]`, else `AWS_DEFAULT_REGION`, else `us-west-2`. +3. Inside a `try`: + - Download config from S3 to a temp file (`_download_config`) and load it. + - `_prepare_kernel_rpms(config, region)` - a logging-only no-op expecting RPMs to be pre-uploaded. + - `_make_config_run_local(config)` - appends `uuid4().hex[:8]` to `test_config["test_id"]` so parallel invocations write to distinct S3 prefixes. + - Write the mutated config back to the temp file. + - Import packages, build `auth`/`storage`/`provider`, then call `run_pipeline(provider, storage, run_dir=run_dir)` directly. + - Return `{"status": "success", ...}`. +4. `except Exception` returns `{"status": "error", ...}`. +5. `finally`: if `config_path` is bound, best-effort `os.unlink(config_path)`, swallowing `OSError`. + +--- + +## Modes 4 & 5 - Pull-lab poller (`pull_labs_poller.py`) + +The poller bridges `kernelci-api` and `pullab_cloud`: it polls for available pull-lab job nodes, claims each, translates its job definition into a run config, runs it via a *job executor*, and finishes the node back in `kernelci-api`. KCIDB direct submission is currently disabled (see below). + +### Construction and configuration precedence (`PullLabsPoller.__init__`) + +- `api_token` precedence: `KERNELCI_API_TOKEN` -> `UNIFIED_TOKEN` -> `config["kernelci"]["api_token"]`. +- `poll_interval_sec` default `30` from `DEFAULT_POLL_INTERVAL_SEC`. +- Cursor file default `/tmp/pullab_cloud_cursor.json` from `DEFAULT_CURSOR_FILE`. +- KCIDB endpoint resolved by `_resolve_kcidb_endpoint` with four-tier precedence: (`KCIDB_SUBMIT_URL` + `KCIDB_JWT`) > `KCIDB_REST` (`https://@host/submit`) > `UNIFIED_TOKEN` (+ `KCIDB_SUBMIT_URL` or config URL) > config `kcidb_submit_url`/`kcidb_jwt`. +- `self.job_executor` defaults to `_default_job_executor`; a custom executor passed to the constructor bypasses it. +- With no custom executor, `_validate_default_executor_deps()` eagerly imports `boto3` plus the executor packages so a missing dependency fails at startup, not on first event. +- `_validate_api_token` does a one-shot `GET /whoami` preflight; **non-fatal** - a transient error only logs a warning. + +### Default job executor (`_default_job_executor`) + +Mirrors the registry pattern with two differences from `main.py`: + +1. `auth` is built with **`credentials=None`** - the poller has no `credentials.json` step. +2. `run_pipeline(provider, storage)` is called with **no `run_dir`**, so the pipeline creates its own. + +It then calls `_extract_test_results(summary or {})` to produce `(per_test_results, optional_log_url)`. + +### Polling and per-event flow + +`fetch_events(from_ts)` GETs `/events` with `state=available`, `kind=job`, `recursive=true`, `limit=1000`, `from=`. + +`process_event` returns `True` (benign skip) when: runtime mismatch, platform mismatch, no `job_definition` artifact, or the node cannot be claimed. + +Once claimed, a node *must* be finished: a default `NodeOutcome("incomplete", "Infrastructure", "unexpected internal error")` is set, `_execute_job` runs inside a `try`, and `_finish_node(node_id, node_outcome)` is always called in the `finally`. + +`_claim_node` claims by writing `data.job_id = ":"` while the node stays `available`. It re-reads the node first and skips if `state != "available"` or if `data.job_id` is already set. The claim is best-effort - `kernelci-api` has no compare-and-set, so the PUT is a full-document overwrite and two pollers can both claim the same node; parallel pollers must be partitioned by platform. + +In `_execute_job`, if the job executor raises, it is an infrastructure failure: outcome becomes `incomplete`/`Infrastructure`, and a synthetic `{"name": "boot.infrastructure", "status": "ERROR"}` row is emitted. + +**KCIDB direct submission is commented out / disabled.** Instead, the boot log URL is written onto the maestro node's `artifacts.test_log` (extra URLs under `test_log_{i}`), which `send_kcidb` later picks up. + +`poll_once` reads the cursor, fetches events, processes each (swallowing per-event exceptions), advances `last_ts` to the last event's `timestamp`, writes the cursor if it changed, and returns the processed count. + +`run_forever` loops `poll_once` and only sleeps `poll_interval_sec` when `count == 0`. + +```mermaid +flowchart TD + Poll["poll_once"] --> Fetch["fetch_events(from_ts)
state=available kind=job"] + Fetch --> Loop["for each event"] + Loop --> Match{"runtime + platform match
and job_definition present?"} + Match -->|no| Skip["return True (skip)"] + Match -->|yes| Claim{"_claim_node OK?"} + Claim -->|no| Skip + Claim -->|yes| Exec["_execute_job"] + Exec --> Trans["translate_job(jobdef, base_config, node_id)"] + Trans --> JobEx["job_executor(run_config)"] + JobEx -->|raises| Infra["incomplete / Infrastructure
boot.infrastructure ERROR row"] + JobEx -->|ok| Rows["build_test_row per result"] + Rows --> Artifacts["write log URL to artifacts.test_log"] + Infra --> Finish["_finish_node (in finally)"] + Artifacts --> Finish +``` + +### CLI vs Lambda entry points + +**Mode 4 - `main(argv)`** builds an argparse parser with `--config`, `--once`, `--log-level`. It loads the base config (`_load_base_config`) and constructs the poller. With `--once` it runs a single `poll_once` and returns `0`; otherwise it calls `run_forever()`. + +**Mode 5 - `lambda_handler(event, context=None)`** runs a single `poll_once` per invocation. It reads `config_path` from `event["config_path"]` or the `PULLAB_BASE_CONFIG` env var, loads config, constructs the poller, and returns `{"status": "ok", "processed": n}`. + +`_load_base_config` resolves its path from the argument, else `PULLAB_BASE_CONFIG`, else `examples/aws/config.json`. + +### Translation (`translate_job`) + +`_execute_job` calls `translate_job(jobdef, self.base_config, node_id=node_id)` - only `node_id` is passed; `platform_map` and `test_type_map` default internally to `DEFAULT_PLATFORM_MAP` / `DEFAULT_TEST_TYPE_MAP` (`pull_labs_translate.py`). `translate_job` (`pull_labs_translate.py`) deep-copies `base_config` and rewrites `test_config`; it raises `ValueError` if `artifacts.kernel` or `artifacts.modules` is missing. + +--- + +## Environment variables referenced across the entry modes + +| Env var | Used by | Purpose | +|---------|---------|---------| +| `LOG_LEVEL` | all modes | logging level | +| `KERNELCI_API_BASE_URI` | poller | kernelci-api base URI | +| `KERNELCI_API_TOKEN` / `UNIFIED_TOKEN` | poller | API token (in that precedence) | +| `KERNELCI_RUNTIME_NAME` | poller | runtime label used in claim job_id | +| `KERNELCI_PLATFORMS` | poller | optional platform allowlist | +| `KCIDB_SUBMIT_URL` / `KCIDB_JWT` / `KCIDB_REST` / `KCIDB_ORIGIN` | poller | KCIDB endpoint resolution | +| `PULLAB_BASE_CONFIG` | poller (CLI + Lambda) | default base config path | +| `PULLAB_POLL_INTERVAL_SEC` / `PULLAB_CURSOR_FILE` | poller | poll interval / cursor file overrides | +| `AWS_DEFAULT_REGION` | eventbridge_handler | region fallback (-> `us-west-2`) | + +--- + +## Summary + +All five entry modes funnel through the same registry-based `auth -> storage -> provider -> run_pipeline` core. Modes 1-3 run exactly one pipeline per invocation, sourcing config from a local path (mode 2), an optional S3 override (mode 1), or a required S3 URI (mode 3). Modes 4-5 add a polling/claim/translate/finish loop, deferring the pipeline run to a swappable job executor (the default re-uses the same core, minus `credentials.json` and minus an externally supplied `run_dir`). Direct KCIDB submission in the poller is currently disabled in favor of writing the boot-log URL onto the maestro node's `artifacts.test_log`. diff --git a/docs/03-pipeline-orchestration.md b/docs/03-pipeline-orchestration.md new file mode 100644 index 0000000..2c3ad0f --- /dev/null +++ b/docs/03-pipeline-orchestration.md @@ -0,0 +1,208 @@ +# Pipeline Orchestration - run_pipeline() + +`run_pipeline(provider, storage, run_dir=None)` in `src/kernel_ci_cloud_labs/core/pipeline.py` is the single entry point that turns a resolved test configuration into a Fargate task, a fleet of guest VMs, on-disk log/artifact files, and a `summary.json`. + +It is invoked from four places, all passing the same `(provider, storage[, run_dir])` shape: + +- `pull_labs_poller.py` - `summary = run_pipeline(provider, storage)` +- `cli.py` - `run_pipeline(provider, storage, run_dir=run_dir)` +- `eventbridge_handler.py` - `run_pipeline(provider, storage, run_dir=run_dir)` +- `main.py` - `run_pipeline(provider, storage, run_dir=run_dir)` + +The provider is concretely `AWSProvider` (`src/kernel_ci_cloud_labs/providers/aws_provider.py`), a subclass of abstract `BaseProvider` (`src/kernel_ci_cloud_labs/core/base_provider.py`), which declares only `authenticate`, `spawn_container`, and `stop_all_tasks` as abstract. + +## 1. Run directory and the two timestamps + +If no `run_dir` is passed, `run_pipeline` calls `create_run_directory()` (`core/logging_config.py`), which builds `logs/run_` where ` = datetime.now().strftime("%Y%m%d_%H%M%S")` - **local** time. + +This differs from the **S3 run prefix** computed later in `pipeline.py`: + +```python +run_timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") +run_prefix = f"run_{test_id}_{run_timestamp}" +``` + +So `run_prefix` is `run_{test_id}_{UTC-timestamp}`. The local-time log directory and the UTC S3 prefix are independent timestamps and will not match in clock value off-UTC - keep this in mind when correlating local log folders to S3 keys. + +First, `_warn_if_logs_not_public(provider, storage)` runs: a read-only probe of the bucket policy that only warns if the public-read boot-log policy is missing. It never aborts the run. + +## 2. Expected VM count + +Inside the `try` block, the test config is read from `provider.config`. For each entry in `test_config["vms"]`: + +- `test` is a **list** -> every name added to `test_names`; `expected_vm_count += min_count * len(test_value)`. +- `test` is a **single** value -> added; `expected_vm_count += min_count`. + +`min_count` defaults to `1` via `vm.get("min_count", 1)` in both branches. Used later only to detect a spawn shortfall in the summary. + +## 3. run_prefix propagation, uploads, authentication + +After computing `run_prefix`, it is pushed into both storage and provider config: `storage.set_run_prefix(run_prefix)` if available, and `provider.config["run_prefix"] = run_prefix`. The latter is what the `finally` cleanup re-reads and what `spawn_container` forwards to the container. + +Test scripts and payloads are uploaded per unique test name. Authentication is lazy: `provider.authenticate()` runs only if `provider.auth.is_authenticated` is false, followed by an optional `provider.auth.wait_for_resources()` when resources were just created. + +## 4. spawn_container() - the Fargate launch + +`task_arn = provider.spawn_container()`. `AWSProvider.spawn_container` (`aws_provider.py`) builds `containerOverrides` environment variables and uploads the test config: + +- Sets `RUN_PREFIX`, `S3_BUCKET`, `AWS_REGION`, `EC2_LOG_GROUP`, `KCI_DEBUG`. +- Uploads the full `test_config` JSON to S3 at `/test_config.json` via `storage.upload_string`, passing only the bare filename `test_config.json` in `TEST_CONFIG_FILENAME` - the container reconstructs the full key. + +The `run_task` call is wrapped in a retry loop of `max_retries = 5` with `2**attempt` backoff to ride out transient IAM-propagation errors. After the loop it raises on `response["failures"]` or when no tasks are returned, otherwise returns the task ARN. + +Back in `run_pipeline`, a `None` `task_arn` raises `RuntimeError("Container spawn failed")`. The task is then waited to RUNNING with the ECS `tasks_running` waiter. + +```mermaid +flowchart TD + A["run_pipeline()"] --> B["create_run_directory if run_dir None"] + B --> C["_warn_if_logs_not_public (warn only)"] + C --> D["compute expected_vm_count from test_config vms"] + D --> E["run_prefix = run_{test_id}_{UTC ts}"] + E --> F["set run_prefix on storage and provider.config"] + F --> G["upload test scripts and payloads"] + G --> H["authenticate if not already"] + H --> I["spawn_container -> task_arn"] + I --> J{"task_arn is None?"} + J -->|yes| K["raise RuntimeError"] + J -->|no| L["ECS waiter tasks_running"] + L --> M["wait_for_task_completion"] +``` + +## 5. wait_for_task_completion() - polling with crash/hang detection + +`final_status = provider.wait_for_task_completion()`. This method (`aws_provider.py`) is the heart of in-flight monitoring. + +It reads four tunables from the environment with these defaults: + +| Env var | Default | +|---|---| +| `PULLAB_TASK_POLL_INTERVAL_SEC` | `30` | +| `PULLAB_TASK_PROGRESS_LOG_SEC` | `120` | +| `PULLAB_TASK_HANG_THRESHOLD_SEC` | `600` | +| `PULLAB_TASK_WAIT_TIMEOUT_SEC` | `3600` | + +`last_event_ms` is initialized to `start_ms - 1` because `filter_log_events`' `startTime` is **inclusive**; the `-1` ensures the first poll picks up events whose timestamp equals `start_ms`. + +Crash detection is optional. `_build_vm_log_manager` returns `None` - disabling crash detection and falling back to pure status polling - unless **both** an `/ec2/` log group and a `run_prefix` are configured **and** a CloudWatch `logs` client can be obtained. + +Each loop iteration: + +1. If elapsed > `overall_timeout`: log error, call `self.terminate_container()`, raise `RuntimeError(f"task wait timeout exceeded after {int(elapsed)}s")`. +2. Get task status; if `STOPPED`, break. +3. If a CloudWatch manager exists, fetch new events with `get_logs_with_filter(start_time=last_event_ms + 1)`: + - **New events**: reset `last_event_seen_at` to now, advance `last_event_ms`, run `_scan_for_kernel_crash`. A match logs an error, calls `self.terminate_container()`, raises `RuntimeError(f"kernel crash detected in VM: {msg}")`. + - **Else** (`elif`): if more than `hang_threshold` seconds passed since `last_event_seen_at`, log a hang, call `self.terminate_container()`, raise `RuntimeError(f"no VM console output for {int(hang_threshold)}s")`. + +All three abort paths (timeout, crash, hang) call `self.terminate_container()` **and then** raise. The hang check is an `elif` on the no-new-events branch - any new events reset `last_event_seen_at` first, so a hang is declared only during true console silence. + +`_KERNEL_CRASH_PATTERNS` (`aws_provider.py`) is a tuple of compiled regexes covering: + +- `Kernel panic - not syncing` +- `\bOops\s*:` +- `\bBUG\s*:` +- `general protection fault` +- `unable to handle kernel paging request` +- `double fault` +- `Internal error\s*:` (arm/arm64 die() banner) +- `watchdog: BUG: soft lockup` +- `soft lockup - CPU#` +- `rcu_(?:sched|preempt|bh) detected stalls` +- `INFO: task .* blocked for more than` + +```mermaid +flowchart TD + S["wait_for_task_completion loop"] --> T{"elapsed gt overall_timeout?"} + T -->|yes| TA["terminate_container then raise RuntimeError timeout"] + T -->|no| U["get_task_status"] + U --> V{"status == STOPPED?"} + V -->|yes| W["break and read final_status"] + V -->|no| X{"cw_manager present?"} + X -->|no| Z["log progress, sleep poll_interval"] + X -->|yes| Y["get_logs_with_filter start gt last_event_ms"] + Y --> Y1{"new events?"} + Y1 -->|yes| Y2["reset last_event_seen_at, scan crash patterns"] + Y2 --> Y3{"crash hit?"} + Y3 -->|yes| TA2["terminate_container then raise RuntimeError crash"] + Y3 -->|no| Z + Y1 -->|no| Y4{"silent gt hang_threshold?"} + Y4 -->|yes| TA3["terminate_container then raise RuntimeError hang"] + Y4 -->|no| Z + Z --> S +``` + +`get_task_status` handles `ExpiredTokenException` by refreshing the ECS client (`self.ecs = self.auth.get_client("ecs")`) and retrying `describe_tasks` once. The `auth` client factory for the default credential chain (`auth/aws_auth.py`) creates a fresh `boto3.Session().client(service, region_name=self.region)` per call, so each refresh re-resolves rotated temporary credentials. + +## 6. container_failed and the shortened VM-log wait + +After the wait returns, `run_pipeline` computes: + +```python +container_failed = bool(final_status) and any( + (c.get("exit_code") or 0) != 0 + for c in (final_status.get("containers") or []) +) +``` + +So `container_failed` is `True` when `final_status` is truthy and **any** container's exit code (treating `None` as `0`) is non-zero. The code keys off the non-zero test only, not a specific numeric exit code. + +A non-zero container exit means the launcher died before any VM ran SSM, so the per-run `/ec2/.../` log group never appears. The CloudWatch client is then refreshed to handle credentials that may have expired during the wait: + +```python +cw_manager.client = provider.auth.get_client("logs") +``` + +## 7. Container log retrieval and the VM-log wait + +Container logs are pulled with `cw_manager.get_all_logs(log_group, log_stream)` and written to `container.log`. The group/stream were computed earlier: + +```python +log_group = f"/ecs/{provider.config['ecs']['task_definition']['family']}" +log_stream = f"ecs/{provider.config['ecs']['task_definition']['container_name']}/{task_id}" +``` + +VM logs are retrieved next. The retry budget depends on `container_failed`: + +- `container_failed is True` -> `max_retries = 1` (a single probe; the log group can't appear if no VM launched). +- otherwise -> `max_retries = 10`, with `retry_delay = 30`. + +Each instance's events are grouped by instance ID and written to `vms/.log` with separate STDOUT/STDERR sections. + +## 8. Boot logs, artifacts manifest, and the container-failure URL + +`collect_run_artifacts` (`core/artifacts.py`) downloads each instance's boot log from S3 key `{run_prefix}/test_{test_name}/output/{instance_id}/console-output.log` to `vms/-console.log`, and writes `artifacts.json` carrying `schema_version`. + +This pairs with `parse_vm_logs` (`pipeline.py`), which deliberately **skips** files ending in `-console.log` so boot logs are not double-counted or treated as failures. `parse_vm_logs` marks a VM `PASS` only if the literal string `"Test execution completed: SUCCESS"` is in the log content; everything else is `FAIL`. + +When `container_failed` is true and `container.log` exists, the container's own log is uploaded to `s3:////container-failure.log` and its public URL captured into `container_failure_log_url`. This URL is later threaded into the summary so KCIDB users land on the actual failure reason instead of an absent kernel log. + +## 9. Benchmark analysis and save_results + +Benchmark regression analysis runs best-effort, fully guarded by a broad `except`. Then `storage.save_results({"status": "success", "task_arn": task_arn})` is called. Note for the S3 backend `S3Storage.save_results` (`storage/s3_storage.py`) merely logs `"Saving: %s"` and does **not** persist anything - the durable record is `summary.json` plus the S3 objects. + +## 10. The finally block - cleanup is unconditional + +The `finally` block (`pipeline.py`) has **two independent try/except blocks**: + +1. **Stop the task**: if `task_arn` is in locals and truthy, `provider.terminate_container(task_arn)`. +2. **Terminate VMs**: re-reads `provider.config["run_prefix"]`, then `ec2.describe_instances` with `Filters` `tag:run_prefix = [run_prefix]` and `instance-state-name` in `("pending", "running")`, collects instance IDs, and calls `ec2.terminate_instances(InstanceIds=...)`. + +Because they are separate `try` blocks, a failure to stop the task does not prevent VM termination, and vice versa. + +## 11. create_summary - runs after finally + +`create_summary` (`pipeline.py`) is called **after** the `try/finally`, so the summary is produced even when the body raised (the `except` re-raises, but `finally` and the trailing `create_summary` still execute in the normal completion path; on an exception the function exits via the re-raise after `finally`). It is passed `container_failure_log_url` when set. + +Status determination inside `create_summary`: + +- starts as `"success"`; +- becomes `"partial_failure"` if `expected_vm_count` is known and `total_vms != expected_vm_count` (also logged at ERROR); +- becomes `"partial_failure"` if `vm_stats["failed"] > 0`. + +There is **no `"failure"` status value** - the only two outcomes written are `"success"` and `"partial_failure"`. + +## Key invariants to remember + +- The S3 `run_prefix` is `run_{test_id}_{UTC-timestamp}`; the local log directory `logs/run_` uses local-time. They are independent timestamps. +- Crash detection is best-effort: absent an `/ec2/` log group **and** `run_prefix` **and** an obtainable logs client, `wait_for_task_completion` degrades to plain status polling. +- The only summary statuses are `success` and `partial_failure`. +- Cleanup (stop task + terminate VMs) always runs in `finally`, in two independent try blocks, and `create_summary` runs afterward. diff --git a/docs/04-execution-fargate-ec2.md b/docs/04-execution-fargate-ec2.md new file mode 100644 index 0000000..c031fe4 --- /dev/null +++ b/docs/04-execution-fargate-ec2.md @@ -0,0 +1,196 @@ +# Execution Layer - Fargate -> EC2 VMs -> SSM + +This chapter traces how a kernel-CI job becomes a Fargate container, how that container launches throwaway EC2 VMs, and how those VMs run test stages over SSM with reboot support. + +**Key files:** + +- `src/kernel_ci_cloud_labs/providers/aws_provider.py` - Fargate orchestration (host side). +- `src/kernel_ci_cloud_labs/launch_vm.py` - in-container orchestrator that spawns EC2 VMs and drives them over SSM. +- `vm-tests/test-vm-client.sh` - guest-side client that runs test stages, persists state across reboots, and reports results. + +The container image running `launch_vm.py` is built from `dockerfiles/aws/test.dockerfile`. + +--- + +## 1. The big picture + +Three nested layers hand off in sequence: Host poller (`pipeline.py`) -> Fargate task (`launch_vm.py`) -> EC2 VM (`test-vm-client.sh`). The VM uploads `result.txt`/logs to S3; the Fargate task reads S3 via `check_test_result()`; the host tails CloudWatch (`EC2_LOG_GROUP/run_prefix`), which SSM RunCommand stdout feeds. + +The key design choice across all layers: **S3 is the source of truth for pass/fail**, not SSM command status nor the container exit code. SSM may report `Failed` simply because the VM shut down before SSM read back the final status, so `launch_vm.py` always re-checks `result.txt` in S3. + +--- + +## 2. Layer 1 - Fargate orchestration (`aws_provider.py`) + +### 2.1 Spawning the container + +`AWSProvider.spawn_container()` builds container environment overrides and calls ECS `run_task`. The host poller invokes it from `pipeline.py`. + +| Env var | Source | +|---|---| +| `RUN_PREFIX` | `config["run_prefix"]` | +| `S3_BUCKET` | `storage.bucket` | +| `AWS_REGION` | `config["region"]` | +| `EC2_LOG_GROUP` | the `/ec2/`-prefixed key in `cloudwatch.log_groups` | +| `KCI_DEBUG` | forwarded from the host's `KCI_DEBUG` env var | +| `TEST_CONFIG_FILENAME` | always `"test_config.json"` | + +The `vms` array is **not** passed inline. The whole `test_config` object is serialized to JSON, uploaded to S3 at `{run_prefix}/test_config.json` via `storage.upload_string(...)`, and only the filename `test_config.json` is passed as `TEST_CONFIG_FILENAME`. The container rebuilds the full S3 key from `RUN_PREFIX` + filename. + +`run_task` uses `launchType="FARGATE"`, `count=1`, `enableExecuteCommand=True`, wrapped in a 5-attempt retry loop with `2**attempt` backoff to absorb transient errors such as IAM propagation. + +```mermaid +flowchart TD + A["spawn_container()"] --> B["Build env_vars list"] + B --> C["Upload test_config.json to S3"] + C --> D["containerOverrides with env_vars"] + D --> E{"run_task attempt
max 5"} + E -->|"success"| F["task_arn = tasks[0].taskArn"] + E -->|"exception, attempt < 4"| G["sleep 2**attempt, retry"] + G --> E + E -->|"exception, last attempt"| H["raise"] + F --> I["return task_arn"] +``` + +### 2.2 Waiting for completion with crash detection + +`wait_for_task_completion()` polls task status until `STOPPED`. When both an `/ec2/` CloudWatch log group and a `run_prefix` are configured, it tails the per-run VM console group `{EC2_LOG_GROUP}/{run_prefix}` (where SSM RunCommand writes) and aborts early on any of: + +- **Kernel crash/stall pattern** in the guest console - compiled regexes in `_KERNEL_CRASH_PATTERNS`: panic, `Oops:`, `BUG:`, GP fault, kernel paging fault, double fault, arm/arm64 `Internal error:`, soft lockup, RCU stall, hung-task. A hit calls `terminate_container()` and raises `RuntimeError`. +- **Silent stall** - no new VM console output for `PULLAB_TASK_HANG_THRESHOLD_SEC` (default 600). +- **Overall timeout** - `PULLAB_TASK_WAIT_TIMEOUT_SEC` elapsed (default 3600). + +Poll/log intervals are env-tunable: `PULLAB_TASK_POLL_INTERVAL_SEC` (default 30) and `PULLAB_TASK_PROGRESS_LOG_SEC` (default 120). `terminate_container()` issues `ecs.stop_task`. + +### 2.3 Reading the container exit code + +After the task stops, `pipeline.py` reads the container exit code from `final_status`: + +```python +container_failed = bool(final_status) and any( + (c.get("exit_code") or 0) != 0 + for c in (final_status.get("containers") or []) +) +``` + +A non-zero container exit means `launch_vm.py` died before SSM ran on any VM, so the `/ec2/.../{run_prefix}` log group never appears; the pipeline then shortens its VM-log wait and surfaces the container log as the failure URL. + +--- + +## 3. Layer 2 - VM orchestration (`launch_vm.py`) + +The container entrypoint is `launch_vms_from_config()`. Per `dockerfiles/aws/test.dockerfile`, the image is `python:3.12-slim` with only `boto3` pip-installed, and the `CMD` is: + +``` +python -u /app/debug_aws_setup.py || true && python -u /app/launch_vm.py +``` + +A best-effort diagnostic pass followed by the launcher. + +### 3.1 Config load and VM expansion + +`launch_vms_from_config()` reads `RUN_PREFIX`, `S3_BUCKET`, `AWS_REGION` (default `us-west-2`), and `TEST_CONFIG_FILENAME` from the environment, downloads `{run_prefix}/{config_filename}` from S3, and parses it as JSON. `RUN_PREFIX`, `S3_BUCKET`, and `TEST_CONFIG_FILENAME` are required; any missing var aborts with `None`. + +It then **expands** the `vms` array: each VM config's `test` field may be a list or scalar; the launcher emits one VM config per test into `expanded_vms`. For each expanded config it spawns `min_count` threads (default 1) running `launch_and_test_vm`. + +### 3.2 Per-VM lifecycle + +Each `launch_and_test_vm` thread runs a `VMLauncher` through `prepare_test_artifacts()` -> `spawn_vm()` -> `execute_test_via_ssm()` -> `check_test_result()`, with `cleanup()` in a `finally` block. + +The success/failure decision is **S3-first**: it always calls `check_test_result()` after SSM, and even when SSM reported failure it records success if `result.txt` contains `SUCCESS`. + +### 3.3 Spawning the EC2 VM + +`spawn_vm()` calls `run_instances` with `MinCount=1`, `MaxCount=1`, `InstanceInitiatedShutdownBehavior="terminate"`, and a gp3 root volume on `/dev/xvda` with `DeleteOnTermination=True`. The instance is tagged with `Name`, `TestID`, and `run_prefix` - the `run_prefix` tag is what the IAM policy keys off of (see section 5). + +If `ami_id` starts with `resolve:ssm:`, the launcher strips the prefix and resolves the remaining SSM parameter path via `ssm.get_parameter` (`_resolve_ssm_parameter`). + +The user-data script installs a `nohup` self-shutdown firing after `max_runtime + 600` seconds - a safety net that terminates the VM if the orchestrator dies before sending the SSM command. It then waits for the SSM agent to become active. + +### 3.4 Driving the test over SSM + +`execute_test_via_ssm()` builds a shell command that downloads `test-vm-client.sh` from S3 and runs it with four positional args - `bucket run_prefix test max_runtime`. It sends this via SSM `AWS-RunShellScript`: + +- `executionTimeout` and `TimeoutSeconds` are `min(max_runtime + 3600, 43200)` - capped at 12 hours. +- `CloudWatchOutputConfig.CloudWatchLogGroupName` is `{ec2_log_group}/{run_prefix}` - the same group `aws_provider.py` tails for crash detection. + +It polls `get_command_invocation` every 5 seconds. On terminal SSM status `Success`/`Failed`/`TimedOut`/`Cancelled` it stops; on anything other than `Success` it captures the console buffer with `reason="ssm-failure"` and returns `False`. The `Failed` branch notes the VM may have shut down before SSM could report - which is why the S3 result check in section 3.2 is authoritative. + +### 3.5 Console capture and cleanup + +`capture_console_output(reason=...)` fetches the EC2 serial console (`get_console_output`), scrubs it, scans the **scrubbed** text for `PANIC_PATTERNS`, and uploads to `{run_prefix}/test_{test}/output/{instance_id}/console-output.log` with metadata `capture-reason`, `scrubbed=v1`, and `panic-detected`. Scrubbing runs `scrub_text()` before upload because the results bucket is public-read, so an unredacted secret would be world-visible; the panic scan runs on the scrubbed buffer so the logged marker can't re-leak a token. + +Re-entrancy rules: + +- `reason="cleanup"` - skipped if a previous call already captured a non-empty buffer. +- `reason="ssm-failure"` and `reason="post-terminate"` - always run. +- `reason="post-terminate"` - polls up to 540s at 15s intervals because EC2 finalizes the serial-console mirror only after shutdown. + +`cleanup()` brackets `terminate_instances` with a pre-terminate capture (`reason="cleanup"`) and a post-terminate capture (`reason="post-terminate"`), each in its own `try/except`, and calls `_wait_for_terminated(timeout=90)` in between to let the buffer flush. + +### 3.6 Final aggregation + +`launch_vms_from_config()` joins all threads, counts successes, and returns `successful == total and total > 0`. In `__main__`, a `True` return maps to `sys.exit(0)`; both `False` (some/all VMs failed) and `None` (no VMs launched) map to `sys.exit(1)`. + +--- + +## 4. Layer 3 - Guest-side client (`test-vm-client.sh`) + +The VM downloads and runs `test-vm-client.sh` with args ` [timeout-seconds]`. If invoked as root, it re-executes itself as `ec2-user`/`ubuntu`/first `/home` user. + +### 4.1 Per-instance RUN_ID state across reboots + +The client tracks a per-instance `RUN_ID` counter persisted in S3 at `{run_prefix}/test_{test}/state/{instance_id}/run_id.txt`. On each boot it downloads the prior value (default 0), increments it, and uploads it back. The test payload zip is downloaded and unzipped **only when `RUN_ID == 1`**; later boots reuse the persisted working directory. A `RUN_ID` greater than the number of `run*.sh` scripts (`TOTAL_SCRIPTS`) is an error. + +### 4.2 Independent watchdog + +A self-contained watchdog (`start_watchdog`) is written out as a separate script and launched with `nohup`. It sleeps in 5s increments up to `SAFETY_TIMEOUT` (the 4th positional arg, default 1800) then runs `sudo shutdown -h now`. It is torn down cleanly via `cleanup_watchdog` before any reboot, completion, or failure exit. + +### 4.3 The exit-code contract (reboot signaling) + +The client runs the `run*.sh` stage for the current `RUN_ID`, captures `SCRIPT_EXIT_CODE` via `PIPESTATUS[0]`, then follows this contract: + +- **Stage failed** (`SCRIPT_EXIT_CODE != 0` and `!= 194`): write a `FAILED` `result.txt` / `stats.json`, tear down the watchdog, and `exit $SCRIPT_EXIT_CODE`. Codes above 100 are capped at 100 first. +- **Last stage succeeded** (`RUN_ID == TOTAL_SCRIPTS`, exit 0): write a `SUCCESS` `result.txt` / `stats.json`, upload `benchmark-*.csv` and `results_*` files, tear down the watchdog, schedule `sudo shutdown +5`, and `exit $SCRIPT_EXIT_CODE` (i.e. 0). +- **More stages remain** (stage succeeded but `RUN_ID < TOTAL_SCRIPTS`): tear down the watchdog, `sync`, and `exit 194` to signal SSM that a reboot is needed; the SSM agent reboots the instance and re-runs this same script with the incremented `RUN_ID`. + +The **194 reboot signal is emitted by `test-vm-client.sh` itself** based on `RUN_ID < TOTAL_SCRIPTS`, not by the individual stage scripts. The stage scripts exit 0 on success - e.g. `vm-tests/example-reboot-test/run-1.sh` and `vm-tests/example-kernel-reboot-test/run-01-install-first-kernel.sh` both `exit 0`. (The client does recognize 194 if a stage emits it directly, treating it as a reboot request, but the multi-stage reboot loop is driven by the client's own stage counter.) + +```mermaid +flowchart TD + A["test-vm-client.sh boot"] --> B["RUN_ID = S3 counter + 1"] + B --> C{"RUN_ID == 1?"} + C -->|"yes"| D["download + unzip payload"] + C -->|"no"| E["reuse existing files"] + D --> F["run stage RUN_ID"] + E --> F + F --> G{"exit code?"} + G -->|"non-zero, non-194"| H["upload FAILED result.txt
exit code (capped 100)"] + G -->|"0 and RUN_ID == TOTAL"| I["upload SUCCESS result.txt
shutdown +5, exit 0"] + G -->|"0 and RUN_ID < TOTAL"| J["cleanup_watchdog, sync
exit 194 (reboot)"] + J -->|"SSM reboots VM"| A +``` + +--- + +## 5. Security boundary - IAM scoping by `run_prefix` + +The execution layer bounds its blast radius through resource tags. The ECS task role's inline policy (`examples/aws/config.json`) restricts the two most dangerous actions to instances tagged `run_prefix=run_*`: + +- `ec2:TerminateInstances` - scoped to `arn:aws:ec2:*:*:instance/*` with a `StringLike` condition on `aws:ResourceTag/run_prefix` of `run_*`. +- `ssm:SendCommand` against instances - scoped the same way via `ssm:resourceTag/run_prefix` `run_*`. + +This is why `spawn_vm()` tags every instance with `run_prefix`: without that tag, the role could neither send the test command to the VM nor terminate it. `ec2:RunInstances`, `ec2:GetConsoleOutput`, and the describe/resolve actions are broader (`Resource: "*"`), but the state-changing terminate/command path is tag-gated. + +--- + +## 6. End-to-end sequence + +The recurring theme: **status flows up through S3, not through process exit codes.** SSM status, container exit codes, and the watchdog all exist to bound runtime and surface infrastructure failures, but the authoritative pass/fail signal for a test is `result.txt` in the S3 results bucket: + +1. Host uploads `test_config.json`, then `spawn_container` (`run_task` FARGATE). +2. Fargate loads `test_config.json`, calls `run_instances` (tagging `run_prefix`), and sends SSM `AWS-RunShellScript` -> `test-vm-client.sh`. +3. The VM streams console output to `EC2_LOG_GROUP/run_prefix`; the host tails it for crash/hang. +4. Per stage, the VM uploads client/run logs and may `exit 194` -> SSM reboot when more stages remain. +5. On the last stage, the VM uploads `result.txt` + `stats.json` and runs `shutdown +5`. +6. Fargate runs `check_test_result` (`result.txt` is source of truth), then `cleanup` (capture console, terminate), and returns the container exit code to the host, which sets `container_failed = exit_code != 0`. diff --git a/docs/05-storage-dataflow.md b/docs/05-storage-dataflow.md new file mode 100644 index 0000000..b0522b8 --- /dev/null +++ b/docs/05-storage-dataflow.md @@ -0,0 +1,178 @@ +# Storage & Data Flow - S3 layout + +This chapter traces every object `pullab_cloud` reads from or writes to S3, from naming a run through publishing a public kernel boot log. It is grounded in the source. + +**Key files:** `storage/s3_storage.py`, `core/artifacts.py`, `core/pipeline.py`, `providers/aws_provider.py`, `vm-tests/test-vm-client.sh`, `launch_vm.py`, `pull_labs_poller.py`, `core/benchmark_analyzer.py`, `setup_upload_tests.py`, `setup_upload_rpms.py`, `setup_validate.py`. + +## Two buckets, two roles + +| Bucket | Config key | Role | +|---|---|---| +| **Results bucket** | `bucket` | Everything a run produces, namespaced under a per-run prefix. Created on demand; may carry a narrow public-read policy for boot logs. | +| **External storage bucket** | `external_storage.bucket` | Pre-populated, read-only-from-the-pipeline. Holds reusable inputs: VM client script, per-test payload zips, kernel RPMs. | + +Note on the `pull_labs_jobs/` prefix: **this repo's `src/` never reads or writes it** (a grep returns nothing). It is a producer-side path written by the upstream `kernelci-core` runtime (`kernelci/runtime/pull_labs.py`, `PullLabs._store_job_definition`) into the external bucket as `pull_labs_jobs//.json`. pullab_cloud only fetches that job definition via the `artifacts.job_definition` URL handed to it by kernelci-api, never by constructing the key. See [07 - KernelCI & KCIDB Integration](07-kernelci-kcidb-integration.md). + +> `S3Storage.__init__` reads `results_prefix` (default `"results"`), but it is **never** used to build any key. Run output is written directly under `run__/`, not under a `results/` prefix. Treat `results_prefix` as dead/vestigial config. + +## The run prefix + +The orchestrator names each run with one prefix that all keys hang off of, built in `pipeline.py`: + +```python +run_timestamp = datetime.now(timezone.utc).strftime("%Y%m%d_%H%M%S") +run_prefix = f"run_{test_id}_{run_timestamp}" +``` + +Template: `run_{test_id}_{YYYYMMDD_HHMMSS}`, timestamp in **UTC**. The orchestrator pushes the prefix into the storage backend via `storage.set_run_prefix(run_prefix)` and into the provider config as `provider.config["run_prefix"]`. + +## Bucket creation and the account-ID fallback + +`S3Storage._ensure_bucket` head-checks the configured bucket: +- **No error** - use as-is. +- **404** - create the bucket. +- **403** (exists but inaccessible, typically the global name is taken by another account) - append the caller's AWS account ID and retry: `self.bucket = f"{self.bucket}-{account_id}"`. If reachable, use it; otherwise create it. + +This is why the example IAM policy (`examples/aws/config.json`) grants access to **both** the bare and account-suffixed names: the `Resource` list includes `arn:aws:s3:::kernel-ci-exampleuser-results`, `.../*`, **and** `arn:aws:s3:::kernel-ci-exampleuser-results-*`, `.../*` - covering the `-` form the fallback may select. + +## Inputs: what the run pulls in + +### test_config.json + +`AWSProvider.spawn_container` ships the test config through S3, not as a giant env var. It serializes `self.config["test_config"]` and uploads it via `storage.upload_string` to `f"{run_prefix}/{config_filename}"` (`config_filename = "test_config.json"`). The container receives **only** the basename via `TEST_CONFIG_FILENAME=test_config.json` and rebuilds the full key in `launch_vm.py`: + +```python +config_key = f"{run_prefix}/{config_filename}" +``` + +The launcher refuses to start if `S3_BUCKET` or `TEST_CONFIG_FILENAME` is missing. + +### Test scripts and payloads + +- `upload_tests` places the VM client script at `{run_prefix}/test_{test_name}/input/{script_name}` (default `test-vm-client.sh`). +- `upload_test_payload` places the zipped test at `{run_prefix}/test_{test_name}/input/{test_name}_test_payload.zip`. + +Both follow a **local-first, external-fallback** pattern: +- If a local `vm-tests/` dir exists, the file is zipped/read and uploaded directly. +- Otherwise it is copied from the external bucket: `_copy_from_external_storage("test-scripts/{script_name}", ...)` for the client, and `test-scripts/{test_name}/{test_name}_test_payload.zip` for the payload. + +External bucket layout pre-populated by setup tooling: +- `test-scripts/test-vm-client.sh`, `test-scripts//_test_payload.zip`, `test-scripts//external_requirements.json` (`setup_upload_tests.py`) +- RPMs (`setup_upload_rpms.py`): `kernel-rpms/src`, `kernel-rpms/binary/x86_64`, `kernel-rpms/binary/aarch64` + +### Idempotent uploads via MD5 vs ETag + +Both local-upload paths skip unchanged content: they compute `hashlib.md5(local_content).hexdigest()` and compare it to the existing object's `head_object(...)["ETag"].strip('"')`. If they match, the upload is skipped. + +> Caveat: S3 ETag equals the body MD5 only for **single-PUT** objects. Both helpers upload the whole buffer in one `put_object` (`save_file`), so the comparison is valid here. It would silently break if switched to multipart uploads. + +### External requirements -> the shared/ area + +When a test declares external requirements, they are deduplicated into a per-run **shared** area so multiple tests don't recopy the same RPMs. `copy_external_requirements` reads each test's `external_requirements.json` and, for every enabled folder, copies it from the external bucket to `f"{self.run_prefix}/shared/{folder_name}/"`. Before copying it calls `_check_s3_folder_exists` (a `list_objects_v2(..., MaxKeys=1)`, non-empty `Contents` = "already present"); if present, the copy is skipped. The external-fallback path uses the same dedup via `_copy_external_requirements_from_s3`. + +## Outputs: what the VM writes back + +Inside each VM, `test-vm-client.sh` uses an `S3_PREFIX` that already includes the `test_` segment, so all keys land under `run_prefix/test_/...`. + +Per-instance output objects under `run_prefix/test_/output//`: + +| Object | Contents | +|---|---| +| `client-.log` | client wrapper log | +| `run--output.log` | script stdout/stderr | +| `result.txt` | pass/fail string | +| `stats.json` | timing/metadata | +| `benchmark-*.csv` | benchmark outputs, uploaded by basename | +| `results_*` | arbitrary results files, uploaded by basename | + +Multi-script tests reboot the VM between scripts: the client exits with **code 194**, which SSM interprets as "reboot and re-run" (documented in `vm-tests/README.md`). To survive the reboot the client persists its run counter as a tiny state object at `run_prefix/test_/state//run_id.txt` - read on start (default `0`) and rewritten after increment. + +```mermaid +flowchart TD + S["test-vm-client.sh start"] -->|"read run_id.txt"| ST["run_prefix/test_/state//run_id.txt"] + S --> RUN["run test script"] + RUN -->|"upload"| OUT["run_prefix/test_/output//{client,run,result,stats,benchmark-*,results_*}"] + RUN --> Q{"more scripts?"} + Q -->|"yes"| EXIT["exit 194 -> SSM reboot"] + EXIT --> S + Q -->|"no"| DONE["done"] +``` + +## The kernel boot console log + +The console boot log is **not** uploaded by the in-VM client; it is captured by the launcher in the ECS container via the EC2 console-output API and written by `capture_console_output` (`launch_vm.py`). The buffer is scrubbed of secrets before upload (the bucket is public-read for these objects, so anything left would be world-visible), then a panic scan runs on the **scrubbed** buffer. + +Uploaded to: + +``` +{run_prefix}/test_{test}/output/{instance_id}/console-output.log +``` + +with `ContentType="text/plain; charset=utf-8"` and metadata `capture-reason=`, `scrubbed=v1`, `panic-detected=true|false`. + +This single key per instance is the only thing exposed publicly. `setup_validate.py` installs a bucket policy whose statement `Sid` is `PublicReadKernelBootLogs`, granting anonymous `s3:GetObject` on: + +``` +arn:aws:s3:::/*/test_*/output/*/console-output.log +``` + +(pattern `_PUBLIC_LOGS_KEY_PATTERN = "*/test_*/output/*/console-output.log"`, assembled into the resource ARN). Everything else - payloads, `result.txt`, `stats.json`, benchmark CSVs - stays private. + +## Collecting artifacts: the manifest + +After CloudWatch logs are pulled, the orchestrator calls `collect_run_artifacts` (`artifacts.py`, invoked from `pipeline.py`). It: + +1. Discovers `(test_name, instance_id)` pairs by walking `run_prefix/` with `Delimiter="/"` twice (`_discover_instances`). Leaves not starting with `test_` are skipped, deliberately excluding the bare `run_prefix/test_config.json`. +2. For each instance, downloads `console-output.log` to `run_dir/vms/-console.log`. A `NoSuchKey` is a quiet skip producing a `status: "missing"` entry. +3. Builds the public URL via `s3_public_url`, returning a **virtual-hosted-style** URL `https://.s3..amazonaws.com/`. +4. Writes `run_dir/artifacts.json`. + +The manifest carries `schema_version = ARTIFACTS_MANIFEST_VERSION = 1`, plus `generated_at`, `run_prefix`, `s3_bucket`, `origin`, and an `artifacts` list. Each entry has: `test`, `instance_id`, `kind`, `kcidb_role`, `status` (`ready`/`missing`), `s3_uri`, `log_url`, `local_path`, `sha256`, `size_bytes`, `content_type`. The only known artifact kind is `console-output.log`, mapped to role `log` and content type `text/plain; charset=utf-8`. + +## The KCIDB hand-off (currently indirect) + +Direct KCIDB submission is **disabled**: in `pull_labs_poller.py` the `submit_tests(...)` call is commented out, and the dashboard never displayed the parallel `pull_labs_aws_ec2`-origin rows. Instead, per-instance `log_url` values are written back onto the maestro node's artifacts - the first goes to `outcome.artifacts["test_log"]`, extras to suffixed keys `test_log_`. `kernelci-pipeline`'s `send_kcidb` then emits the single, dashboard-visible row from `artifacts.test_log`. + +### Failure fallback: container-failure.log + +If the ECS container exited non-zero **before any VM booted**, there is no kernel log to link. The orchestrator uploads the container's own log to `f"{run_prefix}/container-failure.log"` (`pipeline.py`) with `ContentType="text/plain; charset=utf-8"`, and its public URL (same `s3_public_url`) becomes the fallback `log_url`. This key sits directly under `run_prefix` and is therefore **not** matched by the public-read pattern (`*/test_*/output/*/console-output.log`); it is publicly readable only if a broader policy exists. + +## Benchmark analysis reads back from output/ + +The benchmark analyzer reuses the output layout. `_analyze_test` lists `f"{self.run_prefix}/test_{test_name}/output/"`, filters to keys ending in `.csv` containing `benchmark-`, and classifies by filename: `benchmark-base-*` rows feed the baseline, `benchmark-tip-*` rows feed the candidate. + +## End-to-end key map + +```mermaid +flowchart LR + subgraph EXT["external bucket"] + E1["test-scripts/test-vm-client.sh"] + E2["test-scripts/<name>/<name>_test_payload.zip"] + E3["test-scripts/<name>/external_requirements.json"] + E4["kernel-rpms/src + binary/x86_64 + binary/aarch64"] + end + subgraph RES["results bucket / run_prefix"] + R0["test_config.json"] + R1["test_<name>/input/<script> + payload.zip"] + R2["shared/<folder>/ (deduped)"] + R3["test_<name>/state/<id>/run_id.txt"] + R4["test_<name>/output/<id>/console-output.log (public)"] + R5["test_<name>/output/<id>/{result,stats,benchmark-*,results_*}"] + R6["container-failure.log (fallback)"] + end + E1 --> R1 + E2 --> R1 + E3 --> R2 + E4 --> R2 + R4 --> M["artifacts.json -> node.artifacts.test_log"] + R6 --> M +``` + +## Summary of invariants + +- All run state hangs off `run_{test_id}_{YYYYMMDD_HHMMSS}` (UTC). +- `results_prefix` exists in config but is never used. +- Inputs come from local `vm-tests/` first, else the external bucket under `test-scripts/` and `kernel-rpms/`. +- Reusable inputs are deduplicated into `run_prefix/shared//`. +- Only `*/test_*/output/*/console-output.log` is public; everything else is private. +- The boot log is scrubbed before upload and is the only artifact surfaced to KCIDB, indirectly, via `node.artifacts.test_log`. diff --git a/docs/06-aws-provisioning.md b/docs/06-aws-provisioning.md new file mode 100644 index 0000000..f27fba0 --- /dev/null +++ b/docs/06-aws-provisioning.md @@ -0,0 +1,206 @@ +# AWS Resource Provisioning - ensure_exists + +This chapter documents how `pullab_cloud` provisions the AWS resources a Fargate-launched kernel-CI test run depends on. Provisioning is constructor-driven, idempotent, and built on one small abstraction: the `ensure_exists` "check, then create" pattern in `core/base_resource_manager.py`. Every resource type (IAM role, ECR repo, ECS cluster, CloudWatch log group, ECS task definition) is a thin subclass supplying three primitives; the base class orchestrates them. + +**Key files** + +- `auth/aws_auth.py` - `AWSAuth.__init__`, `authenticate`, `wait_for_resources`, `_build_and_push_docker_image` +- `core/base_resource_manager.py` - `BaseResourceManager`, `ensure_exists` +- `core/client_manager.py` - `ClientManager`, `get_client` +- `aws_role_manager.py`, `aws_ecr_manager.py`, `aws_cluster_manager.py`, `aws_task_definition_manager.py`, `aws_cloudwatch_manager.py`, `aws_network_manager.py` +- `main.py` - `main`, `load_credentials` +- `core/pipeline.py` - `run_pipeline` +- `examples/aws/config.json` + +--- + +## 1. Where provisioning starts: a constructor side effect + +`AWSAuth.__init__` ends by calling `self.authenticate()` (`auth/aws_auth.py`). There is no separate "provision" step - constructing the auth object both authenticates **and** ensures every configured resource exists. `main.main()` triggers this via `auth = auth_class(config, credentials)` (`main.py`). + +`authenticate()` is re-entry guarded: if `self._authenticated` is already `True` it returns immediately (`aws_auth.py`), so re-calls (e.g. via the lazy `get_client` / `get_credentials` fallbacks) are cheap. + +--- + +## 2. Credential resolution and precedence + +Before any resource is touched, `authenticate()` establishes a boto3 `Session`. Precedence is explicit-first, default-chain-second: + +1. **Explicit credentials** - if both `access_key_id` and `secret_access_key` are present in the `credentials` dict (from `credentials.json`), a `Session` is built from them (including optional `session_token`) and validated with `sts.get_caller_identity()`. On any failure it logs and falls back (`self._session = None`). +2. **Default chain** - if no explicit session was established, `_check_credentials()` tries `sts.get_caller_identity()` against the default boto3 chain (env vars, `~/.aws/credentials`, IAM role). On success a default `Session` is created. +3. **No credentials** - if `self._session` is still `None`, `authenticate()` raises `ValueError` with guidance to add keys or run `aws configure`. + +`credentials.json` is loaded by `main.load_credentials()`, which looks in the **same directory** as `config_path` via `os.path.join(os.path.dirname(config_path), "credentials.json")` and returns `None` (with a warning) if absent. + +> Note: `_run_setup_script()` exists and would loop on `setup-iam-user.sh` until valid credentials appear, but `authenticate()` does **not** invoke it - it raises `ValueError` directly when no credentials resolve. + +--- + +## 3. The auto-refreshing ClientManager + +Service clients go through `ClientManager` (`core/client_manager.py`), not the session directly. The manager is created with a factory closure and a `refresh_interval` defaulting to **59 seconds**. + +`get_client(service_name)` rebuilds a client when it is **missing OR older than `refresh_interval`**: + +```python +if service_name not in self._clients or now - self._timestamps.get(service_name, 0) > self._refresh_interval: + self._clients[service_name] = self._client_factory(service_name) + self._timestamps[service_name] = now +``` + +The factory closure differs by credential mode (`aws_auth.py`): + +- **Explicit-credential mode** - each call rebuilds a `boto3.Session` from the stored `access_key_id` / `secret_access_key` / `session_token`, then returns a client from it. +- **Default-chain mode** - each call does `boto3.Session().client(service, ...)`, so a fresh session re-resolves credentials (useful for short-lived assumed-role credentials that may refresh underneath the process). + +--- + +## 4. The ensure_exists pattern + +`BaseResourceManager` (`core/base_resource_manager.py`) is an ABC defining three abstract primitives - `check_exists`, `create`, `get_identifier` - and one concrete orchestrator, `ensure_exists`, which returns a **`(resource_identifier, was_created)` tuple**: + +```python +def ensure_exists(self, resource_name, resource_config=None, force_recreate=False) -> tuple: + if force_recreate and self.check_exists(resource_name): + if hasattr(self, "delete_role"): + self.delete_role(resource_name) + if self.check_exists(resource_name): + return self.get_identifier(resource_name), False + config = resource_config or self.config.get(resource_name, {}) + identifier = self.create(resource_name, config) + return identifier, True +``` + +Two easy-to-miss behaviors: + +- **`force_recreate` is gated on `hasattr(self, "delete_role")`.** Only `AWSRoleManager` defines `delete_role`, so `force_recreate` deletes-and-recreates for IAM roles only; for the ECR, cluster, and task-definition managers it is effectively a **no-op** (they fall through to "exists? then return it"). +- The `was_created` flag lets the caller decide whether to wait for AWS eventual consistency afterwards (see section 6). + +```mermaid +flowchart TD + EE["ensure_exists(name, config, force_recreate)"] --> FR{"force_recreate AND check_exists?"} + FR -->|yes| HD{"hasattr delete_role?"} + HD -->|yes| DEL["delete_role(name)"] + HD -->|no| SKIP["no-op"] + FR -->|no| CHK + DEL --> CHK + SKIP --> CHK + CHK{"check_exists(name)?"} -->|yes| EX["return get_identifier(name), False"] + CHK -->|no| CR["create(name, config)"] + CR --> NEW["return identifier, True"] +``` + +--- + +## 5. The resource managers + +Each manager is a small subclass implementing the three primitives: + +| Manager | check_exists | create | Identifier | +|---|---|---|---| +| `AWSRoleManager` | `get_role` succeeds (else `NoSuchEntityException`) | `create_role` + attach managed/inline policies + instance profile for EC2 roles | role ARN | +| `AWSECRManager` | `describe_repositories` succeeds (else `RepositoryNotFoundException`) | `create_repository` with `scanOnPush` | repository URI | +| `AWSClusterManager` | `describe_clusters` returns a cluster with `status == "ACTIVE"`; any exception means absent | `create_cluster` | cluster ARN | +| `AWSTaskDefinitionManager` | `describe_task_definition` returns `status == "ACTIVE"`; any exception means absent | `register_task_definition` (FARGATE, awsvpc) | task definition ARN | +| `AWSCloudWatchManager` | `describe_log_groups` returns a non-empty prefix match | `create_log_group` + `put_retention_policy` | log group name | +| `AWSNetworkManager` | always `True` (uses default VPC) | returns `"default-vpc"` (no-op) | `"default-vpc"` | + +Details worth calling out: + +- **Cluster and task-definition existence both require `status == "ACTIVE"`** and treat any exception as "does not exist" (`aws_cluster_manager.py`, `aws_task_definition_manager.py`). This is deliberate: an `INACTIVE`/deregistered resource is treated as needing re-creation, not silently reused. +- **CloudWatch retention** defaults to **7 days**: `create` reads `resource_config.get("retention_days", 7)` and calls `put_retention_policy` (`aws_cloudwatch_manager.py`). +- **AWSNetworkManager** never provisions anything: `check_exists` returns `True` and `create` returns the literal `"default-vpc"`. The real work is in `get_network_config` (`aws_network_manager.py`) - it resolves `default-for-az` subnets, takes the **first two**, finds the default security group in the same VPC, and sets `assignPublicIp` to `"ENABLED"`. + +### AWSRoleManager overrides ensure_exists to heal drift + +`AWSRoleManager.ensure_exists` (`aws_role_manager.py`) calls `super().ensure_exists(...)` and then, **for EC2-trusted roles only**, always re-runs `_ensure_instance_profile(resource_name)`. The base implementation short-circuits when the role already exists, so a drifted instance profile (profile present but no role attached) would never be repaired on re-run; the override forces the binding check every time. + +`_is_ec2_role` (`aws_role_manager.py`) inspects the trust policy's first statement `Principal` and returns `True` if `"ec2.amazonaws.com"` appears. `_ensure_instance_profile` (`aws_role_manager.py`) is independently idempotent for both create-profile and attach-role steps. + +--- + +## 6. How authenticate() drives provisioning + +After the `ClientManager` is built, `authenticate()` walks the config in a fixed order, calling `ensure_exists` on each configured section and recording whether anything new was created (`aws_auth.py`): + +1. **IAM roles** - if `config["roles"]` is present, an `AWSRoleManager` is created and each role passed through `ensure_exists` with `force_recreate = config.get("force_recreate_roles", False)`. ARNs collect into `role_arns`; a created role sets `self._resources_created = True`. +2. **ECR repository** - `AWSECRManager.ensure_exists` returns the repo URI; a created repo sets `_resources_created`. If `config["docker"]` is present, `_build_and_push_docker_image(repo_uri)` runs. +3. **ECS cluster** - `AWSClusterManager.ensure_exists`; a created cluster sets `_resources_created`. +4. **CloudWatch log groups** - only if `config["cloudwatch"]` exists; each log group is ensured. (Created log groups do **not** flip `_resources_created`.) +5. **Task definition** - `task_config` is copied from `config["ecs"]["task_definition"]`, and **both** `execution_role_arn` and `task_role_arn` are set to `next(iter(role_arns.values()), None)` - the **first** role's ARN. Then `AWSTaskDefinitionManager.ensure_exists` registers it. +6. **Network manager** - an `AWSNetworkManager` is stored on `self._network_manager` for later `get_network_config()` calls. + +Finally `self._authenticated = True`. + +```mermaid +flowchart TD + A["authenticate(): build ClientManager"] --> R{"config roles?"} + R -->|yes| RM["AWSRoleManager.ensure_exists per role"] + RM --> RA["collect role_arns; set _resources_created if created"] + R -->|no| EC + RA --> EC{"config ecr?"} + EC -->|yes| ER["AWSECRManager.ensure_exists"] + ER --> DK{"config docker?"} + DK -->|yes| BP["_build_and_push_docker_image(repo_uri)"] + DK -->|no| CL + BP --> CL + EC -->|no| CL{"config ecs?"} + CL -->|yes| CM["AWSClusterManager.ensure_exists"] + CM --> CW{"config cloudwatch?"} + CW -->|yes| LG["ensure each log group"] + CW -->|no| TD + LG --> TD["task roles = first role ARN; AWSTaskDefinitionManager.ensure_exists"] + TD --> NM["store AWSNetworkManager"] + CL -->|no| DONE + NM --> DONE["_authenticated = True"] +``` + +--- + +## 7. Docker image build / push + +`_build_and_push_docker_image` (`aws_auth.py`) is the one provisioning step that shells out: + +1. Read `docker` config: `dockerfile`, `build_context` (default `"."`), `tag` (default `"latest"`), `force_rebuild` (default `False`). +2. Unless `force_rebuild` is set, call `describe_images` for the tag. If the image **exists**, it logs, rewrites `config["ecs"]["task_definition"]["image"]` to the existing `{repo_uri}:{tag}`, and **returns early**. +3. If `describe_images` raises `ImageNotFoundException` (or `force_rebuild` is `True`), it gets an ECR auth token, `docker login`s, `docker build`s (with `--network host`), and `docker push`es. +4. It then rewrites `config["ecs"]["task_definition"]["image"]` to the freshly pushed URI so the subsequent task-definition registration uses the new image. + +--- + +## 8. Waiting for eventual consistency + +Newly created IAM roles and ECS resources are not immediately usable (AWS is eventually consistent). The `was_created` bookkeeping feeds a single decision in `run_pipeline` (`core/pipeline.py`): + +```python +if hasattr(provider.auth, "resources_were_created") and provider.auth.resources_were_created(): + provider.auth.wait_for_resources() +else: + logger.debug("Using existing resources, no propagation delay needed") +``` + +So the propagation wait is **skipped entirely** when nothing was created. + +`wait_for_resources()` (`aws_auth.py`) uses boto3 waiters: + +- For each configured role name, an IAM `role_exists` waiter. +- If ECS is configured, a `role_exists` waiter on the service-linked role `"AWSServiceRoleForECS"`. +- If a cluster is configured, a final `describe_clusters` verification. + +--- + +## 9. Worked example: the bundled config + +`examples/aws/config.json` exercises the full path. It sets `"force_recreate_roles": true`, so on every run the single configured role `kernel-ci-exampleuser-ecs-role` is deleted and recreated (the only manager for which `force_recreate` is not a no-op). That role's trust policy lists both `ecs-tasks.amazonaws.com` and `ec2.amazonaws.com`, so `_is_ec2_role` is `True` and an instance profile is created/healed. The same ARN is used as both the task definition's `execution_role_arn` and `task_role_arn`. + +The config also declares an ECR repo (`kernel-ci-exampleuser-ecr`), a Docker build (`force_rebuild: true`), an ECS cluster (`kernel-ci-exampleuser-cluster`), a task family (`kernel-ci-exampleuser-task`), and two CloudWatch log groups (`/ecs/...` retention 7 days, `/ec2/...` retention 3 days). + +--- + +## 10. Summary of guarantees + +- Provisioning is **constructor-driven**: building `AWSAuth` runs the full ensure-exists sweep. +- Every resource type implements the same three primitives and inherits the same idempotent orchestration; only `AWSRoleManager` can delete/recreate. +- `force_recreate` is meaningful only for IAM roles (gated on `hasattr(self, "delete_role")`). +- `AWSRoleManager` additionally self-heals drifted EC2 instance-profile bindings on every run. +- The expensive eventual-consistency wait happens only when something was actually created, tracked via the `(identifier, was_created)` tuple. diff --git a/docs/07-kernelci-kcidb-integration.md b/docs/07-kernelci-kcidb-integration.md new file mode 100644 index 0000000..6c8bd87 --- /dev/null +++ b/docs/07-kernelci-kcidb-integration.md @@ -0,0 +1,244 @@ +# KernelCI & KCIDB Integration - the pull-lab bridge + +This chapter documents how `pullab_cloud` bridges the KernelCI ecosystem (the `kernelci-api` "maestro" event service and the KCIDB results database) to the AWS-backed test pipeline. It is a "pull-lab": instead of maestro pushing jobs at a runtime, the runtime *polls* maestro for jobs it can run, executes them, and reports results back. + +**Key files:** + +- `pullab_cloud/src/kernel_ci_cloud_labs/pull_labs_poller.py` - the long-lived poller / orchestrator. +- `pullab_cloud/src/kernel_ci_cloud_labs/pull_labs_translate.py` - translates a PULL_LABS job definition into a `pullab_cloud` run config. +- `pullab_cloud/src/kernel_ci_cloud_labs/kcidb_submit.py` - builds and (when enabled) submits KCIDB `tests[*]` rows. +- `kernelci-core/kernelci/runtime/pull_labs.py` - the upstream KernelCI runtime that *produces* PULL_LABS job definitions and stores them for labs to pull. + +## 1. The two halves of the protocol + +PULL_LABS is pull-based, with two cooperating sides: + +1. **Producer (upstream KernelCI):** `kernelci-core`'s `PullLabs(Runtime)` class renders a JSON job definition from a template (`PullLabs.generate`), then `submit()` *stores* that JSON in external storage rather than dispatching it. `submit()` returns `None` because pull-based labs pick up jobs asynchronously. `_store_job_definition` builds the storage path `pull_labs_jobs//.json` from `time.strftime("%Y%m%d")` and `uuid.uuid4().hex`, uploaded via `storage.upload_single(...)`. +2. **Consumer (this repo):** `PullLabsPoller` polls `kernelci-api` for job events, fetches each job's `job_definition` JSON, translates it, runs it on AWS, and writes results back. + +```mermaid +flowchart LR + subgraph Producer["kernelci-core PullLabs runtime"] + Gen["generate() renders JSON"] + Sub["submit() stores JSON"] + Gen --> Sub + Sub --> Store["external storage
pull_labs_jobs/YYYYMMDD/uuid.json"] + end + subgraph API["kernelci-api (maestro)"] + Node["job node
state=available
artifacts.job_definition=URL"] + end + subgraph Consumer["pullab_cloud PullLabsPoller"] + Poll["poll /events"] + Run["translate + run on AWS"] + Report["finish node + log_url back"] + Poll --> Run --> Report + end + Store -.-> Node + Node --> Poll + Report --> Node +``` + +## 2. The polling loop + +The poller polls `kernelci-api` `/events`, claims each job by recording `data.job_id`, fetches the `job_definition`, translates and runs it, submits to KCIDB, then marks the node done. + +`_events_url` builds the query with these exact parameters: + +``` +GET {api_base_uri}/events?state=available&kind=job&recursive=true&limit=1000&from= +``` + +The `from` value is a cursor timestamp persisted by `FileCursorStore`, defaulting to `DEFAULT_FROM_TIMESTAMP = "1970-01-01T00:00:00.000000"` and the file `/tmp/pullab_cloud_cursor.json`. `poll_once` reads the cursor, fetches events, processes each, then writes the last event's timestamp back. + +> Note: the upstream reference tool `kernelci-pipeline/tools/example_pull_lab.py` differs deliberately - it polls `state=done`, uses `requests`, runs `tuxrun`, waits for `input()` before launching, and defaults to `--group-filter pull-labs` / `--platform qemu-x86_64`. The production poller polls `state=available` (jobs not yet run) using stdlib `urllib`, runs the AWS pipeline, and never blocks on interactive input. + +## 3. Claiming a node + +`kernelci-api` has no node *state* usable as a "claimed" marker. Its state machine (`Node.validate_node_state_transition` using `state_transition_map`) allows: + +``` +running -> [available, closing, done] +available -> [closing, done] +closing -> [done] +done -> [] +``` + +There is **no** `available -> running` edge, so an `available` job cannot be promoted to `running`. A same-state transition returns `True` early, so `available -> available` is a no-op the API accepts. + +`_claim_node` therefore claims by writing `data.job_id` - the "Runtime job ID" field (`TestData.job_id`, used by both `Test` and `Job` nodes; the build-node analogue is the identically-named `KbuildData.job_id`). Procedure: + +1. Re-read the node (`GET /node/`). +2. Require `state == "available"`; skip otherwise. +3. Skip if `data.job_id` is already set (already claimed). +4. Set `data.job_id = f"{runtime_name}:{uuid.uuid4().hex}"`. +5. Strip `NODE_READ_ONLY_FIELDS` and `PUT` the full document back. + +The node is left in `available`; the claim is purely the `job_id` marker and is **best effort** - `kernelci-api` has no compare-and-set, so two pollers that both read before either writes can each claim the same node. Parallel pollers must be partitioned by platform (`KERNELCI_PLATFORMS`). + +`NODE_READ_ONLY_FIELDS` is exactly: `id`, `_id`, `created`, `updated`, `user`, `user_groups`, `owner`, `submitter`, `treeid`, `processed_by_kcidb_bridge`, `retry_counter`, `timeout`. These are omitted from PUT payloads to avoid FastAPI/Pydantic validation rejections. + +## 4. Per-event processing + +`process_event` runs each event end to end: + +1. `_matches_runtime` - skip unless `node.data.runtime == runtime_name`. +2. `_matches_platform` - skip unless `node.data.platform` is in the `KERNELCI_PLATFORMS` allowlist (`None` accepts all). +3. `_job_definition_url` - skip unless `node.artifacts.job_definition` is an `http` URL. +4. `_claim_node` - skip if it cannot be claimed. +5. `_execute_job` in a `try`, with `_finish_node` in a `finally` so an owned node is always finished. The default outcome before success is `NodeOutcome("incomplete", _ERR_INFRASTRUCTURE, "unexpected internal error")`. + +`_execute_job`: + +- Fetches the `job_definition` JSON (fetch failure -> `incomplete` / `Infrastructure`). +- Resolves `build_id` via `resolve_build_id`. +- Translates with `translate_job` (`ValueError` -> `incomplete` / `invalid_job_params`). +- Runs `self.job_executor(run_config)`; an executor exception is an infrastructure failure: emits a single `{"name": "boot.infrastructure", "status": "ERROR"}` row and marks the node `incomplete`. +- Builds KCIDB `tests[*]` rows with `build_test_row` (test id is `f"{node_id}.{instance_suffix}"`). +- If no rows came back, emits one synthetic `path="boot"` `ERROR` row and marks the node `incomplete`. + +### 4.1 Node result derivation + +`_node_result_from_rows` maps test statuses to a node result for a job that actually ran: + +- Any of `FAIL`, `ERROR`, `MISS` present -> `"fail"`. +- Else any of `PASS`, `DONE` present -> `"pass"`. +- Else `SKIP` present -> `"skip"`. +- Else -> `"fail"`. + +It **never** returns `"incomplete"` - that value is reserved for infrastructure failures and is decided by the caller (`_execute_job`). Those caller-side codes come from the `ErrorCodes` enum (`kernelci-core/kernelci/api/models.py`): `INVALID_JOB_PARAMS = "invalid_job_params"` and `INFRASTRUCTURE = "Infrastructure"`, surfaced in the poller as module constants `_ERR_INVALID_JOB_PARAMS` and `_ERR_INFRASTRUCTURE`. + +### 4.2 Finishing the node + +`_finish_node` re-reads the node, sets `state="done"` and `result`, and on an infrastructure failure also sets `data.error_code` / `data.error_msg`. It **merges** (not replaces) any `outcome.artifacts` into the node's existing `artifacts` dict so it never clobbers `job_definition`, then strips `NODE_READ_ONLY_FIELDS` and PUTs. + +## 5. Build-id resolution + +`resolve_build_id` walks `node.parent` upward looking for a `kbuild` ancestor, up to **8 hops**. On finding one it returns `f"{origin}:{kbuild_node['id']}"`, mirroring the convention in `kernelci-pipeline/src/send_kcidb.py` (`"id": f"{origin}:{node['id']}"`). If no `kbuild` ancestor is found the caller (`_execute_job`) falls back to `f"{origin}:unknown_{node_id}"`. + +## 6. Translation + +`translate_job` deep-copies `base_config` and rewrites `test_config` for the job. It requires `artifacts.kernel` **and** `artifacts.modules`, raising `ValueError` otherwise. + +`DEFAULT_PLATFORM_MAP`: + +| arch | instance_type | AMI hint | +|-------------------|----------------|---------------------------------------------------| +| `x86_64` | `c5a.4xlarge` | AL2023 `...al2023-ami-kernel-default-x86_64` | +| `arm64`/`aarch64` | `c6g.4xlarge` | AL2023 `...al2023-ami-kernel-default-arm64` | + +`DEFAULT_TEST_TYPE_MAP`: `baseline`, `ltp`, `unixbench` all map to `url-kernel-boot`. Unknown types fall back to `url-kernel-boot` via `_resolve_test_dir`, which uses `test_type_map["_default"]` if present, else the literal `"url-kernel-boot"`. + +The `test_params` dict carries: + +- `KERNEL_URL`, `MODULES_URL`, `ARCH` (always). +- `ROOTFS_URL` (only if `artifacts.rootfs` or `artifacts.ramdisk` present). +- `KERNELCI_NODE_ID` (only if `node_id` was passed). +- `PULL_LABS_TESTS` (only if the job has tests) - a comma-joined list of `id:type` pairs. + +The job `timeout` defaults to `3600`, is coerced to `int`, and maps to the VM entry's `max_runtime`. Each job becomes exactly one entry in `test_config.vms[*]` (one VM per job). + +## 7. KCIDB submission + +`kcidb_submit.py` builds tests-only KCIDB revisions. `STATUS_MAP`: + +- `pass`/`ok`/`success` -> `PASS` +- `fail`/`failed` -> `FAIL` +- `skip`/`skipped` -> `SKIP` +- `error`/`errored`/`incomplete` -> `ERROR` +- `miss` -> `MISS` +- `done` -> `DONE` + +`to_kcidb_status` defaults to `ERROR` for any unknown or empty value. + +`build_test_row` validates `origin` against `^[a-z0-9_]+$` (`validate_origin`) and `path` against dot-separated `[A-Za-z0-9_-]` segments (`validate_test_path`), raising `ValueError` on an invalid value so a bad test name fails locally rather than at the ingester. + +### 7.1 Direct submission is disabled + +Direct KCIDB submission is currently **disabled**. The `submit_tests(...)` call is preserved as a commented-out block, and the `build_test_row` machinery is kept so `_node_result_from_rows` still works and dual submission can be re-enabled cheaply. + +The reason: the old direct submission posted rows under origin `pull_labs_aws_ec2`, producing a parallel row keyed `(pull_labs_aws_ec2, .)` that KCIDB stored but the dashboard never displayed - the dashboard renders the **maestro-origin** row (`origin=maestro`, `id=maestro:`) emitted by kernelci-pipeline's `send_kcidb`. + +The new flow instead writes the boot-log URL onto the maestro node's `artifacts` under the `test_log` key (extra URLs from multi-VM jobs under suffixed `test_log_` keys). `send_kcidb` then picks it up: `_get_artifacts` walks the parent chain when a node has no artifacts of its own, and the test-node parser sets `log_url = artifacts.lava_log` if present, else `artifacts.test_log`. A `test_log` value written on the job node is thus visible to every test descendant. + +## 8. Artifact collection + +`collect_run_artifacts` (`core/artifacts.py`) runs after VM logs are pulled. For every `(test_name, instance_id)` discovered under the run prefix in S3, it downloads the boot console log and writes an `artifacts.json` manifest whose entries carry a `log_url` built by `s3_public_url` from the S3 key: + +``` +{run_prefix}/test_{test}/output/{instance_id}/console-output.log +``` + +The public URL form is `https://.s3..amazonaws.com/` and only resolves when the bucket carries the public-read policy. Each manifest entry is keyed by `test` and `instance_id`; the poller's `_load_artifact_log_urls` indexes them by the `(test, instance_id)` tuple to attach a `log_url` to each test row. + +## 9. Executor and pipeline + +The default executor `_default_job_executor` instantiates the auth, provider, and storage classes from the `AUTH_REGISTRY` / `PROVIDER_REGISTRY` / `STORAGE_REGISTRY` registries, calls `run_pipeline(provider, storage)`, and returns `_extract_test_results(summary)`. + +`run_pipeline` (`core/pipeline.py`) returns the dict produced by `create_summary`. That summary dict includes `run_directory`, `vms.instances[]` (per-instance ground truth), and `container_failure_log_url`. The per-run S3 prefix is `run_{test_id}_{datetime}` (`f"run_{test_id}_{run_timestamp}"`, timestamp `%Y%m%d_%H%M%S` UTC). + +`_extract_test_results` emits one row per VM instance from `summary["vms"]["instances"]`, joining with `artifacts.json` on `(test, instance_id)` to attach `log_url`. It falls back to legacy per-test aggregation when `instances` is absent, and sets the second tuple element to `summary["container_failure_log_url"]` so a container-died-before-boot failure still links to the container log. + +### 9.1 Boot-test path remapping + +`_BOOT_TEST_NAMES` is the frozenset `{"baseline", "url-kernel-boot", "boot"}`. `_test_name_to_path` remaps any of these to the path `"boot"` so the KernelCI dashboard's `is_boot()` classifier treats the row as a boot test; every other name passes through unchanged (then validated by `build_test_row`). + +## 10. Configuration and entry points + +### 10.1 Credential resolution + +`_resolve_kcidb_endpoint` picks the KCIDB submit URL and token in priority order: + +1. `KCIDB_SUBMIT_URL` + `KCIDB_JWT` (both set). +2. `KCIDB_REST` (kci-dev form `https://@host/submit`). +3. `UNIFIED_TOKEN` as the JWT, paired with `KCIDB_SUBMIT_URL` if set, else config `kcidb_submit_url`. +4. config `kernelci.kcidb_submit_url` + `kernelci.kcidb_jwt`. + +The kernelci-api token precedence is `KERNELCI_API_TOKEN` > `UNIFIED_TOKEN` > config `api_token`. + +`_parse_kcidb_rest` parses `https://@host[/path]`, extracts the username as the token, rebuilds the host (with port if any), and ensures the path ends with `/submit`. + +### 10.2 Token preflight + +`_validate_api_token` calls `GET /whoami` once at startup and checks that the user is a superuser or a member of one of: `node:edit:any`, `runtime::node-editor`, `runtime::node-admin`. It is never fatal - a transient API error must not stop the poller from starting. + +### 10.3 Environment variables + +All optional, falling back to the `kernelci` section of `config.json`: + +| Env var | Purpose | +|----------------------------|-----------------------------------------------| +| `KERNELCI_API_BASE_URI` | maestro API base URI | +| `KERNELCI_API_TOKEN` | maestro API token | +| `KERNELCI_RUNTIME_NAME` | runtime/lab name to match jobs against | +| `KERNELCI_PLATFORMS` | comma-separated platform allowlist | +| `KCIDB_SUBMIT_URL` | KCIDB `/submit` URL | +| `KCIDB_JWT` | KCIDB bearer token | +| `KCIDB_ORIGIN` | KCIDB origin string | +| `KCIDB_REST` | combined kci-dev `https://@host/submit`| +| `UNIFIED_TOKEN` | shared fallback for API token and KCIDB JWT | +| `PULLAB_CURSOR_FILE` | cursor file path | +| `PULLAB_POLL_INTERVAL_SEC` | poll interval seconds | +| `PULLAB_BASE_CONFIG` | path to the base config JSON | + +### 10.4 Entry points + +- `main`: CLI. `--once` runs a single `poll_once` and exits; otherwise `run_forever` loops, sleeping `poll_interval_sec` only when a cycle processed zero events. +- `lambda_handler`: AWS Lambda entry point that runs a single `poll_once` per invocation, with config read from env vars (and an optional `config_path` in the event payload), returning `{"status": "ok", "processed": }`. + +## 11. End-to-end data flow + +```mermaid +flowchart TD + Ev["GET /events?state=available&kind=job"] --> PE["process_event"] + PE --> Claim["_claim_node sets data.job_id"] + Claim --> Fetch["fetch job_definition JSON"] + Fetch --> Build["resolve_build_id -> origin:kbuild_id"] + Build --> Tr["translate_job -> run_config"] + Tr --> Exec["job_executor runs pipeline on AWS"] + Exec --> Sum["summary with vms.instances and run_directory"] + Sum --> Rows["_extract_test_results joins artifacts.json"] + Rows --> Outcome["_node_result_from_rows -> pass/fail/skip"] + Outcome --> Attach["attach test_log URL to artifacts"] + Attach --> Finish["_finish_node state=done + result"] + Finish --> SendK["send_kcidb emits maestro row with log_url"] +``` diff --git a/docs/08-analysis-regression.md b/docs/08-analysis-regression.md new file mode 100644 index 0000000..513a271 --- /dev/null +++ b/docs/08-analysis-regression.md @@ -0,0 +1,308 @@ +# Benchmark Regression Analysis & Tooling + +This chapter covers two adjacent concerns in `pullab_cloud`: the **benchmark regression analysis** subsystem (raw per-VM CSV metrics -> statistical regression verdicts) and the **operational tooling** around it (setup validation, configuration, uploads, cleanup, secret scrubbing). Both live under `src/kernel_ci_cloud_labs/` and run either inline in the pipeline or via `aws ...` CLI subcommands. + +**Two distinct regression engines** exist, and the distinction matters: + +| Engine | Module | Invoked by | Gate for "regression" | +|---|---|---|---| +| **In-pipeline analyzer** | `core/benchmark_analyzer.py` | `core/pipeline.py` (inline, post-run) | significance (p-value) **AND** effect size (Cohen's d) **AND** direction | +| **Offline analysis CLI** | `analysis/analyze_regressions.py` | `aws analyze` -> `analysis/run_analysis.py` | **sign of percent change only** (no significance/effect gate) | + +Same CSV schema, different verdicts. The analyzer is conservative (requires statistical evidence); the offline CLI is a visualization/plotting pipeline that flags any directional change. + +**Key files:** `core/benchmark_analyzer.py`, `core/pipeline.py`, `analysis/run_analysis.py`, `analysis/analyze_regressions.py`, `analysis/download_results.py`, `core/log_scrub.py`, `launch_vm.py`, `setup_validate.py`, `setup_configure.py`, `setup_upload_rpms.py`, `setup_upload_tests.py`, `setup_cleanup.py`, `cli.py`, `vm-tests/unixbench-kernel-regression/common_lib.sh`. + +--- + +## 1. The benchmark CSV contract + +Both engines read the same CSV schema emitted by the in-VM test harness. The header is written in `vm-tests/unixbench-kernel-regression/common_lib.sh`: + +``` +metric,unit,value,more_is_better,kernel_version,instance_id,instance_type,arch +``` + +(documented in `vm-tests/unixbench-kernel-regression/README.md`). `more_is_better` is computed per metric in the same awk block: every metric defaults to `"true"`; the **only** metric forced to `"false"` is `System_Call_Overhead` (a latency-style metric where larger is worse). + +The two CSV "sides" come from different run scripts: + +- `benchmark-base-*.csv` - written by `run-02-run-unixbench-setup-kernel-B.sh` (`summarize_unixbench_log ... "benchmark-base-...csv"`). +- `benchmark-tip-*.csv` - written by `run-03-run-second-unixbench.sh`. + +These filenames are load-bearing: the in-pipeline analyzer routes rows by the `benchmark-base-` / `benchmark-tip-` substring in the S3 key (see section 3). + +--- + +## 2. Statistical core (`core/benchmark_analyzer.py`) + +### Thresholds + +Two module-level constants gate a regression: + +```python +P_VALUE_THRESHOLD = 0.05 # significance +COHENS_D_THRESHOLD = 0.5 # meaningful effect size +``` + +### Dataclasses + +- `MetricStats` - value list deriving `mean`, `median`, `stddev` (sample, divides by `n-1`), and `cv` (coefficient of variation) in `__post_init__`. `stddev` computed only when `n > 1`; `cv` is `stddev/abs(mean)`, or `0.0` when `mean == 0`. +- `MetricComparison` - base/tip `MetricStats` plus computed statistics; `__post_init__` computes `pct_change`, then calls `_compute_tests()` and `_detect_regression()`. +- `TestBenchmarkResult` - aggregates comparisons for one test; exposes `regressions` / `has_regression`. +- `PipelineBenchmarkSummary` - aggregates across all tests, tracking `regression_test_names`, `tests_with_regression`, and failed-test bookkeeping. + +### Test computation (`_compute_tests`) + +If **either** sample has fewer than 2 values (`len(base_v) < 2 or len(tip_v) < 2`), it returns early with dataclass defaults (`t_pvalue = 1.0`, `u_pvalue = 1.0`, `cohens_d = 0.0`). Otherwise it computes Welch's t-test, Mann-Whitney U, and pooled Cohen's d. + +### Regression decision (`_detect_regression`) + +```python +significant = self.t_pvalue < P_VALUE_THRESHOLD or self.u_pvalue < P_VALUE_THRESHOLD +meaningful = abs(self.cohens_d) >= COHENS_D_THRESHOLD +if not (significant and meaningful): + self.is_regression = False + return +if self.more_is_better: + self.is_regression = self.pct_change < 0 +else: + self.is_regression = self.pct_change > 0 +``` + +All three must hold for a regression: + +1. **Significant** - `t_pvalue < 0.05` **OR** `u_pvalue < 0.05` (either test crossing suffices). +2. **Meaningful** - `abs(cohens_d) >= 0.5` (AND-ed with significance). +3. **Wrong direction** - for `more_is_better` metrics a drop (`pct_change < 0`) regresses; otherwise a rise (`pct_change > 0`) does. + +```mermaid +flowchart TD + S["_compute_tests"] --> G{"len base or tip lt 2"} + G -->|"yes"| D0["defaults p=1.0 d=0.0, no regression"] + G -->|"no"| C["Welch t, Mann-Whitney U, Cohen d"] + C --> SIG{"t_p lt 0.05 OR u_p lt 0.05"} + SIG -->|"no"| NR["is_regression = False"] + SIG -->|"yes"| EFF{"abs cohens_d ge 0.5"} + EFF -->|"no"| NR + EFF -->|"yes"| DIR{"more_is_better"} + DIR -->|"true"| P1{"pct_change lt 0"} + DIR -->|"false"| P2{"pct_change gt 0"} + P1 -->|"yes"| REG["is_regression = True"] + P2 -->|"yes"| REG + P1 -->|"no"| NR + P2 -->|"no"| NR +``` + +### Pure-Python statistics (no scipy/numpy) + +All helpers are hand-rolled so the analyzer carries no scientific dependencies: + +- `_welch_t_test` - unequal-variance t-test with Welch-Satterthwaite df; returns `(0.0, 1.0)` if either sample is `< 2` or standard error is `0`. +- `_mann_whitney_u` - rank-based U test with average-rank tie handling, reduced to a two-tailed p-value via normal approximation (`_normal_cdf`). (Docstring says "n >= 8" but the code applies the approximation unconditionally.) +- `_cohens_d` - pooled-stddev effect size; returns `0.0` if either sample is `< 2` or pooled std is `0`. +- `_normal_cdf` - standard normal CDF via `math.erfc`. +- `_t_distribution_two_tailed_p` - for `df > 100` uses the normal approximation; otherwise the regularized incomplete beta function. +- `_regularized_incomplete_beta` - Lentz continued-fraction evaluation with `max_iter = 200` and early break on delta convergence. + +--- + +## 3. S3 ingestion and comparison (`BenchmarkAnalyzer`) + +`BenchmarkAnalyzer` is constructed with `(s3_client, bucket, run_prefix)`. + +`analyze(test_names, vm_success_map=None)` seeds a `PipelineBenchmarkSummary`, optionally records success/fail counts from `vm_success_map`, then iterates each test name through `_analyze_test`, appending results and tallying regressions. + +`_analyze_test` lists objects under `f"{self.run_prefix}/test_{test_name}/output/"` and, for each key, **skips anything that is not a `.csv` and does not contain `benchmark-`**. Surviving rows route by substring: `benchmark-base-` -> base rows, `benchmark-tip-` -> tip rows. The whole S3 traversal is wrapped in a broad `try/except` that logs a warning and returns `None` on failure; if either side is empty it also returns `None`. + +`_compare` extracts `kernel_version` from the first row of each side, groups both sides via `_group_by_metric`, then iterates `sorted(set(base_by_metric) & set(tip_by_metric))` - i.e. **only metrics present in both** sides, sorted - building a `MetricComparison` per metric. + +`_group_by_metric` parses each row: skips rows with empty `metric`, parses `value` via `float(...)` and **skips the row on `ValueError`/`TypeError`**, and defaults `more_is_better` to `"true"`, treating it as `True` only when the lowercased string equals `"true"`. + +```mermaid +flowchart TD + A["analyze(test_names)"] --> B["for each test: _analyze_test"] + B --> C["list_objects_v2 under run_prefix/test_NAME/output/"] + C --> D{"endswith .csv AND contains benchmark-"} + D -->|"no"| C + D -->|"yes"| E["_download_csv"] + E --> F{"key has benchmark-base- ?"} + F -->|"base"| G["base_rows"] + F -->|"tip"| H["tip_rows"] + G --> I{"both non-empty"} + H --> I + I -->|"no"| J["return None"] + I -->|"yes"| K["_compare: group, intersect metrics, MetricComparison"] + K --> L["TestBenchmarkResult"] +``` + +### Reporting and the notification hook + +`log_benchmark_summary` renders a human-readable report: per test it logs base/tip kernel, metric count, and for each regression the means, stddevs, CVs, percent change, both p-values, and Cohen's d. It ends by listing `regression_test_names`. A documented **notification hook** comment block marks where downstream alerting (SNS, KCIDB, Slack/email, bisection triggers) would attach, noting that `PipelineBenchmarkSummary` supplies the structured payload (`regression_test_names`, per-metric stats, p-values, effect sizes). + +--- + +## 4. Pipeline wiring (`core/pipeline.py`) + +After a run, the pipeline performs benchmark analysis inline inside a broad `try/except`: + +```python +test_names = list({t for vm in provider.config["test_config"]["vms"] + for t in vm.get("test", [])}) +s3_client = provider.auth.get_client("s3") +analyzer = BenchmarkAnalyzer(s3_client, storage.bucket, run_prefix) +benchmark_summary = analyzer.analyze(test_names) # no vm_success_map +log_benchmark_summary(benchmark_summary) +``` + +Verified details: + +- `analyze` is called **without** a `vm_success_map`, so the summary's success/fail counts stay at defaults. +- A second copy of the notification-hook comment lives in `pipeline.py`, pointing back at `BenchmarkAnalyzer`. +- The block is guarded by `except Exception` that logs `"Benchmark analysis skipped: %s"` - a benchmark failure never fails the pipeline. + +--- + +## 5. Offline analysis path (`aws analyze`) + +The CLI `cmd_analyze` (`cli.py`) imports `analysis.run_analysis.main`; on `ImportError` it prints `"Analysis requires extra dependencies"` and the hint `pip install -e '.[analysis]'`, then exits. It builds an args namespace with the fixed `file_pattern="benchmark-*.csv"`. + +### `run_analysis.main` - three steps + +1. **Download** - builds a `download_args` namespace and calls `download_results.main`; returns `1` if it fails. +2. **Analyze** - calls `analyze_regressions.main`; returns `1` on failure. +3. **Optional upload** - only if `args.upload_analysis` is set, `upload_analysis_to_s3` pushes the combined CSV, regression CSV, and any `plots/*.png` to `{run_prefix}/analysis/`. + +### Download (`download_results.py`) + +`download_csvs_from_s3` paginates the whole `run_prefix`, keeping a key only if it **contains `/output/`** and its basename matches the pattern. Local files are named `f"{test_name}_{instance_id}_{Path(key).name}"`, where `test_name` strips the `test_` prefix from path part `[1]` and `instance_id` is path part `[3]`. `main` rejects any `file_pattern` not ending in `.csv`. + +### Offline regression math (`analyze_regressions.py`) + +`calculate_regression_simple` is the key contrast with the in-pipeline analyzer: per metric it computes `pct_change` from the two kernel means and flags `is_regression` purely on the **sign** of `pct_change` relative to `more_is_better` - **no significance test, no effect-size gate**. Only rows with `abs(percent_change) > 1.0` are passed to the plotter (`results_for_plot`). + +`main` loads the combined CSV, derives `kernel_base` by stripping a trailing `.x86_64` / `.aarch64` / `.arm64` suffix from `kernel_version`, requires **at least 2** distinct kernels, and compares the **two lowest** sorted kernels (`kernel_a, kernel_b = kernels[0], kernels[1]`). It produces an overall slice, then per-architecture slices for `x86_64` and `aarch64|arm64`, concatenates results, and writes the regression CSV. + +--- + +## 6. Secret scrubbing for public logs (`core/log_scrub.py`) + +Kernel boot console buffers are uploaded to a **public-read** prefix and their URLs published to KCIDB, so they are scrubbed at the upload boundary. + +`_RULES` is an ordered list - order matters where patterns overlap (more-specific wins): + +| Order | Rule name | Matches | Behavior | +|---|---|---|---| +| 1 | `pem-private-key` | `-----BEGIN ... PRIVATE KEY-----` ... `-----END ... PRIVATE KEY-----` (DOTALL) | full redaction | +| 2 | `ssh-public-key` | `ssh-rsa`/`ed25519`/`dss`/`ecdsa-*` + base64 body (40+) | full redaction | +| 3 | `jwt` | `eyJ...eyJ....` three base64url segments | full redaction | +| 4 | `github-token` | `(ghp\|gho\|ghu\|ghs\|ghr)_` + 36+ chars | full redaction | +| 5 | `aws-access-key-id` | `(AKIA\|ASIA)` + 16 chars | full redaction | +| 6 | `bearer-token` | `Bearer ` (case-insensitive) | **keeps** `"Bearer "` (group 1), redacts the token | +| 7 | `credential-kv` | secret-named `KEY=VALUE` / `KEY: VALUE` | **keeps** the `KEY=`/`KEY:`, redacts the value | + +`scrub_text` returns `("", {})` on empty input; otherwise the scrubbed string plus a `{rule_name: hits}` counter (zero-hit rules omitted). The substitution marker is `[REDACTED:{kind}]`. + +### Upload integration (`launch_vm.py`) + +In `capture_console_output`, the decoded buffer is scrubbed **before** upload. Redaction **counts** are logged, never the originals. The kernel-panic scan runs on the **scrubbed** text so a logged marker cannot re-leak a token. The object is written to: + +``` +{run_prefix}/test_{test}/output/{instance_id}/console-output.log +``` + +with object `Metadata` recording `"scrubbed": "v1"` (alongside `capture-reason` and `panic-detected`). + +--- + +## 7. Setup validation (`setup_validate.py`) + +`validate(...)` runs an ordered battery of checks and returns `0` only if **all** pass (`return 0 if all(results.values()) else 1`). Order: + +1. `aws_credentials` +2. `ec2_describe` +3. `ec2_console_output` +4. `ssm` +5. *(optional, only if `role_name` given)* `iam_role`, `instance_profile` +6. `s3_bucket` - and **only if the bucket exists/was created**, `s3_logs_public_policy` +7. `kernelci_api_token` +8. `kcidb_jwt` + +### Environment variables + +Held as plain strings to avoid a circular import on the poller: + +``` +KERNELCI_API_BASE_URI (ENV_API_BASE_URI) +KERNELCI_API_TOKEN (ENV_API_TOKEN) +KCIDB_SUBMIT_URL (ENV_KCIDB_URL) +KCIDB_JWT (ENV_KCIDB_JWT) +KCIDB_REST (ENV_KCIDB_REST) +UNIFIED_TOKEN (ENV_UNIFIED_TOKEN) +``` + +### Console-output permission probe + +`check_console_output_permission` deliberately calls `get_console_output` against the non-existent instance id `"i-0000000000000000f"` so the only thing under test is the IAM action. A `NotFound`/`Malformed` error means the call was authorized (pass); `UnauthorizedOperation`/`AccessDenied` is the failure case. + +### Public-read bucket policy + +The bucket is configured for **public access via bucket policy only** - ACLs stay blocked. `_create_s3_bucket` sets the PublicAccessBlock with `BlockPublicAcls=True`, `IgnorePublicAcls=True`, `BlockPublicPolicy=False`, `RestrictPublicBuckets=False`; the same shape is enforced by `_check_public_access_block`. + +The narrow public-read statement uses key pattern `"*/test_*/output/*/console-output.log"` (`_PUBLIC_LOGS_KEY_PATTERN`) and Sid `"PublicReadKernelBootLogs"` (`_PUBLIC_LOGS_SID`) - matching the console-log key layout from `launch_vm` (section 6). Everything else (payloads, results, benchmark CSVs) stays private. With `--fix`, `_check_bucket_policy_statement` merges the expected statement into the existing policy, replacing any prior statement carrying the same Sid. + +> **Production safety note.** Setting `BlockPublicPolicy=False` / `RestrictPublicBuckets=False` deliberately relaxes a public-access safeguard so the boot-log policy is accepted. It is scoped to a single narrow `s3:GetObject` statement and ACL-based public access remains blocked, but it is still a public-exposure surface - pair it with the `log_scrub` pass (section 6) and confirm the bucket is intended to host only world-readable boot logs before applying `--fix`. + +--- + +## 8. Configuration and resource lifecycle tooling + +### `setup_configure.py` + +`get_default_prefix` returns `f"kernel-ci-{user}-"` from `$USER`. `update_config` strips the trailing dash to form `base` and derives every resource name from it: + +- S3 buckets: `{base}-results` and `{base}-storage` +- IAM role key: `{base}-ecs-role` +- ECR repository: `{base}-ecr` +- ECS cluster: `{base}-cluster`; task family: `{base}-task` +- CloudWatch log groups: `/ecs/{base}-task` and `/ec2/{base}-vms` + +It also rewrites IAM policy ARNs to track the renamed role/buckets: the `AllowPassRole` resource, the optional `AllowIAMInstanceProfile` resource, and the `AllowS3Access` resource list. + +### `setup_upload_rpms.py` + +`upload_to_s3` places RPMs under a fixed layout: + +``` +kernel-rpms/src +kernel-rpms/binary/x86_64 +kernel-rpms/binary/aarch64 +``` + +Uploads are size-verified by `verify_s3_upload`, which re-`head_object`s and compares `ContentLength` to local size with up to `retries=3` and exponential backoff (`2**attempt`). + +### `setup_upload_tests.py` + +`upload_to_s3` uploads `test-scripts/test-vm-client.sh` once, then for **each subdirectory containing at least one `run*.sh`** writes a zip to `test-scripts/{name}/{name}_test_payload.zip` and, when present, uploads `external_requirements.json` **separately** to `test-scripts/{name}/external_requirements.json` so the pipeline can read it without unzipping. + +### `setup_cleanup.py` + +The tool **lists by default and deletes only with `--delete`** (closing guidance: `"Run with --delete to remove these resources"`). `delete_iam_role` unwinds in order: detach managed policies, delete inline policies, remove the role from its instance profile and delete that profile, then delete the role. Beyond prefix-derived resources it also checks the legacy default names `ecsTaskExecutionRole` and the `kernel-ci-test` ECR repository. + +--- + +## 9. Summary of the two regression verdicts + +The single most important takeaway: a "regression" means different things in the two code paths. + +```mermaid +flowchart TD + CSV["benchmark-base / benchmark-tip CSVs in S3"] --> A["In-pipeline analyzer"] + CSV --> B["aws analyze (offline)"] + A --> A1["sig (t_p OR u_p lt 0.05) AND effect (d ge 0.5) AND direction"] + B --> B1["sign of pct_change only; plot if abs gt 1.0"] + A1 --> R["PipelineBenchmarkSummary + log + notification hook"] + B1 --> P["regression_results.csv + plots, optional upload"] +``` + +The pipeline path is the authoritative pass/fail signal; the offline `aws analyze` path is a reporting/visualization aid that intentionally trades statistical rigor for a complete directional picture across architectures. diff --git a/docs/09-cost-leak-prevention.md b/docs/09-cost-leak-prevention.md new file mode 100644 index 0000000..f28938f --- /dev/null +++ b/docs/09-cost-leak-prevention.md @@ -0,0 +1,224 @@ +# Cost & Resource-Leak Prevention + +Every test run in `pullab_cloud` spins up real, billable AWS infrastructure: one ECS Fargate orchestrator task plus one or more EC2 VMs (each with an attached EBS root volume), CloudWatch log groups, and S3 objects. An orphaned `c5a.4xlarge` VM is the single most expensive failure mode. The codebase uses layered, defense-in-depth mechanisms so that no compute keeps running and no storage keeps accruing once a run is over - even when the orchestrator dies, SSM never connects, the guest kernel panics, or a thread crashes silently. + +**Key files:** `launch_vm.py`, `aws_provider.py`, `pipeline.py`, `test-vm-client.sh`, `aws_cloudwatch_manager.py`, `base_resource_manager.py`, `setup_cleanup.py`, `examples/aws/config.json`. + +```mermaid +flowchart TD + subgraph orchestrator["ECS Fargate orchestrator (launch_vm.py)"] + A["spawn_vm: run_instances"] + B["execute_test_via_ssm: send_command"] + C["cleanup: terminate_instances"] + end + subgraph vm["EC2 VM (guest)"] + D["UserData watchdog: sleep max_runtime+600 then shutdown -h now"] + E["test-vm-client.sh watchdog: sleep SAFETY_TIMEOUT then shutdown -h now"] + F["last script: sudo shutdown +5"] + end + A -->|"InstanceInitiatedShutdownBehavior = terminate"| vm + A --> D + B -->|"4th arg = max_runtime as SAFETY_TIMEOUT"| E + D -->|"OS shutdown"| G["EC2 terminates instance (DeleteOnTermination drops EBS)"] + E -->|"OS shutdown"| G + F -->|"OS shutdown"| G + C --> G +``` + +--- + +## 1. The keystone: instance self-termination + +The most important guarantee: an EC2 VM deletes itself and its disk if it ever shuts down its own OS, regardless of why. Two `run_instances` parameters in `launch_vm.py:spawn_vm` make this true: + +- `InstanceInitiatedShutdownBehavior="terminate"`. Any in-guest `shutdown -h now` (or `shutdown +N`) does not merely *stop* the instance - it *terminates* it. This converts every guest-side safety timer into a hard cost stop. +- `BlockDeviceMappings` for `/dev/xvda` with `Ebs.DeleteOnTermination=True`, `VolumeType="gp3"`, `VolumeSize=self.root_volume_size`. On termination the root EBS volume is deleted with the instance, so terminated VMs leave no lingering block storage. + +Because of these two settings, *any* path reaching an OS shutdown inside the guest results in full teardown (compute + disk). The remaining mechanisms exist to make sure such a shutdown - or an external `terminate_instances` - always eventually happens. + +--- + +## 2. Guest-side safety timers (two independent watchdogs + a fallback) + +| Mechanism | Timer | Covers | Action | +|---|---|---|---| +| UserData watchdog | `max_runtime + 600` (10-min buffer) | orchestrator dies before sending SSM command | `shutdown -h now` -> terminate | +| In-test watchdog (`SAFETY_TIMEOUT`) | `= max_runtime` (4th arg; default `1800` only if absent) | test hangs mid-run | `shutdown -h now` -> terminate | +| Post-completion fallback | `shutdown +5` | script exits but instance doesn't terminate | delayed shutdown -> terminate | + +### 2a. UserData watchdog - survives orchestrator death + +The cloud-init `UserData` arms a detached watchdog *before* it waits for the SSM agent (`launch_vm.py`): + +```bash +nohup bash -c 'sleep {self.max_runtime + 600}; \ +echo "UserData safety timeout reached, shutting down"; shutdown -h now' &>/dev/null & +``` + +The sleep is `max_runtime + 600` (`max_runtime` plus a 10-minute buffer). Its in-code comment states it "catches the case where the orchestrator dies before sending the SSM command". Because it is armed before the `while ! systemctl is-active --quiet amazon-ssm-agent` wait loop, even a VM that never becomes SSM-manageable self-terminates after `max_runtime + 600` seconds. This backstop means the VM does not depend on any external actor to die. + +### 2b. In-test watchdog - bounds the active test run + +`launch_vm.py:execute_test_via_ssm` invokes the client with `max_runtime` as the 4th positional argument: + +```bash +/tmp/test-vm-client.sh {self.s3_bucket} {self.run_prefix} {self.test} {self.max_runtime} +``` + +Inside the script that 4th arg becomes `SAFETY_TIMEOUT` (`test-vm-client.sh`): + +```bash +SAFETY_TIMEOUT=${4:-1800} # Default 30 minutes if not provided +``` + +The literal `1800` default applies only when the 4th arg is absent; the orchestrated path always supplies `max_runtime` (3600s in the example config), so the effective window equals `max_runtime`, not 1800s. `test-vm-client.sh:start_watchdog` writes a separate `watchdog_runner.sh` and launches it with `nohup`; it sleeps in 5-second increments up to `SAFETY_TIMEOUT`, then runs `sudo shutdown -h now`. It is cancellable: `test-vm-client.sh:cleanup_watchdog` removes an active-flag file and sends `SIGTERM` (escalating to `SIGKILL`) to the watchdog PID, so a normally-progressing multi-script run can stand the watchdog down between reboots. + +### 2c. Post-completion fallback shutdown + +After the final test script succeeds, the client schedules `sudo shutdown +5` (a 5-minute delayed shutdown) as a belt-and-suspenders fallback "in case the script exits but instance doesn't terminate". Combined with the `terminate` shutdown behavior, this guarantees teardown even if no external terminate call ever arrives. + +--- + +## 3. Orchestrator-side timeouts and crash detection + +### 3a. SSM command timeout (12-hour hard cap) + +`launch_vm.py:execute_test_via_ssm` computes one timeout for both the SSM command and the local wait loop: + +```python +total_timeout = min(self.max_runtime + 3600, 43200) # max 12 hours +``` + +Base + a 1-hour (3600s) reboot buffer, hard-capped at 43200s (12 hours). It is passed as both `executionTimeout` (in `Parameters`) and the top-level `TimeoutSeconds` of `send_command`, and bounds the polling `while` loop. + +On a terminal SSM status: +- `Success` -> return `True`. +- `Failed` -> soft warning ("VM may have shut down before SSM could report"), captures console buffer, returns `False` - the real verdict comes from S3 (`check_test_result`). +- `TimedOut` / `Cancelled` -> logged as error, console buffer captured, returns `False`. +- All non-`Success` terminal cases call `capture_console_output(reason="ssm-failure")` to grab the kernel tail before teardown. + +If the local wait loop instead exhausts `total_timeout` without a terminal status, it calls `ssm.cancel_command(...)` then `capture_console_output(reason="ssm-failure")` before returning `False`. Note: `cancel_command` is invoked only on this outer-loop-timeout path, not on the per-status terminal branch. + +### 3b. ECS task wait loop - crash, hang, and overall-timeout aborts + +`aws_provider.py:wait_for_task_completion` polls the ECS task to `STOPPED` while tailing the per-run VM console log group, aborting early on three conditions. All four tunables are env-var overridable: + +| Env var | Default | Role | +|---|---|---| +| `PULLAB_TASK_POLL_INTERVAL_SEC` | `30` | poll/sleep cadence | +| `PULLAB_TASK_PROGRESS_LOG_SEC` | `120` | progress-log cadence (cosmetic) | +| `PULLAB_TASK_HANG_THRESHOLD_SEC` | `600` | max silence before declaring a hang | +| `PULLAB_TASK_WAIT_TIMEOUT_SEC` | `3600` | overall wait ceiling | + +Each abort path calls `self.terminate_container()` then raises `RuntimeError`: +- **Overall timeout** - `elapsed > overall_timeout` -> stop task, raise. +- **Kernel crash/stall** - `_scan_for_kernel_crash` matches a guest console line (panic, Oops, `BUG:`, GP fault, kernel paging fault, double fault, arm/arm64 `Internal error:`, soft lockup, RCU stall, hung task) -> stop task, raise. +- **Hang** - no new console events for more than `hang_threshold` seconds -> stop task, raise. + +`terminate_container` is `ecs.stop_task(cluster=..., task=arn_to_stop)`. Crash detection is best-effort: it falls back to plain status polling when there is no `/ec2/` log group or no `run_prefix` configured. + +```mermaid +flowchart TD + Start["wait_for_task_completion loop"] + Start --> Chk{"elapsed > overall_timeout (3600s)?"} + Chk -->|"yes"| Abort1["terminate_container then raise RuntimeError"] + Chk -->|"no"| Status{"task status == STOPPED?"} + Status -->|"yes"| Done["return final_status"] + Status -->|"no"| Tail["tail VM console log group"] + Tail --> Crash{"crash pattern hit?"} + Crash -->|"yes"| Abort2["terminate_container then raise RuntimeError"] + Crash -->|"no"| Hang{"no new events > hang_threshold (600s)?"} + Hang -->|"yes"| Abort3["terminate_container then raise RuntimeError"] + Hang -->|"no"| Sleep["sleep poll_interval (30s)"] + Sleep --> Chk + Abort1 --> CleanFinally["run_pipeline finally sweep"] + Abort2 --> CleanFinally + Abort3 --> CleanFinally +``` + +--- + +## 4. Pipeline `finally` sweep - the unconditional cleanup + +`pipeline.py:run_pipeline` wraps the whole orchestration in `try/except/finally`. The `finally` block always runs (on success or any exception) and performs two independently guarded teardown steps: + +1. **Stop the ECS task** - `provider.terminate_container(task_arn)` (i.e. `ecs.stop_task`), only if `task_arn` is in locals and truthy, in its own `try/except`. +2. **Sweep this run's VMs** - `ec2.describe_instances` filtered by *both* `tag:run_prefix == ` *and* `instance-state-name in ["pending", "running"]`, then `ec2.terminate_instances(InstanceIds=...)` for matches, in a second separate `try/except`. + +The two-key filter (run_prefix tag AND state) is deliberate: it only terminates instances belonging to *this* run that are still consuming compute, and never touches `stopping`/`stopped`/`terminated` instances or instances from other runs. The per-instance tags that make this filter work (`Name`, `TestID`, `run_prefix`) are stamped at launch in `launch_vm.py:spawn_vm`, where `Name = "-"`, `TestID = test_id`, `run_prefix = run_prefix`. + +This sweep is the orchestrator's primary leak-stopper for the normal case; the guest-side watchdogs of section 2 cover the case where the orchestrator never reaches its `finally`. + +--- + +## 5. Thread-level guarding (no silent leaks from worker crashes) + +VMs are launched on one Python thread each (`launch_vm.py`). Each worker's `launch_and_test_vm` calls `launcher.cleanup()` in its own `finally`, *and* wraps that call in a second `try/except` so an unhandled exception escaping the thread surfaces a traceback rather than silently skipping teardown. `cleanup()` itself terminates the instance via `terminate_instances` with each stage individually guarded - so a console-capture failure cannot abort the terminate call. + +--- + +## 6. Storage-cost controls + +### 6a. CloudWatch log retention + +`aws_cloudwatch_manager.py:create` always sets a retention policy on log-group creation, defaulting to 7 days if none is specified: + +```python +retention_days = resource_config.get("retention_days", 7) +self.client.put_retention_policy(logGroupName=resource_name, retentionInDays=retention_days) +``` + +The example config (`examples/aws/config.json`) sets per-group retention explicitly: +- `/ecs/kernel-ci-exampleuser-task` -> `retention_days: 7` +- `/ec2/kernel-ci-exampleuser-vms` -> `retention_days: 3` + +So VM console/SSM logs (the higher-volume `/ec2/` group) expire after 3 days; orchestrator logs after 7. Without an explicit value, the code default of 7 applies. (The example config's `region` is `eu-west-2`; VM sizing is `instance_type: c5a.4xlarge`, `root_volume_size: 40`, `max_runtime: 3600`.) + +### 6b. EBS + +Covered by section 1: `DeleteOnTermination=True` means root volumes never outlive their instance. + +--- + +## 7. Manual reconciliation: `setup_cleanup.py` (read-only by default) + +`setup_cleanup.py:main` is the out-of-band janitor for cleaning up by resource prefix. It is read-only unless `--delete` is passed; `--delete` is an opt-in `store_true`, and without it the tool only lists what it found and prints "Run with --delete to remove these resources". The prefix `base` is derived as `args.prefix.rstrip("-")`. With `--delete`, it sweeps: + +| Resource | Match criteria | Delete action | +|---|---|---| +| EC2 instances | `tag:Name` in `["*", "kernel-ci-test-*"]` AND state in `pending/running/stopping/stopped` | `terminate_instances` | +| ECS tasks | RUNNING tasks in `-cluster` | `stop_task` each | +| ECS cluster | `-cluster` if ACTIVE | `delete_cluster` | +| Task-def families | `familyPrefix=`, status ACTIVE | deregister all ACTIVE revisions | +| IAM role | `-ecs-role` AND default `ecsTaskExecutionRole` | detach managed + delete inline policies, remove/delete instance profile, delete role | +| ECR repo | `-ecr` AND default `kernel-ci-test` | `delete_repository(force=True)` | +| Log groups | prefixes `/ecs/` and `/ec2/` | `delete_log_group` each | +| S3 buckets | name starts with `` | empty all objects, then `bucket.delete()` | + +Two "default-name extras" are swept beyond the prefixed names: the IAM role `ecsTaskExecutionRole` and the ECR repo `kernel-ci-test`. This tool is the safety net for resources left behind by an aborted setup or a run that escaped both the guest watchdogs and the pipeline sweep. + +> Production-safety note: `setup_cleanup.py --delete` performs irreversible deletions (terminating instances, deleting clusters/roles/repos, emptying and deleting S3 buckets). Treat any prefix you cannot positively identify as non-production as production, and run list-only (no `--delete`) first to review the matched set. + +--- + +## 8. Idempotent provisioning (avoids duplicate-resource cost) + +Resource managers extend `base_resource_manager.py:ensure_exists`, which is idempotent: it calls `check_exists` first and returns `(identifier, False)` without creating anything if the resource already exists; recreation is opt-in via `force_recreate` (default `False`), which only deletes-then-recreates when the resource both exists and a `delete_*` hook is present. This prevents the slow leak of duplicated clusters, roles, log groups, and repositories across repeated setup runs. + +--- + +## Summary of layers + +| Layer | Trigger it covers | Mechanism | Result | +|---|---|---|---| +| `InstanceInitiatedShutdownBehavior="terminate"` + `DeleteOnTermination=True` | any guest OS shutdown | EC2 instance + EBS deleted | no orphan compute/disk | +| UserData watchdog (`max_runtime + 600`) | orchestrator dies before SSM | `shutdown -h now` -> terminate | self-healing VM | +| Test-client watchdog (`SAFETY_TIMEOUT = max_runtime`) | test hangs mid-run | `shutdown -h now` -> terminate | bounded active run | +| `shutdown +5` fallback | script exits without terminating | delayed `shutdown` -> terminate | post-completion backstop | +| `execute_test_via_ssm` `min(max_runtime+3600, 43200)` | SSM stuck | command timeout + cancel + console capture | bounded orchestrator wait | +| `wait_for_task_completion` (30/600/3600) | crash / hang / overall timeout | `stop_task` + raise | early abort of wedged task | +| `run_pipeline` `finally` sweep | normal end + any exception | `stop_task` + tagged `terminate_instances` | guaranteed per-run teardown | +| Thread `finally` + guard | worker thread crash | `cleanup()` -> `terminate_instances` | no silent per-VM leak | +| CloudWatch retention (7 / 3) | log accumulation | `put_retention_policy` | bounded log storage | +| `setup_cleanup.py --delete` | escaped leaks / aborted setup | prefix sweep (read-only by default) | manual reconciliation | +| `ensure_exists` idempotency | repeated setup | check-before-create | no duplicate resources |