-
Notifications
You must be signed in to change notification settings - Fork 1
docs: add Terraform drift detection tutorial #300
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
29e14ed
fb8f65e
a3743c5
825d08c
ac91e5b
2af8ba4
f130222
7a92f41
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,130 @@ | ||
| --- | ||
| title: Detecting non-Terraform changes | ||
| description: Detect infrastructure changes made outside Terraform — console, API, or CLI edits — with a scheduled plan whose result is attested into a Kosli Environment. | ||
|
gsavage marked this conversation as resolved.
|
||
| --- | ||
|
|
||
| <Tooltip tip="Drift occurs when infrastructure diverges from the desired state defined in your version-controlled Terraform config.">Terraform drift</Tooltip> comes in two distinct types, and each is invisible to a detector built for the other: | ||
|
|
||
| 1. **Unexpected statefile changes** — someone runs `terraform apply` outside your pipeline, so the statefile and the world still agree and a plan comes back empty. See [Detecting unexpected statefile changes](/tutorials/detecting_unexpected_statefile_changes). | ||
| 2. **Non-Terraform changes** — someone edits the world directly via the cloud console, API, or CLI: a hotfix in the console, a partial apply failure, an out-of-band automation. Reality no longer matches the statefile, so a `terraform plan` catches it. This page covers detecting this type. | ||
|
|
||
| Both pages implement Kosli's [Drift Detection](https://sdlc.kosli.com/controls/runtime/drift_detection/) control (SDLC-CTRL-0018), a detective control that mitigates configuration drift risk under our secure SDLC framework. | ||
|
gsavage marked this conversation as resolved.
Comment on lines
+6
to
+11
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion: The intro block (lines 6–11) is nearly identical across both new pages. Fine for standalone reading, but if maintenance surface is a concern, a |
||
|
|
||
| ## How the detection works | ||
|
|
||
| The detector is a scheduled `terraform plan` against the last-applied git SHA, with the result recorded in a small marker file that Kosli watches for tampering: | ||
|
|
||
| - **At apply time**, the pipeline writes a fresh marker — `drift.plan.json`, stored next to the statefile — recording the applied SHA with `drift: false`, and attests it into your [Kosli Environment](/getting_started/environments): | ||
|
|
||
| ```json | ||
| { | ||
| "sha": "abc123def456...", | ||
| "drift": false | ||
| } | ||
| ``` | ||
|
|
||
| - **On a schedule**, the detector reads the marker, checks out the recorded SHA, and runs a read-only plan. The cleanest machine-readable signal is the plan exit code: | ||
|
|
||
| ```shell | ||
| terraform plan -input=false -lock=false -detailed-exitcode -no-color -out=tfplan | ||
| # exit 0 -> no changes (no drift) | ||
| # exit 2 -> changes present (DRIFT) | ||
| # exit 1 -> error | ||
| ``` | ||
|
|
||
| `-lock=false` means the read-only drift plan never contends with a real apply; `-input=false` means it can never hang waiting for a prompt. | ||
|
|
||
| - **When drift is found**, the detector overwrites the marker in S3 with `{sha, drift: <timestamp>}` — fresh, un-attested content. On its next snapshot, the Kosli reporter Lambda sees a marker that no longer matches its attestation, and the Environment reports itself as **non-compliant**. | ||
|
|
||
| <Tip> | ||
| The detector never calls the Kosli API. It just rewrites the marker in S3; the reporter Lambda does the detection on its next snapshot. Detection and evidence stay decoupled — fewer moving parts, one less credential in the detector, and a single place (the Environment) that tells you whether the world still matches what was approved. The Environment's compliance state, backed by attested artifacts linked to the git SHA that produced them, is exactly the kind of evidence an auditor wants for SOC 2 (CC7.2, CC8.1) and NIST SP 800-53 (CM-2, CM-3, SI-7). | ||
| </Tip> | ||
|
|
||
| ## Plan against the applied SHA, not against `main` | ||
|
|
||
| This is the single most common false-positive source. If changes are merged to `main` but not yet applied — because the apply is gated behind a manual approval, or batched into a release — then planning against `main` shows a non-empty plan that reflects pending intentional changes, not drift. The marker exists precisely to record the *applied* SHA, and the detector always checks out that commit before planning. | ||
|
|
||
| ## Latch, don't spam | ||
|
|
||
| Once drift is flagged, you usually don't want to re-plan and re-alert every cycle until someone acts. The marker doubles as a latch: the detector only plans while `drift` is `false`, and the next successful apply writes a fresh `{sha, drift: false}` marker to reset it. | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Terraform is applied through CI/CD, not from laptops, as the normal path — with remote, locked state (for example, an S3 backend with the native S3 lockfile or DynamoDB). | ||
| - Keyless CI authentication to your cloud (for example, GitHub OIDC) with a dedicated, read-capable role for the detector. The detector never needs apply permissions. | ||
| - A [Kosli account and API token](/getting_started/authenticating_to_kosli). | ||
| - A Kosli [Environment](/getting_started/environments) for each Terraform environment you want to protect. | ||
| - The Kosli reporter Lambda deployed to snapshot the drift marker (and statefile) into that Environment on a schedule. | ||
|
|
||
| <Warning> | ||
| Drift detection on top of an undisciplined apply process produces mostly noise. Fix the pipeline first. | ||
| </Warning> | ||
|
|
||
| ## Setting it up with `kosli-dev/tf` | ||
|
|
||
| Everything above is implemented at [github.com/kosli-dev/tf](https://github.com/kosli-dev/tf): a thin Terraform wrapper (`tf`) and a set of reusable GitHub Actions workflows, both open source under the MIT license. Two of the workflows carry this control: | ||
|
|
||
| - **`apply.yml`** — the plan steps plus `tf apply`, then a reset-drift-detection job that writes a fresh `{sha, drift: false}` marker to S3 (the known-good baseline for the next drift run) and attests it, along with the plan, apply log, and statefile, into your Kosli Environment. See [Detecting unexpected statefile changes](/tutorials/detecting_unexpected_statefile_changes) for the caller workflow and flow template — the same apply setup covers both drift types. | ||
| - **`detect-drift.yml`** — the detector. Reads the baseline marker, and only if `drift == false` runs a plan against the baseline SHA. A non-empty plan overwrites the marker with `{sha, drift: <timestamp>}`; otherwise it records a no-drift summary. | ||
|
|
||
| A scheduled caller that runs the detector (use a matrix to fan out across environments): | ||
|
|
||
| ```yaml | ||
| name: Drift | ||
| on: | ||
| schedule: | ||
| - cron: "*/15 * * * *" | ||
| workflow_dispatch: | ||
|
|
||
| jobs: | ||
| drift: | ||
| uses: kosli-dev/tf/.github/workflows/detect-drift.yml@main | ||
| permissions: | ||
| id-token: write | ||
| contents: write | ||
| with: | ||
| aws_region: eu-west-1 | ||
| aws_role_arn: arn:aws:iam::111122223333:role/my-role | ||
| environment: production | ||
| ``` | ||
|
|
||
| ## Hardening | ||
|
|
||
| A detector that runs once and alerts once is easy. A detector you can depend on for an audit needs to handle the failure modes below. | ||
|
|
||
| <AccordionGroup> | ||
| <Accordion title="Monitor the monitor" icon="heart-pulse"> | ||
| This is the most dangerous failure mode. If the scheduled job silently stops running, no new evidence arrives to contradict the last result — so the environment looks green forever, even as drift accumulates. Treating "the dashboard is green" as proof of cleanliness, without also verifying the underlying job is running on schedule, is a misuse of the control. Add a heartbeat or alert on "job has not run in N intervals" for both the detector workflow and the reporter Lambda. | ||
| </Accordion> | ||
|
|
||
| <Accordion title="Terraform-managed resources only" icon="filter"> | ||
| `terraform plan` can only see resources Terraform manages. A resource created entirely outside Terraform — say, an IAM user added by hand in the console with no corresponding Terraform resource — is invisible to this control. Closing that gap is the job of an Infrastructure-as-Code coverage policy (everything in production must be defined as code in the first place); drift detection assumes that policy holds and does not substitute for it. | ||
| </Accordion> | ||
|
|
||
| <Accordion title="Tune cadence per environment" icon="gauge"> | ||
| Worst-case detection latency is the check interval **plus the reporter Lambda's snapshot interval**. A ten-minute check with a five-minute reporter Lambda surfaces drift within fifteen minutes. Set the schedule from each environment's rate-of-change and blast radius rather than using one global value. | ||
| </Accordion> | ||
|
|
||
| <Accordion title="Concurrency and least privilege" icon="lock"> | ||
| Guard against overlapping runs for the same environment with a concurrency group. Scope the detector's cloud role tightly: it needs to read state and plan, plus write the marker file — nothing more. It must never hold apply permissions. | ||
| </Accordion> | ||
| </AccordionGroup> | ||
|
|
||
| ## Implementation checklist | ||
|
|
||
| - [ ] Terraform is applied through CI/CD, with remote, locked state. | ||
| - [ ] Each apply writes a fresh `{sha, drift: false}` marker and attests it into a Kosli Environment. | ||
| - [ ] A scheduled job plans against the applied SHA — not against `main` — using a read-only, lock-free plan. | ||
| - [ ] A non-empty plan overwrites the marker; the result latches until the next apply resets it. | ||
| - [ ] The Kosli reporter Lambda snapshots the marker from S3 into the Environment on a schedule. | ||
| - [ ] Both the detector workflow and the reporter Lambda are monitored for silent failure. | ||
| - [ ] The detector's cloud role can read and plan only — never apply. | ||
| - [ ] Cadence and concurrency are tuned per environment. | ||
|
|
||
| ## Related | ||
|
|
||
| - [Drift Detection (SDLC-CTRL-0018)](https://sdlc.kosli.com/controls/runtime/drift_detection/) — the control both drift-detection tutorials implement. | ||
| - [Detecting unexpected statefile changes](/tutorials/detecting_unexpected_statefile_changes) — the other drift type: out-of-CI applies a plan can *never* catch. | ||
| - [`kosli-dev/tf`](https://github.com/kosli-dev/tf) — the reference wrapper and reusable workflows. | ||
| - [Environments](/getting_started/environments) — the Kosli primitive that carries the compliance signal. | ||
| - [Terraform `-detailed-exitcode`](https://developer.hashicorp.com/terraform/cli/commands/plan#detailed-exitcode) and [remote/locked S3 backends](https://developer.hashicorp.com/terraform/language/backend/s3). | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,134 @@ | ||
| --- | ||
| title: Detecting unexpected statefile changes | ||
| description: Detect Terraform applies that bypass CI by attesting statefile provenance into a Kosli Environment — the class of drift a scheduled plan can never see. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Suggestion: Same as the sibling page — an |
||
| --- | ||
|
|
||
| <Tooltip tip="Drift occurs when infrastructure diverges from the desired state defined in your version-controlled Terraform config.">Terraform drift</Tooltip> comes in two distinct types, and each is invisible to a detector built for the other: | ||
|
|
||
| 1. **Unexpected statefile changes** — someone runs `terraform apply` outside your pipeline. A laptop apply updates the statefile and the world *together*, so they still agree and a scheduled `terraform plan` comes back empty. This page covers detecting this type. | ||
| 2. **Non-Terraform changes** — someone edits the world directly via the cloud console, API, or CLI, so reality no longer matches the statefile. See [Detecting non-Terraform changes](/tutorials/detecting_non_terraform_changes). | ||
|
|
||
| Both pages implement Kosli's [Drift Detection](https://sdlc.kosli.com/controls/runtime/drift_detection/) control (SDLC-CTRL-0018), a detective control that mitigates configuration drift risk under our secure SDLC framework. | ||
|
|
||
| ## Why a plan can never catch this | ||
|
|
||
| `terraform plan` compares the statefile to the world. An out-of-CI apply changes both in lockstep, so the comparison stays clean — the plan is structurally blind to it. What *has* changed is the statefile itself: it is now a file your pipeline never produced. Detecting that requires a record of where every statefile came from — a provenance system. | ||
|
|
||
| ## How Kosli detects it | ||
|
|
||
| The mechanism is **attestation plus continuous reporting** against a [Kosli Environment](/getting_started/environments): | ||
|
|
||
| - **At apply time**, the pipeline attests the Terraform statefile as an artifact into the Kosli Environment. Attestation fingerprints the file and records that your pipeline produced it, linked to the git SHA — establishing its provenance. | ||
| - **Continuously**, a scheduled Kosli reporter Lambda snapshots the live statefile from S3 into the same Environment. The Environment's policy requires every artifact to have known provenance. | ||
|
|
||
| The moment the statefile in S3 no longer matches an attestation — because an apply outside CI rewrote it — the next snapshot sees an unrecognized artifact and the Environment reports itself as **non-compliant**. No scheduled plan is involved; the reporter Lambda detects this entirely on its own. | ||
|
|
||
| <Tip> | ||
| "The plan was clean yesterday" is not evidence that the environment is clean today. A green dashboard can be stale (the job stopped running) or unverifiable (who can prove the statefile wasn't swapped?). Only a **current, verifiable** signal counts as evidence — which is why the signal here is a Kosli Environment's compliance state, backed by attested artifacts. That is exactly the kind of evidence an auditor wants for SOC 2 (CC7.2, CC8.1) and NIST SP 800-53 (CM-2, CM-3, SI-7). | ||
| </Tip> | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - Terraform is applied through CI/CD, not from laptops, as the normal path — with remote, locked state (for example, an S3 backend with the native S3 lockfile or DynamoDB). | ||
| - Keyless CI authentication to your cloud (for example, GitHub OIDC). | ||
| - A [Kosli account and API token](/getting_started/authenticating_to_kosli). | ||
| - A Kosli [Environment](/getting_started/environments) for each Terraform environment you want to protect. | ||
| - The Kosli reporter Lambda deployed to snapshot the statefile into that Environment on a schedule. | ||
|
|
||
| ## Setting it up with `kosli-dev/tf` | ||
|
|
||
| Everything above is implemented at [github.com/kosli-dev/tf](https://github.com/kosli-dev/tf): a thin Terraform wrapper (`tf`) and a set of reusable GitHub Actions workflows, both open source under the MIT license. You can call the workflows directly, or borrow their shape for your own CI. | ||
|
|
||
| ### The wrapper | ||
|
|
||
| `tf` is a drop-in replacement for `terraform` that removes the manual bookkeeping. It selects the correct `-var-file` for your active AWS profile and region, and injects the S3 backend config so you never hand-manage it. The backend is derived deterministically: | ||
|
|
||
| ```text | ||
| bucket = terraform-state-<sha1(account_id-region)> | ||
| key = terraform/<repo>/<state_file_name> # default: main.tfstate | ||
| region = <region> encrypt = true # native S3 lockfile by default | ||
| ``` | ||
|
|
||
| `tf plan` saves a binary plan for later inspection; `tf apply` appends `-auto-approve` (the plan has already been reviewed, and CI has no interactive prompt). Locally you wrap it in your credential helper, for example `aws-vault exec staging -- tf plan`. | ||
|
|
||
| ### The apply workflow | ||
|
|
||
| The reusable `apply.yml` workflow runs the plan steps plus `tf apply`, then attests the plan, the apply log, and the statefile into your Kosli Environment. A caller workflow that applies on merge: | ||
|
|
||
| ```yaml | ||
| name: Apply | ||
|
|
||
| on: | ||
| push: | ||
| branches: [main] | ||
|
|
||
| jobs: | ||
|
gsavage marked this conversation as resolved.
|
||
| apply: | ||
| uses: kosli-dev/tf/.github/workflows/apply.yml@main | ||
| permissions: | ||
| id-token: write | ||
| contents: write | ||
| pull-requests: read # needed by the PR-attestation step | ||
| with: | ||
| aws_region: eu-west-1 | ||
| aws_role_arn: arn:aws:iam::111122223333:role/my-role | ||
| environment: production | ||
| tf_version: v1.14.6 | ||
| kosli_template_file: kosli-apply-template.yml | ||
| secrets: | ||
| kosli_api_token: ${{ secrets.KOSLI_API_TOKEN }} | ||
| kosli_github_token: ${{ secrets.GITHUB_TOKEN }} | ||
| ``` | ||
|
|
||
| The Kosli [flow template](/template-reference/flow_template) declares every attestation and artifact the workflow emits: | ||
|
|
||
| ```yaml | ||
| # kosli-apply-template.yml | ||
| version: 1 | ||
| trail: | ||
| attestations: | ||
| - name: terraform-plan | ||
| type: generic | ||
| - name: terraform-apply | ||
| type: generic | ||
| artifacts: | ||
| - name: terraform-state | ||
| - name: drift-plan | ||
| ``` | ||
|
|
||
| <Info> | ||
| The `drift-plan` artifact belongs to the second drift type — the marker file used by the scheduled plan loop in [Detecting non-Terraform changes](/tutorials/detecting_non_terraform_changes). The same apply workflow attests both, so one setup covers both types. | ||
| </Info> | ||
|
|
||
| ## What a detection looks like | ||
|
|
||
| Someone runs `terraform apply` from a laptop. The statefile in S3 is rewritten with content your pipeline never attested. On its next snapshot the Kosli reporter Lambda finds an artifact with no known provenance, and the Environment turns non-compliant. The Environment's snapshot history shows exactly when the unrecognized statefile appeared and what its fingerprint is — a concrete starting point for the investigation, and a durable record for the audit trail. | ||
|
|
||
| ## Hardening | ||
|
|
||
| <AccordionGroup> | ||
| <Accordion title="Monitor the monitor" icon="heart-pulse"> | ||
| This is the most dangerous failure mode. If the reporter Lambda silently stops running, no new evidence arrives to contradict the last snapshot — so the Environment looks green forever, even as unattested statefiles accumulate. Treating "the dashboard is green" as proof of cleanliness, without also verifying the Lambda is running on schedule, is a misuse of the control. Add a heartbeat or alert on "no snapshot in N intervals". | ||
| </Accordion> | ||
|
|
||
| <Accordion title="Least privilege" icon="lock"> | ||
| The reporter Lambda needs read access to the state bucket and the ability to report snapshots to Kosli — nothing more. It must never hold apply permissions. | ||
| </Accordion> | ||
| </AccordionGroup> | ||
|
|
||
| ## Implementation checklist | ||
|
|
||
| - [ ] Terraform is applied through CI/CD, with remote, locked state. | ||
| - [ ] Each apply attests the statefile (plus plan and apply log) into a Kosli Environment. | ||
| - [ ] The Kosli reporter Lambda snapshots the live statefile from S3 into the Environment on a schedule. | ||
| - [ ] The Environment's policy requires known provenance for every artifact. | ||
| - [ ] The reporter Lambda is monitored for silent failure (heartbeat / not-run alert). | ||
| - [ ] Snapshot cadence is tuned per environment. | ||
|
|
||
| ## Related | ||
|
|
||
| - [Drift Detection (SDLC-CTRL-0018)](https://sdlc.kosli.com/controls/runtime/drift_detection/) — the control both drift-detection tutorials implement. | ||
| - [Detecting non-Terraform changes](/tutorials/detecting_non_terraform_changes) — the other drift type: console and API edits a plan *can* catch. | ||
| - [`kosli-dev/tf`](https://github.com/kosli-dev/tf) — the reference wrapper and reusable workflows. | ||
| - [Environments](/getting_started/environments) — the Kosli primitive that carries the compliance signal. | ||
| - [Flow template reference](/template-reference/flow_template) — declaring attestations and artifacts. | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improvement: The redirect from the intermediate URL (
/tutorials/terraform_drift_detection) lands on the "non-Terraform changes" page. A reader who bookmarked the old single-page tutorial might expect to see the statefile-provenance content too. Consider whether a short landing page or a redirect to a parent group would be friendlier — or at minimum, the target page's intro already cross-links to the sibling, which mitigates this. Noting for awareness.