Back to Labs
Incident response drill

Triage a payments outage

A deploy went live and checkout errors spiked. Use the alert feed and logs to find the root cause and pick a safe action.

Topic intro

Incident response is about fast, safe recovery. Find the first cause, stop the failure, then capture what happened. Use alerts to spot the impact and logs to confirm the cause.

Alert feed

Checkout latency spike
high

p95 jumped from 240ms to 1.8s in 10 minutes

Service: checkout

Payment errors above threshold
high

5xx rate at 7.2 percent

Service: payments

Webhook queue growing
medium

backlog 1,200 pending jobs

Service: payments

Mission brief

You are on call. Checkout is timing out and payments are failing. Work fast, keep changes low risk, and capture notes for the incident report.

Goal 1: stop the error spike
Goal 2: keep checkout available
Goal 3: write a short summary

Log explorer

Filter by service, level, or search terms.

Showing 16 entries
10:42:13info
web · release banner enabled for version 2.3.1
10:42:19info
api · deploy started for payments 2.3.1
10:42:52info
payments · booting worker pool with 6 threads
10:43:03warn
payments · webhook secret not found in env
10:43:04error
payments · cannot verify signature, rejecting webhook
10:43:11info
checkout · checkout request started request_id=8f1c
10:43:18error
checkout · payment capture failed request_id=8f1c
10:43:25warn
api · retrying payment provider request
10:43:33error
payments · signature verification failed for event evt_7281
10:43:40error
payments · missing webhook secret in runtime config
10:43:51info
checkout · checkout retry scheduled request_id=8f1c
10:44:05warn
payments · queue depth 980 jobs, oldest 6m
10:44:18error
payments · event processing halted for invalid signature
10:44:37info
api · feature flag payments_v2 enabled
10:45:02warn
checkout · slow payment response 2.2s request_id=934b
10:45:16error
payments · webhook retries exhausted for evt_7281

Decision point

Root cause

Immediate action

Hint ladder

Hint 1

The first spike appears right after a deploy.

Hint 2

Payment signatures fail even before the queue backs up.

Hint 3

Look for anything about a secret or config in the logs.

Artifact: write a short incident note and share it with your team.

Incident note

Write a short summary. Include the cause, the fix, and the prevention step.