03 — Exchange / Platform Operational Outage
Trigger: Trading, deposits, withdrawals, login, or another core service is degraded or unavailable for users. Includes API outages affecting integrators.
First 30 minutes
Outages are the most-frequent crisis type and the easiest to mishandle. The asymmetry: brief outages well-handled build trust; brief outages badly-handled erode it more than the outage itself does.
- Confirm scope. Is it deposits, withdrawals, trading, login, or all? Some users? All users? Some regions? Your status page should already have this; if it doesn’t, that’s the first fix.
- Open the hotline. “Outage: [scope]. Started: [time]. ETA to resolution: [if known, else ‘investigating’].”
- Update status page within 5 minutes. Even if you don’t know the cause. “We are aware of [scope] and investigating.” Status pages exist to be updated, not to be perfect.
- Pause user-facing notifications until you have something true to say. Don’t email users every 10 minutes during a fast-moving incident.
- Engage support team with current scope. They are getting tickets within minutes. Same script as status page.
- Notify integrators (CEX listings, payment processors, on/off ramp partners). Their teams need to know what to tell their users.
- Decide: is this a 30-minute outage or a multi-hour event? Different communication cadences. If multi-hour, escalate to a holding statement on X and the blog.
Holding statement template
For outages exceeding 60 minutes, ship a public statement (X + status page) within 90 minutes of incident start.
[TIMESTAMP — UTC]
[COMPANY] is aware of an issue affecting [SPECIFIC SCOPE — e.g., "withdrawal processing for
ERC-20 assets", "trading on the [PRODUCT] platform"]. The issue began at approximately
[INCIDENT START UTC] and is currently [ONGOING / PARTIALLY RESOLVED].
[USER ACTION REQUIRED — e.g., "No user action is required" / "Please do not retry deposits;
duplicates may occur"].
User funds are not at risk. [Optional: "[OTHER SERVICES] remain operational."]
We are working to restore service and will update at [SPECIFIC INTERVAL — usually 30-60 min].
Status page: [URL]
— [Company / Team]
What this template doesn’t say: - “Apologies for the inconvenience” in the opening — do that at the end if at all. Action-first language. - A specific cause. You’ll be wrong half the time; just describe scope. - A specific ETA unless you actually know one. “Investigating” is more honest than “we’ll have it back in 30 min” if you don’t actually know. - “Some users may experience” — vague language is the hallmark of failed status comms.
Stakeholder cascade
For outages exceeding 60 minutes:
| # | Audience | Channel | Who | Goal |
|---|---|---|---|---|
| 1 | Internal team (eng, support, exec) | Slack #incident | Incident commander | Unified picture |
| 2 | Status page | Public status URL | Eng or DevOps | First public signal |
| 3 | Support team | Slack briefing + scripts | Support lead | Consistent ticket responses |
| 4 | Major integrators | Pre-existing partner Slack channels or email | BD/Eng lead | Avoid integrator-side panic |
| 5 | All users | In-app banner + email if outage > 4h | Comms lead | Acknowledge, no action needed |
| 6 | X (general public) | Public account | Comms lead | Status visibility, prevent rumour cycle |
| 7 | Press | Reactive only unless severe | Comms lead | Quotable on facts |
Do
- Update the status page on a strict schedule (every 30 min for active incidents) even if there’s nothing new. “We are continuing to investigate, no further user impact observed since last update” is a useful update.
- Be specific about scope. “Trading on USD pairs is affected; trading on EUR pairs is not” beats “some trading services may be impacted.”
- Reassure on funds early and clearly if it’s true. “User funds are not at risk” in the second sentence of the holding statement is one of the most under-used trust-building moves in crypto comms.
- Acknowledge the inconvenience at the end, briefly. Action first, apology last.
- Time-stamp everything in UTC. Local times in mixed-region products are confusing.
Don’t
- Don’t disable the status page during the incident. This sounds absurd; people do it.
- Don’t blame “third-party issues” without naming the third party. It reads as deflection.
- Don’t promise refunds or compensation in the first hour. That’s a counsel decision and a precedent decision; don’t make it on the fly.
- Don’t tweet jokes during an active outage. The brand voice that works in normal times reads as tone-deaf during impact.
- Don’t go silent for more than an hour. Even no-update updates are reassuring relative to silence.
Variants
Partial outage (one chain or one region). Frame the scope precisely so unaffected users know the platform works. “Withdrawals on Base are temporarily delayed; all other networks operating normally.”
Planned maintenance overrun. When scheduled maintenance runs longer than announced, the comms shift from “outage” to “maintenance taking longer than expected.” Different framing, same cadence.
Trading halted by venue / external dependency. When the outage is upstream (oracle, RPC provider, market-data feed), name the dependency without making the partner the scapegoat.
Login / authentication outage. Particularly anxiety-producing because users worry about account security. Add explicit reassurance: “This is a system availability issue. Account credentials and security are unaffected.”
24-hour follow-up
- Post-incident report (RCA) within 5 business days. Public, technical, blameless. Customers in regulated environments may need to forward this to their auditors.
- Review whether SLA credits or compensation are due (per your terms of service). Apply automatically if your terms require it; do not make customers ask.
- Internal retro: was the status page kept current? Did the support team have what they needed? Did the comms cadence match commitments?
- Update the status-page communication template based on what worked and didn’t.
Cross-references: 01 — Hack, 08 — Withdrawal Pause.