How to Communicate During a Service Outage (Without Making It Worse)

An outage hurts once. Poor communication during that outage can hurt for months.

A 2023 PagerDuty and Dimensional Research study found that 62% of customers reduced usage or stopped using a service after an outage where communication was poor, compared to 28% who reduced usage after an outage where communication was handled well. The outage itself is less corrosive to trust than the silence or vagueness that surrounds it.

This guide covers the strategy behind outage communication: timing, tone, what to say to each audience, and the mistakes that turn a manageable incident into a customer retention problem.

The Core Principle: Communicate Before You Know the Cause

The instinct during an active incident is to wait until you understand what happened before saying anything. That instinct costs you.

Customers who see no status update for 20 minutes during an outage draw one of three conclusions:

You don't know it's happening
You know and don't care
You're hiding something

None of these are true. But silence creates that interpretation regardless.

Post an "Investigating" update to your status page within 5 minutes of confirming an outage. It does not need to explain the cause. It needs to confirm that you know and that your team is working. That single post reduces support ticket volume during the incident by 60-80% and changes the customer's psychological frame from "they're ignoring this" to "they're on it."

Post first. Investigate simultaneously.

The Four Communication Audiences

Different people need different information during an outage. Conflating them causes problems.

Audience	What they need	Channel	Frequency
Customers	Status and impact	Public status page	Every 15-20 min
Stakeholders (internal)	Situation awareness	Private Slack channel	Every 15-20 min
Responders	Technical details, coordination	Incident channel	Continuous
Support team	What to tell customers	Dedicated thread	On change

The biggest mistake: letting these audiences mix. When executives join the incident channel and ask for status every 10 minutes, they pull responders' attention. When customers see internal technical details, they get confused or alarmed.

Keep the channels separate. Designate one person as the communications lead whose job is updating the external channels so responders can focus on the technical problem.

Timing: The Communication Schedule

Customers do not expect instant resolution. They expect to be kept informed. An outage that lasts 90 minutes with clear updates every 20 minutes generates far fewer complaints than an outage that lasts 30 minutes with complete silence.

The update schedule for a live incident

Time	What to post
T+5 min	"Investigating" - you know about it, you're working on it
T+20 min	Update: what you've found (even if nothing yet)
T+40 min	Update: progress or new findings
Every 20 min after	Update until resolved

Commit to the next update time in every post. "Next update in 20 minutes" sets an expectation. Missing it tells customers nobody is watching the clock. Hitting it repeatedly - even with "still investigating" - builds trust faster than intermittent posts with more information.

What to Say at Each Stage

Investigating (T+5, cause unknown)

What to include:

Which service or feature is affected
What users experience (errors, slow responses, unavailability)
That your team is actively investigating
When the next update will come

What to leave out:

The cause (you don't know it yet)
Internal system names or technical jargon
Speculation about what might be wrong

Example:

We are investigating reports of errors affecting service. Some users may be unable to specific action. Our engineers are actively working on this.
Next update by time.

Identified (cause found, fix in progress)

What to include:

A plain-language explanation of the cause
Which features are affected and which are working normally
What you're doing to fix it
Conservative time estimate if you have one

What to leave out:

Technical blame language ("a developer deployed a bad config")
Precise timeline commitments you cannot keep

Example:

We have identified the cause: plain-language description - e.g., "a configuration change we deployed at 2:30 PM caused our database to reject new connections". Our team is deploying a fix now.
Affected: feature. Working normally: other features.
We expect to restore service within conservative estimate. Next update by time.

Specificity about the cause builds trust. "A database configuration change" is better than "an internal issue." Customers understand that systems are complex. What erodes trust is vagueness that reads as concealment.

Monitoring (fix deployed, watching for recovery)

What to include:

That the fix is deployed
That you're confirming recovery
What to do if customers still see issues

Example:

We have deployed a fix and are monitoring recovery. Most users should see service functioning normally now. If you continue to experience issues, contact support email.
We will post a final update once recovery is confirmed.

Do not skip this stage to jump directly to Resolved. A second failure immediately after declaring resolved damages trust more than a longer "monitoring" period.

Resolved

What to include:

Confirmation that service is fully restored
Duration of the incident (start time to resolution time)
One-sentence cause explanation
A commitment to publish a postmortem (for significant incidents)

Example:

This incident is resolved. Service is fully operational as of time UTC.
Duration: X hours Y minutes. Cause: one honest sentence.
We will publish a post-incident review within 48 hours. We apologize for the disruption.

Tone: The Three Rules

Rule 1: Specific beats vague.

"A database connection pool exhaustion caused elevated error rates for users attempting to log in" is better than "we experienced a technical issue." Vague language reads as either incompetence or hiding. Specific language, even when technical, reads as honest and competent.

Rule 2: Factual beats apologetic.

"We're so sorry for the terrible experience" reads as hollow. "Service was unavailable for 47 minutes. We've made X, Y, and Z changes to prevent this class of failure" reads as accountable. Apologize once, directly. Then focus on facts.

Rule 3: Committed beats hedged.

"We're working as hard as we can to address this" tells customers nothing. "We're deploying a rollback to restore service. Expected recovery in 20 minutes" commits to something actionable. If you miss the estimate, update immediately. Transparency about a missed estimate is better than silence.

What Not to Say

"We are experiencing some technical difficulties."

This says nothing. It signals either that you do not know what is happening or that you are not willing to share it. Both interpretations damage trust.

"This is affecting a small number of users."

Unless you have data to support this, avoid it. The customer reading the update does not know whether they are in the "small number." If they are, this reads as dismissive.

"This was due to an unprecedented situation."

Almost never true, and sounds defensive. Most outages have known causes. Own the specific cause.

"We are working around the clock."

Filler. Customers do not care about effort. They care about restoration.

Giving a timeline you cannot keep.

Missing a stated ETA without updating is worse than not giving one. If you say "resolved in 30 minutes" and post nothing for 90, you've compounded the original problem.

The Post-Incident Customer Email

Send a customer email for incidents lasting over 30 minutes with broad user impact. The status page handles real-time communication during the incident. The email handles the follow-up communication after.

Send it within 2 hours of resolution - not the next day.

The email should cover:

What happened and when
Who was affected and how
What you've already done to fix it
What you're doing to prevent recurrence

Send it from the founder or CEO for major incidents, not a generic no-reply@ address. Customers who receive a personal note from the founder are far less likely to churn than customers who receive a template from a support queue.

Research from Zendesk's CX Trends report shows that customers rate companies 2.5x higher on trust when they receive proactive outage communication compared to when they find out through their own investigation.

The Most Expensive Silence: Not Having a Status Page

Teams without a status page force every outage into two channels: Twitter/social media and support tickets.

Support tickets during an outage generate an average of 3-5 tickets per 100 affected users in the first 30 minutes. For a service with 10,000 active users, that is 300-500 tickets your support team has to process, each requiring an individual response, while the technical team is still fighting the incident.

A status page that customers can find and check cuts that volume by 70-80%. One URL, one post, redirects the entire curiosity load away from your team.

If you don't have a status page, set one up before the next incident. Vantaj includes public status pages on every plan, including free. It takes about 3 minutes to configure one.

The Communication Checklist

For any SEV-1 or significant SEV-2:

Status page "Investigating" posted within 5 minutes of confirmed outage
Dedicated incident channel created in Slack/Teams
Support team briefed on what to tell customers
Status page updated every 20 minutes until resolved
Status page "Resolved" with duration and cause
Customer email sent within 2 hours of resolution (if impact was broad)
Postmortem scheduled within 48 hours