Back to blog
Tutorials

How to Communicate During a Service Outage (Without Making It Worse)

Poor communication during an outage can do more damage than the outage itself. This guide covers timing, tone, audience, and what to say at each stage so you keep customer trust intact.

Vantaj Team · June 28, 2026 · 10 min read

An outage hurts once. Poor communication during that outage can hurt for months.

A 2023 PagerDuty and Dimensional Research study found that 62% of customers reduced usage or stopped using a service after an outage where communication was poor, compared to 28% who reduced usage after an outage where communication was handled well. The outage itself is less corrosive to trust than the silence or vagueness that surrounds it.

This guide covers the strategy behind outage communication: timing, tone, what to say to each audience, and the mistakes that turn a manageable incident into a customer retention problem.

The Core Principle: Communicate Before You Know the Cause

The instinct during an active incident is to wait until you understand what happened before saying anything. That instinct costs you.

Customers who see no status update for 20 minutes during an outage draw one of three conclusions:

  1. You don't know it's happening
  2. You know and don't care
  3. You're hiding something

None of these are true. But silence creates that interpretation regardless.

Post an "Investigating" update to your status page within 5 minutes of confirming an outage. It does not need to explain the cause. It needs to confirm that you know and that your team is working. That single post reduces support ticket volume during the incident by 60-80% and changes the customer's psychological frame from "they're ignoring this" to "they're on it."

Post first. Investigate simultaneously.

The Four Communication Audiences

Different people need different information during an outage. Conflating them causes problems.

AudienceWhat they needChannelFrequency
CustomersStatus and impactPublic status pageEvery 15-20 min
Stakeholders (internal)Situation awarenessPrivate Slack channelEvery 15-20 min
RespondersTechnical details, coordinationIncident channelContinuous
Support teamWhat to tell customersDedicated threadOn change

The biggest mistake: letting these audiences mix. When executives join the incident channel and ask for status every 10 minutes, they pull responders' attention. When customers see internal technical details, they get confused or alarmed.

Keep the channels separate. Designate one person as the communications lead whose job is updating the external channels so responders can focus on the technical problem.

Timing: The Communication Schedule

Customers do not expect instant resolution. They expect to be kept informed. An outage that lasts 90 minutes with clear updates every 20 minutes generates far fewer complaints than an outage that lasts 30 minutes with complete silence.

The update schedule for a live incident

TimeWhat to post
T+5 min"Investigating" - you know about it, you're working on it
T+20 minUpdate: what you've found (even if nothing yet)
T+40 minUpdate: progress or new findings
Every 20 min afterUpdate until resolved

Commit to the next update time in every post. "Next update in 20 minutes" sets an expectation. Missing it tells customers nobody is watching the clock. Hitting it repeatedly - even with "still investigating" - builds trust faster than intermittent posts with more information.

What to Say at Each Stage

Investigating (T+5, cause unknown)

What to include:

  • Which service or feature is affected
  • What users experience (errors, slow responses, unavailability)
  • That your team is actively investigating
  • When the next update will come

What to leave out:

  • The cause (you don't know it yet)
  • Internal system names or technical jargon
  • Speculation about what might be wrong

Example:

We are investigating reports of errors affecting service. Some users may be unable to specific action. Our engineers are actively working on this.

Next update by time.


Identified (cause found, fix in progress)

What to include:

  • A plain-language explanation of the cause
  • Which features are affected and which are working normally
  • What you're doing to fix it
  • Conservative time estimate if you have one

What to leave out:

  • Technical blame language ("a developer deployed a bad config")
  • Precise timeline commitments you cannot keep

Example:

We have identified the cause: plain-language description - e.g., "a configuration change we deployed at 2:30 PM caused our database to reject new connections". Our team is deploying a fix now.

Affected: feature. Working normally: other features.

We expect to restore service within conservative estimate. Next update by time.

Specificity about the cause builds trust. "A database configuration change" is better than "an internal issue." Customers understand that systems are complex. What erodes trust is vagueness that reads as concealment.


Monitoring (fix deployed, watching for recovery)

What to include:

  • That the fix is deployed
  • That you're confirming recovery
  • What to do if customers still see issues

Example:

We have deployed a fix and are monitoring recovery. Most users should see service functioning normally now. If you continue to experience issues, contact support email.

We will post a final update once recovery is confirmed.

Do not skip this stage to jump directly to Resolved. A second failure immediately after declaring resolved damages trust more than a longer "monitoring" period.


Resolved

What to include:

  • Confirmation that service is fully restored
  • Duration of the incident (start time to resolution time)
  • One-sentence cause explanation
  • A commitment to publish a postmortem (for significant incidents)

Example:

This incident is resolved. Service is fully operational as of time UTC.

Duration: X hours Y minutes. Cause: one honest sentence.

We will publish a post-incident review within 48 hours. We apologize for the disruption.


Tone: The Three Rules

Rule 1: Specific beats vague.

"A database connection pool exhaustion caused elevated error rates for users attempting to log in" is better than "we experienced a technical issue." Vague language reads as either incompetence or hiding. Specific language, even when technical, reads as honest and competent.

Rule 2: Factual beats apologetic.

"We're so sorry for the terrible experience" reads as hollow. "Service was unavailable for 47 minutes. We've made X, Y, and Z changes to prevent this class of failure" reads as accountable. Apologize once, directly. Then focus on facts.

Rule 3: Committed beats hedged.

"We're working as hard as we can to address this" tells customers nothing. "We're deploying a rollback to restore service. Expected recovery in 20 minutes" commits to something actionable. If you miss the estimate, update immediately. Transparency about a missed estimate is better than silence.

What Not to Say

"We are experiencing some technical difficulties."

This says nothing. It signals either that you do not know what is happening or that you are not willing to share it. Both interpretations damage trust.

"This is affecting a small number of users."

Unless you have data to support this, avoid it. The customer reading the update does not know whether they are in the "small number." If they are, this reads as dismissive.

"This was due to an unprecedented situation."

Almost never true, and sounds defensive. Most outages have known causes. Own the specific cause.

"We are working around the clock."

Filler. Customers do not care about effort. They care about restoration.

Giving a timeline you cannot keep.

Missing a stated ETA without updating is worse than not giving one. If you say "resolved in 30 minutes" and post nothing for 90, you've compounded the original problem.

The Post-Incident Customer Email

Send a customer email for incidents lasting over 30 minutes with broad user impact. The status page handles real-time communication during the incident. The email handles the follow-up communication after.

Send it within 2 hours of resolution - not the next day.

The email should cover:

  1. What happened and when
  2. Who was affected and how
  3. What you've already done to fix it
  4. What you're doing to prevent recurrence

Send it from the founder or CEO for major incidents, not a generic no-reply@ address. Customers who receive a personal note from the founder are far less likely to churn than customers who receive a template from a support queue.

Research from Zendesk's CX Trends report shows that customers rate companies 2.5x higher on trust when they receive proactive outage communication compared to when they find out through their own investigation.

The Most Expensive Silence: Not Having a Status Page

Teams without a status page force every outage into two channels: Twitter/social media and support tickets.

Support tickets during an outage generate an average of 3-5 tickets per 100 affected users in the first 30 minutes. For a service with 10,000 active users, that is 300-500 tickets your support team has to process, each requiring an individual response, while the technical team is still fighting the incident.

A status page that customers can find and check cuts that volume by 70-80%. One URL, one post, redirects the entire curiosity load away from your team.

If you don't have a status page, set one up before the next incident. Vantaj includes public status pages on every plan, including free. It takes about 3 minutes to configure one.

The Communication Checklist

For any SEV-1 or significant SEV-2:

  • Status page "Investigating" posted within 5 minutes of confirmed outage
  • Dedicated incident channel created in Slack/Teams
  • Support team briefed on what to tell customers
  • Status page updated every 20 minutes until resolved
  • Status page "Resolved" with duration and cause
  • Customer email sent within 2 hours of resolution (if impact was broad)
  • Postmortem scheduled within 48 hours