Why Your Website Keeps Going Down (8 Common Causes)

Websites go down for a reason. If yours goes down repeatedly, the problem is usually one of a small set of known causes - and most of them are fixable once you identify which one you are dealing with.

This guide covers the 8 most common causes of recurring website downtime with practical diagnostics for each.

1. Server Resource Exhaustion

What it looks like: Your site goes down under traffic spikes, during batch jobs, or at predictable times like deploy cycles or cron runs.

When a server runs out of CPU, memory, or disk space, it stops serving requests. The failure mode depends on which resource hits the limit:

CPU exhaustion: requests slow to a crawl, then time out
Memory exhaustion: processes crash or get killed by the OS; 502 or 503 errors follow
Disk full: database writes fail, log rotation stops, application crashes on file I/O

How to diagnose:

Check server metrics at the time of each outage. Look for:

CPU above 90% sustained for more than 30 seconds
Memory usage above 85% with no free swap
Disk usage above 90%

Most hosting providers (AWS, Fly.io, Railway, Render) surface these in their dashboard. For VPS, use top, free -h, and df -h.

Fix:

Scale your server vertically (more CPU/RAM) or horizontally (more instances)
Identify and fix memory leaks in application code
Set up log rotation and disk space monitoring with alerts before the disk fills

2. Database Connection Failures

What it looks like: Your application returns 500 errors. The web server is running but the app cannot reach the database. Often appears as "connection pool exhausted" or "too many connections" in logs.

Every request to a dynamic site touches the database. When the database becomes unreachable - due to connection limits, a crashed process, or network issues between the app and database host - every request fails.

Common triggers:

Connection pool size set too low for the request volume
Database process crashed due to OOM or configuration error
Network timeout between app server and database host
Database max connections limit reached (often 100 on default PostgreSQL setups)

How to diagnose:

Check your application logs for connection errors at the time of the outage. Look for:

FATAL: remaining connection slots are reserved
could not connect to server
Connection refused

On PostgreSQL, check active connections: SELECT count(*) FROM pg_stat_activity;

Fix:

Increase max_connections in your database config and provision a connection pooler (PgBouncer for PostgreSQL)
Set connection pool size in your app to match what the database can handle
Add database availability as a monitored dependency alongside your web endpoints

3. SSL Certificate Expiry

What it looks like: Your site returns a browser security warning or stops loading entirely. Monitoring tools report SSL_ERROR_EXPIRED_CERT or certificate has expired.

SSL certificates have expiry dates. When a certificate expires, browsers refuse to load the site over HTTPS. For users, the experience is the same as a full outage - they cannot access your service.

This is one of the most avoidable causes of downtime. It fails on a known date, with weeks of warning available.

How to diagnose:

Check your certificate expiry date:

echo | openssl s_client -connect yourdomain.com:443 2>/dev/null | openssl x509 -noout -dates

The output shows notAfter= - the expiry date.

Fix:

Set up SSL certificate monitoring with alerts at 30 days, 14 days, and 7 days before expiry
Use Let's Encrypt with auto-renewal (via Certbot or your hosting platform)
Verify that auto-renewal is actually working - a misconfigured renewal script silently fails until the certificate expires

Vantaj monitors SSL certificate expiry for every HTTPS monitor and sends tiered alerts before it becomes an outage.

4. DNS Configuration Errors

What it looks like: Your site stops loading with "DNS_PROBE_FINISHED_NXDOMAIN" or similar DNS errors. Monitoring reports the domain as unreachable.

DNS errors prevent browsers from finding your server at all. Your server can be fully healthy and the site still shows as down if DNS breaks.

Common triggers:

Nameserver records changed or removed accidentally
Domain expired (registrar stops resolving the domain)
DNS provider outage
TTL settings that cause stale records to persist after a migration
A record pointing to an IP address that no longer routes to your server

How to diagnose:

# Check if the domain resolves
dig yourdomain.com

# Check nameservers
dig NS yourdomain.com

# Check from a different DNS resolver
dig @8.8.8.8 yourdomain.com

If dig returns SERVFAIL or NXDOMAIN, the problem is DNS.

Fix:

Monitor your domain expiry date (Vantaj does this natively)
Monitor DNS record changes - an alert when your A record changes can catch misconfigurations before they propagate
Use multiple nameservers and verify all of them resolve correctly

5. Traffic Spikes Without Autoscaling

What it looks like: Your site handles normal load fine but goes down when traffic doubles. Recovery happens naturally once the spike passes.

An unexpected surge of traffic - from a viral post, a newsletter send, or a product launch - can overwhelm a fixed-capacity server that handles daily load without issue.

How to diagnose:

Check your server access logs against your uptime history. Look for:

Request volume spike in the minutes before the outage
Response time degradation before the full failure
Load balancer connection queue growing

Fix:

Enable autoscaling on your hosting platform (most cloud providers support this)
Set up load testing before planned traffic events
Use a CDN to absorb static asset load so origin servers only handle dynamic requests
Configure rate limiting to prevent a single source from consuming all capacity

6. Third-Party Service Dependencies

What it looks like: Your site goes down but your own infrastructure is healthy. The outage timing correlates with a status event from Stripe, Auth0, Twilio, or another vendor.

Modern applications depend on many external services. When Stripe's API is unavailable, your checkout flow fails. When your auth provider goes down, nobody can log in. If your code does not handle these failures gracefully, one external dependency outage takes your whole site down.

How to diagnose:

Check the status pages of every external service your application calls. Cross-reference the timing with your own outage window.

Services to monitor:

Payment processors (Stripe, Braintree)
Authentication providers (Auth0, Okta, Clerk)
Email delivery (SendGrid, Postmark, Resend)
Cloud providers (AWS, GCP, Azure)
CDN providers (Cloudflare, Fastly)

Fix:

Add vendor monitors to your uptime dashboard so you know immediately when a dependency goes down
Build timeout and fallback handling for every external API call
Display a maintenance page or degraded mode when critical dependencies are unreachable

7. Deployment Failures

What it looks like: Your site goes down immediately after a deployment. Recovery requires a rollback.

A bad deploy is one of the most common causes of downtime in actively developed products. A missing environment variable, a migration that locks a database table, or a dependency version conflict can take down production within seconds of a deploy.

How to diagnose:

Check your deployment timeline against your incident timeline. If outages consistently start within 5-10 minutes of a deploy, the deploy is the cause.

Fix:

Add a /health endpoint that checks database connectivity and critical dependencies - use it as the deploy health check
Use blue-green or canary deployments to roll out changes to a subset of traffic before full rollout
Run database migrations in a separate step from the application deploy
Set up an automatic rollback trigger if health checks fail after deploy
Test your rollback procedure before you need it under pressure

8. Memory Leaks

What it looks like: Your site performs normally after a fresh start but degrades over time. Restarting the server temporarily fixes the problem. The pattern repeats on a predictable cycle (every 12 hours, every few days).

A memory leak in your application code causes memory usage to grow steadily until the process gets OOM-killed or becomes too slow to serve requests.

How to diagnose:

Plot your server memory usage over time. A memory leak shows a consistent upward slope with sharp drops when the process restarts.

For Node.js applications:

node --inspect app.js

Use Chrome DevTools to take heap snapshots before and after periods of heavy usage. Compare object counts between snapshots.

Fix:

Fix the memory leak in code (unbounded caches, event listeners that are never removed, circular references)
Set max-old-space-size in Node.js to limit heap growth and crash predictably instead of degrading silently
Use a process manager (PM2, systemd) that restarts the application automatically on crash
Set up memory usage monitoring with alerts at 80% and 90% thresholds

Build a Diagnostic Checklist for Every Outage

When your site goes down, run through these in order:

DNS - Does the domain resolve? dig yourdomain.com
SSL - Is the certificate valid and current?
Server health - CPU, memory, disk all within bounds?
Database - Can the application connect?
Recent deploy - Did an outage start within 10 minutes of a deployment?
Third-party dependencies - Are upstream services showing incidents?
Traffic spike - Did request volume jump before the failure?
Memory trend - Has memory been climbing since the last restart?

The answer is almost always in this list. Knowing which category applies turns a 45-minute investigation into a 5-minute one.