What to Monitor: The Complete Checklist for SaaS, E-commerce, and APIs
47 prioritized checks across HTTP, SSL, domain expiry, heartbeat, TCP, and DNS, organized by business type. Use this when setting up monitoring from scratch or auditing an existing setup.
The most common question from teams setting up monitoring for the first time is: what should I actually be watching?
Most guides list monitor types. This one tells you which specific endpoints, certificates, jobs, and records to monitor, organized by priority, so you can set up a complete monitoring stack without missing the things that matter.
Priority key: ๐ด Critical: alert immediately. ๐ก Important: alert within 5 minutes. ๐ข Informational: daily digest is sufficient.
HTTP and Application Monitors
These confirm your application is responding correctly, not just that the server is running.
For Every Product
| Monitor | Priority | Why |
|---|---|---|
| Homepage / root URL | ๐ก | First thing customers check when something feels wrong |
| Login / auth endpoint | ๐ด | If users can't log in, the rest of the product is irrelevant |
| Primary API endpoint | ๐ด | The most-called endpoint your product depends on |
| Health check endpoint | ๐ด | /health or /ping; your own team uses this to verify recovery |
| Signup / registration | ๐ก | A broken signup flow means zero new users until someone notices |
| Password reset | ๐ก | Silent broken state; only surfaces when a user is locked out |
Set up the health check endpoint if you don't already have one. A simple GET /health returning {"status": "ok"} with a 200 is enough. During an incident, this is the fastest way to confirm recovery.
Additional Checks for SaaS Products
| Monitor | Priority | Why |
|---|---|---|
| Core feature API | ๐ด | The endpoint behind your product's primary value |
| Webhook delivery endpoint | ๐ก | Webhook failures are silent: customers see nothing, their integrations just stop |
| Billing / subscription API | ๐ก | A broken billing page blocks upgrades and causes churn at renewal |
| User dashboard | ๐ก | The page users land on after login; degraded performance is noticed immediately |
Additional Checks for E-commerce
| Monitor | Priority | Why |
|---|---|---|
| Product catalog / listing page | ๐ด | If products don't load, nothing sells |
| Cart / checkout page | ๐ด | Direct, immediate, measurable revenue loss when broken |
| Payment processor integration | ๐ด | Stripe, Braintree, or PayPal endpoint; payment failures are the most urgent alert |
| Order confirmation page | ๐ด | Confirms the full purchase flow completed |
| Search / product search API | ๐ก | Second most impactful e-commerce failure after checkout |
For e-commerce, add a peak multiplier in your alerting expectations: a 4-hour outage during a 10x traffic period costs 10x as much as the same outage on a normal day. Check your checkout monitor first when something breaks during a sale event.
Additional Checks for Developer APIs
| Monitor | Priority | Why |
|---|---|---|
| Primary API base URL | ๐ด | api.yourdomain.com with a lightweight authenticated request |
| Auth / token endpoint | ๐ด | If auth breaks, all API consumers break simultaneously |
| Documentation site | ๐ก | docs.yourdomain.com; downtime during an evaluation kills deals |
SSL Certificate Monitors
SSL failures block all users immediately. The browser shows a full-page warning; most users don't click through. Set expiry alerts well in advance, because 7 days is too short if renewal requires vendor coordination or a DNS change.
| Monitor | Priority | Recommended alert thresholds |
|---|---|---|
| Primary domain SSL | ๐ด | 90, 60, 30, 7, 1 day before expiry |
| API subdomain SSL | ๐ด | Same; expires independently of your main domain |
| App subdomain SSL | ๐ด | Same |
| Docs / marketing subdomains | ๐ก | 30, 7, 1 day before expiry |
| Custom customer domains | ๐ก | If you support CNAME-based custom domains, monitor a sample set; auto-renewal failures are common here |
Don't rely on auto-renewal alone. Let's Encrypt, AWS ACM, and commercial CA portals all have failure modes: DNS validation errors, expired billing, misconfigured ACME clients, CDN certificate caching. Monitoring catches silent renewal failures before they cause outages.
Domain Expiry Monitors
Domain expiry is rarer than SSL expiry but more catastrophic. An expired domain takes your entire product offline, including the SSL certificate, DNS, and email. Recovery involves your registrar's support queue.
| Monitor | Priority | Recommended alert thresholds |
|---|---|---|
| Primary domain | ๐ด | 90, 60, 30, 14 days before expiry |
| Brand protection domains | ๐ก | .io, .co, .net variants you own; expiry lets squatters take them |
| Acquired product domains | ๐ก | Alert at 60 days; these often have different registrar accounts |
Heartbeat Monitors
Heartbeat monitoring inverts the check: instead of you pinging the job, the job pings a URL on each successful run. If the ping stops arriving, the monitor alerts. This is the only reliable way to detect silent cron failures.
| Job | Priority | Why |
|---|---|---|
| Database backup job | ๐ด | A backup that silently stops running is a disaster waiting for a trigger |
| Billing renewal / subscription sync | ๐ด | Subscription states diverge from your payment processor; silent revenue loss |
| Email delivery queue | ๐ด | Transactional emails (receipts, resets, notifications) stop without any error |
| User notification job | ๐ก | Digest emails, alerts, summaries; users notice when these go missing |
| Data sync / ETL pipeline | ๐ก | Stale data surfaces as product bugs, not monitoring alerts |
| Report generation job | ๐ก | Scheduled reports that internal teams rely on |
| Cleanup / maintenance jobs | ๐ข | Log rotation, temp file cleanup, expired session purge |
Configure heartbeat intervals to match your cron schedule plus a 10โ20% grace period. A job that runs every hour should have a heartbeat window of 66โ72 minutes, not 60, to account for startup time and processing delays.
TCP Port Monitors
Use for services that don't expose HTTP endpoints.
| Port | Service | Priority |
|---|---|---|
| 5432 | PostgreSQL | ๐ด |
| 3306 | MySQL | ๐ด |
| 27017 | MongoDB | ๐ด |
| 6379 | Redis | ๐ด |
| 587 / 465 | SMTP | ๐ก |
| 22 | SSH | ๐ก |
| 3389 | RDP | ๐ข |
A database host that stops accepting TCP connections causes application failures that surface as HTTP 500 errors, not as "database unavailable." The TCP port monitor tells you the failure is at the infrastructure layer before you spend 30 minutes debugging application code.
DNS Monitors
DNS changes are rare, which is exactly why unexpected changes are significant. Alert on any value change rather than setting specific thresholds; the expected value of an NS record should never change without advance planning.
| Record | Priority | Alert condition |
|---|---|---|
| Primary domain A record | ๐ด | Any IP address change |
| NS records | ๐ด | Any change; unexpected NS changes are the strongest signal of DNS hijacking |
| MX records | ๐ก | Any change; stops email delivery for your entire domain |
| API subdomain A record | ๐ก | Any IP address change |
| SPF TXT record | ๐ข | Value change; affects email deliverability and spam filter performance |
| DMARC TXT record | ๐ข | Value change |
Recommended Setup Order
If you're starting from zero, this order prioritizes coverage of the most impactful failures:
- Login endpoint (HTTP)
- Primary API endpoint (HTTP)
- Primary domain SSL certificate
- Homepage (HTTP)
- Checkout or core feature endpoint (HTTP)
- Primary domain expiry (WHOIS/RDAP)
- Database backup cron (heartbeat)
- Billing sync cron (heartbeat)
- Database TCP port
- NS records (DNS)
These 10 monitors cover the failures most likely to affect users and the silent failures most likely to compound into larger problems. Add the rest of the list once these are stable.
Monitor Settings Reference
| Monitor type | Check interval | Alert after |
|---|---|---|
| HTTP: critical endpoints | 1 minute | 2 consecutive failures from all regions |
| HTTP: secondary pages | 5 minutes | 2 consecutive failures |
| SSL certificate | 12 hours | At 90/60/30/7/1 days before expiry |
| Domain expiry | Daily | At 90/60/30/14 days before expiry |
| Heartbeat | Match cron schedule + 10% | 1 missed expected ping |
| TCP port | 5 minutes | 2 consecutive failures |
| DNS record | 15 minutes | Any value change |
Requiring 2 consecutive failures before alerting eliminates most false positives caused by transient network issues. A monitor checking every minute that requires 2 consecutive failures still alerts within 2 minutes of a real outage, fast enough for any production incident.
Frequently Asked Questions
How many monitors do I need?
For a typical SaaS product, 15โ25 monitors covers everything: 6โ10 HTTP checks, 3โ5 SSL certificates, 1โ2 domain expiry monitors, 3โ5 heartbeat monitors, and a handful of DNS and TCP checks. More monitors add coverage; they don't improve detection speed for the monitors you already have.
Should I monitor staging as well as production?
Monitor production first, completely. Staging monitors are useful for catching deployment issues before they reach production, but they're a secondary concern. A broken staging environment that hasn't been monitored for a week costs nothing; a broken production login endpoint that hasn't been monitored for an hour costs customers.
What check interval should I use?
1 minute for anything customer-facing that generates revenue or blocks access. 5 minutes for secondary pages. Faster than 1 minute is rarely necessary; most outages aren't recovered in under a minute, so additional checks don't change your response time.
Do I need separate tools for each monitor type?
No. Vantaj monitors HTTP endpoints, SSL certificates, domain expiry, heartbeats, TCP ports, and DNS records from a single dashboard. The free tier covers 20 monitors across all types, enough to get full coverage for most small products.
For a deeper look at each monitor type, see ICMP ping monitoring, heartbeat monitoring for cron jobs, and DNS monitoring.