What to Monitor: The Complete Checklist for SaaS, E-commerce, and APIs

The most common question from teams setting up monitoring for the first time is: what should I actually be watching?

Most guides list monitor types. This one tells you which specific endpoints, certificates, jobs, and records to monitor, organized by priority, so you can set up a complete monitoring stack without missing the things that matter.

Priority key: 🔴 Critical: alert immediately. 🟡 Important: alert within 5 minutes. 🟢 Informational: daily digest is sufficient.

HTTP and Application Monitors

These confirm your application is responding correctly, not just that the server is running.

For Every Product

Monitor	Priority	Why
Homepage / root URL	🟡	First thing customers check when something feels wrong
Login / auth endpoint	🔴	If users can't log in, the rest of the product is irrelevant
Primary API endpoint	🔴	The most-called endpoint your product depends on
Health check endpoint	🔴	`/health` or `/ping`; your own team uses this to verify recovery
Signup / registration	🟡	A broken signup flow means zero new users until someone notices
Password reset	🟡	Silent broken state; only surfaces when a user is locked out

Set up the health check endpoint if you don't already have one. A simple GET /health returning {"status": "ok"} with a 200 is enough. During an incident, this is the fastest way to confirm recovery.

Additional Checks for SaaS Products

Monitor	Priority	Why
Core feature API	🔴	The endpoint behind your product's primary value
Webhook delivery endpoint	🟡	Webhook failures are silent: customers see nothing, their integrations just stop
Billing / subscription API	🟡	A broken billing page blocks upgrades and causes churn at renewal
User dashboard	🟡	The page users land on after login; degraded performance is noticed immediately

Additional Checks for E-commerce

Monitor	Priority	Why
Product catalog / listing page	🔴	If products don't load, nothing sells
Cart / checkout page	🔴	Direct, immediate, measurable revenue loss when broken
Payment processor integration	🔴	Stripe, Braintree, or PayPal endpoint; payment failures are the most urgent alert
Order confirmation page	🔴	Confirms the full purchase flow completed
Search / product search API	🟡	Second most impactful e-commerce failure after checkout

For e-commerce, add a peak multiplier in your alerting expectations: a 4-hour outage during a 10x traffic period costs 10x as much as the same outage on a normal day. Check your checkout monitor first when something breaks during a sale event.

Additional Checks for Developer APIs

Monitor	Priority	Why
Primary API base URL	🔴	`api.yourdomain.com` with a lightweight authenticated request
Auth / token endpoint	🔴	If auth breaks, all API consumers break simultaneously
Documentation site	🟡	`docs.yourdomain.com`; downtime during an evaluation kills deals

SSL Certificate Monitors

SSL failures block all users immediately. The browser shows a full-page warning; most users don't click through. Set expiry alerts well in advance, because 7 days is too short if renewal requires vendor coordination or a DNS change.

Monitor	Priority	Recommended alert thresholds
Primary domain SSL	🔴	90, 60, 30, 7, 1 day before expiry
API subdomain SSL	🔴	Same; expires independently of your main domain
App subdomain SSL	🔴	Same
Docs / marketing subdomains	🟡	30, 7, 1 day before expiry
Custom customer domains	🟡	If you support CNAME-based custom domains, monitor a sample set; auto-renewal failures are common here

Don't rely on auto-renewal alone. Let's Encrypt, AWS ACM, and commercial CA portals all have failure modes: DNS validation errors, expired billing, misconfigured ACME clients, CDN certificate caching. Monitoring catches silent renewal failures before they cause outages.

Domain Expiry Monitors

Domain expiry is rarer than SSL expiry but more catastrophic. An expired domain takes your entire product offline, including the SSL certificate, DNS, and email. Recovery involves your registrar's support queue.

Monitor	Priority	Recommended alert thresholds
Primary domain	🔴	90, 60, 30, 14 days before expiry
Brand protection domains	🟡	`.io`, `.co`, `.net` variants you own; expiry lets squatters take them
Acquired product domains	🟡	Alert at 60 days; these often have different registrar accounts

Heartbeat Monitors

Heartbeat monitoring inverts the check: instead of you pinging the job, the job pings a URL on each successful run. If the ping stops arriving, the monitor alerts. This is the only reliable way to detect silent cron failures.

Job	Priority	Why
Database backup job	🔴	A backup that silently stops running is a disaster waiting for a trigger
Billing renewal / subscription sync	🔴	Subscription states diverge from your payment processor; silent revenue loss
Email delivery queue	🔴	Transactional emails (receipts, resets, notifications) stop without any error
User notification job	🟡	Digest emails, alerts, summaries; users notice when these go missing
Data sync / ETL pipeline	🟡	Stale data surfaces as product bugs, not monitoring alerts
Report generation job	🟡	Scheduled reports that internal teams rely on
Cleanup / maintenance jobs	🟢	Log rotation, temp file cleanup, expired session purge

Configure heartbeat intervals to match your cron schedule plus a 10–20% grace period. A job that runs every hour should have a heartbeat window of 66–72 minutes, not 60, to account for startup time and processing delays.

TCP Port Monitors

Use for services that don't expose HTTP endpoints.

Port	Service	Priority
5432	PostgreSQL	🔴
3306	MySQL	🔴
27017	MongoDB	🔴
6379	Redis	🔴
587 / 465	SMTP	🟡
22	SSH	🟡
3389	RDP	🟢

A database host that stops accepting TCP connections causes application failures that surface as HTTP 500 errors, not as "database unavailable." The TCP port monitor tells you the failure is at the infrastructure layer before you spend 30 minutes debugging application code.

DNS Monitors

DNS changes are rare, which is exactly why unexpected changes are significant. Alert on any value change rather than setting specific thresholds; the expected value of an NS record should never change without advance planning.

Record	Priority	Alert condition
Primary domain A record	🔴	Any IP address change
NS records	🔴	Any change; unexpected NS changes are the strongest signal of DNS hijacking
MX records	🟡	Any change; stops email delivery for your entire domain
API subdomain A record	🟡	Any IP address change
SPF TXT record	🟢	Value change; affects email deliverability and spam filter performance
DMARC TXT record	🟢	Value change

Recommended Setup Order

If you're starting from zero, this order prioritizes coverage of the most impactful failures:

Login endpoint (HTTP)
Primary API endpoint (HTTP)
Primary domain SSL certificate
Homepage (HTTP)
Checkout or core feature endpoint (HTTP)
Primary domain expiry (WHOIS/RDAP)
Database backup cron (heartbeat)
Billing sync cron (heartbeat)
Database TCP port
NS records (DNS)

These 10 monitors cover the failures most likely to affect users and the silent failures most likely to compound into larger problems. Add the rest of the list once these are stable.

Monitor Settings Reference

Monitor type	Check interval	Alert after
HTTP: critical endpoints	1 minute	2 consecutive failures from all regions
HTTP: secondary pages	5 minutes	2 consecutive failures
SSL certificate	12 hours	At 90/60/30/7/1 days before expiry
Domain expiry	Daily	At 90/60/30/14 days before expiry
Heartbeat	Match cron schedule + 10%	1 missed expected ping
TCP port	5 minutes	2 consecutive failures
DNS record	15 minutes	Any value change

Requiring 2 consecutive failures before alerting eliminates most false positives caused by transient network issues. A monitor checking every minute that requires 2 consecutive failures still alerts within 2 minutes of a real outage, fast enough for any production incident.

Frequently Asked Questions

How many monitors do I need?

For a typical SaaS product, 15–25 monitors covers everything: 6–10 HTTP checks, 3–5 SSL certificates, 1–2 domain expiry monitors, 3–5 heartbeat monitors, and a handful of DNS and TCP checks. More monitors add coverage; they don't improve detection speed for the monitors you already have.

Should I monitor staging as well as production?

Monitor production first, completely. Staging monitors are useful for catching deployment issues before they reach production, but they're a secondary concern. A broken staging environment that hasn't been monitored for a week costs nothing; a broken production login endpoint that hasn't been monitored for an hour costs customers.

What check interval should I use?

1 minute for anything customer-facing that generates revenue or blocks access. 5 minutes for secondary pages. Faster than 1 minute is rarely necessary; most outages aren't recovered in under a minute, so additional checks don't change your response time.

Do I need separate tools for each monitor type?

No. Vantaj monitors HTTP endpoints, SSL certificates, domain expiry, heartbeats, TCP ports, and DNS records from a single dashboard. The free tier covers 20 monitors across all types, enough to get full coverage for most small products.

For a deeper look at each monitor type, see ICMP ping monitoring, heartbeat monitoring for cron jobs, and DNS monitoring.