What is a liveness check?

A liveness check determines whether a process is alive enough to keep running. If it repeatedly fails, restarting the process may be the right recovery action.

What is a readiness check?

A readiness check determines whether a service is ready to receive traffic. If it fails, the service should usually be removed from traffic until it becomes ready again.

What is a startup check?

A startup check gives slow-starting applications time to initialize before stricter liveness or readiness checks begin.

What is synthetic monitoring?

Synthetic monitoring tests user-facing behavior from the outside, such as loading a page, calling an API, checking TLS, or completing a login flow.

Should liveness checks include database checks?

Usually no. A liveness check should fail when restarting the process is likely to help. Database checks are often better suited for readiness, dependency checks, or alerts.

Does Raff support custom server health checks?

Yes. Raff Linux VMs provide full root access and Docker-ready infrastructure, so teams can run application health endpoints, reverse proxies, monitoring agents, and synthetic checks.

Server Health Checks Guide

Q: What is a server health check?

A server health check is an automated check that decides whether an application, service, or server should receive traffic, restart, alert the team, or be investigated.

Server health checks are automated checks that decide whether an application, service, or server should receive traffic, restart, alert the team, or be investigated.

For small teams, health checks are often treated as a small monitoring detail. In reality, they are reliability decisions. A bad health check can restart a healthy service, send users to an unready server, hide a broken customer journey, or wake the team up for noise. Raff Technologies gives developers full-root Linux VMs, Docker-ready infrastructure, and flexible server control, which makes it practical to design health checks around real workload behavior instead of relying on one generic “is the server up?” signal. Raff Linux VM

This guide belongs in Raff’s observability and reliability cluster. Raff already has guides on observability, incident response, performance bottlenecks, reverse proxies, load balancers, high availability, and disaster recovery. This guide focuses on the missing decision layer: what each health check should prove, which failures it should trigger, and when synthetic monitoring is more useful than internal status checks.

Health Checks Are Not All the Same

A health check is only useful when the team knows what decision it controls.

Some checks decide whether a process should be restarted. Some decide whether a service should receive traffic. Some decide whether a new instance has finished starting. Some test the product from the outside, like a user would.

Mixing these together creates dangerous behavior.

For example, a database outage should not always restart every application server. A slow startup should not be treated as a dead process. A service can be alive but not ready. A homepage can return 200 OK while checkout is broken.

That is why health checks need names and responsibilities.

Health check type	Main question	Typical decision
Liveness check	Is the process alive enough to keep running?	Restart if repeatedly unhealthy
Readiness check	Is the service ready to receive traffic?	Remove from traffic until ready
Startup check	Has the service finished initialization?	Wait before applying other checks
Synthetic check	Can a user-facing journey work from outside?	Alert or investigate user impact
Dependency check	Is a required dependency usable?	Degrade, stop traffic, or alert
Deep health check	Can critical app behavior complete?	Investigate business-impacting failure

The key rule: a health check should control one decision, not every decision.

If the same endpoint is used for restarts, traffic routing, uptime alerts, and business health, it will eventually create the wrong reaction.

The Server Health Check Decision Framework

Use this framework to decide which health check fits each reliability question.

Scenario	Best check type	What it should prove	What it should not do
App process is deadlocked	Liveness	The process cannot make progress	Check every external dependency
App is starting slowly	Startup	Initialization is still in progress	Restart too early
App is running but warming cache	Readiness	Traffic should wait until safe	Mark the process dead
Database is temporarily unavailable	Readiness or degraded state	App may not be able to serve full traffic	Restart every app instance immediately
Homepage loads but checkout fails	Synthetic	User journey is broken	Depend only on internal metrics
Load balancer needs healthy targets	Readiness	Instance can accept requests	Prove every business workflow
Team needs uptime signal	Synthetic	Public service is reachable from outside	Replace logs, metrics, or traces
Background worker is stuck	Liveness plus job metrics	Worker is alive and jobs are moving	Treat web endpoint health as worker health

The safest pattern is:

use liveness checks for process survival,
use readiness checks for traffic safety,
use startup checks for slow initialization,
use synthetic monitoring for user-visible availability,
and use observability to explain why a check failed.

Liveness Checks Answer: Should This Process Keep Running?

A liveness check decides whether a process is alive enough to continue running.

This is useful when the service can become stuck in a state where it is technically still running but cannot make progress. A deadlock, frozen event loop, exhausted worker pool, or unrecoverable internal state may require a restart.

Kubernetes’ official documentation explains that liveness probes determine when a container should be restarted. The same idea applies beyond Kubernetes: a liveness check should answer whether restarting the process is a reasonable recovery action. Kubernetes Liveness, Readiness, and Startup Probes

A good liveness check is usually narrow.

Liveness check should include	Liveness check should avoid
Process can respond at all	Full database queries
Runtime is not deadlocked	External API calls
Event loop or worker can make progress	Payment provider checks
Basic internal state is not corrupted	Expensive business workflows
Minimal timeout-sensitive check	Slow dependency chains

The danger is making liveness too deep.

If the liveness check depends on the database, and the database has a short outage, every application instance may restart even though restarting does not fix the database. That can turn a dependency problem into a wider outage.

A practical rule: liveness should fail only when restarting this process is likely to help.

Readiness Checks Answer: Should This Instance Receive Traffic?

A readiness check decides whether an instance should receive traffic right now.

A service can be alive but not ready. It may be starting, warming caches, applying migrations, waiting for configuration, reconnecting to dependencies, or draining before shutdown. In those states, it may be better to keep the process running but remove it from traffic.

Kubernetes’ documentation explains that readiness probes determine when a container is ready to accept traffic, and a pod is not considered ready when the readiness probe fails. Kubernetes Liveness, Readiness, and Startup Probes

Readiness is especially important behind reverse proxies and load balancers.

Readiness signal	Why it matters
App has loaded configuration	Avoids serving broken startup state
Required local services are available	Prevents traffic before app can work
Database connection pool is usable	Avoids routing requests that cannot complete
Cache or warmup process is complete	Prevents slow first-user impact
Instance is not draining	Avoids sending traffic during shutdown
Critical feature dependency is available	Prevents known broken workflows

Readiness should control routing, not restarts.

If readiness fails, the system should usually stop sending new traffic to that instance while it recovers. It should not immediately kill the process unless the liveness check also proves the process itself is unhealthy.

A practical rule: readiness should fail when the instance should not receive traffic, even if it should keep running.

Startup Checks Protect Slow Services From Restart Loops

Some applications need time to start.

They may load large models, run migrations, warm caches, build indexes, connect to multiple services, or initialize a heavy runtime. If normal liveness checks start too early, the orchestrator or supervisor may think the app is dead and restart it repeatedly.

A startup check solves this by giving the service time to finish initialization before liveness and readiness checks become strict. Kubernetes’ documentation describes startup probes as a way to know when a container application has started; if configured, liveness and readiness checks do not start until the startup probe succeeds. Kubernetes Liveness, Readiness, and Startup Probes

Startup checks are useful for:

Workload	Why startup checks help
Large web frameworks	Boot time can vary after deploy
JVM or .NET services	Runtime warmup can be slower
ML or AI services	Models may need loading
Databases or search services	Recovery and index checks can take time
Apps with migrations	Startup may include schema or state checks
Heavy container images	Initialization can exceed normal check timeout

A startup check should not hide a broken deployment forever. It should give realistic startup time, then fail clearly if the service never becomes usable.

A practical rule: startup checks protect initialization; they should not become an excuse for unknown boot behavior.

Synthetic Monitoring Answers: Can Users Actually Use It?

Synthetic monitoring tests the system from the outside.

Instead of asking whether the process is alive internally, synthetic monitoring asks whether an external user path works. It may check a homepage, login page, API endpoint, checkout flow, dashboard load, DNS resolution, TLS certificate, or multi-step journey.

Google’s SRE material distinguishes white-box monitoring from black-box monitoring. White-box monitoring uses internal system knowledge, while black-box monitoring tests externally visible behavior as a user would see it. Google SRE: Monitoring Distributed Systems

Datadog describes synthetic monitoring as a proactive way to simulate user flows and requests to applications, endpoints, and network layers. Datadog Synthetic Monitoring

Synthetic checks are useful because internal health can be misleading.

Internal system says...	But synthetic monitoring may reveal...
App process is running	Public endpoint is unreachable
Database is healthy	Login flow is broken
Server CPU is normal	DNS or TLS is failing
Load balancer has healthy targets	Checkout returns an error
API service is up	Auth provider integration is broken
All metrics look normal	Users in one region cannot connect

Synthetic monitoring is best for user-visible truth.

A practical rule: synthetic checks should test what customers care about, not every internal detail.

Shallow and Deep Checks Serve Different Purposes

Health checks can be shallow or deep.

A shallow check proves the service can respond quickly. A deep check proves more meaningful behavior, often involving dependencies or business logic.

Both are useful, but they should not control the same decisions.

Check depth	Example	Best use
Shallow liveness	Process responds quickly	Restart decision
Shallow readiness	App is initialized and accepting traffic	Load balancer routing
Dependency readiness	Database/cache is reachable	Traffic safety
Deep health check	Login, checkout, or API journey works	Synthetic monitoring or alerts
Business health check	Critical workflow produces expected result	Customer-impacting monitoring

A deep health check is more valuable but also more fragile. If it depends on several systems, it can fail for reasons that do not mean the app process should restart.

This is why a deep check is often better as a synthetic monitor or alert, not a liveness check.

A practical rule: the deeper the check, the more careful you should be about what action it triggers.

Dependency Checks Need Careful Boundaries

Dependencies matter, but they can create bad health-check behavior.

A web app may depend on a database, cache, queue, object storage, payment API, email provider, authentication service, and internal API. If the health check requires every dependency to be perfect, the service may appear down too often. If it ignores every dependency, it may receive traffic it cannot handle.

The right boundary depends on whether the dependency is required for the specific traffic the service receives.

Dependency	Health-check decision
Primary database	Often part of readiness if most requests require it
Cache	May be degraded if app can still work without it
Queue	Should affect worker health, not always web health
External payment API	Better as synthetic or feature-specific check
Object storage	Important for upload/download paths
Email provider	Usually not liveness; may be app-specific alert
Auth provider	Important for login readiness or synthetic login flow
Internal API	Depends on whether requests can degrade gracefully

A dependency outage should trigger the right response.

If the database is unavailable, stopping traffic may be appropriate. Restarting every web server usually is not. If the email provider is down, checkout may still work, but notifications may be delayed. If object storage is down, uploads may fail while other pages continue working.

A practical rule: dependency checks should match the feature impact, not the emotional desire to check everything.

Health Checks Should Support Load Balancers and Reverse Proxies

Health checks are often used by reverse proxies and load balancers to decide which backend receives traffic.

Raff’s Reverse Proxy vs Load Balancer guide explains that reverse proxies and load balancers sit in front of applications and control traffic flow in different ways. Health checks make that traffic flow safer because unhealthy or unready backends can be removed from rotation. Reverse Proxy vs Load Balancer

For load-balanced systems, readiness matters more than simple process uptime.

Backend state	Better traffic decision
Starting	Do not send traffic yet
Ready	Send traffic
Draining	Stop new traffic, finish existing requests
Dependency degraded	Route only if app can serve useful responses
Liveness failed	Restart or replace
Synthetic check failed	Investigate customer-facing path

A backend can pass liveness and fail readiness. That is normal.

For example, a service may still be alive while it is draining connections before deployment. It should not be killed, but it should stop receiving new traffic. A readiness check supports that behavior.

Alerting Should Not Page on Every Failed Check

Not every health-check failure deserves an urgent alert.

Some failures are expected during deploys, restarts, warmups, or short dependency blips. If every check failure pages the team, health checks become noise. If no check failure alerts the team, users may discover outages first.

Google SRE guidance emphasizes that monitoring should help decide which problems deserve human attention and which do not. Google SRE: Monitoring Distributed Systems

A practical alerting model looks like this:

Signal	Alert urgency
One readiness failure during deploy	Usually no page
One instance fails liveness and restarts	Ticket or watch if isolated
Many instances fail readiness	High urgency
Synthetic user journey fails from multiple locations	High urgency
Startup check fails after realistic window	Investigate deployment
Dependency check degraded but app still works	Warning or ticket
Public endpoint unavailable	Page if customer-impacting

The best alert is tied to user impact.

A liveness failure on one worker may be low severity if redundancy exists. A synthetic login failure for all users may be urgent even if internal metrics look healthy.

A practical rule: health checks should inform alerts, but customer impact should decide urgency.

False Positives and False Negatives Are Both Dangerous

A false positive says the system is unhealthy when it is actually acceptable. A false negative says the system is healthy when users are actually affected.

Both are expensive.

Error type	Example	Result
False positive	Readiness fails because cache is briefly slow, but app can still serve traffic	Unnecessary traffic removal
False positive	Liveness depends on external API and restarts app during API outage	Restart loop
False negative	Health endpoint returns OK while checkout is broken	Users see failure first
False negative	App process responds but worker queue is stuck	Background work silently stops
False negative	Server is up but DNS is broken	External users cannot reach app

Good health-check design reduces both.

A shallow liveness check reduces false restarts. A useful readiness check reduces traffic to unready instances. Synthetic checks reduce false confidence from internal-only monitoring.

Health Checks Should Be Different for Web Apps, Workers, and Databases

Different workloads need different health checks.

A web application receives user traffic. A background worker processes jobs. A database stores state. A reverse proxy routes traffic. Treating them all with the same health endpoint creates confusion.

Workload	Best health signal
Web app	Liveness, readiness, public synthetic endpoint
API	Readiness, dependency checks, synthetic API check
Background worker	Worker process liveness, queue progress, job failure rate
Database	Connection availability, replication, disk, memory, backup status
Cache	Connection and response check, but not always app liveness
Reverse proxy	Backend availability and public endpoint checks
Scheduled jobs	Last successful run and duration
WebSocket service	Active connections, reconnect rate, message latency

For background workers, a web health endpoint is not enough. A worker can be alive but not processing jobs. For scheduled jobs, the health question is not whether a port responds; it is whether the job ran successfully on time.

A practical rule: health checks should match the workload’s responsibility.

Synthetic Monitoring Should Cover Critical User Journeys

Synthetic monitoring becomes more valuable when it covers the paths that matter most.

A homepage check is useful, but it may not prove the product works. For a SaaS application, login may matter more. For an API platform, an authenticated API call may matter more. For an e-commerce app, checkout matters more. For a control panel, VM creation or dashboard loading may matter more.

Product type	Useful synthetic check
Marketing site	Homepage loads and TLS is valid
SaaS app	Login and dashboard load
API platform	Authenticated API request returns expected response
E-commerce app	Product page and checkout path
Developer tool	API, docs, and status endpoint
Real-time app	WebSocket connect and basic message flow
Admin panel	Restricted login page availability
File app	Upload or download path

Synthetic checks should not test every feature at high frequency. That can create noise, cost, and false alarms. They should test the small number of user journeys that prove the service is usable.

A practical rule: synthetic monitoring should represent the customer experience, not the developer’s curiosity.

How Health Checks Apply on Raff

Raff gives teams the server-level control needed to design health checks around their actual application.

On a Raff Linux VM, developers can run application processes, Docker containers, reverse proxies, monitoring agents, cron jobs, workers, and custom health endpoints. Raff Linux VMs provide full root access, SSH key authentication, Docker-ready infrastructure, NVMe SSD storage, unmetered bandwidth, and deployment in under 60 seconds. Raff Linux VM

A practical Raff health-check model looks like this:

Need	Raff-friendly approach
Simple web app	Basic liveness and readiness endpoint
Docker app	Container health checks plus app-level readiness
Reverse proxy	Backend readiness and public synthetic checks
Background worker	Worker liveness plus queue progress
Production API	Readiness, dependency checks, synthetic API probe
Deployment safety	Startup checks and readiness before traffic
Incident response	Preserve health-check events with logs and metrics
Performance review	Combine health with CPU, RAM, disk, and network metrics

Health checks should not replace observability. They should work with observability.

Raff’s Observability guide explains metrics, logs, and traces as the signals that help teams understand system behavior. Health checks answer the first operational question: is this instance usable right now? Observability answers the next question: why?

Serdar’s infrastructure angle is direct: a health check is only as good as the action it triggers. If the action is wrong, the health check can create downtime instead of preventing it.

Common Health Check Mistakes

Using one endpoint for everything.
Liveness, readiness, startup, and synthetic checks should not all mean the same thing.

Making liveness too deep.
A liveness check that depends on every external service can restart healthy apps during dependency outages.

Making readiness too shallow.
A service that returns “OK” before it can serve traffic creates bad deployments and user errors.

Ignoring startup time.
Slow-starting apps can be restarted repeatedly if startup checks are not designed realistically.

Only monitoring from inside the server.
Internal metrics can look healthy while users cannot reach the app.

Alerting on every check failure.
Health checks should reduce noise, not create it.

Forgetting workers and scheduled jobs.
A website can be healthy while background processing is stuck.

Not reviewing checks after incidents.
Every incident should teach the team whether health checks were too shallow, too deep, or missing.

A Practical Health Check Policy for Small Teams

A small-team health check policy should be simple enough to follow.

Policy area	Recommended baseline
Liveness	Check whether restarting the process would help
Readiness	Check whether the instance should receive traffic
Startup	Give slow services enough time to initialize
Dependencies	Include only dependencies that affect the decision being made
Synthetic monitoring	Test critical user journeys from outside the system
Workers	Track process health and job progress
Alerts	Page on customer impact, not every isolated check failure
Deployment	Use readiness to avoid sending traffic too early
Review	Update checks after incidents and major architecture changes

The goal is not to add every possible check. The goal is to create the few checks that make production safer.

Good Health Checks Make Failure Boring

Server health checks are reliability controls.

Liveness checks keep dead processes from staying dead. Readiness checks keep traffic away from unready instances. Startup checks prevent slow services from being restarted too early. Synthetic monitoring proves whether users can actually reach and use the product.

For related reading, this guide should link to Raff’s Observability for Small Teams guide, Server Incident Response guide, Performance Bottlenecks guide, Reverse Proxy vs Load Balancer guide, High Availability vs Disaster Recovery guide, and Auto-Scaling VM Planning guide.

On Raff, the practical path is to start with simple, accurate health checks, connect them to the right actions, and expand only when the workload proves it needs more detail. A good health check should make failure easier to detect, easier to route around, and easier to recover from.

Server Health Checks Explained: Liveness, Readiness, and Synthetic Monitoring

Key Takeaways