Server health checks are automated checks that decide whether an application, service, or server should receive traffic, restart, alert the team, or be investigated.
For small teams, health checks are often treated as a small monitoring detail. In reality, they are reliability decisions. A bad health check can restart a healthy service, send users to an unready server, hide a broken customer journey, or wake the team up for noise. Raff Technologies gives developers full-root Linux VMs, Docker-ready infrastructure, and flexible server control, which makes it practical to design health checks around real workload behavior instead of relying on one generic “is the server up?” signal. Raff Linux VM
This guide belongs in Raff’s observability and reliability cluster. Raff already has guides on observability, incident response, performance bottlenecks, reverse proxies, load balancers, high availability, and disaster recovery. This guide focuses on the missing decision layer: what each health check should prove, which failures it should trigger, and when synthetic monitoring is more useful than internal status checks.
Health Checks Are Not All the Same
A health check is only useful when the team knows what decision it controls.
Some checks decide whether a process should be restarted. Some decide whether a service should receive traffic. Some decide whether a new instance has finished starting. Some test the product from the outside, like a user would.
Mixing these together creates dangerous behavior.
For example, a database outage should not always restart every application server. A slow startup should not be treated as a dead process. A service can be alive but not ready. A homepage can return 200 OK while checkout is broken.
That is why health checks need names and responsibilities.
| Health check type | Main question | Typical decision |
|---|---|---|
| Liveness check | Is the process alive enough to keep running? | Restart if repeatedly unhealthy |
| Readiness check | Is the service ready to receive traffic? | Remove from traffic until ready |
| Startup check | Has the service finished initialization? | Wait before applying other checks |
| Synthetic check | Can a user-facing journey work from outside? | Alert or investigate user impact |
| Dependency check | Is a required dependency usable? | Degrade, stop traffic, or alert |
| Deep health check | Can critical app behavior complete? | Investigate business-impacting failure |
The key rule: a health check should control one decision, not every decision.
If the same endpoint is used for restarts, traffic routing, uptime alerts, and business health, it will eventually create the wrong reaction.
The Server Health Check Decision Framework
Use this framework to decide which health check fits each reliability question.
| Scenario | Best check type | What it should prove | What it should not do |
|---|---|---|---|
| App process is deadlocked | Liveness | The process cannot make progress | Check every external dependency |
| App is starting slowly | Startup | Initialization is still in progress | Restart too early |
| App is running but warming cache | Readiness | Traffic should wait until safe | Mark the process dead |
| Database is temporarily unavailable | Readiness or degraded state | App may not be able to serve full traffic | Restart every app instance immediately |
| Homepage loads but checkout fails | Synthetic | User journey is broken | Depend only on internal metrics |
| Load balancer needs healthy targets | Readiness | Instance can accept requests | Prove every business workflow |
| Team needs uptime signal | Synthetic | Public service is reachable from outside | Replace logs, metrics, or traces |
| Background worker is stuck | Liveness plus job metrics | Worker is alive and jobs are moving | Treat web endpoint health as worker health |
The safest pattern is:
- use liveness checks for process survival,
- use readiness checks for traffic safety,
- use startup checks for slow initialization,
- use synthetic monitoring for user-visible availability,
- and use observability to explain why a check failed.
Liveness Checks Answer: Should This Process Keep Running?
A liveness check decides whether a process is alive enough to continue running.
This is useful when the service can become stuck in a state where it is technically still running but cannot make progress. A deadlock, frozen event loop, exhausted worker pool, or unrecoverable internal state may require a restart.
Kubernetes’ official documentation explains that liveness probes determine when a container should be restarted. The same idea applies beyond Kubernetes: a liveness check should answer whether restarting the process is a reasonable recovery action. Kubernetes Liveness, Readiness, and Startup Probes
A good liveness check is usually narrow.
| Liveness check should include | Liveness check should avoid |
|---|---|
| Process can respond at all | Full database queries |
| Runtime is not deadlocked | External API calls |
| Event loop or worker can make progress | Payment provider checks |
| Basic internal state is not corrupted | Expensive business workflows |
| Minimal timeout-sensitive check | Slow dependency chains |
The danger is making liveness too deep.
If the liveness check depends on the database, and the database has a short outage, every application instance may restart even though restarting does not fix the database. That can turn a dependency problem into a wider outage.
A practical rule: liveness should fail only when restarting this process is likely to help.
Readiness Checks Answer: Should This Instance Receive Traffic?
A readiness check decides whether an instance should receive traffic right now.
A service can be alive but not ready. It may be starting, warming caches, applying migrations, waiting for configuration, reconnecting to dependencies, or draining before shutdown. In those states, it may be better to keep the process running but remove it from traffic.
Kubernetes’ documentation explains that readiness probes determine when a container is ready to accept traffic, and a pod is not considered ready when the readiness probe fails. Kubernetes Liveness, Readiness, and Startup Probes
Readiness is especially important behind reverse proxies and load balancers.
| Readiness signal | Why it matters |
|---|---|
| App has loaded configuration | Avoids serving broken startup state |
| Required local services are available | Prevents traffic before app can work |
| Database connection pool is usable | Avoids routing requests that cannot complete |
| Cache or warmup process is complete | Prevents slow first-user impact |
| Instance is not draining | Avoids sending traffic during shutdown |
| Critical feature dependency is available | Prevents known broken workflows |
Readiness should control routing, not restarts.
If readiness fails, the system should usually stop sending new traffic to that instance while it recovers. It should not immediately kill the process unless the liveness check also proves the process itself is unhealthy.
A practical rule: readiness should fail when the instance should not receive traffic, even if it should keep running.
Startup Checks Protect Slow Services From Restart Loops
Some applications need time to start.
They may load large models, run migrations, warm caches, build indexes, connect to multiple services, or initialize a heavy runtime. If normal liveness checks start too early, the orchestrator or supervisor may think the app is dead and restart it repeatedly.
A startup check solves this by giving the service time to finish initialization before liveness and readiness checks become strict. Kubernetes’ documentation describes startup probes as a way to know when a container application has started; if configured, liveness and readiness checks do not start until the startup probe succeeds. Kubernetes Liveness, Readiness, and Startup Probes
Startup checks are useful for:
| Workload | Why startup checks help |
|---|---|
| Large web frameworks | Boot time can vary after deploy |
| JVM or .NET services | Runtime warmup can be slower |
| ML or AI services | Models may need loading |
| Databases or search services | Recovery and index checks can take time |
| Apps with migrations | Startup may include schema or state checks |
| Heavy container images | Initialization can exceed normal check timeout |
A startup check should not hide a broken deployment forever. It should give realistic startup time, then fail clearly if the service never becomes usable.
A practical rule: startup checks protect initialization; they should not become an excuse for unknown boot behavior.
Synthetic Monitoring Answers: Can Users Actually Use It?
Synthetic monitoring tests the system from the outside.
Instead of asking whether the process is alive internally, synthetic monitoring asks whether an external user path works. It may check a homepage, login page, API endpoint, checkout flow, dashboard load, DNS resolution, TLS certificate, or multi-step journey.
Google’s SRE material distinguishes white-box monitoring from black-box monitoring. White-box monitoring uses internal system knowledge, while black-box monitoring tests externally visible behavior as a user would see it. Google SRE: Monitoring Distributed Systems
Datadog describes synthetic monitoring as a proactive way to simulate user flows and requests to applications, endpoints, and network layers. Datadog Synthetic Monitoring
Synthetic checks are useful because internal health can be misleading.
| Internal system says... | But synthetic monitoring may reveal... |
|---|---|
| App process is running | Public endpoint is unreachable |
| Database is healthy | Login flow is broken |
| Server CPU is normal | DNS or TLS is failing |
| Load balancer has healthy targets | Checkout returns an error |
| API service is up | Auth provider integration is broken |
| All metrics look normal | Users in one region cannot connect |
Synthetic monitoring is best for user-visible truth.
A practical rule: synthetic checks should test what customers care about, not every internal detail.
Shallow and Deep Checks Serve Different Purposes
Health checks can be shallow or deep.
A shallow check proves the service can respond quickly. A deep check proves more meaningful behavior, often involving dependencies or business logic.
Both are useful, but they should not control the same decisions.
| Check depth | Example | Best use |
|---|---|---|
| Shallow liveness | Process responds quickly | Restart decision |
| Shallow readiness | App is initialized and accepting traffic | Load balancer routing |
| Dependency readiness | Database/cache is reachable | Traffic safety |
| Deep health check | Login, checkout, or API journey works | Synthetic monitoring or alerts |
| Business health check | Critical workflow produces expected result | Customer-impacting monitoring |
A deep health check is more valuable but also more fragile. If it depends on several systems, it can fail for reasons that do not mean the app process should restart.
This is why a deep check is often better as a synthetic monitor or alert, not a liveness check.
A practical rule: the deeper the check, the more careful you should be about what action it triggers.
Dependency Checks Need Careful Boundaries
Dependencies matter, but they can create bad health-check behavior.
A web app may depend on a database, cache, queue, object storage, payment API, email provider, authentication service, and internal API. If the health check requires every dependency to be perfect, the service may appear down too often. If it ignores every dependency, it may receive traffic it cannot handle.
The right boundary depends on whether the dependency is required for the specific traffic the service receives.
| Dependency | Health-check decision |
|---|---|
| Primary database | Often part of readiness if most requests require it |
| Cache | May be degraded if app can still work without it |
| Queue | Should affect worker health, not always web health |
| External payment API | Better as synthetic or feature-specific check |
| Object storage | Important for upload/download paths |
| Email provider | Usually not liveness; may be app-specific alert |
| Auth provider | Important for login readiness or synthetic login flow |
| Internal API | Depends on whether requests can degrade gracefully |
A dependency outage should trigger the right response.
If the database is unavailable, stopping traffic may be appropriate. Restarting every web server usually is not. If the email provider is down, checkout may still work, but notifications may be delayed. If object storage is down, uploads may fail while other pages continue working.
A practical rule: dependency checks should match the feature impact, not the emotional desire to check everything.
Health Checks Should Support Load Balancers and Reverse Proxies
Health checks are often used by reverse proxies and load balancers to decide which backend receives traffic.
Raff’s Reverse Proxy vs Load Balancer guide explains that reverse proxies and load balancers sit in front of applications and control traffic flow in different ways. Health checks make that traffic flow safer because unhealthy or unready backends can be removed from rotation. Reverse Proxy vs Load Balancer
For load-balanced systems, readiness matters more than simple process uptime.
| Backend state | Better traffic decision |
|---|---|
| Starting | Do not send traffic yet |
| Ready | Send traffic |
| Draining | Stop new traffic, finish existing requests |
| Dependency degraded | Route only if app can serve useful responses |
| Liveness failed | Restart or replace |
| Synthetic check failed | Investigate customer-facing path |
A backend can pass liveness and fail readiness. That is normal.
For example, a service may still be alive while it is draining connections before deployment. It should not be killed, but it should stop receiving new traffic. A readiness check supports that behavior.
Alerting Should Not Page on Every Failed Check
Not every health-check failure deserves an urgent alert.
Some failures are expected during deploys, restarts, warmups, or short dependency blips. If every check failure pages the team, health checks become noise. If no check failure alerts the team, users may discover outages first.
Google SRE guidance emphasizes that monitoring should help decide which problems deserve human attention and which do not. Google SRE: Monitoring Distributed Systems
A practical alerting model looks like this:
| Signal | Alert urgency |
|---|---|
| One readiness failure during deploy | Usually no page |
| One instance fails liveness and restarts | Ticket or watch if isolated |
| Many instances fail readiness | High urgency |
| Synthetic user journey fails from multiple locations | High urgency |
| Startup check fails after realistic window | Investigate deployment |
| Dependency check degraded but app still works | Warning or ticket |
| Public endpoint unavailable | Page if customer-impacting |
The best alert is tied to user impact.
A liveness failure on one worker may be low severity if redundancy exists. A synthetic login failure for all users may be urgent even if internal metrics look healthy.
A practical rule: health checks should inform alerts, but customer impact should decide urgency.
False Positives and False Negatives Are Both Dangerous
A false positive says the system is unhealthy when it is actually acceptable. A false negative says the system is healthy when users are actually affected.
Both are expensive.
| Error type | Example | Result |
|---|---|---|
| False positive | Readiness fails because cache is briefly slow, but app can still serve traffic | Unnecessary traffic removal |
| False positive | Liveness depends on external API and restarts app during API outage | Restart loop |
| False negative | Health endpoint returns OK while checkout is broken | Users see failure first |
| False negative | App process responds but worker queue is stuck | Background work silently stops |
| False negative | Server is up but DNS is broken | External users cannot reach app |
Good health-check design reduces both.
A shallow liveness check reduces false restarts. A useful readiness check reduces traffic to unready instances. Synthetic checks reduce false confidence from internal-only monitoring.
Health Checks Should Be Different for Web Apps, Workers, and Databases
Different workloads need different health checks.
A web application receives user traffic. A background worker processes jobs. A database stores state. A reverse proxy routes traffic. Treating them all with the same health endpoint creates confusion.
| Workload | Best health signal |
|---|---|
| Web app | Liveness, readiness, public synthetic endpoint |
| API | Readiness, dependency checks, synthetic API check |
| Background worker | Worker process liveness, queue progress, job failure rate |
| Database | Connection availability, replication, disk, memory, backup status |
| Cache | Connection and response check, but not always app liveness |
| Reverse proxy | Backend availability and public endpoint checks |
| Scheduled jobs | Last successful run and duration |
| WebSocket service | Active connections, reconnect rate, message latency |
For background workers, a web health endpoint is not enough. A worker can be alive but not processing jobs. For scheduled jobs, the health question is not whether a port responds; it is whether the job ran successfully on time.
A practical rule: health checks should match the workload’s responsibility.
Synthetic Monitoring Should Cover Critical User Journeys
Synthetic monitoring becomes more valuable when it covers the paths that matter most.
A homepage check is useful, but it may not prove the product works. For a SaaS application, login may matter more. For an API platform, an authenticated API call may matter more. For an e-commerce app, checkout matters more. For a control panel, VM creation or dashboard loading may matter more.
| Product type | Useful synthetic check |
|---|---|
| Marketing site | Homepage loads and TLS is valid |
| SaaS app | Login and dashboard load |
| API platform | Authenticated API request returns expected response |
| E-commerce app | Product page and checkout path |
| Developer tool | API, docs, and status endpoint |
| Real-time app | WebSocket connect and basic message flow |
| Admin panel | Restricted login page availability |
| File app | Upload or download path |
Synthetic checks should not test every feature at high frequency. That can create noise, cost, and false alarms. They should test the small number of user journeys that prove the service is usable.
A practical rule: synthetic monitoring should represent the customer experience, not the developer’s curiosity.
How Health Checks Apply on Raff
Raff gives teams the server-level control needed to design health checks around their actual application.
On a Raff Linux VM, developers can run application processes, Docker containers, reverse proxies, monitoring agents, cron jobs, workers, and custom health endpoints. Raff Linux VMs provide full root access, SSH key authentication, Docker-ready infrastructure, NVMe SSD storage, unmetered bandwidth, and deployment in under 60 seconds. Raff Linux VM
A practical Raff health-check model looks like this:
| Need | Raff-friendly approach |
|---|---|
| Simple web app | Basic liveness and readiness endpoint |
| Docker app | Container health checks plus app-level readiness |
| Reverse proxy | Backend readiness and public synthetic checks |
| Background worker | Worker liveness plus queue progress |
| Production API | Readiness, dependency checks, synthetic API probe |
| Deployment safety | Startup checks and readiness before traffic |
| Incident response | Preserve health-check events with logs and metrics |
| Performance review | Combine health with CPU, RAM, disk, and network metrics |
Health checks should not replace observability. They should work with observability.
Raff’s Observability guide explains metrics, logs, and traces as the signals that help teams understand system behavior. Health checks answer the first operational question: is this instance usable right now? Observability answers the next question: why?
Serdar’s infrastructure angle is direct: a health check is only as good as the action it triggers. If the action is wrong, the health check can create downtime instead of preventing it.
Common Health Check Mistakes
Using one endpoint for everything.
Liveness, readiness, startup, and synthetic checks should not all mean the same thing.
Making liveness too deep.
A liveness check that depends on every external service can restart healthy apps during dependency outages.
Making readiness too shallow.
A service that returns “OK” before it can serve traffic creates bad deployments and user errors.
Ignoring startup time.
Slow-starting apps can be restarted repeatedly if startup checks are not designed realistically.
Only monitoring from inside the server.
Internal metrics can look healthy while users cannot reach the app.
Alerting on every check failure.
Health checks should reduce noise, not create it.
Forgetting workers and scheduled jobs.
A website can be healthy while background processing is stuck.
Not reviewing checks after incidents.
Every incident should teach the team whether health checks were too shallow, too deep, or missing.
A Practical Health Check Policy for Small Teams
A small-team health check policy should be simple enough to follow.
| Policy area | Recommended baseline |
|---|---|
| Liveness | Check whether restarting the process would help |
| Readiness | Check whether the instance should receive traffic |
| Startup | Give slow services enough time to initialize |
| Dependencies | Include only dependencies that affect the decision being made |
| Synthetic monitoring | Test critical user journeys from outside the system |
| Workers | Track process health and job progress |
| Alerts | Page on customer impact, not every isolated check failure |
| Deployment | Use readiness to avoid sending traffic too early |
| Review | Update checks after incidents and major architecture changes |
The goal is not to add every possible check. The goal is to create the few checks that make production safer.
Good Health Checks Make Failure Boring
Server health checks are reliability controls.
Liveness checks keep dead processes from staying dead. Readiness checks keep traffic away from unready instances. Startup checks prevent slow services from being restarted too early. Synthetic monitoring proves whether users can actually reach and use the product.
For related reading, this guide should link to Raff’s Observability for Small Teams guide, Server Incident Response guide, Performance Bottlenecks guide, Reverse Proxy vs Load Balancer guide, High Availability vs Disaster Recovery guide, and Auto-Scaling VM Planning guide.
On Raff, the practical path is to start with simple, accurate health checks, connect them to the right actions, and expand only when the workload proves it needs more detail. A good health check should make failure easier to detect, easier to route around, and easier to recover from.
