10 Linux server errors Reflex fixes automatically
The Reflex Team11 min9 May 2026
Most server incidents follow the same ten patterns. They have the same symptoms, the same root causes, and the same fixes — applied manually by an engineer who SSHs in at 3am, runs the same diagnostic commands they ran last time, and applies the same repair they applied last time.
Reflex automates this loop. The reflexd agent detects the condition, the Brain evaluates the appropriate repair, executes it within policy, and logs an audit trail. Here are the ten errors that account for the majority of automated repairs across our fleet.
1. OOM kill
What happens: The Linux kernel runs out of allocatable memory and invokes the OOM killer, which selects a process (usually the largest one) and sends SIGKILL. No graceful shutdown, no cleanup, no error handler.
Typical victim: PHP-FPM children, Node.js workers, or Java processes with aggressive heap settings.
How the Brain handles it: Reflex monitors memory pressure trends and dmesg for OOM kill events. When an OOM kill occurs, the Brain checks whether the killed process was managed (FPM pool, PM2 cluster, systemd service) and triggers a supervised restart of the process manager — not the individual process. It then flags the memory trend for review, because an OOM kill is a symptom, not a root cause.
2. PHP-FPM crash loop
What happens: PHP-FPM workers crash and respawn repeatedly, often due to a segfault in an extension, a corrupted opcache, or a fatal error in a preloaded file. Each restart attempt fails within seconds.
Symptoms: nginx logs flooded with connect() to unix:/run/php/php-fpm.sock failed, HTTP 502 errors, and FPM's emergency restart log lines.
How the Brain handles it: Reflex detects the crash-restart pattern by monitoring FPM's process stability and restart frequency. If workers cannot stay alive for longer than the configured min_uptime threshold, the Brain first attempts an opcache clear and graceful pool restart. If the loop continues, it escalates to an alert with diagnostic context — because a persistent crash loop usually means a code or extension issue that cannot be fixed by restarting harder.
3. Nginx 502 Bad Gateway
What happens: nginx cannot connect to its upstream (PHP-FPM, Gunicorn, Node.js). The upstream is either crashed, overloaded, or the socket/port is not listening.
How the Brain handles it: Reflex correlates the 502 rate with upstream process health. If the upstream process is dead, the Brain restarts it. If the upstream is alive but overloaded (FPM listen queue growing, all workers active), it logs the capacity constraint and alerts — because adding workers without checking memory arithmetic is how you trigger error number one on this list.
4. Disk full
What happens: A partition (usually /var or /) hits 100% utilisation. Writes fail silently or with errors. Log files cannot be written, database transactions fail, session files cannot be created, and deployments break.
Symptoms: No space left on device errors, application 500s, failed apt operations, and database write failures.
How the Brain handles it: Reflex monitors disk utilisation on a trend basis, alerting at 85% and escalating at 95%. When a partition is critically full, the Brain identifies the largest space consumers (typically old log files, failed deployment releases, or temp files), and if the offending files match safe-to-clean patterns (rotated logs, orphaned release directories, PHP session files past TTL), it removes them and logs the action. Files that do not match known-safe patterns are flagged for human review.
5. Dead queue worker
What happens: A Supervisor or systemd-managed queue worker stops processing jobs. The process may still be alive but stuck on a blocking operation, deadlocked, or consuming a poison job that causes infinite retries.
Symptoms: Jobs accumulate in the pending state, Horizon shows "inactive" supervisors, and background tasks (emails, notifications, webhooks) stop executing.
How the Brain handles it: Reflex monitors queue depth, worker process state, and job processing rate. When queue depth grows while workers appear idle or stuck, the Brain sends a graceful restart signal (queue:restart for Laravel, equivalent for other frameworks). For stuck processes that do not respond to graceful signals, it escalates to a SIGTERM and respawn after a configurable timeout.
6. SSL certificate expiry
What happens: A Let's Encrypt (or other CA) certificate expires because the renewal cron failed, certbot's configuration drifted, or the renewal hook could not restart nginx. The site shows browser security warnings and loses all user trust.
Symptoms: Browser "connection not secure" warnings, failed HTTPS connections, and certbot error logs.
How the Brain handles it: Reflex checks SSL certificate expiry dates daily across all configured domains. At 14 days before expiry, it alerts. At 7 days, it attempts a certbot renewal. At 3 days, it escalates with high urgency. Post-renewal, it verifies the new certificate is loaded by checking the live TLS handshake — because a renewed certificate sitting on disk while nginx serves the old one is a common failure mode.
7. High load average
What happens: The system load average exceeds the CPU core count for sustained periods, indicating more processes are waiting for CPU time than the hardware can serve. Response times degrade, timeouts increase, and the server feels "slow" without any single process being obviously broken.
How the Brain handles it: Reflex distinguishes between CPU-bound load (high %us and %sy in CPU stats) and I/O-bound load (high %wa). For CPU-bound spikes correlated with a recent deploy, it flags the deploy as a likely cause. For I/O-bound load, it checks disk latency and identifies processes with abnormal I/O patterns. Sustained high load triggers an alert with a ranked list of processes by CPU and I/O consumption — context that saves the responding engineer ten minutes of diagnostic commands.
8. MySQL connection storm
What happens: The application exhausts MySQL's max_connections limit. New connection attempts receive ERROR 1040 (HY000): Too many connections and all database-dependent operations fail.
Common causes: PHP-FPM worker count multiplied by persistent connections exceeds MySQL's limit. Or a long-running query holds a connection while new requests pile up.
How the Brain handles it: Reflex monitors MySQL's active connection count relative to max_connections. When utilisation exceeds 80%, it alerts. When connections are exhausted, it identifies the top connection consumers by process (typically FPM workers) and, if policy allows, gracefully restarts the FPM pool to release stale connections. It also flags slow queries from the MySQL slow log that may be holding connections open.
9. Redis memory overflow
What happens: Redis reaches its maxmemory limit. Depending on the eviction policy, it either starts evicting keys (which can break sessions and cache assumptions) or rejects all write commands with OOM errors.
Symptoms: Application errors on cache writes, session loss, and queue failures if Redis is the queue driver.
How the Brain handles it: Reflex monitors Redis memory usage relative to maxmemory. When usage exceeds 85%, it alerts with a breakdown of key patterns by memory consumption (using MEMORY USAGE sampling). If the overflow is caused by an identifiable pattern — such as a cache prefix growing unboundedly — it flags the specific pattern. If Redis is in an OOM-reject state and the queue is backed up, the Brain can trigger a targeted key eviction for known-safe cache prefixes while preserving session and queue data.
10. Cron failure
What happens: A scheduled task (Laravel's schedule:run, a system crontab entry, or a systemd timer) silently stops executing. No jobs are dispatched, no reports are generated, no cleanup runs. Often nobody notices for days.
Symptoms: Missing scheduled data, growing database tables that should have been pruned, and queue jobs that depend on cron dispatch accumulating as "never created."
How the Brain handles it: Reflex monitors cron execution by tracking the expected schedule against actual execution timestamps. When a scheduled task misses its expected window (with a configurable grace period), it alerts immediately. For Laravel applications, it monitors schedule:run execution via the artisan process and verifies that expected commands actually fired. When cron itself is the problem (crond not running, crontab corrupted), the Brain restarts the cron service and verifies the schedule is intact.
The pattern
These ten errors share a common trait: the diagnosis and repair are well-understood, repeatable, and safe to automate. No engineer learns something new by restarting a crashed FPM pool at 3am for the twelfth time. The value of a human is in the investigation that follows — why did it crash, what changed, how do we prevent it. Reflex automates the immediate repair so the human conversation can start with "here is what happened and what we did" instead of "the site is down and we are still SSH-ing in."