· Yair Knijn
The four ways certificate renewal breaks (and which ones monitoring catches)
It is 14:00 on a Tuesday. A customer-facing API starts returning 502s. The on-call engineer looks at the certificate on the load balancer and it is, of course, valid until next month. The certificate that expired is on the origin server, and the rotation script did not run, and nobody is sure when it last ran, and the page that documents it is six months old. This is the shape of a certificate incident. The question worth asking is not "how do we monitor expiry" — it is "which of the four things that actually break did we just hit".
1. The DNS delegation expired
ACME issuance is gated on proving control of the domain. For DNS-01, the control is a TXT record under _acme-challenge. The record lives in a delegated zone — Cloudflare, Azure DNS, Route53, Google DNS, or whatever the firm uses. If the delegation expires, or the zone is moved, or the API token used to write the record is rotated without updating the automation, the challenge fails. Renewal is blocked. The certificate does not rotate, and the team does not find out until the old one expires.
This is invisible to a cert-expiry monitor, because the cert is still valid for weeks. The monitor fires a warning at thirty days, by which point the broken delegation has been broken for a while and the diagnostic trail is cold.
2. The ACME rate limit
Let's Encrypt publishes rate limits, and the most relevant one for a busy estate is the certificates per registered domain limit — 50 per week, per set of names, sliding window. A retry loop that fires on a flaky validation, a backfill after a discovery scan, or a misconfigured worker that issues a fresh certificate on every deploy will burn through the limit in a single morning. Once the limit is hit, issuance is throttled for the rest of the window, and the renewals that were supposed to happen quietly do not happen.
Rate limits are the failure mode where the platform did everything right and still failed. Monitoring helps, but only if the rate-limit counter is exposed somewhere visible. The lesson is that retry policy on a CA integration is a security control, not an engineering detail.
3. Issuance worked; deployment did not
This is the one most teams discover at the worst possible moment. The CA issued the new certificate, the API call returned 201, the record in the certificate store updated — and the web server, the API gateway, the database listener, or the SSH daemon is still presenting the old one. The reason is almost always drift between the issuance step and the deployment step. The certificate was stored in one place. The service reads it from another. Or the service caches the certificate at boot and a restart never happened. Or the deployment was rolled back, taking the cert with it, and the issuance is now a new cert for a hostname the platform is not serving.
External monitoring will show a valid certificate on the public endpoint, and the team will conclude that everything is fine. The break is in the gap between the two.
4. The orphaned cert nobody tracked
Discovery scans are the part of the lifecycle that reveals the certs your team did not know about. A developer stood up a service in a sandbox subscription six months ago, attached a public IP, requested a Let's Encrypt cert by hand, and then forgot. The sandbox was promoted to production. The cert is serving production traffic under a hostname the certificate inventory does not know is in scope. When that cert expires, the alarm fires on a hostname the on-call rotation does not own.
Tenant-wide discovery is the only honest answer. Anything narrower leaves a category of certs outside the rotation, and that category is where the incidents come from.
What good looks like
A healthy lifecycle handles all four. Delegation is monitored with a synthetic check that walks the _acme-challenge path through each provider, and the token used for automation is rotated on a known cadence with a documented owner. Rate-limit exposure is bounded by deduplication per registered domain, and issuance is idempotent against the same SAN set. Deployment is a single workflow per target — Azure, AWS, GCP, a database listener, an SSH host, a custom API — and the workflow's final step is a verification probe against the live endpoint, not against the certificate store. Discovery runs across the entire Azure tenant, on a schedule, and orphans flow into the same workflow that managed the known certs.
How Automate Certificates handles it
Automate Certificates issues through Let's Encrypt over ACME and renews automatically. DNS automation covers Cloudflare, Azure DNS, Route53, and Google DNS, so the delegation problem is reduced to a credential rotation problem with a known owner. Deployment workflows target Azure, AWS, GCP, databases, SSH, and custom HTTP APIs, with the same workflow model in each case; the failure mode where issuance succeeded and deployment did not is visible in the workflow log, not only on the wire.
Tenant-wide Azure discovery scans run on a schedule and surface certs the platform did not issue. Each discovered cert can be promoted into the managed workflow, which means an orphan gets a renewal owner rather than a ticket. The audit log is exportable as CSV and delivered by webhook, so the lifecycle state is queryable for ISO 27001 A.8.24 (use of cryptography) reviews rather than reconstructed.
What Automate Certificates does not do is reach into systems it has not been told about. The deployment steps are explicit, and the discovery scope is the tenant the platform is pointed at. The platform closes the four failure modes within the boundary it knows; the boundary itself is the operator's call.