High availability

SIP High Availability and Redundancy: How DNS Becomes the Control Plane

A server will die. A datacenter will lose power. A carrier will pull a route at 3 a.m. on a Sunday. None of that should reach a customer, and none of it should ask a customer to log in and change a setting. The way you get there in modern SIP is not a magic appliance: it is a naming discipline plus DNS used as a control plane. Here is how we build high availability and redundancy at DIDHub: one stable hostname a customer dials forever, SRV records that make every name redundant by default, GeoDNS that steers to the nearest healthy edge, a single wildcard certificate that covers the whole tree, and a fleet you can grow, drain, and replace with nothing more than a DNS edit.

2026-06-14 · 11 min read

By Daria Kesselman · DIDHub editorial

1. The single point of failure nobody budgets for

Most SIP outages are not exotic. A box runs out of memory, a hypervisor reboots, an upstream transit link flaps, a kernel panics during a patch window. Hardware and networks fail constantly, and a serious voice platform is judged not on whether a node dies but on whether a customer ever notices when one does.

The trap is the trunk config itself. The classic setup bakes the provider into the PBX as sip.example.com:5060, or worse, as a raw 198.51.100.10:5060. That is a single point of failure with a customer’s hands on it. The day you need to move that customer to a healthy node, you have exactly two bad options:

  • Email every customer and ask them to edit their PBX. Slow, error-prone, and politically expensive. Some will not act for weeks.
  • Float the IP with anycast or a hot-standby NAT layer that pretends one address is many boxes. Powerful, but heavy to operate and easy to get subtly wrong.

There is a third option that the SIP protocol was designed for and that costs almost nothing to run: never put a server in the customer’s config in the first place. Put a name there, and make the name the thing you control.

2. The one principle: a name you never have to change

Everything below follows from a single rule we never break: a customer only ever configures a stable name. Something like sip.didhub.io, or a region-pinned name like euro.sip.didhub.io. What sits behind that name, which physical servers, in which datacenters, on which ports, over which transport, is entirely ours to reshape, and we reshape it in DNS.

Remove a node from a name’s record set and new calls stop arriving at it within the record’s TTL. Calls already up finish naturally. A few hours later that node is idle and replaceable. No customer is notified, because from the customer’s side nothing changed: the name they dial is the same name it has always been.

The whole design in one sentence. Customers own a stable name; we own everything the name points at. Failover, load balancing, maintenance, capacity, and geographic expansion all become edits to a DNS zone we control, never edits to a config a customer controls.

The rest of this post is the machinery that makes that rule hold under real traffic: SRV records for redundancy, a tiered naming scheme so a name can mean “anywhere” or “exactly this datacenter,” GeoDNS for proximity, and a wildcard certificate so TLS stays valid no matter which node answers.

3. Two problems, two layers: redundancy vs proximity

High availability and low latency are different problems, and conflating them is where a lot of designs get muddy. Keep them on separate layers:

LayerMechanismAlways on?Solves
RedundancyDNS SRV priority + weightYes, plain DNS, no add-onA node, rack, or POP dying without taking calls with it.
ProximityGeoDNS / health-checked poolsOpt-in, per nameRouting each customer to the nearest healthy edge for lower latency.

Redundancy is the non-negotiable layer and it is free: it is just SRV records, resolvable by every SIP stack worth running, buildable today with nothing but a DNS zone. Proximity is the optimization layer you add only on the names where geography actually changes the answer. Build the first layer everywhere; add the second where it pays for itself. The next two sections take them in that order.

4. SRV: the redundancy primitive

A DNS SRV record (RFC 2782) returns four fields per result: priority, weight, port, and target. A SIP-compliant client discovers a server through them as part of the RFC 3263 resolution chain, and they do two jobs that together give you redundancy for free. (For the full NAPTR→SRV→A walkthrough, see our companion post, SIP SRV records explained.)

Priority is failover ordering. Lower wins. A client must exhaust every target at priority 10 before it touches anything at priority 20.

Weight is load balancing within one priority. Among targets that share a priority, weight is the proportional share of new sessions each one receives.

Put a region behind one name and the name is redundant the moment it has more than one target:

; euro.sip.didhub.io - the region name a customer dials.
; Two co-primary nodes, one standby. The customer sees one name.
_sips._tcp.euro.sip.didhub.io.  60 IN SRV  10 50 5061 fra1.sip.didhub.io.
_sips._tcp.euro.sip.didhub.io.  60 IN SRV  10 50 5061 lhr1.sip.didhub.io.
_sips._tcp.euro.sip.didhub.io.  60 IN SRV  20 100 5061 ams1.sip.didhub.io.

; Each target is a real host with its own A record.
fra1.sip.didhub.io.  60 IN A  198.51.100.11
lhr1.sip.didhub.io.  60 IN A  198.51.100.21
ams1.sip.didhub.io.  60 IN A  198.51.100.31

This reads: send new calls to fra1 and lhr1 split roughly 50/50; if both are unreachable, fall to ams1. A single URI in the customer’s PBX, and it already survives the loss of any one node. We change the whole topology by editing this one zone: drain a node by dropping its weight to 0, add a node by appending one line, move the standby by editing the priority-20 target. The customer config never moves.

A note on REGISTER mode. SRV gives you clean per-call failover for outbound INVITEs. Inbound trunks in REGISTER mode pin to the target they registered through, so pair them with active SIP OPTIONS keepalives (30–60 s) and a short Expires so a dead registrar is detected fast rather than at binding-refresh time. Our deeper treatment is in SIP trunk failover that actually works.

5. The name hierarchy: global, region, POP, node

One redundant name is good. The leverage comes from a tree of names, where each tier trades “automatic” for “specific” and every tier is itself redundant. A customer points at whichever tier matches how much control they want:

TierExample nameMeansResolves via
Globalsip.didhub.io“Just work.” Nearest healthy region, redundant within it.GeoDNS in front of SRV
Regioneuro.sip.didhub.ioPin to a region, still redundant across every node in it.SRV (all region nodes)
Zone / POPfra.sip.didhub.ioPin to one datacenter (IATA code), redundant across its nodes.SRV (POP nodes)
Nodefra1.sip.didhub.ioOne physical SBC. Debug, special routing, the building block.A record → IP

The tree is in the naming and the membership, not in chained records: you cannot point an SRV at another SRV, so each higher tier enumerates the nodes beneath it. The region name lists its POPs’ nodes; the global name fronts the regions. That sounds like bookkeeping, and it is, which is exactly why it should be generated rather than hand-maintained (more on that in section 8).

Why offer four tiers instead of one? Because different customers want different promises. A startup wants sip.didhub.io and never thinks about it again. A carrier with its own redundancy logic wants to pin two specific POPs and run its own failover between them. A migration or a debugging session wants a single node by name. The same DNS tree serves all of them, and a customer can move up or down a tier by changing one hostname.

6. GeoDNS: nearest-healthy, where it earns its keep

Redundancy keeps calls up; proximity keeps them crisp. Geo steering returns different nodes depending on where the client resolves from, so a caller in Toronto lands on a North American edge and a caller in Frankfurt lands on a European one. It is a separate layer, added only on the names where distance changes the answer.

Two ways to implement it, both keeping the SIP path itself untouched:

  • DNS load balancing with health-checked pools (for example Cloudflare Load Balancing in DNS-only mode), one pool per region, steered by the resolver’s location and ECS.
  • A GeoDNS provider (NS1, Route 53) when you want richer steering policies or protocol-aware health checks.

The health check is the subtle part. Load balancers speak TCP and HTTP, not SIP, so the liveness probe is a TCP connect to the TLS port (5061) or an HTTP check against a node-local health endpoint, and an unhealthy node is pulled from the geo pool automatically. Turning geo on for a name is invisible to customers: the name is the same, the answers just get smarter about location.

Where does it actually pay off? On the global name, where “nearest region” is the entire point, and inside a geographically wide region (coast-to-coast in North America is 60–70 ms, enough that east-vs-west steering matters). It is wasted effort on a single-POP zone name like fra.sip.didhub.io, because there is only one place to go. Build redundancy everywhere; switch on geo only where the map is big enough to care.

7. One wildcard certificate for the whole tree

SRV introduces a TLS subtlety that catches people. When a customer dials euro.sip.didhub.io and the SRV chain lands the connection on fra1.sip.didhub.io, which name must the certificate match? RFC 5922 says the certificate must match the name the customer dialed, not the node it landed on. Get this wrong and strict clients reject the handshake even though routing worked perfectly.

A single wildcard certificate solves the entire hierarchy at once:

CN / SAN = *.sip.didhub.io   # euro. / noam. / fra. / iad. / fra1. ... every region, POP, and node
         + sip.didhub.io   # explicit SAN: a wildcard does NOT match the bare apex

One certificate, every name in the tree. Practical notes from running it:

  • Issue it per node via Let’s Encrypt DNS-01, so each box holds its own copy and no private key is ever shared between nodes.
  • Set it as the SBC’s default TLS profile, so any node can validly answer for any region or zone name it is listed under.
  • Remember the apex. A wildcard covers anything.sip.didhub.io but not sip.didhub.io itself, so the bare name needs its own explicit SAN entry.

Why this matters for failover. Because every node carries the same wildcard, you can repoint any name at any node and TLS stays valid instantly. The certificate is never the thing blocking a failover or a node swap, which is precisely what you want when you are moving traffic under pressure.

8. Operating the fleet: add, drain, replace, promote

Here is the payoff. Every routine fleet operation, the things that on a hard-coded platform mean a maintenance window and a customer notice, becomes a DNS edit that customers never see:

GoalThe editEffect
Replace a server
same slot, new hardware
Repoint fra1 A record to the new IP.New calls hit the new box within the TTL; the old one drains as calls end. No SRV change.
Add capacityNew host A record, then append one SRV line to the POP and region names.The new node starts taking its weight share within the TTL.
Drain for maintenanceSet its weight to 0, then delete its SRV line.New INVITEs stop within the TTL; in-flight calls finish, then the node goes idle.
Promote a new locationRepoint a region or the global name’s SRV set to the newly live local node.Traffic shifts to the closer node; customers on the region name change nothing.

Notice what is missing from that table: a customer action, anywhere. That is the entire point of section 2 made concrete.

The one discipline this demands is that membership must never be hand-maintained, because a tree of region and POP names that each enumerate their nodes is easy to get inconsistent by hand. The clean pattern is a single source of truth, a small node inventory, that a sync script renders into every A and SRV record:

# nodes.json - the single source of truth for the whole tree
[
  { "slot": "fra1", "region": "euro", "pop": "fra",
    "ip": "198.51.100.11", "weight": 100, "status": "active" }
  // add an entry, flip status to "draining", or remove it,
  // then one command syncs every SRV + A record to DNS.
]

Draining a node is now “flip status to draining and re-run.” Adding one is “append an entry and re-run.” The tree, including any fallback records for regions that do not yet have their own hardware, regenerates from that list, so it is always internally consistent.

9. How fast is failover, really?

“Within the TTL” deserves an honest number, because failover speed is not one value, it is a sum:

recovery time ≈ SRV TTL + the client’s re-resolution interval + its DNS cache

For outbound INVITEs the dominant term is usually the TTL, because a compliant stack re-resolves on new dialogs and skips a target it cannot reach. For REGISTER-mode trunks the dominant term is how quickly the client notices its registrar is gone, which is why OPTIONS keepalives matter so much there. A node never “disappears” mid-call: it simply stops being offered to new calls, while existing calls run to their natural end.

The tuning knob is the TTL, and it is a deliberate trade. We publish short TTLs (on the order of 60 s) on the SIP SRV and node records so drains and failovers propagate fast. The cost is more DNS queries, which is cheap and cached aggressively up the resolver chain. As a rule, the records you reshape during an incident want short TTLs; records that never change can sit longer. Telephony failover is firmly in the first camp.

The safe-drain envelope. Because recovery is TTL plus re-registration plus cache, the safe rule for fully retiring a node is “weight to 0, remove from the set, then wait hours, not minutes.” New traffic leaves within the TTL; the long tail is just in-flight calls and slow-to-refresh clients draining away. Nothing breaks during that window, the node is simply no longer offered.

10. What this means for a DIDHub trunk

Put together, the four ingredients (a stable name, SRV redundancy, optional GeoDNS, a wildcard certificate) give a customer carrier-grade resilience with a config that is one line long and never changes:

  • You configure a name, not a server. Point your PBX at your DIDHub SIP hostname as a bare hostname, no port, no IP, and let the SRV chain run. That single act opts you into every failover and capacity change we make behind it.
  • Redundancy is the default, not an upgrade. The name resolves to multiple nodes, so the loss of any one is absorbed without a ticket, a window, or a notification.
  • Maintenance is invisible. We replace hardware, patch kernels, and add capacity by editing DNS, draining nodes gracefully so calls finish rather than drop.
  • TLS just stays valid. One wildcard across the tree means the certificate is never what blocks a move.

If you run your own SIP infrastructure and want failover decided on your side, you can pin specific region or POP names and drive your own dispatcher logic across them; if you would rather we own the resilience end to end, the global name does it for you. Either way the contract is the same: you hold a name, we hold everything it points at.

Worth reading next: SIP SRV records explained for the resolution chain in depth, SIP trunk failover that actually works for the client-side routing strategies, and the glossary on SIP transports and SIP OPTIONS. When you are ready to put a trunk behind it, see DIDHub SIP trunks.

11. Bottom line

High availability in SIP is not a box you buy, it is a discipline you keep: never let a server into the customer’s config, and make DNS the control plane for everything behind the name. SRV records turn one hostname into a redundant set; a tiered naming tree lets that hostname mean “anywhere” or “exactly this datacenter”; GeoDNS adds proximity where the map is big enough to matter; and a single wildcard certificate keeps TLS valid no matter which node answers. The reward is the quiet kind of reliability: servers come and go, regions grow, hardware gets replaced, and the customer dialing sip.didhub.io never has a reason to know.

That is the test we hold ourselves to. If a node dies and you had to do something about it, we built it wrong.

More from the blog

Ready to get a number?

Pick a DID in 130+ countries from $1.99/month. Activates instantly on most numbers.