Architecture

SIP Trunk Failover That Actually Works

A single SIP trunk is a single point of failure: one PoP hiccup, one expired cert, one overloaded gateway and your inbound calls die silently, no alarm, just a quiet drop in answered calls until someone notices the revenue gap. Real resilience means more than a backup trunk in a config file. Here’s how to design routing that survives failure: how dead peers get detected, which SIP responses mean “try the next target” versus “this is the final answer,” and how the four routing strategies, failover, round-robin, weighted load-balancing and simultaneous ring, differ in practice.

2026-05-26 · 9 min read

By Daria Kesselman · DIDHub editorial

1. Why one trunk isn’t enough

A SIP trunk is a logical path to a single peer, one hostname, one set of credentials, one gateway behind it. That gateway is a machine in a building on a network, and every one of those layers fails. The failure modes are mundane and they all produce the same symptom: calls stop completing.

  • PoP outage. The point of presence terminating your trunk goes down for maintenance, a power event, or a hardware fault. Every call routed there fails until it’s back.
  • Gateway overload. The peer is up but saturated, CPS limits hit, channels exhausted, CPU pegged. It starts returning 503 Service Unavailable or simply stops answering.
  • Network partition. A routing change, a flapping BGP session, or a transit provider problem makes the peer unreachable from your network even though it’s perfectly healthy from everywhere else.
  • Expired TLS certificate. If you run SIP over TLS, an expired or mis-renewed cert on either side breaks the handshake. Calls don’t degrade, they stop dead, and the cause is invisible at the SIP layer.
  • Registration loss. For register-based trunks, a dropped registration means inbound has nowhere to land. The peer thinks you’re gone; you think everything is fine.
  • Upstream carrier issue. Your direct peer is healthy but its upstream, the carrier that actually reaches the PSTN, has a problem. You see 5xx or one-way audio you can’t fix from your side.

The common thread: failure is the normal state of distributed systems, not the exception. A “backup trunk” defined in config does nothing unless something actively detects the primary is dead and moves traffic. Resilience is not a second trunk, it’s the detection plus the decision that sits in front of both.

2. How failure is detected

You can’t fail over from a peer you don’t know is broken. Detection comes in two flavours: proactive health-checking that runs continuously in the background, and reactive interpretation of what a peer tells you mid-call.

SIP OPTIONS pings

The standard keepalive is a periodic SIP OPTIONS request, a lightweight probe sent to the peer on an interval (commonly every 30 seconds). A healthy peer answers 200 OK; if it returns an error or doesn’t respond within the qualify timeout, the peer is marked down and pulled from the routing pool before a real call ever hits it. When it starts answering again, it’s marked back up. This is the difference between “the first call after an outage fails” and “no call ever touches the dead peer.”

Registration state

For register-based trunks, the registration itself is a health signal. A lapsed or rejected REGISTER tells you the path is broken even when no call is in flight. Active OPTIONS probing is generally more reliable than registration alone because it tests the full request/response round-trip, not just the binding.

DNS SRV records

One hostname can resolve to multiple prioritized targets via SRV records, each carrying a priority and a weight. A client that honours SRV (Asterisk pjsip, FreeSWITCH, Kamailio, OpenSIPS all do) will try the lowest-priority-number target first and fall through to the next on failure, weighted load-balancing within a priority tier comes for free. SRV is the simplest form of failover because the resolver, not your dialplan, carries the target list.

Response-code semantics

Mid-call, the peer’s response tells you whether to advance or stop. This is the single most-misunderstood part of failover: not every non-200 is a reason to retry. A 486 Busy Here or 603 Decline is a real call outcome, the destination was reached and the answer was “no.” Retrying it on another trunk is wrong: at best you waste time, at worst you ring a second device and deliver a duplicate call. Reserve retries for failures that mean “this path is broken,” not “this call is resolved.”

SIP responseRetry / try next?Why
Timeout (no response)YesPeer is dead or unreachable, nothing came back. Advance immediately.
408 Request TimeoutYesUpstream didn’t respond in time; the path, not the call, failed.
500 Server Internal ErrorYesPeer-side fault. The next target may be healthy.
503 Service UnavailableYesGateway overloaded or in maintenance, the canonical “use my backup” signal.
504 Gateway TimeoutYesPeer’s upstream didn’t answer; try a different path to the PSTN.
486 Busy HereNoDestination reached and busy. A real outcome, retrying delivers a duplicate.
603 DeclineNoCallee actively declined. Final answer; honour it.
404 Not FoundNoThe number doesn’t exist on that peer. Another trunk won’t change that, fix the routing.
403 ForbiddenNoAuthenticated but not permitted. A config/permission problem, not a transient fault.

The rule of thumb: timeouts and 5xx mean “try next”; 4xx and 6xx call outcomes mean “stop.” The grey area is codes like 480 Temporarily Unavailable, treat it per carrier behaviour, but lean toward not retrying unless you know it reliably indicates a path problem on your routes.

3. The four routing strategies

Once you can detect a bad target, you need a policy for ordering and selecting among the good ones. There are four that cover essentially every real-world need, and they map one-to-one onto DIDHub Routing Profiles.

StrategyHow it worksBest for
FailoverTry targets in priority order; advance to the next only when the current one fails.Primary / backup. A clear preferred path with hot standbys.
Round-robinRotate across targets call-by-call, spreading traffic evenly.Equivalent targets where you want even load and no single hotspot.
Weighted load-balancingDistribute calls by configured weight (e.g. 70 / 30).Targets of different capacity, or a gradual carrier migration.
Simultaneous ringFork to all enabled targets at once; first to answer wins, the rest get CANCEL.Lowest latency-to-answer, “reach a human fast.”

Failover

The default and the one most people mean by “failover.” Targets carry an explicit order; the engine always prefers the top one and only walks down the list when a target fails the retry test above. Calls concentrate on the primary, which is exactly what you want when the primary is your best route (lowest cost, best quality, or a direct peering) and the others exist purely as insurance. Pair it with active health-checks so a dead primary is skipped instantly rather than costing every call a timeout.

Round-robin

Rotates the starting target on each new call so traffic spreads evenly across the pool. Use it when your targets are genuinely interchangeable, two identical SBCs, two equal carrier routes, and you want to avoid hammering one while the other idles. Round-robin distributes load; it doesn’t weight by capacity, so it assumes the targets can each take an equal share.

Weighted load-balancing

Like round-robin but proportional. Assign weights and the engine distributes calls in that ratio, send 70% to carrier A and 30% to carrier B because A has more capacity or better pricing. Weights are also the cleanest migration tool: start a new carrier at a 5% weight, watch quality and CDRs, then ramp to 50/50 and eventually cut over, all without a flag-day change. Set a weight to zero to drain a target gracefully before removing it.

Simultaneous ring (fork)

Rings every enabled target at the same instant; whichever answers first wins the call and the others receive a CANCEL. This minimises time-to-answer because you’re not waiting out one target’s ring timeout before trying the next, ideal for “get a human on the line as fast as possible” scenarios like a small support team across several endpoints. The cost is channel consumption: every fork holds a channel on every target until one answers, so sim-ring across many high-cost targets multiplies your concurrent-channel usage. Use it where speed-to-answer matters more than channel efficiency.

4. Tuning: timeouts, retries, weights

The strategy decides which target; the tuning knobs decide how patiently the engine waits and how hard it tries. Get these wrong and a correct strategy still behaves badly.

  • Per-target ring timeout, how long to wait for an answer before giving up on a target (a sensible default is 25 seconds). Too short and you abandon callees who were about to pick up; too long and a dead-but-not-detected target makes the whole call sequence crawl before failover kicks in.
  • Max retries, how many additional targets the engine will walk through before declaring the call failed. This bounds total call-setup time. A small number (1–2 extra hops) is right for most setups; an unbounded retry chain just delays the inevitable failure and ties up resources.
  • Target priority / position, the order for failover. Lower position = tried first. This is where you encode “primary, then secondary, then tertiary.”
  • Weights, the ratio for weighted load-balancing, and the lever for migrations and graceful drains.
  • Enable / disable per target, pull a target out of rotation without deleting it. Indispensable for planned maintenance: disable, let it drain, work on it, re-enable.

The core tradeoff runs in both directions. Too aggressive, tight timeouts plus eager retries, especially when you retry on codes you shouldn’t, risks duplicate calls (the same call delivered to two targets), longer-than-necessary setup as the engine churns, and amplified load on already-struggling peers. Too lazy, long timeouts, few or no retries, no active health-checking, means calls sit waiting on a dead target and get dropped instead of rescued. The sweet spot is short timeouts backed by active OPTIONS probing (so failover is near-instant because dead targets are already excluded), a small bounded retry count, and retries scoped strictly to timeout/5xx conditions.

Inbound vs outbound failover are not the same problem. Outbound failover is how your calls egress, your PBX or dialplan chooses among carrier trunks and advances on failure; you own that logic. Inbound failover is how a DID provider routes calls to you, it must health-check your endpoints and pick a reachable one. With DIDHub, inbound is where a number’s Routing Profile applies: DIDHub probes your targets and routes incoming calls according to the profile’s strategy. Design both halves; an HA outbound path with a single brittle inbound endpoint is still half-exposed.

5. Multi-PoP & geographic redundancy

Everything above protects you when a target fails. The next tier protects you when a whole location fails. If your primary and secondary trunks both terminate in the same data centre, or both depend on the same metro’s transit, a single regional event takes out your “redundant” pair at once. True high availability means targets that fail independently.

Geographic redundancy means spreading targets across separate points of presence in different regions, so a PoP outage, a regional network partition, or a localized carrier problem only removes one path. The pattern composes cleanly with the strategies: a failover profile whose primary is your nearest PoP and whose backups are progressively more distant ones gives you low latency in the normal case and survivability when the near PoP dies; a weighted or round-robin profile across two healthy regions spreads load and survives the loss of either. DNS SRV with priority tiers expresses exactly this, in-region targets in the low-priority tier, out-of-region targets in the next, and the failover is automatic for any SRV-aware client. Pair multi-PoP topology with active probing per target and you have routing that rides out both single-gateway faults and whole-region outages.

6. How DIDHub Routing Profiles do it

DIDHub builds all of this into a single primitive: the Routing Profile. A profile is a named call-routing policy that chains an ordered list of targets and applies one of the four strategies to them, failover, round_robin, weighted_lb, or simultaneous_ring. Targets aren’t limited to SIP trunks: a single profile can mix SIP trunks, PSTN forwards (to any E.164 number), and HTTPS webhooks, so “try the SIP trunk, then fall back to forwarding to a mobile” is one profile, not a custom dialplan.

Each profile carries a ring timeout and a max-retries setting, and each target carries a position (the failover order), a weight (the load-balancing ratio), and an enable/disable toggle, the exact tuning knobs covered above. You attach a profile to any number of DIDs; the profile governs how DIDHub routes inbound calls to that number, and one profile can back hundreds of numbers so policy stays consistent across your estate.

Build a profile in the dashboard, create it, pick a strategy, add and order targets, set timeout and retries, or drive the whole thing over the API:

# 1. Create a failover profile (25s ring, 2 retries)
curl -X POST https://api.didhub.io/v1/routing-profiles \
  -H "Authorization: Bearer $DIDHUB_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"name":"London inbound HA","strategy":"failover","ring_timeout_s":25,"max_retries":2}'

# 2. Add the primary SIP trunk (position 0), then a PSTN fallback (position 1)
curl -X POST https://api.didhub.io/v1/routing-profiles/rp_.../targets \
  -H "Authorization: Bearer $DIDHUB_TOKEN" \
  -d '{"kind":"sip_trunk","sip_trunk_id":"st_primary","position":0,"weight":100}'

curl -X POST https://api.didhub.io/v1/routing-profiles/rp_.../targets \
  -H "Authorization: Bearer $DIDHUB_TOKEN" \
  -d '{"kind":"pstn_forward","pstn_forward_e164":"+447700900000","position":1}'

# 3. Attach the profile to a DID
curl -X PATCH https://api.didhub.io/v1/numbers/+442035550100 \
  -H "Authorization: Bearer $DIDHUB_TOKEN" \
  -d '{"routing_profile_id":"rp_..."}'

To switch a number from weighted load-balancing to sim-ring, you PATCH the profile’s strategy, every attached number changes at once. To migrate carriers, add the new trunk as a low-weight target and ramp the weight over days. To drain a target for maintenance, set enabled:false and let live calls finish. Full request and response shapes are in the API explorer, and ready-made trunk configs for Asterisk, FreePBX, 3CX, Kamailio and more live under integrations.

Bottom line

Failover that actually works is three decisions layered together, and skipping any one leaves a gap:

  • Always health-check. Active SIP OPTIONS probing is what turns a config-file backup into real failover, it removes dead targets before calls reach them, so failover costs nothing instead of one timeout per call.
  • Retry on the right signals only. Timeouts and 5xx mean “try the next target”; 486, 603 and other 4xx/6xx outcomes are real answers, honour them, or you ship duplicate calls.
  • Pick the strategy that matches the goal. Failover for a clear primary with backups; weighted load-balancing for unequal capacity or carrier migration; round-robin for equivalent targets; simultaneous ring when speed-to-answer beats channel efficiency.
  • Go multi-PoP for true HA. Redundant targets that share a location aren’t redundant, spread them across regions so a whole-PoP failure only removes one path.

DIDHub Routing Profiles give you all four strategies, per-profile timeout and retries, per-target order/weight/enable, and mixed SIP / PSTN / webhook targets behind one assignment, configurable in the dashboard or via /v1/routing-profiles/*. Design both inbound and outbound paths, health-check everything, and your trunks stop being a single point of failure.

More from the blog

Ready to get a number?

Pick a DID in 130+ countries from $1.99/month. Activates instantly on most numbers.