WebRTC

WebRTC & Browser Calling: How Voice Works Without a Desk Phone

Making and taking real phone calls from a browser tab, no plugin, no softphone install, no desk phone, is now table stakes for support tools, sales diallers, and AI voice agents. WebRTC is what makes it possible, and it’s built into every modern browser. But WebRTC only carries you as far as the edge of the internet: bridging it to the actual phone network, the PSTN, over SIP, is where the real engineering lives. Here’s how the whole chain fits together, and where the bodies are buried.

2026-05-26 · 9 min read

By Daria Kesselman · DIDHub editorial

1. What WebRTC is, and what it isn’t

WebRTC (Web Real-Time Communication) is a browser-native standard, jointly specified by the W3C (the JavaScript APIs) and the IETF (the on-the-wire protocols), for real-time audio, video, and arbitrary data between endpoints. It is built into Chrome, Edge, Firefox, and Safari, with no plugin, no Flash, no Java applet, and no download. That is the entire reason browser calling exists: the media engine, the codecs, the echo canceller, and the encryption all already ship inside the browser the user is reading this in.

Here is the single most important thing to understand before you build anything, and the thing that trips up almost everyone on their first WebRTC project: WebRTC is media, not signaling. The standard handles capturing the microphone, encoding and encrypting the audio, punching through firewalls, and getting the media stream from point A to point B. It does not tell the two endpoints how to find each other in the first place, who is calling whom, or that a call should ring at all. There is no “connect to this user” primitive in WebRTC.

That omission is deliberate. The WebRTC authors decided not to dictate a signaling protocol, so that the standard could drop into any existing application, a chat app, a SIP network, a game, using whatever channel that application already has for moving messages around. The consequence is that you bring your own signaling. Two browsers cannot establish a call by themselves; something outside WebRTC has to ferry a small amount of setup information between them first. Get that mental model right and the rest of the architecture falls into place.

2. The pieces: media, signaling & ICE

A browser call is assembled from a handful of distinct parts, only some of which are “WebRTC” proper. Knowing which is which tells you exactly what you have to build versus what the browser gives you for free:

Component	Role
getUserMedia()	Captures the local microphone (and camera, if any). Prompts the user for permission, then hands your code a live `MediaStream` of audio. This is the “turn the mic on” call, nothing leaves the device yet.
RTCPeerConnection	The heart of WebRTC: the actual media connection between two endpoints. It encodes and encrypts the captured audio, runs ICE to find a network path, negotiates codecs, and manages the live stream in both directions. Everything about getting real-time media from A to B lives here.
RTCDataChannel	A bidirectional channel for arbitrary application data over the same peer connection, chat, file transfer, DTMF, game state. Not needed for a plain voice call, but it rides the same encrypted transport when you want it.
Signaling channel (yours)	Not part of WebRTC. A side channel, almost always a WebSocket, that you provide to exchange the SDP offer/answer and ICE candidates between the two endpoints so they can agree on codecs and find a network path. No signaling, no call.
SDP offer/answer	The Session Description Protocol blob each side generates to describe what it supports, codecs, media directions, encryption parameters. The caller sends an offer, the callee replies with an answer, both over your signaling channel.
ICE / STUN / TURN	The NAT-traversal machinery (next section). ICE gathers candidate network paths; STUN discovers your public address; TURN relays the media when a direct path is impossible.

The flow, in order: getUserMedia() turns on the mic and yields a stream; you add that stream to an RTCPeerConnection; the connection produces an SDP offer, which you ship over your WebSocket to the far side; the far side answers; meanwhile both ends trickle ICE candidates to each other over that same WebSocket until they find a usable path; the peer connection then carries encrypted audio directly. The browser does the heavy lifting inside RTCPeerConnection, your job is the plumbing that carries the offer, the answer, and the candidates.

For the signaling layer you have choices. A common one is SIP-over-WebSocket (RFC 7118), which runs the same SIP signaling the phone network uses, just framed over a WebSocket the browser can open, libraries like SIP.js and JsSIP implement exactly this. Equally common is a plain custom JSON signaling layer: you define your own little message format ({type:"offer", sdp:"..."}) and relay it through your own server. Both are valid; SIP-over-WebSocket buys you direct compatibility with SIP infrastructure, while custom JSON buys you simplicity. WebRTC genuinely does not care which you pick.

3. NAT traversal & why TURN matters (and costs)

Once two endpoints have exchanged SDP, they still have to find a network path to each other, and that is hard, because almost nobody is on the public internet directly. Laptops sit behind home routers, office machines behind corporate firewalls, phones behind carrier-grade NAT. Each of those does Network Address Translation, hiding the device behind a shared public IP and rewriting ports on the fly. Neither endpoint inherently knows its own public address, let alone how to reach the other’s.

WebRTC solves this with ICE (Interactive Connectivity Establishment), the framework that orchestrates the whole discovery dance. ICE gathers every candidate path it can find, the local LAN address, the public address as seen from outside, and a relayed address, and then systematically tries them, best to worst, until one connects. It leans on two helper services:

STUN (Session Traversal Utilities for NAT), a lightweight server that answers one question: “what public IP and port do I appear to be coming from?” Armed with that, an endpoint can advertise a reachable address, and in many cases the two sides connect directly, peer to peer, with STUN never touching the actual media. STUN is cheap, it’s a single round-trip, no media flows through it.
TURN (Traversal Using Relays around NAT), the fallback for when a direct path is impossible. A TURN server sits on the public internet and relays the media: both endpoints send their audio to the TURN server, which forwards it to the other side. It always works, because both ends only ever have to reach one well-known public host.

The reason TURN is not optional is symmetric NAT and locked-down corporate firewalls. Behind a symmetric NAT, the public port a device uses is different for every destination, so the address STUN discovered is useless for the peer, the hole-punch fails. Strict enterprise firewalls that block UDP outright produce the same outcome. In these environments, which are common, not edge cases, there is no direct path to be found, and the call only completes if a TURN relay carries the media. For any production browser-calling product, TURN is mandatory for reliability; ship without it and a meaningful fraction of your users simply can’t connect.

The catch is cost. Because TURN relays the actual audio, every relayed call consumes bandwidth on your TURN server in both directions, for the full duration of the call. STUN-assisted direct connections cost you essentially nothing once set up; TURN-relayed connections cost real egress bandwidth that scales linearly with relayed minutes. You can’t predict in advance which calls will need relaying, it depends on each user’s network, so you provision TURN for the worst case and pay for the fraction that actually use it. This single line item is why “just use WebRTC, it’s free” is a half-truth at scale.

4. Bridging WebRTC to the phone network

Everything so far gets a browser talking to another WebRTC endpoint. But a real phone call has a phone on the other end, or a carrier, an IVR, a mobile network, and none of those speak WebRTC. The browser speaks WebRTC; the phone network speaks SIP for signaling and RTP for media. Those are different protocol stacks with different framing, different transports, and frequently different codecs. Something has to sit in the middle and translate. That something is a WebRTC-to-SIP gateway.

The gateway has two distinct jobs, and it’s worth separating them because they fail in different ways:

Job one: signaling translation

On the browser side, call setup arrives as SDP offers/answers over a WebSocket, either your custom JSON or SIP-over-WebSocket. On the phone-network side, call setup is SIP (INVITE, 200 OK, ACK, BYE) over UDP, TCP, or TLS. The gateway terminates the WebSocket-side signaling and re-originates it as standard SIP toward the carrier, mapping one call leg to the other and keeping their state machines in sync for the life of the call. If you used SIP-over-WebSocket this is largely a transport change; if you used custom JSON the gateway translates your messages into real SIP.

Job two: media translation (and usually transcoding)

The browser sends media as SRTP, RTP encrypted with DTLS-SRTP, which in WebRTC is mandatory, not a toggle, typically carrying Opus, the mandatory-to-implement audio codec for WebRTC. The PSTN, by contrast, expects plain RTP, and the legacy phone network overwhelmingly runs G.711. So the gateway has to decrypt the SRTP, and in most cases transcode Opus down to G.711 for the PSTN leg, a full decode-and-re-encode cycle, in real time, for every concurrent call. (For the why and the cost of that, see our deep dive on voice codecs, the Opus↔G.711 conversion is exactly the “transcoding tax” it describes.) Transcoding burns CPU and adds latency, which is why gateway media capacity, not signaling, is usually what you size for.

You don’t build this from scratch. Several mature, battle-tested gateways do exactly this job, and the right one depends on your scale and how much you want to assemble yourself:

Janus, a lightweight, plugin-based WebRTC server; its SIP plugin bridges browser calls to a SIP trunk. Popular when WebRTC is the center of gravity.
Asterisk (with chan_pjsip and its WebSocket transport), a full PBX that natively terminates WebRTC clients and routes them to SIP, so the same box does call logic and bridging.
FreeSWITCH, a high-performance media server with first-class WebRTC (Verto and SIP-over-WS) and strong transcoding; favored at higher call volumes.
Kamailio (often paired with RTPEngine), a carrier-grade SIP proxy that handles SIP-over-WebSocket signaling at scale, with RTPEngine doing the SRTP↔RTP media work. The choice when you need to front a lot of traffic.

The shape is always the same: browser → WebSocket/SRTP → gateway → SIP/RTP → carrier → PSTN. The gateway is the hinge the entire architecture turns on.

5. Why put calling in the browser

If bridging WebRTC to the PSTN is this much work, why not just hand users a softphone? Because the browser’s killer feature is what it removes: zero install. There is nothing to download, nothing to provision on the endpoint, nothing for IT to approve. It runs on Windows, macOS, Linux, ChromeOS, and mobile, on whatever browser is already there. That single property unlocks a set of use cases that are awkward or impossible with desk phones and installed softphones:

Support & contact-centre tools. An agent answers customer calls inside the same web app where the ticket, the CRM record, and the call controls already live, no separate handset, no alt-tabbing to a softphone, no per-seat client deployment.
Sales diallers. Reps work a calling queue straight from the browser CRM. Click a lead, the call connects, the disposition is logged in place. Onboarding a new rep is a login, not a desk setup.
Click-to-call. A “call us” button on a web page or in-app that connects the visitor instantly, with no number to dial and no app to install, the lowest-friction path from interest to a live conversation.
AI voice agents. A natural fit: the AI agent runs server-side, the human sits in a browser, and WebRTC carries the audio between them, while a gateway gives that same agent PSTN reach so it can also take calls from real phone numbers. Browser audio in, phone-network audio out, the model in the middle.
Embedded softphones. A full dial-pad, hold, transfer, and presence experience embedded directly in your product’s UI, so “calling” becomes a feature of your app rather than a separate tool the user has to run alongside it.

In every one of these, the value is the same: the call lives where the work already is, and there is nothing to install to get there.

6. Gotchas at scale

A two-person WebRTC demo works on the first afternoon. Production at volume is where the sharp edges show up, and most of them are predictable, so plan for them up front:

TURN bandwidth cost. As covered above, relayed calls consume real egress bandwidth for their full duration, and you can’t predict which calls will need it. Provision and budget TURN for a meaningful fraction of traffic; treat it as a recurring cost line, not a one-time setup.
Transcoding load. Opus→G.711 for the PSTN leg is CPU-heavy and adds latency, and it happens per concurrent call. Media-server transcoding capacity, not signaling throughput, is what caps a gateway, so size for simultaneous transcoded calls and scale the media tier horizontally.
Echo & audio quality. Browser laptops with open speakers and cheap mics are an echo machine. WebRTC ships acoustic echo cancellation and noise suppression, but you have to keep them enabled and avoid undoing them, and a PSTN leg can still inject echo from the far side that your browser AEC never sees.
Mic-permission & autoplay UX. Browsers gate microphone access behind an explicit user permission prompt, and they block audio from auto-playing until the user interacts with the page. If you call getUserMedia() at the wrong moment or try to play inbound audio before a click, the user gets silence or a denied prompt. Design the permission and first-interaction flow deliberately, it is a UX problem, not just a code one.
Mobile-browser quirks. Mobile Safari and mobile Chrome have their own rules around background tabs, audio interruptions (an incoming cellular call), and power management that can drop or mute a WebRTC session. Test on real devices; desktop behaviour does not predict mobile.
Multiparty needs an SFU. A simple peer-to-peer mesh, where every participant sends their stream to every other participant, falls apart past a handful of peers, upload bandwidth and CPU scale with the square of the participant count. For any real multiparty calling, route media through a Selective Forwarding Unit (SFU): each participant sends one stream up, and the SFU forwards the right streams down. Mesh does not scale; an SFU does.

7. How DIDHub fits

Let’s be precise about where DIDHub sits in this picture, because it is one specific, load-bearing part of the chain, not the whole thing. WebRTC and the gateway are yours; DIDHub supplies the phone-network side. A WebRTC-to-SIP gateway is only useful if its SIP side connects to something that actually reaches the PSTN, real phone numbers to receive calls on, and a trunk to carry calls in and out. That is exactly what DIDHub provides: DIDs (inbound numbers in 130+ countries) and the SIP trunks that connect them to the global phone network.

The integration is deliberately boring, which is the point. You run, or use, a WebRTC-to-SIP gateway (Janus, Asterisk, FreeSWITCH, or Kamailio), you point its SIP side at a DIDHub trunk, and your browser app now reaches the phone network: inbound calls to your DIDHub numbers ring through the gateway to the browser, and outbound calls from the browser flow through the gateway and out over the DIDHub trunk to the PSTN. DIDHub doesn’t replace your gateway or lock you into a proprietary client, it terminates the SIP leg the gateway already speaks, so codec policy, media handling, and signaling stay in your control. The Opus↔G.711 transcoding the PSTN leg needs happens at your gateway boundary, exactly as the voice codecs guide describes; DIDHub just delivers standards-based SIP to whatever you point it at. See the integrations page for how this lands against Asterisk, FreeSWITCH, FreePBX, 3CX, and the AI-voice platforms.

One thing worth flagging directly: DIDHub’s routing profiles can include webrtc_user targets (beta), routing a DID toward a browser-side endpoint rather than only a classic SIP destination. It’s early and gated, so treat it as a preview rather than a finished product, but it points at where this is going: the number, the trunk, and the browser leg described in one routing policy.

If you’re architecting browser calling and want a second opinion on the trunk-and-numbering side, which countries, how to size the trunk for your concurrent-call profile, how the routing profiles fit your gateway, [email protected] will talk specifics.

Bottom line

WebRTC handles the browser’s end of a call, and only that. It captures the mic with getUserMedia(), carries encrypted media over an RTCPeerConnection, and punches through NAT with ICE/STUN/TURN, but it is media, not signaling, so you bring your own signaling channel, and you almost certainly need TURN (and its bandwidth bill) for the calls that can’t connect directly. To reach an actual phone, a WebRTC-to-SIP gateway, Janus, Asterisk, FreeSWITCH, or Kamailio, bridges the two worlds, translating WebSocket signaling to SIP and transcoding Opus to G.711 for the PSTN. And on the far side of that gateway, DIDHub supplies the numbers and the SIP trunk that connect the whole thing to the global phone network. Browser does the media, gateway does the bridge, DIDHub does the phone network, get those three boundaries right and a call from a browser tab is just a phone call.

Ready to get a number?

Pick a DID in 130+ countries from $1.99/month. Activates instantly on most numbers.

Browse numbers Talk to sales