When you are trying to implement communication capability in your project, plenty of APIs and protocols such as Web Sockets, HTTP, long polling, Server-sent events, Fetch API, curl, and many more comes rushing to your mind and all of them essentially tells you to choose them. From what I have heard, and believe, a good developer will always look at the use case and choose accordingly which would make it quite easy to choose one method from the above-mentioned pool but one needs to know what problem or problems the particular method solves to make a well-thought decision. Because of this very reason, I started researching WebRTC and that’s when I realized it is not as simple as it sounds. Although work done on the technology since its inception has made the implementation quite straightforward, being a person with deep curiosity, I wanted to understand what goes beneath the surface while also not going too deep to save myself from falling into another rabbit hole. While researching and working with the technology, I realized that the concept involves many complex components which at some point confused me. Thus, I have written this blog to explain the technology and its components in a brief yet simple manner. It may not look brief but believe me it is brief compared to the vast concepts and components that WebRTC includes.
What is WebRTC ?
WebRTC is a free and open-source project which provides peer-to-peer, real-time communication on web browsers and mobile applications. With the help of this technology, applications such as Zoom, Google Hangout, Skype, and many more can be developed. One can even develop a game using WebRTC. The possibility is endless here. YouTube, utilizing and optimizing WebRTC to provide an efficient real-time broadcasting feature on their platform, is one of the many use cases which highlights the strength of WebRTC.
WebRTC is currently supported by all major browsers including Chrome, Microsoft Edge, Firefox, and Opera as a built-in API. The project started with a motivation to have an alternative to products such as Adobe Flash and was later acquired by Google and was made open source in 2011. Ever since then, work has been going on to make this technology better. Last month, World Wide Web Consortium (W3C) and Internet Engineering Task Force (IETF) officially announced WebRTC as a standard.
WebRTC and its components
WebRTC lets you capture video, audio, and any arbitrary data and send it directly to the other person or endpoint where you want your data to be delivered without the help of any central server or anything in between. Before explaining every single component of WebRTC, let’s look at the overview of its working in the simplest form.
This diagram represents the bird’s-eye view of the whole working of the technology. You must be having many questions about the diagram. Don’t worry we will be addressing each one of them in the following sections but for now, just understand the overview.
Understanding the Overview
The connection process is initiated when one of the peers (User-A) creates an offer, sets it as a local description, sends it to another peer (User-B) which then sets this offer as their remote description, creates an answer, sets it as their local description, sends it to the first peer (User-A) which then sets this answer as their remote description. The moment this process completes, both peers will be connected.
Let’s simplify the process a little bit more, consider the offer as User-A’s address and answer as User-B’s address. You need to know where to send your letter before you actually send your letter. The above-described process helps you to identify and store one peer’s address at another peer’s device. This address is later used to directly send video, audio, or any other data to another peer.
Now, some of you must be wondering that if both peers can send offer and answer back and forth, why do they even need WebRTC since they are already connected. Let’s take an example to understand what is happening here. Suppose you met someone at your regular coffee shop or your university campus and had a great conversation regarding a topic of mutual interest. Now that it’s time to say goodbye, both of you wish to continue the conversation some other time and are discussing how to communicate once you go separate ways. While discussing, like any other discussion, both of you will put forward your favorite way of communication and will eventually reach a negotiation where you both will agree on one particular way of communicating with each other. Here, you can consider your favorite way of communication as OFFER, and the other person’s way, ANSWER. The overview can be seen as a negotiation process that includes many other smaller components which are not shown in the overview diagram but one such component that is most important is ICE Candidates. We will look into it in later sections.
WebRTC does a great job of connecting peer to peer but it’s not magic. WebRTC uses many other protocols which work together under the hood to provide the seamless connection that users enjoy. Before understanding how these protocols work together in WebRTC, let’s first understand what these protocols are and why does WebRTC needs them.
Session Description Protocol:
Although it has “protocol” in its name, it is more of a format used to, as the name suggests, describe the session. The session includes everything that is used to have a proper connection such as transport address, media details, encryption keys, bandwidth details, and any other attributes required and used within the session. According to Mozilla Developer Network, SDP does not have the content in it but rather the metadata that describes what the content is. For example, if the content is a video which the user wants to send to another user, SDP will describe its metadata such as its resolution, formats, codecs, encryption, etc. SDP is not used to convey the content itself but to convey content’s details so that both parties can negotiate and agree on mutually accepted content. To initiate a connection with WebRTC, both parties need to send over their SDP to one another. This can be done using any means such as via email, WhatsApp, iMessage, letter, or anything that guarantees the delivery of SDP to one another. It is generally done through WebSockets or other such mechanisms.
For more information, visit RFC 4566: SDP: Session Description Protocol
Let’s take the letter example again and this time dive a little deeper into details. For this example, assume you can request the postman to change their route and can contact them anytime.
Let’s say, one of your friends told you that he would be sending you a letter and you gave him your address. Now you are expecting a letter but like everyone else, you live on earth and earth has cities and cities have roads and sometimes some roads are closed and even if it’s not closed, there are just many ways to get to your house from the city’s post office. But there is one way that you want the postman to take to reach you faster. What you will do now is contact the postman and tell them where to turn from which point on the map. Let’s say you want them to take the following series of turns;
- right turn to abc street from xyz street,
- left turn to def street from abc street,
- right turn to ghi street from def street
- and then go to 123 house number.
Here, the postman would already have the 123 house number because that was exchanged between you and your friend when you decided how to contact one another and chose letters as your means to communicate. But all the other turns were very crucial to reach your house efficiently and faster. That’s where ICE candidates come into the picture. All turns can be considered as an ICE Candidate and as series of turns help to reach the recipient fast and efficiently, ICE Candidates help form a connection between two peers so that data can be reached to one another in a faster and efficient manner.
In terms of Network communication, ICE Candidate can be considered as a node in the network. For example, your router is a node, a switch is a node, etc. In the example above, it was your house so you knew which series of turns would help the postman reach you faster, but in terms of network communication, ICE candidates need to be identified and then sent to another peer. This identification of ICE candidates is performed automatically when you create a local description SDP with the help of the STUN Server. But it’s your device, why do you need another server to figure out a way to your device? Can’t you do that yourself? Well, you could but not anymore. What does that mean? Before getting into STUN, Lets me explain NAT in the next section.
As of now, almost every device is behind a Network Address Translator (NAT). Some time ago, every device on the planet used to have a public IP Address, which could be used to communicate directly between devices. Check out the picture shown below for more simplification.
But in recent years, due to whatever reasons, every device do not have their own permanent public IP address and is usually behind a NAT which is a device that monitors and manages every network calls coming to/from your device from/to the outside world and alters some details to preserve your privacy and security. Your router acts as a NAT.
NAT in itself is a very big concept and the one which I will not be diving deep in this blog. But to understand WebRTC better, just know that whenever a device behind a NAT, tries to contact the outside world, NAT changes the details of that device and contacts the outside world with dummy details so that nobody from the outside world can identify the device that initiated the request and when the outside world tries to contact back with that dummy details, NAT checks its own record to cross-check that this particular dummy details were provided in place of a particular device and forwards the response from the outside world to the device that started the contact. This is the most basic thing NAT does. It also performs many other tasks such as providing dummy details based on some particular method as well as criteria on judging which request from the outside world makes it to the user’s device. This criterion is generally the hurdle for direct communication between peer to peer.
As I said, in the above example, you knew which series of turns will be more efficient and faster but in WebRTC, efficiency is figured out by the underlying algorithm but identification of ICE candidates needs STUN Server because gone are the days when every device used to have its own permanent public IP address and these days it’s not permanent anymore all thanks to NAT that is actively monitoring your network traffic and can get very strict in terms of the request you receive from the internet which is awesome but can create problems for WebRTC. NAT keeps changing a device’s public IP address which creates a big hurdle to create a peer-to-peer connection.
STUN (Session Traversal Utilities for NAT ) server helps to figure out the currently associated public IP address of a peer’s device and the type of NAT a peer is behind. Since the NAT is restrictive and there can be other hindrances such as firewalls, STUN Server is used when identifying ICE Candidates. STUN server performs series of tasks to ultimately help the client prepare for a smooth direct peer-to-peer connection. Although you can create your own STUN Server which should be cheap as it would not be having much load on it since all it does is send and receive a couple of requests here and there, there are many free STUN servers from well-known organizations such as Google that you can use directly.
All is fine until the network monitoring by NAT, a network firewall, or any other defenses is not too restrictive that even STUN can’t prepare the client for direct communication. This is when TURN comes into the picture.
TURN (Traversal Using Relays around NAT) server is as the full form suggests, it helps create a link between two peers through a server thus relaying the exchange of data through a server ultimately making the whole thing NOT peer-to-peer. This is the last resort for the connection and only happens when there is no way to form a direct peer-to-peer connection and a TURN server is provided.
Unlike STUN, TURN servers are expensive to maintain and keep running since it is relaying all the communication between two peers it requires more computational and network strength which is expensive.
Let’s take our overview and try to understand it now that we have understood all its components.
- The connection gets initiated when peer A creates an OFFER and sets it up as his LOCAL DESCRIPTION.
- ICE Candidates starts gathering up at peer A ***
- Peer A signals OFFER to Peer B
- Peer B received OFFER and sets it as her REMOTE DESCRIPTION
- Peer B creates ANSWER and sets it up as her LOCAL DESCRIPTION
- ICE Candidates starts gathering up at peer B ***
- Peer B signals ANSWER to Peer A.
- Peer A received ANSWER and sets it as his REMOTE DESCRIPTION
Notice the ***, this is because special treatment must be given at that step because the ICE that has been gathered at those stages, must be sent to the other peer so that the other peer can add it to their connection. Adding an ICE candidate to the connection is only possible once the remote description is set at the end where you are adding the ice candidate. So if you are adding peer B’s ice candidates to peer A’s end, peer A must have a remote description set. Vice versa for peer B. Now how to handle that is all up to you. One way to do it is to just put an event handler at both ends which on receiving Ice Candidates only adds it to the connection if the remote description is set otherwise just ignores that ice candidate or keep it stored in a list or something and add it once the remote description is set. If all the above-described steps are done and the TURN server is not required, the connection should be successfully formed.
As I said in the introduction, the application list of webRTC is endless. You can create anything that requires a peer-to-peer connection. Check out the YouTube video from WebRTC Boston where a Ying Yin from YouTube explains how YouTube uses WebRTC to optimize their real time broadcasting feature and also compares it to other alternatives which have been used in the industry for so long and some of them were the inspiration behind the creation of WebRTC. The video is provided in the additional resources section down below.
This brings us to the end of this blog, Thank you for reading this blog. I hope you learned what WebRTC is and how its components help to provide a real-time peer-to-peer connection. I encourage you to take a look at some of the resources regarding WebRTC that I have added below to get a more in-depth understanding. If you find any mistakes in this blog, feel free to point them out in the comment section or reach out to me on Twitter. Keep learning and have a nice day. Cheers.