End to End encryption for audio and video calls

True end-to-end encryption for audio and video calls is in beta in RealtimeKit SDKs!

In this guide, we'll explore the mechanisms behind end-to-end encryption (E2EE) in Web Real-Time Communication (WebRTC), focusing on its standards, implementation strategies, impact on performance, and how it works within our SDK.

Inbuilt security

WebRTC communication is already encrypted. It uses DTLS combined with SRTP to secure both data and media communications.

SRTP is an extension of the Real-time Transport Protocol (RTP) to deliver audio and video over the Internet. While RTP handles the delivery, it lacks built-in security features, which is where SRTP comes into play.

Key Features of SRTP:

Encryption: SRTP encrypts the payload of RTP packets, which contains the actual media data (voice, video, etc.), using symmetric encryption algorithms (AES). This ensures that the content of the communication cannot be easily eavesdropped or intercepted by unauthorized parties.
Message Authentication: SRTP provides a mechanism to verify the authenticity of messages, ensuring that the data has not been tampered with in transit. This is typically achieved using a Message Authentication Code (MAC), a small piece of information, or a tag derived from the packet content and a secret key.
Replay Protection: SRTP implements replay protection to ensure that attackers cannot capture and re-send packets in an attempt to disrupt the communication. This is done by keeping track of the sequence numbers of packets and rejecting any that are out of order or duplicated.
Integrity Protection: SRTP also ensures the integrity of the data transmitted using MACs, confirming that it has not been altered from its original form during transit.

The Man-in-the-middle: SFU

While WebRTC was designed as a peer-to-peer protocol, most common implementations involve a centralized server that routes media to different parties. WebRTC connections are made from Client → SFU and then SFU → Client, which means all the inbuilt encryptions stop at the server. Data is decrypted on the server and re-encrypted.

While being secure in transit is acceptable in most security threat models, there are use cases where you want mathematical guarantees against tampering or eavesdropping.

Implementing End-to-End Encryption

As video encoding is lossy, ideally, you would want to apply encryption to the encoded frames before they are transmitted (RTP packetizer).

Implementing End-to-End Encryption (E2EE)

Now, there are two competing standards on how to do this on web browsers:

Insertable Streams API - Introduced in 2020, supported only on Chromium browsers. This is a more general-purpose API that not only allows modification post-encoding and pre-RTP packetization but also allows you to modify frames pre-encoding.
RTCRtpScriptTransform - The current standard, not supported by Chromium but supported by Firefox and Safari, is much more limited and designed for cases like end-to-end encryption.

The good news is that since they both come at the same stage of the pipeline, they are interoperable, i.e., media encrypted using Insertable Streams can be decrypted using RTCRtpScriptTransform.

Also, since these encryption steps would be computationally heavy, we don't want to do them on the main thread. We will use Web Workers to offload the encryption/decryption to a different thread.

end-to-end encryption/decryption to a different thread

Under the hood

Now, we have a place where we can encrypt/decrypt media, but what about the actual encryption process? Technically, you can encrypt it using XORing with a static bit or something naive, but that wouldn't be secure.

We chose AES-GCM to encrypt the media frames/samples. LiveKit's implementation of the same feature inspires our encryption algorithm implementation.

IV Generation

The IV is used in AES-GCM encryption to provide uniqueness to the encryption process. This ensures that the same payload (e.g., video frame) encrypted multiple times with the same key will result in different ciphertexts. For IV, you just need to make sure that an adversary cannot predict the IV in advance, and for that, we use a combination of time and WebRTC metadata around the stream, which guarantees this to be unique.

Key Derivation and Key Ratcheting

If your app users set the encryption key as "12345678," you don't want AES to use this weak key directly. PBKDF2 puts the password and the salt through a pseudo-random function a set number of times, according to the value for iteration count. The final output is a strong key. Therefore, we use PBKDF2 to derive strong keys from weak keys.

The same PBKDF2 mechanism can support key ratcheting, which involves periodically updating the encryption keys used in a communication session. This ensures that the compromise of one key does not compromise past or future communications.

The current encryption key is used to derive a new key at regular intervals or based on specific conditions (e.g., the number of messages sent).
The new key replaces the old key for subsequent encryptions, effectively "ratcheting" forward the key material.
Participants in the communication must synchronize the ratcheting process to ensure they can decrypt received messages with the correct key.

Encrypting the frame

The media frame payload sometimes carries metadata that the SFU requires to function, such as keyframes; therefore, part of the RTP payload must be kept unencrypted.

This differs for each codec (VP8/VP9/OPUS) and each frame type. RealtimeKit SDK provides end-to-end encryption support for all the codecs we support — VP8 and VP9 for video and OPUS for audio.

Enable end-to-end encryption in your RealtimeKit setups

We are rolling this out gradually, and therefore, you will need to contact support@dyte.io to have this enabled.

However, once this is enabled, the integration is relatively straightforward.

Let's first see how a typical RealtimeKit SDK initialization works.

import RealtimeKitClient from '@cloudflare/realtimekit';

const meeting = await RealtimeKitClient.init({
      authToken,
});

// use meeting object

To implement end-to-end encryption,

import RealtimeKitClient from "@cloudflare/realtimekit";
import RTKE2EEManager from "@cloudflare/realtimekit/modules/e2ee";

const sharedKeyProvider = new RTKE2EEManager.SharedKeyProvider();
sharedKeyProvider.setKey("meeting-password");

const e2eeManager = new RTKE2EEManager({ keyProvider: sharedKeyProvider });

const meeting = await RealtimeKitClient.init({
  authToken,
  modules: {
	  e2ee: {
		  enabled: true,
		  manager: e2eeManager
	  }
  }
});

The above example uses a shared key provider, which, in simple words, is a single key that is used for all encryption of all participant's media. You can also set a different key per participant using RTKE2EEManager.ParticipantKeyProvider(); but you will have to coordinate passing the correct key on every participant join.

The key takeaway is that you handle the movement of keys, ensuring all participants use the correct key. This key should ideally be transported outside of RealtimeKit-provided communication channels and your own trusted communication channels. RealtimeKit will handle the encryption and media delivery.

Can I use the X feature while end-to-end encryption is enabled?

Generally, all features should be available except Cloud Recording/AI/ Transcription features when end-to-end encryption is enabled (since we can't decrypt media on our servers).

Are chat, data track, and plugins also end-to-end encrypted?

Not right now, but this should be available in the (very) near future.

Core SDK

UI Kit

Core SDK

UI Kit

Inbuilt security​

Key Features of SRTP:​

The Man-in-the-middle: SFU​

Implementing End-to-End Encryption​

Under the hood​

IV Generation​

Key Derivation and Key Ratcheting​

Enable end-to-end encryption in your RealtimeKit setups​