Episode Details

215: Turning FreeBSD up to 100 Gbps

Published 8 years, 5 months ago

Description

We look at how Netflix serves 100 Gbps from an Open Connect Appliance, read through the 2nd quarter FreeBSD status report, show you a freebsd-update speedup via nginx reverse proxy, and customize your OpenBSD default shell.

This episode was brought to you by

Headlines

Serving 100 Gbps from an Open Connect Appliance

In the summer of 2015, the Netflix Open Connect CDN team decided to take on an ambitious project. The goal was to leverage the new 100GbE network interface technology just coming to market in order to be able to serve at 100 Gbps from a single FreeBSD-based Open Connect Appliance (OCA) using NVM Express (NVMe)-based storage.
At the time, the bulk of our flash storage-based appliances were close to being CPU limited serving at 40 Gbps using single-socket Xeon E52697v2. The first step was to find the CPU bottlenecks in the existing platform while we waited for newer CPUs from Intel, newer motherboards with PCIe Gen3 x16 slots that could run the new Mellanox 100GbE NICs at full speed, and for systems with NVMe drives.

Fake NUMA

Normally, most of an OCAs content is served from disk, with only 1020% of the most popular titles being served from memory (see our previous blog, Content Popularity for Open Connect for details). However, our early pre-NVMe prototypes were limited by disk bandwidth. So we set up a contrived experiment where we served only the very most popular content on a test server. This allowed all content to fit in RAM and therefore avoid the temporary disk bottleneck. Surprisingly, the performance actually dropped from being CPU limited at 40 Gbps to being CPU limited at only 22 Gbps!
The ultimate solution we came up with is what we call Fake NUMA. This approach takes advantage of the fact that there is one set of page queues per NUMA domain. All we had to do was to lie to the system and tell it that we have one Fake NUMA domain for every 2 CPUs. After we did this, our lock contention nearly disappeared and we were able to serve at 52 Gbps (limited by the PCIe Gen3 x8 slot) with substantial CPU idle time.
After we had newer prototype machines, with an Intel Xeon E5 2697v3 CPU, PCIe Gen3 x16 slots for 100GbE NIC, and more disk storage (4 NVMe or 44 SATA SSD drives), we hit another bottleneck, also related to a lock on a global list. We were stuck at around 60 Gbps on this new hardware, and we were constrained by pbufs.
Our first problem was that the list was too small. We were spending a lot of time waiting for pbufs. This was easily fixed by increasing the number of pbufs allocated at boot time by increasing the kern.nswbuf tunable. However, this update revealed the next problem, which was lock contention on the global pbuf mutex. To solve this, we changed the vnode pager (which handles paging to files, rather than the swap partition, and hence handles all sendfile() I/O) to use the normal kernel zone allocator. This change removed the lock contention, and boosted our performance into the 70 Gbps range.
As noted above, we make heavy use of the VM page queues, especially the inactive queue. Eventually, the system runs short of memory and these queues need to be scanned by the page daemon t