Technical blog
"We must view with profound respect the infinite capacity of the human mind to resist the introduction of useful knowledge." - Thomas R. Lounsbury
| github | goodreads | linkedin | twitter |
ansible 2 / elasticsearch 2 / kernel 2 / leadership 1 / linux 2 / mnemonics 1 / nginx 1 / paper 40 / personal 5 / rust 1 / tools 2 /WC 530 / RT 3min
TweetPackets are distributed through ECMP.
Serves traffic for Google services & GCP.
Every Google service has 1 or more VIP’s.
Maglev associates each VIP with a set of service endpoints and announces it to the router over BGP. The router, in turn, announces the VIP to Google backbone.
Router receives a VIP packet it forwards the packet to 1 of Maglev machines in the cluster through ECMP since all Maglev machines announce the VIP with the same cost. When Maglev receives it selects and endpoint from the set of service endpoints associated with the VIP and encapsulates the packet using GRE. When packet arrives at the selected service endpoint, it is decapsulated and consumed. The response when ready is put into an IP packet with source address being the VIP and the destination being the IP of the user.
Forwarder receives packets from the NIC, rewrites them with proper GRE/IP headers and then sends them back to the NIC (Linux kernel is not involved).
Packets received by the NIC are first processed by the steering module of the forwarder, which calculates the 5 tuple hash of the packets and assigns them to different receiving queues depending on the hash value. Each receiving queue is attached to a packet rewriter thread.
First packet thread recomputes hash and tries to match each packet to a configured VIP to filter out unwanted packets.
Then it looks up the hash value in connection tracking table (hash is recomputed to avoid cross-thread sync).
The connection table stores backend selection results for recent connections. If a match is found and the selected backend is still healthy, the result reused. Otherwise, thread consults the consistent hashing module and selects new backend for the packet; it also adds an entry to the connection table for future packets with the same 5-tuple. A packet is dropped if no backend is available.
The forwarder maintains one connection table per packet thread to avoid access contention.
After a backend is selected, the packet thread encapsulates the packet with proper GRE/IP headers and sends it to the attached transmission queue. The muxing module then pools all transmission queues and passes the packets to the NIC.
Maglev is a userspace application running on commodity Linux servers. Since the Linux kernel network stack is rather computationally expensive, and Maglev doesn’t require any of the Linux stack’s features, it is desirable to make Maglev bypass the kernel entirely for packet processing.
load balancing: each backend will receive an almost equal number of connections.
minimal disruption: when the set of backends changes, a connection will likely be sent to the same backend as it was before.
Maglev hashing is to assign a preference list of all the lookup table positions to each backend. Then all the backends take turns filling their most-preferred table positions that are still empty, until the lookup table is completely filled in.
Active-passive pairs provide failure resilience. Only active machines serve traffic in normal situations.
ECMP - Equal cost multipath
DSR - Direct server return
VIP - Virtual IP address
GRE - Generic routing encapsulation
NIC - Network interface card