Check out my first novel, midnight's simulacra!
VXLAN
The Virtual eXtensible Local Area Network protocol is used to encapsulate virtual Layer 2 networks over Layer 3+4, designed for use among multitenant hypervisors in (potentially multi-DC) cloud networks. It was formalized in 2014's RFC 7348. It avoids use of 802.1D's Spanning Tree Protocol while facilitating a full broadcast domain, superseding 802.1Q VLANs and their 12-bit VLAN IDs (VXLAN uses a 24-bit ID, the VXLAN Network Identifier aka VNI). The virtual layer 2 network thus created is known as a "VXLAN segment" or "VXLAN overlay network". The agents adding or removing VXLAN encapsulation (commonly hypervisors or switches) are referred to as "VXLAN Tunnel Endpoints" or VTEPs, and play roles similar to bridges, learning MACs and selectively forwarding frames.
The clients within a VXLAN segment needn't know that VXLAN is in use, and use standard unicast/broadcast traffic to talk to other hosts within the segment. Upon receipt of a frame, the VTEP looks up the VTEP with which this destination MAC is associated. If the client does not know the destination's L2 address, ARP is performed via normal broadcast. Broadcasts within a VXLAN are carried over a multicast address (multicast also uses this same tree).
VTEPs are not allowed to fragment packets. It is thus important to choose a network MTU which allows the VXLAN header to be inserted without exceeding physical MTUs.
VXLAN runs over udp/4789 by default (some vendors used 8472 prior to standardization). VXLANs can be stacked.
VXLAN encapsulation
The outermost header is a standard Ethernet header. The source hardware address is initially set to the originating VTEP's MAC address. The destination hardware address is the hardware address of the destination VTEP, or the router by which said VTEP is reached. 802.1Q tags can be used here as they would in any other case. In the case where routing hops exist between the two VTEPs, the source and destination addresses will change with each hop.
The next header is an IPv4 or IPv6 header with the VTEPs' L3 addresses used as source and dest. These addresses persist across hops. Within is the UDP header and its VXLAN payload. This datagram's destination port is the VTEP's VXLAN port, by default 4789. The source port is arbitrary, though RFC 7348 recommends that it be constructed using a hash over encapsulated headers, over the domain 49152--65535.
The VXLAN payload consists of the VXLAN header, plus the original frame minus its Ethernet FCS. The VXLAN header is 8 bytes:
- 8 bits of flags, RRRRIRRR. All R bits must be 0. The I bit must be 1 for a valid VNI.
- 24 reserved bits. All must be 0.
- 24-bit VNI.
- 8 reserved bits. All must be 0.
The VTEP removes the original FCS, and adds its own.
Use on Linux
VXLAN was added to the Linux kernel in 3.7, with the DOVE extensions added in 3.8. Note that Openvswitch has its own, distinct VXLAN implementation. This describes the Linux kernel tunneling device.
- Create the interface using ip: ip l a VXLANDEV type vxlan id VNI dstport DSTPORT local VTEPIP
- Add group MCASTIP dev MCASTDEV for a multicast-based VXLAN (see below)
- The default multicast TTL is 1; change it by adding ttl MCASTTTL if necessary
- Add the nolearning flag to disable source address-based FDB learning
- Add group MCASTIP dev MCASTDEV for a multicast-based VXLAN (see below)
- Destroy the interface with ip l d VXLANDEV
- Get information with ip -d -s l sh VXLANDEV
- Note: ip l s will not work, as s will be interpreted as set
systemd
Minimal systemd-networkd netdev file:
[NetDev] Name=vx0 Kind=vxlan [VXLAN] VNI=whatever Independent=yes
Without Independent=yes, the device must be bound to some other device via a network file, or it will be more or less silently ignored by systemd-networkd.
Forwarding tables
- On Linux, the forwarding tables can be built up four different ways:
- Automatically using multicast: supply the multicast IP as the group argument when creating the VXLAN interface
- Manually using bridge: add the entries using bridge fdb add to L2ADDR dst L3ADDR dev VXLANDEV
- Automatically using flooding: bridge fdb append 00:00:00:00:00:00 dev VXLANDEV dst RVTEPIP
- Unknown l2 destinations will see ARP flooded to all 00:00:00:00:00:00 peers
- BGP-EVPN (see RFC 7432)
- Delete an entry: bridge fdb delete L2ADDR dev VXLANDEV
- Dump the forwarding table: bridge fdb show dev VXLANDEV
Offloading
Several NICs provide hardware offloading functionality, typically configured with ethtool. VXLAN offloading mainly means that other NIC offloading (UDP checksums, GRO, etc.) can be performed in the presence of VXLAN.
- Mellanox Connect-X 3
- Telefonica's "Maximizing Performance in VXLAN Networks", 2017-01-25
External links
- Linux kernel vxlan documentation
- Vincent Bernat's "VXLAN & Linux", 2017-05-03