Network Engineering Asked on December 30, 2021
This is a "why and how exactly does this work" question. The given problem is already solved.
QUESTION:
I’m interested to understand what "rotate" in ip load-sharing address source-destination port source-destination rotate <value>
actually does.
What is the "64bit stream" the documentation (as far as I could find it, see below) speaks of?
What goes into these 64bits? Is it the 64bits that come into play when one uses concatenation
?
I’ll happily also take pointers to advanced-level documentation of Nexus 9k3’s ECMP behaviour. It seems my google-foo is not good enough.
The Back Story
Using …
ip load-sharing address source-destination port source-destination rotate 30
… on the spines I was able to cure a problem which looked very much like what I came to understand to be a CEF ECMP polarization problem, but since Nexus don’t actually run CEF, I wasn’t quite sure what I was looking at.
General:
Spines:
Leafs:
Routing and links:
Problem:
Server admin reported he could get only 2x5Gbps from Server50 (left) to Server51 (right), using 8 or 16 parallel TCP sessions with iPerf.
Analysis:
Looking at the loads of the interfaces involved, we could quickly see that…
Considerations
So this was all plausible. But why did this happen?
Research
There are many documents and blog posts explaining polarization with CEF and how to avoid it, but I struggle to find the same in-depth info about NXOS and the 9300 series.
Note: the 3164Q is much more of a 9300 than a 3100 Series switch (already starting at what the hardware looks like) – it even shares large parts of the config guide, software releases and release notes with the 9300 series, instead of the 3000/3100 series (see Cisco’s own READ ME FIRST about the 3164Q)
Probably the best I was able to dig up was this:
Cisco Nexus 9000 Series NX-OS Unicast Routing Configuration Guide, Release 9.3(x), Chapter: Managing the Unicast RIB and FIB
Quote therefrom:
The rotate option causes the hash algorithm to rotate the link picking
selection so that it does not continually choose the same link across
all nodes in the network. It does so by influencing the bit pattern
for the hash algorithm. This option shifts the flow from one link to
another and load balances the already load-balanced (polarized)
traffic from the first ECMP level across multiple links.If you specify a rotate value, the 64-bit stream is interpreted
starting from that bit position in a cyclic rotation. The rotate range
is from 1 to 63, and the default is 32.Note With multi-tier Layer 3 topology, polarization is possible. To
avoid polarization, use a different rotate bit at each tier of the
topology.
So I started looking into the load-sharing behaviour of the spines.
spine11# show ip load-sharing
IPv4/IPv6 ECMP load sharing:
Universal-id (Random Seed): 3549312827
Load-share mode : address source-destination port source-destination
GRE-Outer hash is disabled
Concatenation is disabled
Rotate: 32
And I ran a series of commands with the parameters of the streams (which I knew from iPerf’s output), one for each set of flow paramaters
spine11# show routing hash 10.33.50.238 10.33.51.238 ip-proto 6 45440 5001 vrf VRFNAME
Load-share parameters used for software forwarding:
load-share mode: address source-destination port source-destination
Hash for VRF "VRFNAME"
Hashing to path *Eth1/51.301
Out Interface: Eth1/51.301
For route:
10.33.51.0/24, ubest/mbest: 2/0
*via 10.33.63.11, Eth1/19.301, [110/411], 19w0d, ospf-30000, intra
*via 10.33.63.12, Eth1/51.301, [110/411], 19w0d, ospf-30000, intra
I had 16 TCP sessions running, and running this command 16 times with all the exact parameters, I got 8 for Link E and 8 for Link H (cf. diagram).
From that, one should expect spine11 to load-share across both E and H, but…
… since spine11 only gets half (8/16) of the flows (all of which had already been hashed/balanced by leaf101/leaf102 to be "left" flows), spine11’s hashing will forcibly come to a single hashing result. And it all goes to one single egress link.
So that’s what ECMP polarization is.
Solution:
All while the streams were flowing from Server 50 to Server 51, I ran this command on the spines, as hinted at by the Cisco document (see link above) for a multi-tier Layer 3 topology.
ip load-sharing address source-destination port source-destination rotate 30
(to set another value than 32, which is default)
And very quickly, egress load on spine11 started to distribute evenly across links E and H, where it had been all on one link before. Consequently, the servers now experienced 2x10Gbps of total throughput.
Also, when reverting back to default (rotate 32), egress load shifted back to a single egress link.
The reason it works is, you are configuring that switch, with rotate 30
, to make a different hashing decision than neighboring switches (implicit default config) even if all the inputs (packet header, ingress port index, etc) are identical.
You mentioned you are familiar with the older CEF technology (not that different from what we have today, honestly.) You might recall configuring ip cef load-sharing algorithm universal
which causes each node to generate a unique number used to influence the output of the hashing algorithm.
By giving different values for rotate <n>
you're doing the same thing, but supplying an explicit value. The idea is, in a typical datacenter network, you can configure a different value for rotate
at each topology level of your network, and make unwanted ECMP polarization unlikely. In a metro ring you could do the same.
Answered by Jeff Wheeler on December 30, 2021
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP