DCI Using VXLAN EVPN Multi-Site W/ VPC BGW

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

DCI using VXLAN EVPN Multi-Site w/ vPC BGW

Let’s suppose to have the necessity to L2 and/or L3 extend via Data Center
Interconnection (DCI) two (or more) Data Center that are using Classic Ethernet
networks, so no fancy deployment, just classical vPC at most (or VSS, remaining in
CISCO world) rather than spanning tree or Cisco FabricPath inside each fabric.

What about to use VXLAN EVPN Multi-Site vPC BGW as Layer 2 and Layer 3 overlay
technology to interconnect them? … in place of vPC, OTV, VPLS or EoMPLS!

The deployment of vPC BGWs is supported starting with Cisco NX-OS 9.2(1); even
though it can be introduced for several use cases, it can be considered the main
integration point for legacy networks into an EVPN Multi-Site deployment.
The vPC BGW provides in fact redundant Layer 2 attachment through vPC and the
hosting of the first-hop gateway by using IP Anycast Gateway.

Even though vPC BGW well adapts to manage the coexistence of a VXLAN BGP EVPN
fabric with a legacy network, in this paper we’ll talk about the pure DCI
interconnection between two legacy DCs.
Quite often, the scenario model just mentioned represents the first step of a
migration procedure aiming to refresh the legacy technologies used inside each site
replacing them with modern VXLAN EVPN fabrics.

Basically, from a data plane forwarding perspective, the vPC BGW nodes leverage
VXLAN tunnels that extend connectivity between the legacy data centers; in this
way, traffic locally originating at an endpoint in the legacy network and destined
for an endpoint in a remote site is VXLAN encapsulated and delivered across the
external network infrastructure via the VXLAN tunnel.

The huge advantage of VXLAN EVPN Multi-Site vPC BGW solution, respect other
technologies such as VRF-lite, MPLS L3VPN that provide only Layer 3 connectivity or
other ones such as VPLS or Cisco OTV that provide only Layer 2 extension, is the
integration of Layer 2 and Layer 3 extension; that, joined at the workload mobility,
and multitenancy between multiple legacy data center networks, due to the VRF
support, makes this technology, very attractive and easy to implement.

One more thing that makes very cool this technology is the EVPN Multi-Site storm-
control feature; it can individually tune how much BUM traffic is allowed to
propagate among legacy sites.

The vPC BGW nodes do not perform fragmentation so it’s very important that the
transport network interconnecting the data centers be ready to host the extra 50
bytes due to VXLAN encapsulation.

… I’m quite sure that the chance provided by the vPC BGW nodes of using Ingress
Replication (IR) mode (aka Head-End Replication) to handle the BUM traffic between

© copyright 2021, Mario Rosi | All rights reserved


the legacy data centers, without any requirement for the underlay transport
infrastructure to support multicast capability, is the coolest thing you could hear
concerning this technology! Basically, when the BGW receives the overlay BUM
traffic, it encapsulates the packets into unicast VXLAN packets, then, sends one
copy to each remote VTEP peer having configured the same VXLAN.

Ok, now it’s time to start to dig this technology with an example that as usual
makes everything much clearer and very easy to understand … so, let’s start our
journey!

Let’s base all our discussions on the scheme here below, where we have two Sites,
with one VLAN, L2 extended between them, and other two, present one in site 1
and one in site 2, in order to test VLAN (and VXLAN) routing using the VRF VNI
identifier with Symmetric Integrated Routing & Bridging (IRB).

Concerning the IRB, just as reference to see how the source and destination MAC
address change along the way, here a picture highlights this aspect:

OSPF among the two sites is configured to make reachable the Loopbacks used for
mp-BGP EVPN peering and VTEP besides the ones used as Multi-Site VIP address.

mp-BGP EVPN sessions are in place between BGWs for the propagation of L2 and/or
L3 endpoint prefixes.

© copyright 2021, Mario Rosi | All rights reserved


The subnet concerning the customer side (VLAN 100, 200 and 300) have configured
their GWs on BGWs on each site.
We are supposing to have only one VRF named VLAN_200_300 where all the VLANs
are hosted (obviously, for the multitenancy nature, this solution fits the scenario of
more VRFs configured and shared between the two sites).

On Site 1’s BGW1 and BGW2:

On site 2’s BGW3 and BGW4:

The command fabric forwarding mode anycast-gateway defines the SVI to be


used as Anycast Gateway; in this way, any relocation of a VM (vMotion) from one
site to another doesn’t involve any change for the GW remaining the same.

As reference, these are the ip addresses that will be referenced along this
document:

© copyright 2021, Mario Rosi | All rights reserved


Site 1:

o BGW1 (SPINE11):

• OSPF/BGP RID (L0):


o 10.10.10.101/32
• VTEP (L1):
o 10.1.1.101/32 primary
• VTEP vPC VIP (L1):
o 10.1.1.103/32 secondary
• Multi-Site VIP (L100): 10.1.1.111/32

o BGW2 (SPINE12):

• OSPF/BGP RID (L0):


o 10.10.10.102/32
• VTEP (L1):
o 10.1.1.102/32 primary
• VTEP vPC VIP (L1):
o 10.1.1.103/32 secondary
• Multi-Site VIP (L100): 10.1.1.111/32

Site 2:

o BGW3 (SPINE21):

• OSPF/BGP RID (L0):


o 10.10.10.104/32
• VTEP (L1):
o 10.1.1.104/32 primary
• VTEP vPC VIP (L1):
o 10.1.1.106/32 secondary
• Multi-Site VIP (L100): 10.1.1.112/32

o BGW4 (SPINE22):

• OSPF/BGP RID (L0):


o 10.10.10.105/32
• VTEP (L1):
o 10.1.1.105/32 primary
• VTEP vPC VIP (L1):
o 10.1.1.106/32 secondary
• Multi-Site VIP (L100): 10.1.1.112/32

All the loopbacks are announced in OSPF configured through point-to-point


interfaces among BGWs belonging to different sites and also among the two of the
same site.

The secondary ip address provided to L1 for the VTEP vPC VIP, the same for both
the BGWs of each site, is used as source and destination IP address of the VXLAN
tunnel from one site to another; the traffic distribution across the inter-site network
is however fairly distributed among equal-cost paths because the source UDP port,
used as entropy for generating the outgoing interface, per traffic flow, is calculated
basing on the hashing of the inner header of original packet. Statistically so, we can
say that different traffic flows will be distributed on different paths.
© copyright 2021, Mario Rosi | All rights reserved
Among BGWs and Core switches, we have a normal vPC double side, configured,
that means that L2 domain is extended from the access switches towards BGWs
devices where the L2/L3 demarcation point is defined.

Concerning the endpoints used for our scope, we have:

- On site1:

- On site2:

Let’s start first of all analyzing the features enabled on BGW devices.

The two ones I want just spend a few words about are nv overlay evpn (enables the
mp-BGP EVPN control plane) and feature nv overlay (enables VXLAN feature and so
the configuration of VTEP); the other ones should be well known, like for instance
feature vn-segment-vlan-based, used for mapping VLANs to VXLAN.

Feature bgp is used for the mp_BGP EVPN peering among BGWs, in fact, going on
BGW1 of site 1, we find both the peers with BGW3 and BGW4 of site 2:

© copyright 2021, Mario Rosi | All rights reserved


The command peer-type fabric-external allows the rewriting of next-hop IP and
next-hop MAC (RMAC) for all the overlay remote prefixes advertised to remote site
BGWs.
The command rewrite-evpn-rt-asn instead enables the rewriting of Route-Target
values for prefixes advertised to remote BGWs (based on BGP Neighbors Remote
ASN).

Concerning the VRF configuration, we have the VNI definition as representative of


VRF (in the example below the VLAN 23, mapped on the VXLAN ID 2030) for the L3
VXLAN traffic; then in our example we have a stretched VLAN, the VLAN 100
mapped to VXLAN ID 1000 and the local VLAN 200 mapped to VXLAN ID 2000;
analog configuration is for the site 2).

The routing context (the VRF named VLAN_200_300) is configured with the
definition of route-distinguisher and route-target left to the system, in auto mode.

© copyright 2021, Mario Rosi | All rights reserved


In the address family ipv4 unicast under the BGP, the subnets of each customer are
announced towards the other peers (as connected redistributed):

Now, let’s examine the configuration of the VTEP interface, the NVE1, that inherits
the Loopback 1 ip address:

The command host-reachability protocol bgp specifies that BGP is used as


mechanism for host reachability advertisement.
The command global ingress-replication protocol bgp avoids the need for any
multicast configurations that might have been required for configuring the
underlay; it enables globally (for all VNI) the VTEP to exchange local and remote
VTEP IP addresses on the VNI in order to create the ingress replication list. This
enables sending and receiving BUM traffic for the VNI without the use of multicast
protocol.
The command multisite border-gateway interface loopback 100 defines the
interface used for the border gateway virtual IP address (VIP), representative of the
site; it actually would be used as source and destination IP address for inter-site
Unicast traffic in case the internal Site architecture uses VXLAN as overlay as well.

The most important command I want to talk about is multisite ingress-replication;


whereas in a VXLAN installation on Site-internal BUM replication could use multicast
(PIM ASM) or ingress replication, Site-external BUM replication, at the moment, can
use only ingress replication. To prevent any chance to have loops with BUM traffic,

© copyright 2021, Mario Rosi | All rights reserved


a Designated-Forwarder is elected dynamically on a per L2 segment basis. The DF
uses its Physical IP address as a unique destination from the BGW cluster.

Finally, we have the configuration concerning the RD and RT definition for each
VXLAN segment:

After the introduction on configurations, now it’s time to investigate a little with
the show commands…

Let’s start to check local endpoints on site 1:

© copyright 2021, Mario Rosi | All rights reserved


Checking instead remote endpoints on site 1:

© copyright 2021, Mario Rosi | All rights reserved


And as usual, the BGP best path algorithm is the master; the NH chosen is the
10.10.10.104 because of step 10’ of algorithm; in fact, dealing with eBGP sessions
and having the tie on all the previous criteria, this is the older eBGP path.

The two prefixes IP/MAC address just shown, are BGP EVPN type-2 updates (where
MAC Address Length (/48) and MAC Address are always present and IP Address
Length (/32, /128) and IP Address are optional) and the description of each field
can be seen in the following figure:

© copyright 2021, Mario Rosi | All rights reserved


In particular, we can note some aspects:

- Received label 1000 2030 (MPLS Label1 (L2VNI) and MPLS Label2 (L3VNI)) are
the VNI identifiers for the BD 100 (relative to entry 10.10.10.2) and for the
VRF that contains it

- Extcommunity: RT:65001:1000 RT:65001:2030 SOO:10.1.1.106 ENCAP:8


are respectively:
o The RT rewritten (RT:Local_AS:VNI) based on command rewrite-evpn-
rt-asn for L2VNI (VLAN) and L3VNI (VRF)
o The site of origin using the remote VTEP vPC VIP
© copyright 2021, Mario Rosi | All rights reserved
o The VXLAN encapsulation = 8

- Router MAC: 0200.0a01.016a is the mac of the remote NVE1 interface

…but mp_BGP EVPN transport also type-5 prefixes as quoted here below, once
imported in the local VRF with the proper RD: 10.10.10.101:4…:

Route-type 5 provides IP Prefix advertisement in EVPN; route-type 5 decouples IP


prefix from MAC (route-type 2) and provides flexible advertisement of IPv4 and IPv6.

© copyright 2021, Mario Rosi | All rights reserved


On a vPC enabled BGW switch, as we have seen, all route-type 5 prefixes are
advertised with the secondary IP address of the VTEP vPC VIP (Loopback1, in our
case) as the BGP next-hop IP address. Prefix routes and BGWs switch generated
routes are not synced between vPC peer switches. Using the VTEP vPC VIP as the
BGP next-hop for these types of routes can cause traffic to be forwarded to the
wrong vPC BGW and black-holed (for scenarios as orphan-connected IP subnets,
individual loopback IP addresses, Layer 3 routing adjacency on a per-VRF basis
would be required to ensure a routing exchange between the vPC peers, in order to
ensure that even if the packet reaches the incorrect vPC peer, after decapsulation
the routing table lookup within the VRF does not suffer a lookup miss; the
configuration of this per-VRF routing adjacency between the vPC member switches
would be however a bit tedious). The provision (via advertise-pip command) to use
the primary IP address (PIP) as the next-hop when advertising route-type 5 prefix
allows users to select the PIP as BGP next-hop when advertising these types of
routes, so that traffic will always be forwarded to the right vPC enabled switch.

Providing the advertise-pip command under address-family l2vpn evpn:

…the BGW install in BGP table the route-type 5 prefixes having as next hop the PIP
primary addresses. Together also the advertise virtual-mac command has to be
added under the NVE interface:

Vice versa, hosts’ route-type 2 prefixes (so /32 as IP address) are always advertised
using the NVE primary address as next hop.

The snapshot below demonstrates what just said:

© copyright 2021, Mario Rosi | All rights reserved


The 30.30.30.0/24 is a route-type 5 prefix and is announced with the BGW3 and
BGW4 as next hop (primary VTEP PIP address), while the 30.30.30.1/32 is a host
route-type 2 prefix that is still announced using the VTEP vPC VIP secondary
address.

© copyright 2021, Mario Rosi | All rights reserved


…and the same for the remote route-type 5 prefixes received in mp_BGP EVPN,
but not imported yet in local VRF:

© copyright 2021, Mario Rosi | All rights reserved


…and more or less, that’s all! J

I hope one more time you liked the virtual journey across the protocols on the
underlay of DCI VXLAN framework.

© copyright 2021, Mario Rosi | All rights reserved

You might also like