Difference between revisions of "OCP4-IPI-libvirt"

From p0f
Jump to: navigation, search
(Prerequisites: add DHCP requirement)
(Installer Configuration: add dns details)
Line 422: Line 422:
 
* the provisioning network CIDR (in our case, it can be any IP address block not overlapping with the external network as the provisioning network is isolated)
 
* the provisioning network CIDR (in our case, it can be any IP address block not overlapping with the external network as the provisioning network is isolated)
 
* from the external network address space, a designated:
 
* from the external network address space, a designated:
** API server VIP
+
** API server VIP (<code>api.mycluster.example.com</code> should point to it)
** ingress load balancer VIP
+
** ingress load balancer VIP (any host within <code>apps.mycluster.example.com</code> should resolve to it, usually via a wildcard record)
  
 
Additionally, you will need, for each node its:
 
Additionally, you will need, for each node its:

Revision as of 07:25, 26 January 2024

Introduction

What I Assume

  • You know how OpenShift installation works and what the difference between IPI and UPI is.
  • You know about the OpenShift Machine API and various underlying mechanisms.
  • You understand the different types of network interfaces on Linux and different libvirt networks.
  • You are familiar and comfortable with NetworkManager and the nmcli tool.
  • You are familiar and comfortable with libvirt CLI and XML.
  • You are familiar and comfortable with qemu-img tool.
  • We are not talking about any firewall restrictions here - it is your responsibility to ensure traffic is not blocked.

Outcomes

The installation described is for a fully managed IPI running OpenShift Container Platform v4.14, initially with three master and two worker nodes.

At a later point, I will add a couple of steps needed to grow the cluster by one extra worker node.

OpenShift Container Platform IPI Installation Using Libvirt

Prerequisites

Hardware requirements for the cluster:

  • 136 GiB RAM (32 GiB per control plane, 20 GiB per compute node), max overcommit ratio of 1.5 (make sure enough swap is available)
  • 52 vCPUs (12 per control plane, 8 per compute node), max overcommit ratio of 1.3 (higher might work, but will slow down the installation horribly and may ultimately fail)
  • one physical network interface that will be used for the public bridged network
  • a physical or virtual network interface that will be used for the provisioning network bridge

Hardware requirements for the installation client (provisioner) machine:

  • a minimum of 8 GiB RAM and 4 CPUs
  • a network connection to both the public bridged network and the provisioning network

Due to the fact provisioner needs access to both networks, and the provisioning network in this guide is a virtual one, it might be best if you define the provisioner as a VM, with the same network interface settings as the control/compute nodes.

In the case you want to run the workloads spread across several hypervisor hosts, there are some extra steps, but nothing big. More on that in #Installation Spanning Multiple Hypervisors below.

Software artifacts needed on the provisioner host:

  • oc, the command line client, of the corresponding version - download from https://mirror.openshift.com/pub/openshift-v4/clients/ocp/
  • libvirt-client package is required for openshift-baremetal-install to be able to communicate to hypervisor(s)
  • ipmitool or some other IPMI client
  • a pull-secret file containing authentication credentials for OpenShift Container Platform registries - download from https://console.redhat.com/openshift/
  • an SSH keypair that can be used for accessing OpenShift nodes

IMPORTANT: The external IP addresses of cluster nodes must be assigned by your infrastructure DHCP server.

Host Configuration

Beyond the logical requirement of having libvirt installed and started, here are the other configuration details for the hypervisor.

Network Settings

First thing you definitely need to make sure of, is that IP forwarding is enabled.

$ sysctl net.ipv4.ip_forward
1

Linux network settings need to be configured to have two Linux bridges, a public and a private provisioning one.

  • public bridge, call it bridge0, needs to have the public network interface enslaved to it
  • private bridge, call it provbr0, can be a virtual bridge since it is only needed for the provisioning network, which is supposed to be isolated and without any infrastructure services (such as DHCP, DNS, etc.)

It would be wonderful if the bridges could be OpenVSwitch ones, but unfortunately the Terraform bundled with openshift-baremetal-install currently does not include an OpenVSwitch provider, so there's goodbye to that.

As an example, here is my host configuration.

Public bridge:

$ ip addr show bridge0
6: bridge0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 48:21:0b:57:0e:06 brd ff:ff:ff:ff:ff:ff
    inet 172.25.35.2/24 brd 172.25.35.255 scope global noprefixroute bridge0
       valid_lft forever preferred_lft forever
    inet6 fe80::4a21:bff:fe57:e06/64 scope link
       valid_lft forever preferred_lft forever

$ ip addr show enp86s0
2: enp86s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master bridge0 state UP group default qlen 1000
    link/ether 48:21:0b:57:0e:06 brd ff:ff:ff:ff:ff:ff

$ bridge link | grep "master bridge0"
2: enp86s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master bridge0 state forwarding priority 32 cost 100

Provisioning bridge:

$ ip addr show provbr0
5: provbr0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether ce:70:26:9c:88:a4 brd ff:ff:ff:ff:ff:ff
    inet 10.1.1.2/24 brd 10.1.1.255 scope global noprefixroute provbr0
       valid_lft forever preferred_lft forever
    inet6 fe80::cc70:26ff:fe9c:88a4/64 scope link
       valid_lft forever preferred_lft forever

$ bridge link | grep "master provbr0"

Installation Spanning Multiple Hypervisors

If you want to have your cluster spanning multiple hypervisors, make sure there is also a VXLAN connection between all the provisioning bridges.

You can do that by creating a vxlan type interface, which is a slave connection of type bridge, and the master is set to provbr0. Choose any unique VXLAN ID, and make sure it is the same on all interconnected hosts.

As an example, here is one VXLAN interface connecting hypervisor A to B.

$ nmcli con show provbr0-vxlan10 | grep -E '^(connection|vxlan)' | grep -vE '(default|uuid|--|-1|unknown)'
connection.id:                          provbr0-vxlan10
connection.type:                        vxlan
connection.interface-name:              provbr0-vxlan10
connection.autoconnect:                 yes
connection.autoconnect-priority:        0
connection.timestamp:                   1703164860
connection.read-only:                   no
connection.master:                      provbr0
connection.slave-type:                  bridge
connection.gateway-ping-timeout:        0
vxlan.id:                               10
vxlan.local:                            172.25.35.2
vxlan.remote:                           172.25.35.3
vxlan.source-port-min:                  0
vxlan.source-port-max:                  0
vxlan.destination-port:                 4790
vxlan.tos:                              0
vxlan.ttl:                              0
vxlan.ageing:                           300
vxlan.limit:                            0
vxlan.learning:                         yes
vxlan.proxy:                            no
vxlan.rsc:                              no
vxlan.l2-miss:                          no
vxlan.l3-miss:                          no

And this is the corresponding VXLAN interface definition connecting host B to A.

$ nmcli con show  provbr0-vxlan10 | grep -E '^(connection|vxlan)' | grep -vE '(default|uuid|--|-1|unknown)'
connection.id:                          provbr0-vxlan10
connection.type:                        vxlan
connection.interface-name:              provbr0-vxlan10
connection.autoconnect:                 yes
connection.autoconnect-priority:        0
connection.timestamp:                   1697549049
connection.read-only:                   no
connection.master:                      provbr0
connection.slave-type:                  bridge
connection.gateway-ping-timeout:        0
vxlan.id:                               10
vxlan.local:                            172.25.35.3
vxlan.remote:                           172.25.35.2
vxlan.source-port-min:                  0
vxlan.source-port-max:                  0
vxlan.destination-port:                 4790
vxlan.tos:                              0
vxlan.ttl:                              0
vxlan.ageing:                           300
vxlan.limit:                            0
vxlan.learning:                         yes
vxlan.proxy:                            no
vxlan.rsc:                              no
vxlan.l2-miss:                          no
vxlan.l3-miss:                          no

In this case, bridge link will of course initially also show the provbr0-vxlan10 interface as a slave, and will not show empty as above.

$ bridge link | grep "master provbr0"
7: provbr0-vxlan10: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 master provbr0 state forwarding priority 32 cost 100

libvirt Settings

Your libvirt will of course need to know about those network bridges in order to be able to attach VMs to them.

For that, you will need two network definitions, looking a bit like the following XML. Make sure they are autostart for least headache.

$ sudo virsh net-dumpxml external
<network>
  <name>external</name>
  <uuid>whatever</uuid>
  <forward mode='bridge'/>
  <bridge name='bridge0'/>
</network>

$ sudo virsh net-dumpxml provisioning
<network>
  <name>provisioning</name>
  <uuid>whatever</uuid>
  <forward mode='bridge'/>
  <bridge name='provbr0'/>
</network>

Additionally, you want to ensure that the storage pool is big enough, but that is not directly related to the subject at hand.

$ sudo virsh pool-info default
Name:           default
UUID:           whatever
State:          running
Persistent:     yes
Autostart:      yes
Capacity:       250.92 GiB
Allocation:     0 GiB
Available:      250.92 GiB

VirtualBMC

The most important part of IPI facilitation is to be able to simulate a baseboard management controller for your VMs. libvirt obviously doesn't do this, but luckily there's a small bit of Python code that does, and it's called virtualbmc.

In most Python environments you can install it using pip3, just make sure pip3 is up-to-date first.

$ pip3 install --upgrade pip
...

$ pip3 install virtualbmc
...

This gives you /usr/local/bin/vbmcd which you can control using the following systemd unit:

[Unit]
Description=vbmcd
[Service]
Type=forking
ExecStart=/usr/local/bin/vbmcd
[Install]
WantedBy=multi-user.target

Put the above content into /etc/systemd/system/vbmcd.service, reload systemd, and enable/start the service.

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now vbmcd

You now have the ability to associate a TCP port with a virtual machine defined on the hypervisor host, and have it simulate an IPMI BMC for that VM!

Virtual Machine Configuration

VM Definitions

The virtual machines need to be configured with sufficient amount of compute resources, as per #Prerequisites above.

This section ties into the #Network Settings section above. You need two bridges on your hypervisor(s), bridge0 and provbr0.

An example control plane node definition in libvirt XML would look like this:

<domain type='kvm'>
  <name>controlplane1</name>
  <memory unit='GiB'>32</memory>
  <currentMemory unit='GiB'>32</currentMemory>
  <vcpu placement='static'>12</vcpu>
  <os>
    <type arch='x86_64' machine='q35'>hvm</type>
    <boot dev='hd'/>
    <boot dev='network'/>
    <bootmenu enable='yes'/>
  </os>
  <features>
    <acpi/>
    <apic/>
  </features>
  <cpu mode='host-model' check='partial'>
    <model fallback='allow'/>
  </cpu>
  <clock offset='utc'>
    <timer name='rtc' tickpolicy='catchup'/>
    <timer name='pit' tickpolicy='delay'/>
    <timer name='hpet' present='no'/>
  </clock>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <pm>
    <suspend-to-mem enabled='no'/>
    <suspend-to-disk enabled='no'/>
  </pm>
  <devices>
    <emulator>/usr/libexec/qemu-kvm</emulator>
    <disk type='file' device='disk'>
      <driver name='qemu' type='qcow2'/>
      <source file='/var/lib/libvirt/images/controlplane1-vda.qcow2'/>
      <target dev='vda' bus='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x04' function='0x0'/>
    </disk>
    <controller type='usb' index='0' model='qemu-xhci' ports='15'/>
    <controller type='pci' index='0' model='pcie-root'/>
    <interface type='bridge'>
      <mac address='52:54:00:00:fb:11'/>
      <source bridge='provbr0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x03' function='0x0'/>
    </interface>
    <interface type='bridge'>
      <mac address='52:54:00:00:fa:11'/>
      <source bridge='bridge0'/>
      <model type='virtio'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x09' function='0x0'/>
    </interface>
    <console type='pty'/>
    <channel type='unix'>
      <source mode='bind'/>
      <target type='virtio' name='org.qemu.guest_agent.0'/>
    </channel>
    <input type='tablet' bus='usb'/>
    <input type='mouse' bus='ps2'/>
    <input type='keyboard' bus='ps2'/>
    <graphics type='vnc' autoport='yes' listen='0.0.0.0'/>
    <video>
      <model type='virtio'/>
    </video>
    <memballoon model='virtio'/>
    <rng model='virtio'>
      <backend model='random'>/dev/urandom</backend>
    </rng>
  </devices>
</domain>

A couple of things to note:

  • the first network interface is attached to provbr0, and its PCI address is lower (0x03), causing it to be the PXE default device
  • the second network interface is attached to bridge0, and its PCI address is higher (0x09), making it the second interface (for external connections)
  • boot order is set to hard disk first, network second, which means the host will only PXE boot if the disk image is unbootable
  • the disk image needs to be 64GiB in size at the minimum, but you can make it larger and/or add more disk images if you intend to use the local storage operator

When configuring other nodes, simply remember to change the node name, disk image name, and MAC addresses to be unique. Adjust hardware resources accordingly for compute nodes.

IPMI BMC

What is important for OCP IPI to be able to perform the installation properly is to register each virtual machine with vbmcd and assign it with a port.

$ vbmc add --port=6211 controlplane1
$ vbmc list
+-------------------+---------+---------+------+
| Domain name       | Status  | Address | Port |
+-------------------+---------+---------+------+
| controlplane1     | down    | ::      | 6211 |
+-------------------+---------+---------+------+
$ vbmc start controlplane1
$ vbmc list
+-------------------+---------+---------+------+
| Domain name       | Status  | Address | Port |
+-------------------+---------+---------+------+
| controlplane1     | running | ::      | 6211 |
+-------------------+---------+---------+------+
$ sudo ss -aunp | grep 6211
UNCONN 0      0                        *:6211             *:*    users:(("vbmcd",pid=766290,fd=21))

There is no need to run the vbmc client as root as it is the daemon that is running as root and can see all the VMs accessible through the qemu:///system URL.

Of course there are options. For any VM you add, you can specify a custom set of IPMI admin credentials (options --username and --password, they default to admin and password), a custom libvirt URL and credentials if necessary (options --libvirt-uri, --libvirt-sasl-username, and --libvirt-sasl-password), and a custom IP address to listen on (defaults to all addresses, use option --address to restrict it).

$ vbmc show controlplane1
+-----------------------+-------------------+
| Property              | Value             |
+-----------------------+-------------------+
| active                | True              |
| address               | ::                |
| domain_name           | controlplane1     |
| libvirt_sasl_password | ***               |
| libvirt_sasl_username | None              |
| libvirt_uri           | qemu:///system    |
| password              | ***               |
| port                  | 6211              |
| status                | running           |
| username              | admin             |
+-----------------------+-------------------+

Once started (which just opens the port) you can test the BMC connection using ipmitool or similar.

$ ipmitool -I lanplus -H localhost -p 6211 -U admin -P password chassis status
System Power         : off
Power Overload       : false
Power Interlock      : inactive
Main Power Fault     : false
Power Control Fault  : false
Power Restore Policy : always-off
Last Power Event     :
Chassis Intrusion    : inactive
Front-Panel Lockout  : inactive
Drive Fault          : false
Cooling/Fan Fault    : false

That's it! We're ready to install OCP!

Installer Configuration

This is where it all comes together. You need to execute the steps in this section (and next) on the provisioner host, that is, a system that has access to both external and provisioning networks of the OpenShift cluster-to-be. The access need not be direct, it can be routed, but if you, as in our example, configured the provisioning network to be an isolated virtual bridge, you will be best off by creating an additional VM that is directly connected to both bridges.

Gathering the Bits Together

First step is to make sure the following artifacts are available:

  • oc
  • pull-secret
  • an SSH public key for compute node access

The global cluster network settings that we will need to configure are:

  • the parent DNS domain of the cluster (such as example.com)
  • the name of the cluster (concatenated with DNS domain, for example mycluster will become mycluster.example.com)
  • the provisioning network CIDR (in our case, it can be any IP address block not overlapping with the external network as the provisioning network is isolated)
  • from the external network address space, a designated:
    • API server VIP (api.mycluster.example.com should point to it)
    • ingress load balancer VIP (any host within apps.mycluster.example.com should resolve to it, usually via a wildcard record)

Additionally, you will need, for each node its:

  • provisioning interface MAC address
  • IPMI BMC address and port
  • IPMI BMC credentials
  • the name of the provisioning interface as seen from within the VM (re #Virtual Machine Configuration above - since the PCI address of the interface is bus 0x0 slot 0x3 it will be named enp0s3)

As already said, DHCP address assignment to external interfaces is not managed by the installer. It must be handled by your infrastructure.

Installation

Post-Install Smoke Tests

Conclusion