Pod affinity is a feature that allows you to specify rules for how pods should be scheduled onto topologies (e.g. a node or an availability zone) within a Kubernetes cluster based on the presence or characteristics of other pods. Similarly, the PodAntiAffinity property allows you to schedule pods into topologies based on the absence of other pods.
PodAffinity & PodAntiAffinity require 3 parameters:
requiredDuringSchedulingIgnoredDuringExecution
: Pods must be scheduled in a way that satisfies the defined rule. If no topologies that meet the rule's requirements are available, the pod will not be scheduled at all. It will remain in a pending state until a suitable node becomes available.preferredDuringSchedulingIgnoredDuringExecution
: This rule type is more flexible. It expresses a preference for scheduling pods based on the defined rule but doesn't enforce a strict requirement. If topologies that meet the preference criteria are available, Kubernetes will try to schedule the pod there. However, if no such topologies are available, the pod can still be scheduled on other nodes that do not violate the preference. When using this parameter, you also have to pass a weight
parameter (a number between 1-100) that defines affinity priority when you specify multiple rules.kuberetes.io/hostname
- Pods scheduling is based on node hostnames.kubernetes.io/arch
- Pods scheduling is based on node CPU architectures.topology.kubernetes.io/zone
- Pods scheduling is based on availability zones.topology.kubernetes.io/region
- Pods scheduling is based on node regions.You might wonder why there are no requiredDuringExecution
type parameters. The reason is that, as of now, Kube does not offer native support for descheduling pods. Consequently, parameters of this type do not exist. However, you can still enforce the eviction of pods that violate affinity rules or spread constraints using external tools like the Kubernetes descheduler.
Let’s dive into a practical example. The following defines a Pod that has strictly enforced affinity to itself, the topology being a single node. We'll now examine how multiple replicas of this Pod are scheduled within a cluster.
apiVersion: v1
kind: Pod
metadata:
name: self-affinity-pod
labels:
app: self-affinity-app
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- self-affinity-app
topologyKey: topology.kubernetes.io/zone
containers:
- name: pause
image: registry.k8s.io/pause:2.0
Initially, there are no replicas of our pod on the cluster. Since the Pod has an affinity requirement to itself, its placement will be random (on topologies that match eventual tolerations & node selectors). If it lacked self-affinity and had an affinity to another Pod, it would have remained in a "pending" state.
Subsequently, additional replicas are scheduled on the same node, adhering to the specified affinity rule.
As soon as the designated topology reaches its capacity (e.g., when there is insufficient memory available on the node to accommodate a new Pod, highlighted in red below), the next replica will remain in a "pending" state. This is due to the strict enforcement of the affinity rule using requiredDuringSchedulingIgnoredDuringExecution
.
Let's explore the scenario where the affinity rule is not strictly enforced. Initially, the scheduling process will proceed as it did previously. However, when the topology reaches its capacity, the next Pod in line for scheduling will be assigned to a random topology, and the affinity rule will once again be applied until the topology reaches its maximum capacity.
apiVersion: v1
kind: Pod
metadata:
name: self-affinity-pod
labels:
app: self-affinity-app
spec:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- self-affinity-app
topologyKey: kubernetes.io/hostname
containers:
- name: pause
image: registry.k8s.io/pause:2.0
You can also use both requiredDuringSchedulingIgnoredDuringExecution
and preferredDuringSchedulingIgnoredDuringExecution
for the same pod to strictly enforce some affinity rules and apply other rules when possible.
PodAntiAffinity is used to prevent the simultaneous scheduling of pods on the same topology. Let’s dive into an example that is analogous to the previous one, where the pods are scheduled with an anti-affinity rule to themselves.
Initially, the first pod is scheduled onto a topology at random. When the rule is strictly enforced using requiredDuringSchedulingIgnoredDuringExecution
, pods will continue to be allocated to different topologies until each topology houses one pod. After this point, pods will remain in a "pending" status, awaiting available topologies to accommodate them.
If the affinity rule is not strictly enforced (i.e. with a preferedDuringScheduling...
statement), pods will schedule like shown above. But once all topologies are full, they will continue to be scheduled following the default scheduler rules and fill all topologies.
As their name implies, TopologySpreadConstraints are scheduling constraints that allow you to evenly spread pods onto topologies. More often than not, setting pods anti-affinity to themselves is misused instead of a spread constraint. They require 3 parameters:
LabelSelector
: A Selector that is used to target specific pods for which the constraint will be applied.TopologyKey
: A key that defines the key that the node needs to share to define a Topology.maxSkew
: A maximum value for the skew, which is the maximum difference in the number of pods that exist on different topologies.requiredDuringSchedulingIgnoredDuringExecution
and preferredDuringSchedulingIgnoredDuringExecution
parameters for pod affinities, it can be set to DoNotSchedule
or ScheduleAnyway
(whenUnsatisfiable
)When respecting spread constraints, pods are free to schedule in any way on topologies as long as the following rules are respected:
whenUnsatisfiable
is set to DoNotSchedule
, pods will remain in a pending state if the rule cannot be respectedFor instance, the following spread constraint will enforce pods to be spread on different nodes, with a maximum skew of 2
apiVersion: v1
kind: Pod
metadata:
name: spread-pod
labels:
app: spread-app
spec:
affinity:
topologySpreadConstraints:
- maxSkew: 2
topologyKey: kubernetes.io/hostname
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchExpressions:
- key: app
operator: In
values:
- spread-app
containers:
- name: pause
image: registry.k8s.io/pause:2.0
You may note that you can define cluster-level spread constaints on a non-managed cluster and that the following constraints are followed by the Kubernetes scheduler as of Kubernetes 1.24.
defaultConstraints:
- maxSkew: 3
topologyKey: "kubernetes.io/hostname"
whenUnsatisfiable: ScheduleAnyway
- maxSkew: 5
topologyKey: "topology.kubernetes.io/zone"
whenUnsatisfiable: ScheduleAnyway
TopologySpreadConstraints, often set with a maxSkew
of 1, serve as an alternative to strict self-anti-affinity rules to prevent pods from landing on the same node. This works well in larger clusters, but if there are fewer nodes than pods, spread constraints will not prevent pods from scheduling on the same node.
As their name implies, TopologySpreadConstraints should be used to evenly spread pods onto topologies and should never be used to schedule pods that have difficulty working with each other.
Here are a few practical use cases of using Affinity rules and spread constraints to understand better when to use them:
If you want to experiment with advanced pod scheduling by yourself, a great way to do so is to install Minikube and start a local Kubernetes cluster in which the number of pods per node is limited. Other solutions are also available to run local Kubernetes clusters.
# This will start a Kubernetes cluster with 4 nodes that can host a maximum of 8 pods each
minikube start --nodes 4 --extra-config=kubelet.max-pods=8
Use the following boilerplate Kubernetes deployment manifest and update it to experiment with affinities and spread constraints:
apiVersion: apps/v1
kind: Deployment
metadata:
name: self-affinity-required
spec:
replicas: 10
selector:
matchLabels:
app: my-app
template:
metadata:
labels:
app: my-app
spec:
containers:
- name: pause
image: registry.k8s.io/pause:2.0
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: app
operator: In
values:
- my-app
topologyKey: kubernetes.io/hostname
Finally, to master pod scheduling within your Kubernetes cluster, it's essential to have a solid understanding of concepts related to Taints and Tolerations.