Kubernetes 1.26: Pod Scheduling Readiness

Author: Wei Huang (Apple), Abdullah Gharaibeh (Google)

Kubernetes 1.26 introduced a new Pod feature: scheduling gates. In Kubernetes, scheduling gates are keys that tell the scheduler when a Pod is ready to be considered for scheduling.

What problem does it solve?

When a Pod is created, the scheduler will continuously attempt to find a node that fits it. This infinite loop continues until the scheduler either finds a node for the Pod, or the Pod gets deleted.

Pods that remain unschedulable for long periods of time (e.g., ones that are blocked on some external event) waste scheduling cycles. A scheduling cycle may take ≅20ms or more depending on the complexity of the Pod's scheduling constraints. Therefore, at scale, those wasted cycles significantly impact the scheduler's performance. See the arrows in the "scheduler" box below.

graph LR; pod((New Pod))-->queue subgraph Scheduler queue(scheduler queue) sched_cycle[/scheduling cycle/] schedulable{schedulable?} queue==>|Pop out|sched_cycle sched_cycle==>schedulable schedulable==>|No|queue subgraph note [Cycles wasted on keep rescheduling 'unready' Pods] end end classDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000; classDef k8s fill:#326ce5,stroke:#fff,stroke-width:1px,color:#fff; classDef Scheduler fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5; classDef note fill:#edf2ae,stroke:#fff,stroke-width:1px; class queue,sched_cycle,schedulable k8s; class pod plain; class note note; class Scheduler Scheduler;

Scheduling gates helps address this problem. It allows declaring that newly created Pods are not ready for scheduling. When scheduling gates are present on a Pod, the scheduler ignores the Pod and therefore saves unnecessary scheduling attempts. Those Pods will also be ignored by Cluster Autoscaler if you have it installed in the cluster.

Clearing the gates is the responsibility of external controllers with knowledge of when the Pod should be considered for scheduling (e.g., a quota manager).

graph LR; pod((New Pod))-->queue subgraph Scheduler queue(scheduler queue) sched_cycle[/scheduling cycle/] schedulable{schedulable?} popout{Pop out?} queue==>|PreEnqueue check|popout popout-->|Yes|sched_cycle popout==>|No|queue sched_cycle-->schedulable schedulable-->|No|queue subgraph note [A knob to gate Pod's scheduling] end end classDef plain fill:#ddd,stroke:#fff,stroke-width:1px,color:#000; classDef k8s fill:#326ce5,stroke:#fff,stroke-width:1px,color:#fff; classDef Scheduler fill:#fff,stroke:#bbb,stroke-width:2px,color:#326ce5; classDef note fill:#edf2ae,stroke:#fff,stroke-width:1px; classDef popout fill:#f96,stroke:#fff,stroke-width:1px; class queue,sched_cycle,schedulable k8s; class pod plain; class note note; class popout popout; class Scheduler Scheduler;

How does it work?

Scheduling gates in general works very similar to Finalizers. Pods with a non-empty spec.schedulingGates field will show as status SchedulingGated and be blocked from scheduling. Note that more than one gate can be added, but they all should be added upon Pod creation (e.g., you can add them as part of the spec or via a mutating webhook).

NAME       READY   STATUS            RESTARTS   AGE
test-pod   0/1     SchedulingGated   0          10s

To clear the gates, you update the Pod by removing all of the items from the Pod's schedulingGates field. The gates do not need to be removed all at once, but only when all the gates are removed the scheduler will start to consider the Pod for scheduling.

Under the hood, scheduling gates are implemented as a PreEnqueue scheduler plugin, a new scheduler framework extension point that is invoked at the beginning of each scheduling cycle.

Use Cases

An important use case this feature enables is dynamic quota management. Kubernetes supports ResourceQuota, however the API Server enforces quota at the time you attempt Pod creation. For example, if a new Pod exceeds the CPU quota, it gets rejected. The API Server doesn't queue the Pod; therefore, whoever created the Pod needs to continuously attempt to recreate it again. This either means a delay between resources becoming available and the Pod actually running, or it means load on the API server and Scheduler due to constant attempts.

Scheduling gates allows an external quota manager to address the above limitation of ResourceQuota. Specifically, the manager could add a example.com/quota-check scheduling gate to all Pods created in the cluster (using a mutating webhook). The manager would then remove the gate when there is quota to start the Pod.

Whats next?

To use this feature, the PodSchedulingReadiness feature gate must be enabled in the API Server and scheduler. You're more than welcome to test it out and tell us (SIG Scheduling) what you think!

Additional resources