[YUNIKORN-3119] Add Metrics for Monitoring Applications and Nodes Attempted in Each Scheduling Cycle by mitdesai · Pull Request #1041 · apache/yunikorn-core

mitdesai · 2025-10-30T15:39:52Z

What is this PR for?

Added additional metrics for monitoring nodes and applications attempted during scheduling cycle.

What type of PR is it?

Todos

- Task

What is the Jira issue?

Jira https://issues.apache.org/jira/browse/YUNIKORN-3119

How should this be tested?

Screenshots (if appropriate)

Questions:

- The licenses files need update.
- There is breaking changes for older versions.
- It needs documentation.

codecov · 2025-10-31T01:38:46Z

Codecov Report

❌ Patch coverage is 78.57143% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.58%. Comparing base (65d35fc) to head (0c9a3f8).
⚠️ Report is 1 commits behind head on master.

Files with missing lines	Patch %	Lines
pkg/scheduler/objects/application.go	80.35%	11 Missing ⚠️
pkg/scheduler/objects/queue.go	11.11%	8 Missing ⚠️
pkg/metrics/scheduler.go	92.68%	3 Missing ⚠️
pkg/scheduler/context.go	0.00%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1041      +/-   ##
==========================================
+ Coverage   81.56%   81.58%   +0.01%     
==========================================
  Files         103      103              
  Lines       13884    13964      +80     
==========================================
+ Hits        11324    11392      +68     
- Misses       2281     2294      +13     
+ Partials      279      278       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

pbacsko · 2025-11-07T16:52:36Z

@mitdesai please rebase this PR. The unit test failure is unrelated.

mitdesai · 2025-11-07T19:34:06Z

Thanks @pbacsko I have rebased with master

manirajv06

Do we need to include other types of allocations in scheduling cycles? PH, Reservation etc

pbacsko

I definitely think this solution needs some re-work. Current approach is a bit hard to understand.

Code should be not communicating through metrics. Eg. in Queue.tryAllocate(), we call GetTryNode() twice to get a difference. Why not just return this from app.tryAllocate()? That would be much simpler. Then instead of calling Inc() every time from the app, you can just call Add() with the number of nodes which was tried.
We're storing transient information specifically in the root queue, which involves constant queue walking. It's not the speed that bothers me, but it's just weird. This information is not specific to a queue. Similarly to #1, this data (number of apps tried) should propagate back to a higher-level caller which does the necessary processing. You can easily add this to AllocationResult and record the metrics in PartitionContext.tryAllocate() after pc.root.TryAllocate(...) returns.

- NodesTried and ApplicationsTried are tracked in the result structure - Local applicationsTried counter increments for each application tried; returns a total count when returning the result - add application tried counter field to SchedulerMetrics - partition context records both NodesTried and ApplicationsTried - added reset calls in ClusterContext.schedule() - fixed linting issues

…ge gaps - Add TestTryNodeCount covering AddTryNodeCount, GetTryNodeCount, ResetTryNodeCount - Add TestTryApplicationCount covering AddTryApplicationCount, ResetTryApplicationCount - Fix unregisterMetrics to also unregister tryApplicationCount - Add ApplicationsTried assertion to TestApplicationsTriedCount - Fix stale comments in TestApplicationsTriedCount Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

pbacsko

LGTM

wilfred-s

We add an extra return value for the application methods:

tryAllocate
tryReservedAllocate
tryPlaceholderAllocate

However all callers ignore the new value returned and read the NodesTried out of the result object. We should not add that extra return value to the signatures.
It reduces the change size: no changes to applicatio_test.go and preemption_test.go, half the changes to the queue.go file.

For the two left over methods:

tryNodes
tryNodesNoReserve

Setting the NodesTried in the not nil result inside the method is unneeded. If the result is not nil we always override the NodesTried in the caller. We cannot assume the first call to these methods is a success. Both are called inside a loop and thus the caller is the only one that can track the NodesTried correctly and it does that.

wilfred-s assigned mitdesai Oct 31, 2025

mitdesai force-pushed the YUNIKORN-3119 branch from 6bdbaee to e002116 Compare October 31, 2025 02:56

pbacsko self-requested a review November 7, 2025 16:51

mitdesai force-pushed the YUNIKORN-3119 branch from e002116 to 0a89255 Compare November 7, 2025 19:33

manirajv06 reviewed Nov 11, 2025

View reviewed changes

pbacsko requested changes Nov 11, 2025

View reviewed changes

Comment thread pkg/scheduler/partition.go Outdated

Comment thread pkg/scheduler/objects/queue.go Outdated

Comment thread pkg/scheduler/objects/queue.go Outdated

mitdesai force-pushed the YUNIKORN-3119 branch from 0a89255 to d46128a Compare November 29, 2025 17:42

pbacsko requested a review from wilfred-s December 11, 2025 18:07

mitdesai added 3 commits April 1, 2026 11:21

Rebased with master

6710e34

Add code-coverage

f121866

mitdesai force-pushed the YUNIKORN-3119 branch from 1f9cd52 to d2d8893 Compare April 1, 2026 18:54

mitdesai force-pushed the YUNIKORN-3119 branch from d2d8893 to 0c9a3f8 Compare April 1, 2026 19:16

mitdesai requested review from manirajv06 and pbacsko April 1, 2026 23:23

pbacsko approved these changes Apr 7, 2026

View reviewed changes

wilfred-s requested changes Apr 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[YUNIKORN-3119] Add Metrics for Monitoring Applications and Nodes Attempted in Each Scheduling Cycle#1041

[YUNIKORN-3119] Add Metrics for Monitoring Applications and Nodes Attempted in Each Scheduling Cycle#1041
mitdesai wants to merge 4 commits into
apache:masterfrom
mitdesai:YUNIKORN-3119

mitdesai commented Oct 30, 2025

Uh oh!

codecov Bot commented Oct 31, 2025 •

edited

Loading

Uh oh!

pbacsko commented Nov 7, 2025

Uh oh!

mitdesai commented Nov 7, 2025

Uh oh!

manirajv06 left a comment

Uh oh!

pbacsko left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pbacsko left a comment

Uh oh!

wilfred-s left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mitdesai commented Oct 30, 2025

What is this PR for?

What type of PR is it?

Todos

What is the Jira issue?

How should this be tested?

Screenshots (if appropriate)

Questions:

Uh oh!

codecov Bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

pbacsko commented Nov 7, 2025

Uh oh!

mitdesai commented Nov 7, 2025

Uh oh!

manirajv06 left a comment

Choose a reason for hiding this comment

Uh oh!

pbacsko left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pbacsko left a comment

Choose a reason for hiding this comment

Uh oh!

wilfred-s left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented Oct 31, 2025 •

edited

Loading

wilfred-s left a comment •

edited

Loading