Skip to content

[YUNIKORN-3119] Add Metrics for Monitoring Applications and Nodes Attempted in Each Scheduling Cycle#1041

Open
mitdesai wants to merge 4 commits into
apache:masterfrom
mitdesai:YUNIKORN-3119
Open

[YUNIKORN-3119] Add Metrics for Monitoring Applications and Nodes Attempted in Each Scheduling Cycle#1041
mitdesai wants to merge 4 commits into
apache:masterfrom
mitdesai:YUNIKORN-3119

Conversation

@mitdesai
Copy link
Copy Markdown
Contributor

What is this PR for?

Added additional metrics for monitoring nodes and applications attempted during scheduling cycle.

What type of PR is it?

  • - Bug Fix
  • - Improvement
  • - Feature
  • - Documentation
  • - Hot Fix
  • - Refactoring

Todos

  • - Task

What is the Jira issue?

Jira https://issues.apache.org/jira/browse/YUNIKORN-3119

How should this be tested?

Screenshots (if appropriate)

Questions:

  • - The licenses files need update.
  • - There is breaking changes for older versions.
  • - It needs documentation.

@codecov
Copy link
Copy Markdown

codecov Bot commented Oct 31, 2025

Codecov Report

❌ Patch coverage is 78.57143% with 24 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.58%. Comparing base (65d35fc) to head (0c9a3f8).
⚠️ Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
pkg/scheduler/objects/application.go 80.35% 11 Missing ⚠️
pkg/scheduler/objects/queue.go 11.11% 8 Missing ⚠️
pkg/metrics/scheduler.go 92.68% 3 Missing ⚠️
pkg/scheduler/context.go 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1041      +/-   ##
==========================================
+ Coverage   81.56%   81.58%   +0.01%     
==========================================
  Files         103      103              
  Lines       13884    13964      +80     
==========================================
+ Hits        11324    11392      +68     
- Misses       2281     2294      +13     
+ Partials      279      278       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pbacsko pbacsko self-requested a review November 7, 2025 16:51
@pbacsko
Copy link
Copy Markdown
Contributor

pbacsko commented Nov 7, 2025

@mitdesai please rebase this PR. The unit test failure is unrelated.

@mitdesai
Copy link
Copy Markdown
Contributor Author

mitdesai commented Nov 7, 2025

Thanks @pbacsko I have rebased with master

Copy link
Copy Markdown
Contributor

@manirajv06 manirajv06 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to include other types of allocations in scheduling cycles? PH, Reservation etc

Copy link
Copy Markdown
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely think this solution needs some re-work. Current approach is a bit hard to understand.

  1. Code should be not communicating through metrics. Eg. in Queue.tryAllocate(), we call GetTryNode() twice to get a difference. Why not just return this from app.tryAllocate()? That would be much simpler. Then instead of calling Inc() every time from the app, you can just call Add() with the number of nodes which was tried.

  2. We're storing transient information specifically in the root queue, which involves constant queue walking. It's not the speed that bothers me, but it's just weird. This information is not specific to a queue. Similarly to #1, this data (number of apps tried) should propagate back to a higher-level caller which does the necessary processing. You can easily add this to AllocationResult and record the metrics in PartitionContext.tryAllocate() after pc.root.TryAllocate(...) returns.

Comment thread pkg/scheduler/partition.go Outdated
Comment thread pkg/scheduler/objects/queue.go Outdated
Comment thread pkg/scheduler/objects/queue.go Outdated
mitdesai added 3 commits April 1, 2026 11:21
  - NodesTried and ApplicationsTried are tracked in the result structure
  - Local applicationsTried counter increments for each application tried; returns a total count when returning the result
  - add application tried counter field to SchedulerMetrics
  - partition context records both NodesTried and ApplicationsTried
  - added reset calls in ClusterContext.schedule()
  - fixed linting issues
…ge gaps

- Add TestTryNodeCount covering AddTryNodeCount, GetTryNodeCount, ResetTryNodeCount
- Add TestTryApplicationCount covering AddTryApplicationCount, ResetTryApplicationCount
- Fix unregisterMetrics to also unregister tryApplicationCount
- Add ApplicationsTried assertion to TestApplicationsTriedCount
- Fix stale comments in TestApplicationsTriedCount

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@pbacsko pbacsko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@wilfred-s wilfred-s left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add an extra return value for the application methods:

  • tryAllocate
  • tryReservedAllocate
  • tryPlaceholderAllocate

However all callers ignore the new value returned and read the NodesTried out of the result object. We should not add that extra return value to the signatures.
It reduces the change size: no changes to applicatio_test.go and preemption_test.go, half the changes to the queue.go file.

For the two left over methods:

  • tryNodes
  • tryNodesNoReserve

Setting the NodesTried in the not nil result inside the method is unneeded. If the result is not nil we always override the NodesTried in the caller. We cannot assume the first call to these methods is a success. Both are called inside a loop and thus the caller is the only one that can track the NodesTried correctly and it does that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants