[YUNIKORN-3119] Add Metrics for Monitoring Applications and Nodes Attempted in Each Scheduling Cycle#1041
[YUNIKORN-3119] Add Metrics for Monitoring Applications and Nodes Attempted in Each Scheduling Cycle#1041mitdesai wants to merge 4 commits into
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1041 +/- ##
==========================================
+ Coverage 81.56% 81.58% +0.01%
==========================================
Files 103 103
Lines 13884 13964 +80
==========================================
+ Hits 11324 11392 +68
- Misses 2281 2294 +13
+ Partials 279 278 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
6bdbaee to
e002116
Compare
|
@mitdesai please rebase this PR. The unit test failure is unrelated. |
e002116 to
0a89255
Compare
|
Thanks @pbacsko I have rebased with master |
manirajv06
left a comment
There was a problem hiding this comment.
Do we need to include other types of allocations in scheduling cycles? PH, Reservation etc
pbacsko
left a comment
There was a problem hiding this comment.
I definitely think this solution needs some re-work. Current approach is a bit hard to understand.
-
Code should be not communicating through metrics. Eg. in
Queue.tryAllocate(), we callGetTryNode()twice to get a difference. Why not just return this fromapp.tryAllocate()? That would be much simpler. Then instead of callingInc()every time from the app, you can just callAdd()with the number of nodes which was tried. -
We're storing transient information specifically in the root queue, which involves constant queue walking. It's not the speed that bothers me, but it's just weird. This information is not specific to a queue. Similarly to #1, this data (number of apps tried) should propagate back to a higher-level caller which does the necessary processing. You can easily add this to
AllocationResultand record the metrics inPartitionContext.tryAllocate()afterpc.root.TryAllocate(...)returns.
0a89255 to
d46128a
Compare
- NodesTried and ApplicationsTried are tracked in the result structure - Local applicationsTried counter increments for each application tried; returns a total count when returning the result - add application tried counter field to SchedulerMetrics - partition context records both NodesTried and ApplicationsTried - added reset calls in ClusterContext.schedule() - fixed linting issues
…ge gaps - Add TestTryNodeCount covering AddTryNodeCount, GetTryNodeCount, ResetTryNodeCount - Add TestTryApplicationCount covering AddTryApplicationCount, ResetTryApplicationCount - Fix unregisterMetrics to also unregister tryApplicationCount - Add ApplicationsTried assertion to TestApplicationsTriedCount - Fix stale comments in TestApplicationsTriedCount Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
We add an extra return value for the application methods:
- tryAllocate
- tryReservedAllocate
- tryPlaceholderAllocate
However all callers ignore the new value returned and read the NodesTried out of the result object. We should not add that extra return value to the signatures.
It reduces the change size: no changes to applicatio_test.go and preemption_test.go, half the changes to the queue.go file.
For the two left over methods:
- tryNodes
- tryNodesNoReserve
Setting the NodesTried in the not nil result inside the method is unneeded. If the result is not nil we always override the NodesTried in the caller. We cannot assume the first call to these methods is a success. Both are called inside a loop and thus the caller is the only one that can track the NodesTried correctly and it does that.
What is this PR for?
Added additional metrics for monitoring nodes and applications attempted during scheduling cycle.
What type of PR is it?
Todos
What is the Jira issue?
Jira https://issues.apache.org/jira/browse/YUNIKORN-3119
How should this be tested?
Screenshots (if appropriate)
Questions: