-
Notifications
You must be signed in to change notification settings - Fork 332
SAMZA-2582: Add a metric to track container failure tracking metric for Samza #1417
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from 2 commits
09a030e
4189ae8
f6d6214
1fe3c09
5ebf385
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,6 +25,7 @@ | |
| import java.util.List; | ||
| import java.util.Map; | ||
| import java.util.Optional; | ||
| import java.util.concurrent.atomic.AtomicInteger; | ||
| import org.apache.commons.lang3.tuple.ImmutablePair; | ||
| import org.apache.commons.lang3.tuple.Pair; | ||
| import org.apache.samza.SamzaException; | ||
|
|
@@ -182,7 +183,7 @@ public ContainerProcessManager(Config config, SamzaApplicationState state, Metri | |
| this.jobConfig = new JobConfig(clusterManagerConfig); | ||
|
|
||
| this.hostAffinityEnabled = clusterManagerConfig.getHostAffinityEnabled(); | ||
|
|
||
| this.containerProcessManagerMetrics = new ContainerProcessManagerMetrics(jobConfig, state, registry); | ||
| this.clusterResourceManager = resourceManager; | ||
| this.containerManager = containerManager; | ||
| this.diagnosticsManager = Option.empty(); | ||
|
|
@@ -236,6 +237,12 @@ public void start() { | |
| Map<String, String> processorToHostMapping = state.jobModelManager.jobModel().getAllContainerLocality(); | ||
| containerAllocator.requestResources(processorToHostMapping); | ||
|
|
||
| // Initialize the per processor failure count to be 0 | ||
| processorToHostMapping.keySet().forEach(processorId -> { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See comment below on how/why this information isnt really in "state"
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. replied there |
||
| state.perProcessorFailureCount.put(processorId, new AtomicInteger(0)); | ||
| containerProcessManagerMetrics.registerProcessorFailureCountMetric(processorId); | ||
| }); | ||
|
|
||
| // Start container allocator thread | ||
| LOG.info("Starting the container allocator thread"); | ||
| allocatorThread.start(); | ||
|
|
@@ -472,6 +479,9 @@ void onResourceCompletedWithUnknownStatus(SamzaResourceStatus resourceStatus, St | |
| LOG.info("Container ID: {} for Processor ID: {} failed with exit code: {}.", containerId, processorId, exitStatus); | ||
| Instant now = Instant.now(); | ||
| state.failedContainers.incrementAndGet(); | ||
| if (state.perProcessorFailureCount.get(processorId) != null) { | ||
| state.perProcessorFailureCount.get(processorId).incrementAndGet(); | ||
| } | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. else {
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This method is the helper to and is invoked from onResourceCompleted(...) which does the check for processorId to be legit, remeber that we also get redundant notifications so we cannot declare a container orphan / unknown, we need more testing to deem callback senarios as orphans and that work is beyond the scope of this change |
||
| state.jobHealthy.set(false); | ||
|
|
||
| state.neededProcessors.incrementAndGet(); | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -115,6 +115,13 @@ public enum SamzaAppStatus { UNDEFINED, SUCCEEDED, FAILED } | |
| */ | ||
| public final ConcurrentHashMap<String, SamzaResourceStatus> failedProcessors = new ConcurrentHashMap<>(0); | ||
|
|
||
|
|
||
| /** | ||
| * Map of the Samza processor ID to the count of failed attempts | ||
| * Modified by AMRMCallbackThread | ||
| */ | ||
| public final ConcurrentMap<String, AtomicInteger> perProcessorFailureCount = new ConcurrentHashMap<>(0); | ||
|
|
||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If this information is only useful for metric-emission, does it need to be stored in "state" ?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That is correct, we can directly wire metrics registry in ContainerManager, ContainerAllocator and instantiate new guage and counters in the code but all metrics related to AM are under this ContainerProcessManagerMetrics class which holds MetricsRegistry and SamzaApplicationState, so once does not need to wire MetricsRegistry individually to each AM class ContainerManager, ContainerAllocator. This is the justification for maintained this state variable to wire metrics, I feel this approach is cleaner
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Check SamzaApplicationState most of the state there is just used for metric emissions |
||
| /** | ||
| * Final status of the application. Made to be volatile s.t. changes will be visible in multiple threads. | ||
| */ | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe we decided to use "processorId" for 0,1,2..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that lingo is used internally in code as the naming conventions for javadocs, this is public-facing metrics page where we do not need to have context between processorId and containerId