Implement dead hosts component to skip consistently failing hosts#499
Implement dead hosts component to skip consistently failing hosts#499Copilot wants to merge 3 commits into
Conversation
Co-authored-by: CorentinB <5089772+CorentinB@users.noreply.github.com>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #499 +/- ##
==========================================
+ Coverage 56.39% 57.01% +0.61%
==========================================
Files 130 131 +1
Lines 8120 8272 +152
==========================================
+ Hits 4579 4716 +137
- Misses 3172 3184 +12
- Partials 369 372 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| getCmd.PersistentFlags().Duration("rate-limit-cleanup-frequency", time.Duration(5*time.Minute), "How often to run cleanup of stale buckets that are not accessed in the duration.") | ||
|
|
||
| // Dead hosts detection flags | ||
| getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.") |
There was a problem hiding this comment.
dead-hosts-detection seems lik a better name
There was a problem hiding this comment.
Updated the flag name to --dead-hosts-detection as suggested. Commit d7ffd62.
| // Dead hosts detection flags | ||
| getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.") | ||
| getCmd.PersistentFlags().Int("dead-hosts-max-failures", 5, "Number of connection failures before marking a host as dead.") | ||
| getCmd.PersistentFlags().Duration("dead-hosts-refresh-period", time.Duration(30*time.Minute), "How often to clean up the dead hosts cache.") |
There was a problem hiding this comment.
This could be almost irrelevant if we had like a background process dedicated to checking the dead hosts status? Is it a good idea? IDK.
There was a problem hiding this comment.
That's an interesting idea. A background process could periodically check dead hosts to see if they've recovered. For now, I've implemented a cleanup mechanism that removes hosts after a configurable age (default 6h), which allows them to be retried naturally. The background checking approach could be a future enhancement.
| ) | ||
|
|
||
| func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, globalBucketManager *ratelimiter.BucketManager, client *warc.CustomHTTPClient) { | ||
| func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, globalBucketManager *ratelimiter.BucketManager, globalDeadHostManager *deadhosts.Manager, client *warc.CustomHTTPClient) { |
There was a problem hiding this comment.
We start to have a lot of arguments to this function, is there a cleaner way to do things?
There was a problem hiding this comment.
Created an ArchiverDependencies struct to group the related parameters, reducing the function signature from 6 parameters to 4. This makes it cleaner and more maintainable. Commit d7ffd62.
| // Check if host is marked as dead | ||
| if globalDeadHostManager != nil && globalDeadHostManager.IsDeadHost(req.URL.Host) { | ||
| logger.Info("skipping request to dead host", "host", req.URL.Host) | ||
| item.SetStatus(models.ItemFailed) |
There was a problem hiding this comment.
Is it really Failed though?
There was a problem hiding this comment.
It should be similar to retries failing, just accelerated?
There was a problem hiding this comment.
You're right - it should behave similarly to retries failing, just accelerated. Using ItemFailed is consistent with how other network failures are handled in the system when retries are exhausted.
| } | ||
|
|
||
| func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, bucketManager *ratelimiter.BucketManager, client *warc.CustomHTTPClient) { | ||
| func ArchiveItem(item *models.Item, wg *sync.WaitGroup, guard chan struct{}, bucketManager *ratelimiter.BucketManager, deadHostManager *deadhosts.Manager, client *warc.CustomHTTPClient) { |
There was a problem hiding this comment.
Same comment as for archiver.go ArchiveItem.
There was a problem hiding this comment.
Updated the headless archiver function signature to match the general archiver improvements. Commit d7ffd62.
| logger *log.FieldedLogger | ||
| globalArchiver *archiver | ||
| globalBucketManager *ratelimiter.BucketManager | ||
| globalDeadHostManager *deadhosts.Manager |
There was a problem hiding this comment.
globalDeadHostsManager not globalDeadHostManager
There was a problem hiding this comment.
Fixed the variable naming to globalDeadHostsManager throughout the codebase. Commit d7ffd62.
| } | ||
|
|
||
| // Initialize dead hosts manager | ||
| globalDeadHostManager = deadhosts.NewManager(ctx, |
There was a problem hiding this comment.
globalDeadHostsManager not globalDeadHostManager
There was a problem hiding this comment.
Fixed the variable naming to globalDeadHostsManager throughout the codebase. Commit d7ffd62.
| RateLimitCleanupFrequency time.Duration `mapstructure:"rate-limit-cleanup-frequency"` | ||
|
|
||
| // Dead hosts detection | ||
| DeadHostsEnabled bool `mapstructure:"dead-hosts-enabled"` |
There was a problem hiding this comment.
dead-hosts-detection seems lik a better name
There was a problem hiding this comment.
Updated the config field name to dead-hosts-detection as suggested. Commit d7ffd62.
| } | ||
|
|
||
| // isDeadHostError determines if an error indicates a dead host | ||
| func isDeadHostError(err error) bool { |
There was a problem hiding this comment.
Is there a better approach than checking the strings from the error messages?
There was a problem hiding this comment.
Improved the error detection approach using errors.As() to check for specific error types like net.OpError, net.DNSError, and net.Error before falling back to string matching. This is more robust and less brittle. Commit d7ffd62.
| // Manager tracks hosts that consistently deny connections | ||
| type Manager struct { | ||
| mu sync.RWMutex | ||
| deadHosts map[string]*hostInfo |
There was a problem hiding this comment.
Why not use a sync.Map? https://pkg.go.dev/sync#Map
There was a problem hiding this comment.
sync.RWMutex is actually better for this use case since we need complex operations like iterating through all entries for cleanup and age checks. sync.Map is optimized for simple key-value operations but doesn't support the kind of bulk operations we need for the cleanup mechanism. Added a comment explaining this choice.
| // Check if host is marked as dead | ||
| if globalDeadHostManager != nil && globalDeadHostManager.IsDeadHost(req.URL.Host) { | ||
| logger.Info("skipping request to dead host", "host", req.URL.Host) | ||
| item.SetStatus(models.ItemFailed) |
There was a problem hiding this comment.
It should be similar to retries failing, just accelerated?
|
|
||
| // Dead hosts detection flags | ||
| getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.") | ||
| getCmd.PersistentFlags().Int("dead-hosts-max-failures", 5, "Number of connection failures before marking a host as dead.") |
There was a problem hiding this comment.
Default should be higher, perhaps 15 or 30?
There was a problem hiding this comment.
Updated the default max-failures to 15 as suggested. This provides a better balance between avoiding false positives and detecting genuinely dead hosts. Commit d7ffd62.
| getCmd.PersistentFlags().Bool("dead-hosts-enabled", false, "Enable dead hosts detection to skip hosts that consistently deny connections.") | ||
| getCmd.PersistentFlags().Int("dead-hosts-max-failures", 5, "Number of connection failures before marking a host as dead.") | ||
| getCmd.PersistentFlags().Duration("dead-hosts-refresh-period", time.Duration(30*time.Minute), "How often to clean up the dead hosts cache.") | ||
| getCmd.PersistentFlags().Duration("dead-hosts-max-age", time.Duration(24*time.Hour), "How long to keep a host marked as dead before giving it another chance.") |
There was a problem hiding this comment.
I think 6 hours may be a better default.
There was a problem hiding this comment.
Updated the default max-age to 6 hours as suggested. This allows for faster recovery while still providing significant time savings during crawls. Commit d7ffd62.
|
|
||
| resp, err = client.Do(req) | ||
| if err != nil { | ||
| // Record failure for dead host detection |
There was a problem hiding this comment.
This should be clarified to sending the error to the dead host detection function, not necessarily recording it in.
There was a problem hiding this comment.
Clarified the comment to "Send error to dead host detection for analysis" to better reflect that the error is analyzed before being recorded. Commit d7ffd62.
…ion parameters Co-authored-by: NGTmeaty <2244519+NGTmeaty@users.noreply.github.com>
This PR implements a dead hosts detection system that tracks hosts which consistently deny connections and automatically skips them to avoid wasted network timeouts during crawls.
Overview
The dead hosts component addresses a common crawling inefficiency where Zeno would repeatedly attempt to connect to hosts that are permanently unreachable (dead hosts). This results in unnecessary network timeouts and slower crawls. The implementation is inspired by similar functionality in
warcproxand the WBM live web checker.Key Features
Automatic Detection: Tracks network-level failures that indicate dead hosts:
no such host,server misbehaving)Smart Skipping: Once a host is marked as dead (after configurable failures), all subsequent requests to that host are skipped immediately
Recovery Mechanism: Dead hosts cache is periodically cleaned up to allow hosts to recover
Disabled by Default: Following the suggestion in the issue comments, the feature is disabled by default to avoid potential side effects
Robust Error Detection: Uses proper Go error type checking with
errors.As()fornet.OpError,net.DNSError, andnet.Errortypes, falling back to string matching only when necessaryClean Architecture: Refactored function signatures using
ArchiverDependenciesstruct to group related parameters and improve maintainabilityConfiguration Options
Implementation Details
deadhostspackage ininternal/pkg/archiver/deadhosts/ArchiverDependenciesstructsync.RWMutexfor efficient concurrent access (chosen oversync.Mapdue to need for complex cleanup operations)Example Usage
The component integrates with Zeno's existing rate limiter and follows the same architectural patterns. When enabled, it will log when hosts are being skipped due to being marked as dead, helping operators understand crawl behavior.
Code Quality Improvements
Based on code review feedback, the implementation includes:
Fixes #361.
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.