Optimize the OSS extraction process and implement Azure Blob extractor#281
Conversation
Azure Blob's API is not S3 compatible, so it needs to be implemented separately.
| // Azure Blob Storage | ||
| var azureServers = []string{ | ||
| // Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0 | ||
| "Windows-Azure-Blob", | ||
| // Blob Service Version 1.0 Microsoft-HTTPAPI/2.0 | ||
| "Blob Service Version", | ||
| // emulator, https://github.com/Azure/Azurite | ||
| "Azurite-Blob", | ||
| } |
There was a problem hiding this comment.
I have tested Zeno with the azure extracor locally with Azurite and it works. But I haven't run it on a real Azure Blob yet.
Also, I found that some Azure Blobs use the Server: Blob Service Version XXX... header, and I don't know what the difference is between them and Server: Windows-Azure-Blob....
https://en.fofa.info/result?qbase64=aGVhZGVyPSJXaW5kb3dzLUF6dXJlLUJsb2Ii
https://en.fofa.info/result?qbase64=aGVhZGVyPSJCbG9iIFNlcnZpY2UgVmVyc2lvbiI%3D
There was a problem hiding this comment.
Pull Request Overview
This PR refactors outlink extraction to support multiple object storage backends by replacing the legacy S3 extraction with a more generic approach. Key changes include updating the extraction logic in outlinks.go, refactoring S3 extraction into a unified s3Compatible helper, and adding Azure extraction support with corresponding tests.
- Updates log messages and extraction calls from S3 to ObjectStorage.
- Refactors S3 extraction logic and error handling.
- Introduces Azure object storage extraction and tests for ObjectStorage detection.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| internal/pkg/postprocessor/outlinks.go | Replaces legacy S3 extraction with ObjectStorage extraction and updates log messages. |
| internal/pkg/postprocessor/extractor/utils.go | Adds a helper function toURLs for conversion of string slices to URL objects. |
| internal/pkg/postprocessor/extractor/object_storage_test.go | Adds tests verifying object storage header detection. |
| internal/pkg/postprocessor/extractor/object_storage_s3_test.go | Updates tests to use the new s3Compatible helper and removes deprecated tests for IsS3. |
| internal/pkg/postprocessor/extractor/object_storage_s3.go | Refactors S3 extraction logic into s3Compatible with improved error wrapping and documentation. |
| internal/pkg/postprocessor/extractor/object_storage_azure.go | Introduces Azure Blob Storage extraction logic with XML parsing of blob listings. |
| internal/pkg/postprocessor/extractor/object_storage.go | Provides a unified ObjectStorage function that delegates to the appropriate helper based on the server header. |
There was a problem hiding this comment.
Pull Request Overview
This PR optimizes the OSS extraction process by unifying S3 extraction logic under a generalized ObjectStorage interface and adds support for Azure Blob Storage extraction.
- Updated outlinks extraction to use ObjectStorage instead of S3
- Added new utility functions and tests for both S3-compatible and Azure Blob Storage
- Removed legacy S3 tests and refactored extraction helper functions for better consistency
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| internal/pkg/postprocessor/outlinks.go | Updated extractor branch logic and logging to use ObjectStorage |
| internal/pkg/postprocessor/extractor/utils.go | Added a helper function toURLs |
| internal/pkg/postprocessor/extractor/s3_test.go | Removed legacy S3 tests |
| internal/pkg/postprocessor/extractor/object_storage_test.go | Introduced tests for ObjectStorage extraction checks |
| internal/pkg/postprocessor/extractor/object_storage_s3_test.go | Added tests covering S3-compatible extraction within ObjectStorage |
| internal/pkg/postprocessor/extractor/object_storage_azure_test.go | Added tests for Azure Blob Storage extraction |
| internal/pkg/postprocessor/extractor/object_storage_s3.go | Refactored S3 extraction logic into s3Compatible with updated documentation |
| internal/pkg/postprocessor/extractor/object_storage_azure.go | New implementation for Azure Blob Storage extraction |
| internal/pkg/postprocessor/extractor/object_storage.go | Unified the extraction interface for supported object storage servers |
Comments suppressed due to low confidence (1)
internal/pkg/postprocessor/extractor/object_storage_azure.go:24
- [nitpick] Consider renaming 'AZureBlobEnumerationResults' to 'AzureBlobEnumerationResults' for clarity and consistency with common naming conventions.
type AZureBlobEnumerationResults struct {
There was a problem hiding this comment.
Pull Request Overview
This PR optimizes the OSS extraction process by consolidating S3 extraction logic into a unified ObjectStorage extractor and adding Azure Blob extraction support.
- Updated outlinks extraction to use ObjectStorage instead of S3 exclusively
- Added dedicated extraction logic and tests for Azure Blob, and refactored S3 extraction into a single helper (s3Compatible)
- Introduced a helper function (toURLs) for converting string slices to URL objects
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/pkg/postprocessor/outlinks.go | Updated extraction logic to use ObjectStorage for both S3 and Azure sources |
| internal/pkg/postprocessor/extractor/utils.go | Added helper function toURLs for URL conversion |
| internal/pkg/postprocessor/extractor/s3_test.go | Removed legacy S3 tests |
| internal/pkg/postprocessor/extractor/object_storage_test.go | Added tests for object storage recognition including Azure |
| internal/pkg/postprocessor/extractor/object_storage_s3_test.go | Added tests for S3 compatible extraction via s3Compatible |
| internal/pkg/postprocessor/extractor/object_storage_azure_test.go | Added tests for Azure Blob extraction |
| internal/pkg/postprocessor/extractor/object_storage_s3.go | Consolidated S3 extraction logic into s3Compatible with legacy and v2 handling |
| internal/pkg/postprocessor/extractor/object_storage_azure.go | Implemented Azure Blob extraction with proper URL construction and error handling |
| internal/pkg/postprocessor/extractor/object_storage.go | Introduced the ObjectStorage dispatcher to choose between S3 and Azure extractors |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the OSS extraction process by consolidating extraction logic into an ObjectStorage method and adds support for Azure Blob extraction alongside S3-compatible extraction. Key changes include updating outlink extraction routing, adding utility and test functions for both Azure and S3-compatible extraction, and removing the legacy S3-specific tests.
Reviewed Changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| internal/pkg/postprocessor/outlinks.go | Updated extractor routing to use ObjectStorage based on URL. |
| internal/pkg/postprocessor/extractor/utils.go | Added a helper to convert []string to []*models.URL. |
| internal/pkg/postprocessor/extractor/object_storage*.go | Implemented and tested Azure Blob and S3-compatible extractors. |
| internal/pkg/postprocessor/extractor/s3_test.go | Removed legacy S3 tests in favor of consolidated ObjectStorage tests. |
When fixing #280, I found that Azure Blob is not S3 compatible, so I implemented a separate Azure sub-extractor. Then I thought that one URL can obviously only be one type of OSS, so I created the
ObjectStorage()method to call the sub-extractor (S3/Azure/...) conditionally.