Skip to content

Optimize the OSS extraction process and implement Azure Blob extractor#281

Merged
NGTmeaty merged 9 commits into
mainfrom
object-storages
May 21, 2025
Merged

Optimize the OSS extraction process and implement Azure Blob extractor#281
NGTmeaty merged 9 commits into
mainfrom
object-storages

Conversation

@yzqzss

@yzqzss yzqzss commented May 15, 2025

Copy link
Copy Markdown
Collaborator

When fixing #280, I found that Azure Blob is not S3 compatible, so I implemented a separate Azure sub-extractor. Then I thought that one URL can obviously only be one type of OSS, so I created the ObjectStorage() method to call the sub-extractor (S3/Azure/...) conditionally.

yzqzss added 2 commits May 15, 2025 23:07
Azure Blob's API is not S3 compatible, so it needs to be implemented separately.
@yzqzss yzqzss force-pushed the object-storages branch from 93fc415 to 2fe37b3 Compare May 15, 2025 15:23
Comment thread internal/pkg/postprocessor/extractor/object_storage.go Outdated
Comment on lines +11 to +19
// Azure Blob Storage
var azureServers = []string{
// Windows-Azure-Blob/1.0 Microsoft-HTTPAPI/2.0
"Windows-Azure-Blob",
// Blob Service Version 1.0 Microsoft-HTTPAPI/2.0
"Blob Service Version",
// emulator, https://github.com/Azure/Azurite
"Azurite-Blob",
}

@yzqzss yzqzss May 15, 2025

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have tested Zeno with the azure extracor locally with Azurite and it works. But I haven't run it on a real Azure Blob yet.

Also, I found that some Azure Blobs use the Server: Blob Service Version XXX... header, and I don't know what the difference is between them and Server: Windows-Azure-Blob....

https://en.fofa.info/result?qbase64=aGVhZGVyPSJXaW5kb3dzLUF6dXJlLUJsb2Ii
https://en.fofa.info/result?qbase64=aGVhZGVyPSJCbG9iIFNlcnZpY2UgVmVyc2lvbiI%3D

Comment thread internal/pkg/postprocessor/extractor/object_storage_azure.go Outdated
Comment thread internal/pkg/postprocessor/extractor/object_storage_azure.go Outdated
@yzqzss yzqzss requested a review from Copilot May 15, 2025 15:55

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors outlink extraction to support multiple object storage backends by replacing the legacy S3 extraction with a more generic approach. Key changes include updating the extraction logic in outlinks.go, refactoring S3 extraction into a unified s3Compatible helper, and adding Azure extraction support with corresponding tests.

  • Updates log messages and extraction calls from S3 to ObjectStorage.
  • Refactors S3 extraction logic and error handling.
  • Introduces Azure object storage extraction and tests for ObjectStorage detection.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
internal/pkg/postprocessor/outlinks.go Replaces legacy S3 extraction with ObjectStorage extraction and updates log messages.
internal/pkg/postprocessor/extractor/utils.go Adds a helper function toURLs for conversion of string slices to URL objects.
internal/pkg/postprocessor/extractor/object_storage_test.go Adds tests verifying object storage header detection.
internal/pkg/postprocessor/extractor/object_storage_s3_test.go Updates tests to use the new s3Compatible helper and removes deprecated tests for IsS3.
internal/pkg/postprocessor/extractor/object_storage_s3.go Refactors S3 extraction logic into s3Compatible with improved error wrapping and documentation.
internal/pkg/postprocessor/extractor/object_storage_azure.go Introduces Azure Blob Storage extraction logic with XML parsing of blob listings.
internal/pkg/postprocessor/extractor/object_storage.go Provides a unified ObjectStorage function that delegates to the appropriate helper based on the server header.

Comment thread internal/pkg/postprocessor/extractor/object_storage_azure.go Outdated
@yzqzss yzqzss force-pushed the object-storages branch from fc99155 to 6ae7e70 Compare May 15, 2025 17:13
@yzqzss yzqzss force-pushed the object-storages branch from 6ae7e70 to fb9c57e Compare May 15, 2025 17:45
@yzqzss yzqzss changed the title WIP: Object storages Optimize the OSS extraction process and implement Azure Blob extractor May 15, 2025
@yzqzss yzqzss marked this pull request as ready for review May 15, 2025 18:02
@yzqzss yzqzss requested a review from Copilot May 15, 2025 18:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the OSS extraction process by unifying S3 extraction logic under a generalized ObjectStorage interface and adds support for Azure Blob Storage extraction.

  • Updated outlinks extraction to use ObjectStorage instead of S3
  • Added new utility functions and tests for both S3-compatible and Azure Blob Storage
  • Removed legacy S3 tests and refactored extraction helper functions for better consistency

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
internal/pkg/postprocessor/outlinks.go Updated extractor branch logic and logging to use ObjectStorage
internal/pkg/postprocessor/extractor/utils.go Added a helper function toURLs
internal/pkg/postprocessor/extractor/s3_test.go Removed legacy S3 tests
internal/pkg/postprocessor/extractor/object_storage_test.go Introduced tests for ObjectStorage extraction checks
internal/pkg/postprocessor/extractor/object_storage_s3_test.go Added tests covering S3-compatible extraction within ObjectStorage
internal/pkg/postprocessor/extractor/object_storage_azure_test.go Added tests for Azure Blob Storage extraction
internal/pkg/postprocessor/extractor/object_storage_s3.go Refactored S3 extraction logic into s3Compatible with updated documentation
internal/pkg/postprocessor/extractor/object_storage_azure.go New implementation for Azure Blob Storage extraction
internal/pkg/postprocessor/extractor/object_storage.go Unified the extraction interface for supported object storage servers
Comments suppressed due to low confidence (1)

internal/pkg/postprocessor/extractor/object_storage_azure.go:24

  • [nitpick] Consider renaming 'AZureBlobEnumerationResults' to 'AzureBlobEnumerationResults' for clarity and consistency with common naming conventions.
type AZureBlobEnumerationResults struct {

@yzqzss yzqzss requested review from NGTmeaty and Copilot May 15, 2025 18:31

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the OSS extraction process by consolidating S3 extraction logic into a unified ObjectStorage extractor and adding Azure Blob extraction support.

  • Updated outlinks extraction to use ObjectStorage instead of S3 exclusively
  • Added dedicated extraction logic and tests for Azure Blob, and refactored S3 extraction into a single helper (s3Compatible)
  • Introduced a helper function (toURLs) for converting string slices to URL objects

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
internal/pkg/postprocessor/outlinks.go Updated extraction logic to use ObjectStorage for both S3 and Azure sources
internal/pkg/postprocessor/extractor/utils.go Added helper function toURLs for URL conversion
internal/pkg/postprocessor/extractor/s3_test.go Removed legacy S3 tests
internal/pkg/postprocessor/extractor/object_storage_test.go Added tests for object storage recognition including Azure
internal/pkg/postprocessor/extractor/object_storage_s3_test.go Added tests for S3 compatible extraction via s3Compatible
internal/pkg/postprocessor/extractor/object_storage_azure_test.go Added tests for Azure Blob extraction
internal/pkg/postprocessor/extractor/object_storage_s3.go Consolidated S3 extraction logic into s3Compatible with legacy and v2 handling
internal/pkg/postprocessor/extractor/object_storage_azure.go Implemented Azure Blob extraction with proper URL construction and error handling
internal/pkg/postprocessor/extractor/object_storage.go Introduced the ObjectStorage dispatcher to choose between S3 and Azure extractors

Comment thread internal/pkg/postprocessor/extractor/object_storage_azure.go Outdated
Comment thread internal/pkg/postprocessor/extractor/utils.go Outdated
@yzqzss yzqzss requested a review from CorentinB May 15, 2025 18:32
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
CorentinB
CorentinB previously approved these changes May 16, 2025

@CorentinB CorentinB left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Excellent work, thanks for your contribution! Ready to merge after @NGTmeaty's review.

Edit: also Copilot made a couple of good suggestions.

@CorentinB CorentinB requested a review from Copilot May 16, 2025 08:47

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the OSS extraction process by consolidating extraction logic into an ObjectStorage method and adds support for Azure Blob extraction alongside S3-compatible extraction. Key changes include updating outlink extraction routing, adding utility and test functions for both Azure and S3-compatible extraction, and removing the legacy S3-specific tests.

Reviewed Changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

File Description
internal/pkg/postprocessor/outlinks.go Updated extractor routing to use ObjectStorage based on URL.
internal/pkg/postprocessor/extractor/utils.go Added a helper to convert []string to []*models.URL.
internal/pkg/postprocessor/extractor/object_storage*.go Implemented and tested Azure Blob and S3-compatible extractors.
internal/pkg/postprocessor/extractor/s3_test.go Removed legacy S3 tests in favor of consolidated ObjectStorage tests.

Comment thread internal/pkg/postprocessor/extractor/object_storage.go Outdated
Comment thread internal/pkg/postprocessor/extractor/object_storage_azure.go Outdated
@yzqzss yzqzss force-pushed the object-storages branch from ca282e6 to 7ec440f Compare May 16, 2025 13:06

@NGTmeaty NGTmeaty left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

@NGTmeaty NGTmeaty merged commit 9ae3569 into main May 21, 2025
4 checks passed
@yzqzss yzqzss deleted the object-storages branch May 23, 2025 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants