Skip to content

Commit c1ca078

Browse files
committed
feat: add local image downloading functionality
- Add --download-images flag to download images locally with posts - Add --image-quality flag with high/medium/low options (1456px/848px/424px) - Add --images-dir flag to customize image directory name - Support all Substack CDN patterns (substackcdn.com, substack-post-media.s3.amazonaws.com, legacy bucketeer) - Automatically update HTML/Markdown content to reference local image paths - Create organized directory structure: {output}/images/{post-slug}/ - Generate filesystem-safe filenames while preserving uniqueness - Graceful error handling for individual image download failures - Comprehensive test suite with real Substack HTML integration tests - Full backwards compatibility maintained BREAKING CHANGE: None - all existing functionality preserved
1 parent d7d38ec commit c1ca078

File tree

7 files changed

+1165
-20
lines changed

7 files changed

+1165
-20
lines changed

CLAUDE.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -50,6 +50,13 @@ go mod download
5050
- Converts HTML to Markdown/Text using external libraries
5151
- Handles file writing with different formats
5252

53+
### Image Downloader (`lib/images.go`)
54+
- Downloads images locally from Substack posts
55+
- Supports multiple image quality levels (high/medium/low)
56+
- Handles various Substack CDN URL patterns
57+
- Updates HTML/Markdown content to reference local image paths
58+
- Creates organized directory structure for downloaded images
59+
5360
### Commands Structure
5461
Uses Cobra framework:
5562
- `download`: Main functionality for downloading posts
@@ -76,6 +83,18 @@ go run . download --url https://example.substack.com --output ./downloads
7683
go run . download --url https://example.substack.com --verbose --dry-run
7784
```
7885

86+
### Downloading posts with images
87+
```bash
88+
# Download posts with high-quality images
89+
go run . download --url https://example.substack.com --download-images --image-quality high --output ./downloads
90+
91+
# Download with medium quality images and custom images directory
92+
go run . download --url https://example.substack.com --download-images --image-quality medium --images-dir assets --output ./downloads
93+
94+
# Download single post with images in markdown format
95+
go run . download --url https://example.substack.com/p/post-title --download-images --format md --output ./downloads
96+
```
97+
7998
### Building for release
8099
```bash
81100
go build -ldflags="-s -w" -o sbstck-dl .

README.md

Lines changed: 53 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -59,12 +59,15 @@ Usage:
5959
sbstck-dl download [flags]
6060

6161
Flags:
62-
--add-source-url Add the original post URL at the end of the downloaded file
63-
-d, --dry-run Enable dry run
64-
-f, --format string Specify the output format (options: "html", "md", "txt" (default "html")
65-
-h, --help help for download
66-
-o, --output string Specify the download directory (default ".")
67-
-u, --url string Specify the Substack url
62+
--add-source-url Add the original post URL at the end of the downloaded file
63+
--download-images Download images locally and update content to reference local files
64+
-d, --dry-run Enable dry run
65+
-f, --format string Specify the output format (options: "html", "md", "txt" (default "html")
66+
-h, --help help for download
67+
--image-quality string Image quality to download (options: "high", "medium", "low") (default "high")
68+
--images-dir string Directory name for downloaded images (default "images")
69+
-o, --output string Specify the download directory (default ".")
70+
-u, --url string Specify the Substack url
6871

6972
Global Flags:
7073
--after string Download posts published after this date (format: YYYY-MM-DD)
@@ -84,6 +87,49 @@ If you use the `--add-source-url` flag, each downloaded file will have the follo
8487
8588
Where `POST_URL` is the canonical URL of the downloaded post. For HTML format, this will be wrapped in a small paragraph with a link.
8689
90+
#### Downloading Images
91+
92+
Use the `--download-images` flag to download all images from Substack posts locally. This ensures posts remain accessible even if images are deleted from Substack's CDN.
93+
94+
**Features:**
95+
- Downloads images at optimal quality (high/medium/low)
96+
- Creates organized directory structure: `{output}/images/{post-slug}/`
97+
- Updates HTML/Markdown content to reference local image paths
98+
- Handles all Substack image formats and CDN patterns
99+
- Graceful error handling for individual image failures
100+
101+
**Examples:**
102+
103+
```bash
104+
# Download posts with high-quality images (default)
105+
sbstck-dl download --url https://example.substack.com --download-images
106+
107+
# Download with medium quality images
108+
sbstck-dl download --url https://example.substack.com --download-images --image-quality medium
109+
110+
# Download with custom images directory name
111+
sbstck-dl download --url https://example.substack.com --download-images --images-dir assets
112+
113+
# Download single post with images in markdown format
114+
sbstck-dl download --url https://example.substack.com/p/post-title --download-images --format md
115+
```
116+
117+
**Image Quality Options:**
118+
- `high`: 1456px width (best quality, larger files)
119+
- `medium`: 848px width (balanced quality/size)
120+
- `low`: 424px width (smaller files, mobile-optimized)
121+
122+
**Directory Structure:**
123+
```
124+
output/
125+
├── 20231201_120000_post-title.html
126+
└── images/
127+
└── post-title/
128+
├── image1_1456x819.jpeg
129+
├── image2_848x636.png
130+
└── image3_1272x720.webp
131+
```
132+
87133
### Listing posts
88134
89135
```bash
@@ -125,7 +171,7 @@ sbstck-dl download --url https://example.substack.com --cookie_name substack.sid
125171
126172
- [x] Improve retry logic
127173
- [ ] Implement loading from config file
128-
- [ ] Add support for downloading media
174+
- [x] Add support for downloading images
129175
- [x] Add tests
130176
- [x] Add CI
131177
- [x] Add documentation

cmd/download.go

Lines changed: 38 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -15,12 +15,15 @@ import (
1515

1616
// downloadCmd represents the download command
1717
var (
18-
downloadUrl string
19-
format string
20-
outputFolder string
21-
dryRun bool
22-
addSourceURL bool
23-
downloadCmd = &cobra.Command{
18+
downloadUrl string
19+
format string
20+
outputFolder string
21+
dryRun bool
22+
addSourceURL bool
23+
downloadImages bool
24+
imageQuality string
25+
imagesDir string
26+
downloadCmd = &cobra.Command{
2427
Use: "download",
2528
Short: "Download individual posts or the entire public archive",
2629
Long: `You can provide the url of a single post or the main url of the Substack you want to download.`,
@@ -54,9 +57,19 @@ var (
5457
fmt.Printf("Writing post to file %s\n", path)
5558
}
5659

57-
err = post.WriteToFile(path, format, addSourceURL)
58-
if err != nil {
59-
log.Printf("Error writing file %s: %v\n", path, err)
60+
if downloadImages {
61+
imageQualityEnum := lib.ImageQuality(imageQuality)
62+
imageResult, err := post.WriteToFileWithImages(ctx, path, format, addSourceURL, downloadImages, imageQualityEnum, imagesDir, fetcher)
63+
if err != nil {
64+
log.Printf("Error writing file %s: %v\n", path, err)
65+
} else if verbose && imageResult.Success > 0 {
66+
fmt.Printf("Downloaded %d images (%d failed) for post %s\n", imageResult.Success, imageResult.Failed, post.Slug)
67+
}
68+
} else {
69+
err = post.WriteToFile(path, format, addSourceURL)
70+
if err != nil {
71+
log.Printf("Error writing file %s: %v\n", path, err)
72+
}
6073
}
6174

6275
if verbose {
@@ -126,9 +139,19 @@ var (
126139
fmt.Printf("Writing post to file %s\n", path)
127140
}
128141

129-
err = post.WriteToFile(path, format, addSourceURL)
130-
if err != nil {
131-
log.Printf("Error writing file %s: %v\n", path, err)
142+
if downloadImages {
143+
imageQualityEnum := lib.ImageQuality(imageQuality)
144+
imageResult, err := post.WriteToFileWithImages(ctx, path, format, addSourceURL, downloadImages, imageQualityEnum, imagesDir, fetcher)
145+
if err != nil {
146+
log.Printf("Error writing file %s: %v\n", path, err)
147+
} else if verbose && imageResult.Success > 0 {
148+
fmt.Printf("Downloaded %d images (%d failed) for post %s\n", imageResult.Success, imageResult.Failed, post.Slug)
149+
}
150+
} else {
151+
err = post.WriteToFile(path, format, addSourceURL)
152+
if err != nil {
153+
log.Printf("Error writing file %s: %v\n", path, err)
154+
}
132155
}
133156
}
134157
if verbose {
@@ -146,6 +169,9 @@ func init() {
146169
downloadCmd.Flags().StringVarP(&outputFolder, "output", "o", ".", "Specify the download directory")
147170
downloadCmd.Flags().BoolVarP(&dryRun, "dry-run", "d", false, "Enable dry run")
148171
downloadCmd.Flags().BoolVar(&addSourceURL, "add-source-url", false, "Add the original post URL at the end of the downloaded file")
172+
downloadCmd.Flags().BoolVar(&downloadImages, "download-images", false, "Download images locally and update content to reference local files")
173+
downloadCmd.Flags().StringVar(&imageQuality, "image-quality", "high", "Image quality to download (options: \"high\", \"medium\", \"low\")")
174+
downloadCmd.Flags().StringVar(&imagesDir, "images-dir", "images", "Directory name for downloaded images")
149175
downloadCmd.MarkFlagRequired("url")
150176
}
151177

cmd/version.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ var versionCmd = &cobra.Command{
1212
Short: "Print the version number of sbstck-dl",
1313
Long: `Display the current version of the app.`,
1414
Run: func(cmd *cobra.Command, args []string) {
15-
fmt.Println("sbstck-dl v0.4.0")
15+
fmt.Println("sbstck-dl v0.5.0")
1616
},
1717
}
1818

lib/extractor.go

Lines changed: 90 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,96 @@ func (p *Post) WriteToFile(path string, format string, addSourceURL bool) error
127127
return os.WriteFile(path, []byte(content), 0644)
128128
}
129129

130+
// WriteToFileWithImages writes the Post's content to a file with optional image downloading
131+
func (p *Post) WriteToFileWithImages(ctx context.Context, path string, format string, addSourceURL bool,
132+
downloadImages bool, imageQuality ImageQuality, imagesDir string, fetcher *Fetcher) (*ImageDownloadResult, error) {
133+
134+
if err := os.MkdirAll(filepath.Dir(path), 0755); err != nil {
135+
return nil, err
136+
}
137+
138+
content, err := p.contentForFormat(format, true)
139+
if err != nil {
140+
return nil, err
141+
}
142+
143+
var imageResult *ImageDownloadResult
144+
145+
// Download images if requested and format supports it
146+
if downloadImages && (format == "html" || format == "md") {
147+
outputDir := filepath.Dir(path)
148+
imageDownloader := NewImageDownloader(fetcher, outputDir, imagesDir, imageQuality)
149+
150+
// Only process HTML content for image downloading
151+
htmlContent := content
152+
if format == "md" {
153+
// For markdown, we need to work with the original HTML
154+
htmlContent = p.BodyHTML
155+
}
156+
157+
imageResult, err = imageDownloader.DownloadImages(ctx, htmlContent, p.Slug)
158+
if err != nil {
159+
return nil, fmt.Errorf("failed to download images: %w", err)
160+
}
161+
162+
// Update content based on format
163+
if format == "html" {
164+
content = imageResult.UpdatedHTML
165+
// Re-add title if needed
166+
if strings.HasPrefix(content, "<h1>") {
167+
// Title already included
168+
} else {
169+
content = fmt.Sprintf("<h1>%s</h1>\n\n%s", p.Title, imageResult.UpdatedHTML)
170+
}
171+
} else if format == "md" {
172+
// Convert updated HTML to markdown
173+
updatedContent, err := mdConverter.ConvertString(imageResult.UpdatedHTML)
174+
if err != nil {
175+
return nil, fmt.Errorf("failed to convert updated HTML to markdown: %w", err)
176+
}
177+
content = fmt.Sprintf("# %s\n\n%s", p.Title, updatedContent)
178+
}
179+
} else if downloadImages && format == "txt" {
180+
// For text format, we can't embed images, but we can still download them
181+
outputDir := filepath.Dir(path)
182+
imageDownloader := NewImageDownloader(fetcher, outputDir, imagesDir, imageQuality)
183+
184+
imageResult, err = imageDownloader.DownloadImages(ctx, p.BodyHTML, p.Slug)
185+
if err != nil {
186+
return nil, fmt.Errorf("failed to download images: %w", err)
187+
}
188+
// Keep original text content since we can't embed images in text format
189+
}
190+
191+
// Add source URL if requested
192+
if addSourceURL && p.CanonicalUrl != "" {
193+
sourceLine := fmt.Sprintf("\n\noriginal content: %s", p.CanonicalUrl)
194+
195+
// Adjust formatting slightly for HTML
196+
if format == "html" {
197+
sourceLine = fmt.Sprintf("<p style=\"margin-top: 2em; font-size: small; color: grey;\">original content: <a href=\"%s\">%s</a></p>", p.CanonicalUrl, p.CanonicalUrl)
198+
}
199+
content += sourceLine
200+
}
201+
202+
// Write the file
203+
if err := os.WriteFile(path, []byte(content), 0644); err != nil {
204+
return imageResult, err
205+
}
206+
207+
// Return empty result if no image downloading was performed
208+
if imageResult == nil {
209+
imageResult = &ImageDownloadResult{
210+
Images: []ImageInfo{},
211+
UpdatedHTML: content,
212+
Success: 0,
213+
Failed: 0,
214+
}
215+
}
216+
217+
return imageResult, nil
218+
}
219+
130220
// PostWrapper wraps a Post object for JSON unmarshaling.
131221
type PostWrapper struct {
132222
Post Post `json:"post"`

0 commit comments

Comments
 (0)