Skip to content

Commit 7750852

Browse files
committed
feat: add archive index page generation with comprehensive metadata
Add --create-archive flag to generate organized index pages linking all downloaded posts. Features: - Archive pages in HTML/Markdown/Text formats matching post format - Post metadata: titles, publication/download dates, descriptions, cover images - Automatic sorting by publication date (newest first) - Enhanced post extraction for subtitle (.subtitle) and cover image (og:image) - Integration with single post and bulk download workflows - Comprehensive test coverage (30+ new test cases) - Complete documentation and technical specifications Usage: sbstck-dl download --url https://example.substack.com --create-archive Generated files: index.{html|md|txt} in output directory root
1 parent 81844b2 commit 7750852

File tree

10 files changed

+1398
-3
lines changed

10 files changed

+1398
-3
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,4 +29,4 @@ test-download/
2929
.vscode/
3030

3131
# serena
32-
cache/
32+
.serena/cache/

.serena/.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
/cache

CLAUDE.md

Lines changed: 31 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
44

55
## Project Overview
6-
This is a Go CLI tool for downloading posts from Substack blogs. It supports downloading individual posts or entire archives, with features for private newsletters (via cookies), rate limiting, format conversion (HTML/Markdown/Text), and downloading of images and file attachments locally.
6+
This is a Go CLI tool for downloading posts from Substack blogs. It supports downloading individual posts or entire archives, with features for private newsletters (via cookies), rate limiting, format conversion (HTML/Markdown/Text), downloading of images and file attachments locally, and creating archive index pages that link all downloaded posts with their metadata.
77

88
## Architecture
99
The project follows a standard Go CLI structure:
@@ -49,8 +49,11 @@ go mod download
4949

5050
### Extractor (`lib/extractor.go`)
5151
- Parses Substack post JSON from HTML
52+
- Extracts post metadata including subtitle (.subtitle CSS selector) and cover image (og:image meta tag)
5253
- Converts HTML to Markdown/Text using external libraries
5354
- Handles file writing with different formats
55+
- Provides archive page generation functionality (HTML/Markdown/Text formats)
56+
- Manages archive entries with automatic sorting by publication date (newest first)
5457

5558
### Image Downloader (`lib/images.go`)
5659
- Downloads images locally from Substack posts
@@ -67,6 +70,15 @@ go mod download
6770
- Handles filename sanitization and collision avoidance
6871
- Integrates with existing image download workflow
6972

73+
### Archive Page Generator (`lib/extractor.go`)
74+
- Creates index pages linking all downloaded posts with metadata
75+
- Supports HTML, Markdown, and Text formats matching the selected output format
76+
- Includes post titles (linked to downloaded files with relative paths)
77+
- Shows publication dates and download timestamps
78+
- Displays post descriptions/subtitles and cover images when available
79+
- Automatically sorts posts by publication date (newest first)
80+
- Generates `index.{format}` in the output directory root
81+
7082
### Commands Structure
7183
Uses Cobra framework:
7284
- `download`: Main functionality for downloading posts
@@ -120,6 +132,24 @@ go run . download --url https://example.substack.com --download-files --files-di
120132
go run . download --url https://example.substack.com/p/post-title --download-images --download-files --output ./downloads
121133
```
122134

135+
### Creating archive index pages
136+
```bash
137+
# Download posts and create an archive index page
138+
go run . download --url https://example.substack.com --create-archive --output ./downloads
139+
140+
# Download entire archive with archive index in markdown format
141+
go run . download --url https://example.substack.com --create-archive --format md --output ./downloads
142+
143+
# Download single post with archive page (useful for building up an archive over time)
144+
go run . download --url https://example.substack.com/p/post-title --create-archive --output ./downloads
145+
146+
# Download with all features: images, files, and archive page
147+
go run . download --url https://example.substack.com --download-images --download-files --create-archive --output ./downloads
148+
149+
# Download archive with specific format and custom directories
150+
go run . download --url https://example.substack.com --create-archive --format html --images-dir assets --files-dir attachments --output ./downloads
151+
```
152+
123153
### Building for release
124154
```bash
125155
go build -ldflags="-s -w" -o sbstck-dl .

README.md

Lines changed: 65 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,7 @@ Usage:
6060

6161
Flags:
6262
--add-source-url Add the original post URL at the end of the downloaded file
63+
--create-archive Create an archive index page linking all downloaded posts
6364
--download-files Download file attachments locally and update content to reference local files
6465
--download-images Download images locally and update content to reference local files
6566
-d, --dry-run Enable dry run
@@ -181,6 +182,68 @@ output/
181182
└── presentation.pptx
182183
```
183184
185+
#### Creating Archive Index Pages
186+
187+
Use the `--create-archive` flag to generate an organized index page that links all downloaded posts with their metadata. This creates a beautiful overview of your downloaded content, making it easy to browse and access your Substack archive.
188+
189+
**Features:**
190+
- Creates `index.{format}` file matching your selected output format (HTML/Markdown/Text)
191+
- Links to all downloaded posts using relative file paths
192+
- Displays post titles, publication dates, and download timestamps
193+
- Shows post descriptions/subtitles and cover images when available
194+
- Automatically sorts posts by publication date (newest first)
195+
- Works with both single post and bulk downloads
196+
197+
**Examples:**
198+
199+
```bash
200+
# Download entire archive and create index page
201+
sbstck-dl download --url https://example.substack.com --create-archive
202+
203+
# Create archive index in Markdown format
204+
sbstck-dl download --url https://example.substack.com --create-archive --format md
205+
206+
# Build archive over time with single posts
207+
sbstck-dl download --url https://example.substack.com/p/post-title --create-archive
208+
209+
# Complete download with all features
210+
sbstck-dl download --url https://example.substack.com --download-images --download-files --create-archive
211+
212+
# Custom directory structure with archive
213+
sbstck-dl download --url https://example.substack.com --create-archive --images-dir assets --files-dir attachments
214+
```
215+
216+
**Archive Content Per Post:**
217+
- **Title**: Clickable link to the downloaded post file
218+
- **Publication Date**: When the post was originally published on Substack
219+
- **Download Date**: When you downloaded the post locally
220+
- **Description**: Post subtitle or description (when available)
221+
- **Cover Image**: Featured image from the post (when available)
222+
223+
**Archive Format Examples:**
224+
225+
*HTML Format:* Styled webpage with images, organized post cards, and hover effects
226+
*Markdown Format:* Clean markdown with headers, links, and image references
227+
*Text Format:* Plain text listing with all metadata for maximum compatibility
228+
229+
**Directory Structure with Archive:**
230+
```
231+
output/
232+
├── index.html # Archive index page
233+
├── 20231201_120000_post-title.html
234+
├── 20231115_090000_another-post.html
235+
├── images/
236+
│ ├── post-title/
237+
│ │ └── image1_1456x819.jpeg
238+
│ └── another-post/
239+
│ └── image2_848x636.png
240+
└── files/
241+
├── post-title/
242+
│ └── document.pdf
243+
└── another-post/
244+
└── spreadsheet.xlsx
245+
```
246+
184247
### Listing posts
185248
186249
```bash
@@ -223,6 +286,8 @@ sbstck-dl download --url https://example.substack.com --cookie_name substack.sid
223286
- [x] Improve retry logic
224287
- [ ] Implement loading from config file
225288
- [x] Add support for downloading images
289+
- [x] Add support for downloading file attachments
290+
- [x] Add archive index page functionality
226291
- [x] Add tests
227292
- [x] Add CI
228293
- [x] Add documentation

cmd/download.go

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,12 +26,19 @@ var (
2626
downloadFiles bool
2727
fileExtensions string
2828
filesDir string
29+
createArchive bool
2930
downloadCmd = &cobra.Command{
3031
Use: "download",
3132
Short: "Download individual posts or the entire public archive",
3233
Long: `You can provide the url of a single post or the main url of the Substack you want to download.`,
3334
Run: func(cmd *cobra.Command, args []string) {
3435
startTime := time.Now()
36+
37+
// Create archive instance if flag is set
38+
var archive *lib.Archive
39+
if createArchive {
40+
archive = lib.NewArchive()
41+
}
3542

3643
// if url contains "/p/", we are downloading a single post
3744
if strings.Contains(downloadUrl, "/p/") {
@@ -80,6 +87,11 @@ var (
8087
}
8188
}
8289

90+
// Add to archive if enabled
91+
if archive != nil {
92+
archive.AddEntry(post, path, startTime)
93+
}
94+
8395
if verbose {
8496
fmt.Println("Done in ", time.Since(startTime))
8597
}
@@ -166,12 +178,42 @@ var (
166178
log.Printf("Error writing file %s: %v\n", path, err)
167179
}
168180
}
181+
182+
// Add to archive if enabled and post was successfully written
183+
if archive != nil {
184+
archive.AddEntry(post, path, time.Now())
185+
}
169186
}
170187
if verbose {
171188
fmt.Println("Downloaded", downloadedPostsCount, "posts, out of", len(urls))
172189
fmt.Println("Done in ", time.Since(startTime))
173190
}
174191
}
192+
193+
// Generate archive page if enabled
194+
if archive != nil && len(archive.Entries) > 0 {
195+
if verbose {
196+
fmt.Printf("Generating archive page in %s format...\n", format)
197+
}
198+
199+
var archiveErr error
200+
switch format {
201+
case "html":
202+
archiveErr = archive.GenerateHTML(outputFolder)
203+
case "md":
204+
archiveErr = archive.GenerateMarkdown(outputFolder)
205+
case "txt":
206+
archiveErr = archive.GenerateText(outputFolder)
207+
default:
208+
archiveErr = fmt.Errorf("unknown format for archive: %s", format)
209+
}
210+
211+
if archiveErr != nil {
212+
log.Printf("Error generating archive page: %v\n", archiveErr)
213+
} else if verbose {
214+
fmt.Printf("Archive page generated: %s/index.%s\n", outputFolder, format)
215+
}
216+
}
175217
},
176218
}
177219
)
@@ -188,6 +230,7 @@ func init() {
188230
downloadCmd.Flags().BoolVar(&downloadFiles, "download-files", false, "Download file attachments locally and update content to reference local files")
189231
downloadCmd.Flags().StringVar(&fileExtensions, "file-extensions", "", "Comma-separated list of file extensions to download (e.g., 'pdf,docx,txt'). If empty, downloads all file types")
190232
downloadCmd.Flags().StringVar(&filesDir, "files-dir", "files", "Directory name for downloaded file attachments")
233+
downloadCmd.Flags().BoolVar(&createArchive, "create-archive", false, "Create an archive index page linking all downloaded posts")
191234
downloadCmd.MarkFlagRequired("url")
192235
}
193236

0 commit comments

Comments
 (0)