Portage / Routes / Substack → Astro
// Route specification
Substack → Astro
A crossing for a Substack publication: newsletter posts, subtitles, images, and footnotes mapped to Astro content collections from the export archive — with subscribe widgets, paywall CTAs, and share buttons stripped out.
01Overview
This route carries a Substack publication into an Astro project as content collections backed by Markdown or MDX. Substack has no stable public API, so the source of truth is the export archive — a ZIP of rendered post HTML plus a posts.csv metadata table.
Substack's HTML is heavy with growth chrome — subscribe widgets, share buttons, paywall calls-to-action, comment prompts. The bulk of this route is cleanup: reducing each post to the article itself, then converting to clean Markdown with footnotes intact.
Published posts, subtitles, body content, inline images, footnotes, publish dates, audience (free / paid) flags, and canonical post URLs.
02Export
From Settings → Exports → Create a new export, Substack emails a download link to a ZIP. Portage joins each post's HTML to its metadata row in posts.csv by post_id.
substack-export/ ├── posts.csv ← post_id, title, subtitle, post_date, │ type, audience, is_published, email_sent ├── posts/ │ ├── 148210.leaving-the-feed.html │ └── … one HTML file per post └── email_list.csv ← subscribers (out of scope — §07)
$ npx portage extract --from substack \ --export ./substack-export.zip --to ./astro-project → 88 posts · 84 web · 4 paid (truncated) · 152 images referenced
The export is authoritative and offline-friendly. Unpublished drafts are included in posts.csv with is_published=false and skipped unless you pass --include-drafts.
03Content mapping
Metadata comes from posts.csv; the body comes from the matching HTML file. The filename pattern {post_id}.{slug}.html yields the slug.
| Substack field | Source | Astro frontmatter | Notes | |
|---|---|---|---|---|
| title | csv | → | title | Required. |
| {slug} | filename | → | (filename) | From {post_id}.{slug}.html. |
| subtitle | csv | → | description | Substack's standfirst. |
| (post body) | html | → | (body) | Cleaned & converted — see §04. |
| post_date | csv | → | pubDate | Coerced by zod. |
| audience | csv | → | access | everyone → public; only_paid·founding → members. |
| type | csv | → | type | newsletter · podcast · thread. |
| is_published | csv | → | draft | false → draft: true. |
| (first figure image) | html | → | heroImage | Derived — Substack has no feature-image field. |
| canonical | derived | → | canonicalURL | {publication}/p/{slug}. |
04Content transforms
This is the heavy lifting. Substack wraps articles in .available-content markup interleaved with growth widgets. Portage strips the chrome, keeps the article, converts to Markdown, and preserves footnotes.
| Substack element | Result | |
|---|---|---|
| subscribe widget | → | Removed |
| share / "leave a comment" buttons | → | Removed |
| "Thanks for reading… Subscribe" footer | → | Removed |
| paywall marker (.paywall) | → | Boundary recorded; body flagged if truncated |
| captioned image (.captioned-image) | → |  + caption / <Figure> |
| footnotes (.footnote-anchor) | → | Markdown footnotes [^1] |
| pullquote / blockquote | → | Blockquote |
| tweet / YouTube / embedded post | → | Link fallback / <Embed> |
--strip-cta (default on) removes Substack's subscribe / share / comment chrome. Disable it with --strip-cta=false if you want to triage the widgets by hand.
05Assets
Substack images are referenced through a CDN fetch proxy that wraps the real source URL — for example substackcdn.com/image/fetch/w_1456,…/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2F…png. Portage decodes the embedded origin, downloads the original, and rewrites the path.
- Decode the proxy URL — the URL-encoded S3 origin after
/fetch/…/is extracted; transform prefixes (w_1456,c_limit,f_auto…) are dropped. - Download & dedupe — originals are content-hashed; duplicates collapse to one file.
- Rewrite references — hero and in-body images point at local paths; Astro regenerates responsive sizes.
Parsing the nested, URL-encoded proxy origin is the step that trips up most ad-hoc Substack scripts. Portage handles it as a first-class transform.
06Routes & redirects
Substack publishes every post under /p/{slug}. Portage preserves old links with redirects while moving to clean Astro routes.
| Substack route | Astro route | Handled by | |
|---|---|---|---|
| /p/{slug} | → | /{slug}/ | Post collection + 301 from /p/{slug} |
| /archive | → | / | Index / blog listing |
| /feed | → | /rss.xml | @astrojs/rss + redirect |
| {domain}/p/{slug} | → | /{slug}/ | Custom-domain mapping |
- Keep the prefix — preserve
/p/{slug}instead with--route-base p. - Trailing slashes & sitemap —
trailingSlash: 'always'and@astrojs/sitemapset to the new base URL.
07Out of scope
Substack is a newsletter, payments, and community platform. Portage moves the writing; the platform machinery stays behind, and it tells you what it skipped.
- Subscribers —
email_list.csvis carried for portability, but sign-up and auth are yours to rebuild. - Paid subscriptions & Stripe — billing relationships can't be migrated.
- Email delivery — past issues become posts; sending does not.
- Comments & Notes — not meaningfully exported.
- Podcast audio — episodes link out; audio files aren't rehosted by default.
- Recommendations & network — Substack-specific discovery features.
For only_paid / founding posts, the export often contains only the free preview above the paywall. Portage writes what's present, sets access: members + draft: true, and flags any truncated body so you can re-supply the full text.
08Edge cases
- Paywalled posts truncated in export — flagged and gated, never published by accident.
- Post type —
podcastandthreadrows are skipped unless--include-podcasts/--include-threads. - Footnotes — Substack anchors convert to Markdown footnotes.
- Authors —
posts.csvlacks reliable per-post authorship; defaults to the publication, overridable via--author-map. - Drafts —
is_published=falserows carrydraft: true, included only with--include-drafts. - Image proxy URLs — nested URL-encoded S3 origins are decoded before download.
- Embeds — tweets and videos fall back to links, or MDX
<Embed>components.
09Output
A predictable, buildable Astro project, with a manifest ledger at the root.
astro-project/ ├── src/ │ ├── content/ │ │ └── blog/ ← 84 posts (4 paid gated) │ ├── assets/blog/ ← 131 localized images │ ├── components/portage/ ← MDX stubs (if --content mdx) │ └── content.config.ts ├── public/_redirects ← 88 /p/{slug} redirects ├── members.csv ← exported subscribers (reference) ├── portage.manifest.json ← extract / transform / load ledger └── astro.config.mjs ← trailingSlash · sitemap · redirects
Content collection schema
import { defineCollection, z } from 'astro:content'; import { glob } from 'astro/loaders'; const blog = defineCollection({ loader: glob({ pattern: '**/*.{md,mdx}', base: './src/content/blog' }), schema: ({ image }) => z.object({ title: z.string(), description: z.string().optional(), pubDate: z.coerce.date(), heroImage: image().optional(), access: z.enum(['public', 'members']).default('public'), type: z.enum(['newsletter', 'podcast', 'thread']).default('newsletter'), draft: z.boolean().default(false), canonicalURL: z.string().url().optional(), }), }); export const collections = { blog };
Sample migrated post
--- title: "Leaving the Feed" description: "On owning the publish button again." pubDate: 2026-02-11 heroImage: ../../assets/blog/leaving-the-feed.jpg access: public type: newsletter canonicalURL: "https://example.substack.com/p/leaving-the-feed" --- The newsletter went out fine.[^1] The platform was the problem… [^1]: 11,400 sends, 64% open rate — numbers I now keep myself.
10CLI
Three stages, run in order. Cleanup runs at transform time.
$ npx portage extract --from substack --export ./substack-export.zip --to ./astro-project $ npx portage transform --schema content-collections --content mdx --strip-cta $ npx portage load --images assets --redirects netlify
| Flag | Values | Default | Purpose |
|---|---|---|---|
| --export | path | — | Substack export ZIP. Required. |
| --strip-cta | flag | on | Remove subscribe / share / comment chrome. |
| --include-drafts | flag | off | Carry unpublished rows. |
| --include-podcasts · --include-threads | flag | off | Bring non-newsletter post types. |
| --author-map | path | — | Assign per-post authors. |
| --route-base | string | / | Flatten or keep /p/. |
| --dry-run | flag | off | Plan & diff only. |
11Verification
Every crossing is auditable. The dry-run is where paid-post truncation and skipped types surface — before anything is written.
- Dry-run first —
--dry-runprints the plan and a diff; nothing is written. - Manifest ledger — records each
post_id, its checksum, transforms applied, and whether the body was truncated at a paywall. - Counted on, counted off — gated, skipped, and truncated posts are listed explicitly.
$ npx portage load --dry-run → 88 posts → 84 files ✓ (2 threads, 2 podcasts skipped) → 152 images → 131 unique ✓ proxy URLs decoded → 88 redirects from /p/{slug} ✓ → 4 paid posts gated ⚠ access: members · bodies truncated → 0 unresolved references ✓ nothing left on the dock