Portage / Routes / Substack → Astro

// Route specification

Substack Astro

A crossing for a Substack publication: newsletter posts, subtitles, images, and footnotes mapped to Astro content collections from the export archive — with subscribe widgets, paywall CTAs, and share buttons stripped out.

Beta portage v0.9.2
Source
Substack export
Method
Archive (HTML + CSV)
Output
Markdown · MDX
Target
Content collections
Spec rev
2026.06 · r1

01Overview

This route carries a Substack publication into an Astro project as content collections backed by Markdown or MDX. Substack has no stable public API, so the source of truth is the export archive — a ZIP of rendered post HTML plus a posts.csv metadata table.

Substack's HTML is heavy with growth chrome — subscribe widgets, share buttons, paywall calls-to-action, comment prompts. The bulk of this route is cleanup: reducing each post to the article itself, then converting to clean Markdown with footnotes intact.

Moves intact

Published posts, subtitles, body content, inline images, footnotes, publish dates, audience (free / paid) flags, and canonical post URLs.

02Export

From Settings → Exports → Create a new export, Substack emails a download link to a ZIP. Portage joins each post's HTML to its metadata row in posts.csv by post_id.

export archivezip
substack-export/
├── posts.csv               ← post_id, title, subtitle, post_date,  type, audience, is_published, email_sent
├── posts/
│   ├── 148210.leaving-the-feed.html
│   └── … one HTML file per post
└── email_list.csv          ← subscribers (out of scope — §07)
extract · archivebash
$ npx portage extract --from substack \
    --export ./substack-export.zip --to ./astro-project
→ 88 posts · 84 web · 4 paid (truncated) · 152 images referenced
Note

The export is authoritative and offline-friendly. Unpublished drafts are included in posts.csv with is_published=false and skipped unless you pass --include-drafts.

03Content mapping

Metadata comes from posts.csv; the body comes from the matching HTML file. The filename pattern {post_id}.{slug}.html yields the slug.

Substack fieldSourceAstro frontmatterNotes
titlecsvtitleRequired.
{slug}filename(filename)From {post_id}.{slug}.html.
subtitlecsvdescriptionSubstack's standfirst.
(post body)html(body)Cleaned & converted — see §04.
post_datecsvpubDateCoerced by zod.
audiencecsvaccesseveryone → public; only_paid·founding → members.
typecsvtypenewsletter · podcast · thread.
is_publishedcsvdraftfalsedraft: true.
(first figure image)htmlheroImageDerived — Substack has no feature-image field.
canonicalderivedcanonicalURL{publication}/p/{slug}.

04Content transforms

This is the heavy lifting. Substack wraps articles in .available-content markup interleaved with growth widgets. Portage strips the chrome, keeps the article, converts to Markdown, and preserves footnotes.

Substack elementResult
subscribe widgetRemoved
share / "leave a comment" buttonsRemoved
"Thanks for reading… Subscribe" footerRemoved
paywall marker (.paywall)Boundary recorded; body flagged if truncated
captioned image (.captioned-image)![alt](src) + caption / <Figure>
footnotes (.footnote-anchor)Markdown footnotes [^1]
pullquote / blockquoteBlockquote
tweet / YouTube / embedded postLink fallback / <Embed>
Stripping CTAs

--strip-cta (default on) removes Substack's subscribe / share / comment chrome. Disable it with --strip-cta=false if you want to triage the widgets by hand.

05Assets

Substack images are referenced through a CDN fetch proxy that wraps the real source URL — for example substackcdn.com/image/fetch/w_1456,…/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2F…png. Portage decodes the embedded origin, downloads the original, and rewrites the path.

  • Decode the proxy URL — the URL-encoded S3 origin after /fetch/…/ is extracted; transform prefixes (w_1456,c_limit,f_auto…) are dropped.
  • Download & dedupe — originals are content-hashed; duplicates collapse to one file.
  • Rewrite references — hero and in-body images point at local paths; Astro regenerates responsive sizes.
Substack-specific

Parsing the nested, URL-encoded proxy origin is the step that trips up most ad-hoc Substack scripts. Portage handles it as a first-class transform.

06Routes & redirects

Substack publishes every post under /p/{slug}. Portage preserves old links with redirects while moving to clean Astro routes.

Substack routeAstro routeHandled by
/p/{slug}/{slug}/Post collection + 301 from /p/{slug}
/archive/Index / blog listing
/feed/rss.xml@astrojs/rss + redirect
{domain}/p/{slug}/{slug}/Custom-domain mapping
  • Keep the prefix — preserve /p/{slug} instead with --route-base p.
  • Trailing slashes & sitemaptrailingSlash: 'always' and @astrojs/sitemap set to the new base URL.

07Out of scope

Substack is a newsletter, payments, and community platform. Portage moves the writing; the platform machinery stays behind, and it tells you what it skipped.

  • Subscribersemail_list.csv is carried for portability, but sign-up and auth are yours to rebuild.
  • Paid subscriptions & Stripe — billing relationships can't be migrated.
  • Email delivery — past issues become posts; sending does not.
  • Comments & Notes — not meaningfully exported.
  • Podcast audio — episodes link out; audio files aren't rehosted by default.
  • Recommendations & network — Substack-specific discovery features.
Paid posts

For only_paid / founding posts, the export often contains only the free preview above the paywall. Portage writes what's present, sets access: members + draft: true, and flags any truncated body so you can re-supply the full text.

08Edge cases

  • Paywalled posts truncated in export — flagged and gated, never published by accident.
  • Post typepodcast and thread rows are skipped unless --include-podcasts / --include-threads.
  • Footnotes — Substack anchors convert to Markdown footnotes.
  • Authorsposts.csv lacks reliable per-post authorship; defaults to the publication, overridable via --author-map.
  • Draftsis_published=false rows carry draft: true, included only with --include-drafts.
  • Image proxy URLs — nested URL-encoded S3 origins are decoded before download.
  • Embeds — tweets and videos fall back to links, or MDX <Embed> components.

09Output

A predictable, buildable Astro project, with a manifest ledger at the root.

project treeoutput
astro-project/
├── src/
│   ├── content/
│   │   └── blog/             ← 84 posts (4 paid gated)
│   ├── assets/blog/          ← 131 localized images
│   ├── components/portage/   ← MDX stubs (if --content mdx)
│   └── content.config.ts
├── public/_redirects        ← 88 /p/{slug} redirects
├── members.csv              ← exported subscribers (reference)
├── portage.manifest.json    ← extract / transform / load ledger
└── astro.config.mjs         ← trailingSlash · sitemap · redirects

Content collection schema

src/content.config.tstypescript
import { defineCollection, z } from 'astro:content';
import { glob } from 'astro/loaders';

const blog = defineCollection({
  loader: glob({ pattern: '**/*.{md,mdx}', base: './src/content/blog' }),
  schema: ({ image }) => z.object({
    title: z.string(),
    description: z.string().optional(),
    pubDate: z.coerce.date(),
    heroImage: image().optional(),
    access: z.enum(['public', 'members']).default('public'),
    type: z.enum(['newsletter', 'podcast', 'thread']).default('newsletter'),
    draft: z.boolean().default(false),
    canonicalURL: z.string().url().optional(),
  }),
});

export const collections = { blog };

Sample migrated post

src/content/blog/leaving-the-feed.mdmarkdown
---
title: "Leaving the Feed"
description: "On owning the publish button again."
pubDate: 2026-02-11
heroImage: ../../assets/blog/leaving-the-feed.jpg
access: public
type: newsletter
canonicalURL: "https://example.substack.com/p/leaving-the-feed"
---

The newsletter went out fine.[^1] The platform was the problem…

[^1]: 11,400 sends, 64% open rate — numbers I now keep myself.

10CLI

Three stages, run in order. Cleanup runs at transform time.

portage · substack → astrobash
$ npx portage extract --from substack --export ./substack-export.zip --to ./astro-project
$ npx portage transform --schema content-collections --content mdx --strip-cta
$ npx portage load --images assets --redirects netlify
FlagValuesDefaultPurpose
--exportpathSubstack export ZIP. Required.
--strip-ctaflagonRemove subscribe / share / comment chrome.
--include-draftsflagoffCarry unpublished rows.
--include-podcasts · --include-threadsflagoffBring non-newsletter post types.
--author-mappathAssign per-post authors.
--route-basestring/Flatten or keep /p/.
--dry-runflagoffPlan & diff only.

11Verification

Every crossing is auditable. The dry-run is where paid-post truncation and skipped types surface — before anything is written.

  • Dry-run first--dry-run prints the plan and a diff; nothing is written.
  • Manifest ledger — records each post_id, its checksum, transforms applied, and whether the body was truncated at a paywall.
  • Counted on, counted off — gated, skipped, and truncated posts are listed explicitly.
portage load --dry-runbash
$ npx portage load --dry-run
→ 88 posts → 84 files          ✓ (2 threads, 2 podcasts skipped)
→ 152 images → 131 unique      ✓ proxy URLs decoded
→ 88 redirects from /p/{slug}  ✓
→ 4 paid posts gated           ⚠ access: members · bodies truncated
→ 0 unresolved references      ✓ nothing left on the dock