BATCH_SIZE = 200
for i in range(0, len(videos), BATCH_SIZE):
batch = videos[i : i + BATCH_SIZE]
prompt = f"""Topics:
{topics_as_numbered_list}
Assign each video to exactly one topic
by number. Return JSON.
Videos:
{format_batch(batch)}"""
result = claude_haiku(prompt)
apply_assignments(batch, result)
Step 3 of 3 — refinement with Sonnet
The model does the surgery too.
"Python & Dev Tools" 400 videos — too broad
→
Claude Sonnet
→
Python Basics Dev Tooling Testing & CI Packaging
Split — break a broad topic into 3–8 subtopics automatically
Rename — suggest better names based on sample video titles
Reclassify — reassign a whole subtree with refined rules
Haiku classifies fast and cheap. Sonnet handles nuance.
▶ ACT · III
Scraping without a quota.
YouTube gives you 10,000 API units/day. A channel fetch costs 100. Do the math.
YouTube Data API quota
400 creators. 100 API calls.
daily budget
10,000 units
per channel fetch
100 units
channels per day
100 max
We follow 400+ creators. Polling all of them daily exhausts the quota before covering a quarter.
Discovery had to work without API calls.
Discovery architecture · the fallback chain
Four layers. Zero quota.
The decision
😒
YouTube Data API 100 units/channel · quota gone by 10am
😌
scrapetube + RSS feeds no API key · no quota · no problem
youtube_channel_fallback.py
Two sources. One merge.
scrapetube
YouTube's internal API great coverage
+
RSS feed
/feeds/videos.xml catches Shorts + recents
def fetch_channel_videos(channel_id):
scrape_vids = list(scrapetube.get_channel(channel_id))
rss_vids = fetch_rss(channel_id)
seen = {}
for v in rss_vids: # RSS wins on overlap
seen[v["videoId"]] = {**v, "source": "rss"}
for v in scrape_vids: # scrapetube fills gaps
seen.setdefault(v["videoId"], {**v, "source": "scrape"})
return list(seen.values())
Solution: dual extraction paths + regex chains + HTTP HEAD to expand URLs.
Core design rule: return empty, not error.
def get_channel_links(channel_id):
try:
# Path A: og:description meta tag
# stable across layout changes
links = extract_from_og_desc(channel_id)
if links:
return links
# Path B: ytInitialData JSON blob
# future-proofing
links = extract_from_initial_data(
channel_id
)
return links or []
except Exception:
return [] # never crash the caller
▶ ACT · IV
Python inside a Mac app.
A venv bundled in the app, called from Swift via subprocess. With one nasty gotcha.
The deployment model
A venv, bundled.
.runtime/discovery-venv/ — managed Python env
Built by build-app.sh at build time
requirements-discovery.txt installs scrapetube
Swift calls Python via Process()
Input: channel IDs via args. Output: JSON on stdout
Clean boundary — just pipes, no shared state
let process = Process()
process.executableURL = venvPython
process.arguments = [
scriptPath,
"--channel-id", channelId
]
let stdout = Pipe()
let stderr = Pipe()
process.standardOutput = stdout
process.standardError = stderr
try process.run()
// ⚠️ DO NOT waitUntilExit() yet
// see next slide
The bug you will hit
The pipe buffer deadlock.
Script outputs >64KB. You wait for exit before reading. Process waits for you to read before it can exit. Neither side moves.
❌ naive — hangs at 64KB
process.run()
process.waitUntilExit() // 💀
let data = stdout
.fileHandleForReading
.readDataToEndOfFile()
✓ concurrent drain
process.run()
// drain WHILE process runs
async let out = Task.detached {
stdout.fileHandleForReading
.readDataToEndOfFile()
}
async let err = Task.detached {
stderr.fileHandleForReading
.readDataToEndOfFile()
}
process.waitUntilExit()
let result = await out
Being a good citizen
Don't get rate-limited.
2s minimum between any two requests
Random jitter — not perfectly spaced (more human)
10-min per-channel cooldown after any failure
5-min global cooldown after sustained failures
YouTube doesn't hard-block —
it slows down and returns garbage.
The symptoms are subtle. The limiter keeps you in the safe zone.
import time, random
class ScrapeRateLimiter:
MIN_INTERVAL = 2.0 # seconds
JITTER = 1.5
CHANNEL_COOL = 600 # 10 min
GLOBAL_COOL = 300 # 5 min
def wait(self):
elapsed = time.time() - self.last_req
gap = (self.MIN_INTERVAL
+ random.uniform(0, self.JITTER))
if elapsed < gap:
time.sleep(gap - elapsed)
self.last_req = time.time()
def record_failure(self, channel_id):
self.cooldowns[channel_id] = (
time.time() + self.CHANNEL_COOL
)
Personal tools are where the real engineering lives.
No PM. No spec. Just a real problem you actually care about solving.
What it's made of
The stack.
PLATFORM
macOS · SwiftUI native
DATABASE
SQLite · local-first
LLM (classify)
Claude Haiku · fast + cheap
LLM (refine)
Claude Sonnet · nuanced ops
CLASSIFY COST
$0.15 for 5,000 videos
TESTS
147 across 25 suites
SCRAPING
scrapetube + urllib
FALLBACK
YouTube RSS feeds
SYNC
YouTube API + CDP
SUBPROCESS
Swift Process() → JSON
PYTHON ENV
.runtime/discovery-venv
SCRIPTS
4 Python scripts · ~750 lines
Things worth stealing
Three takeaways.
1
Scrape first, API second. For read-heavy discovery, scrapetube + RSS is faster, cheaper, and more reliable than the official YouTube API.
2
Batch your LLM calls. 200 items per prompt vs. 1 per prompt is the difference between $0.15 and $150. Design for batching from day one.
3
When there's no API, drive the browser. CDP is stable, powerful, and indistinguishable from a real user. It's not a hack — it's the same protocol DevTools uses.
▶ Demo + Q&A
Let's watch something.
Questions welcome — especially about the scraping.