Architecture & CoreDecisions
0001 hybrid scraper architecture
ADR 0001: Hybrid Scraper Architecture
Status
ACCEPTED
Context
We need to scrape e-commerce data (CJ, AliExpress) reliably.
- Problem 1: CJ Dropshipping has aggressive WAF/shield (Cloudflare) that blocks standard datacenter IPs.
- Problem 2: Serverless (Cloudflare Workers) cannot hold long-lived connections for headless browsers easily/cheaply.
- Problem 3: 3rd party APIs (ZenRows) are expensive for high volume.
Decision
We implement a Hybrid Governed Architecture:
- Primary: Oracle Cloud VPS (Free Tier) running a Python/Playwright server (
deploy/vps-scraper).- Why: Real browser fingerprint, resident IP, full control.
- Fallback: ZenRows (Serverless API).
- Why: Reliable backup if VPS is detected/down.
- Governor:
ScraperGovernor(Proxy Pattern) in Backend.- Logic: Circuit Breaker opens after 5 VPS failures -> Failover to ZenRows -> Penalty of 20% on confidence score.
Consequences
- Positive: High reliability, low cost (Free VPS), audit trail of source confidence.
- Negative: Operational overhead of managing a VPS (SSH, updates, reboot).