Context
These tools serve as upstream building blocks in a prospecting and exploitation chain. The value sits not only in the interface, but in the ability to collect, qualify and pass the right data forward.
The need
The point was not to collect data for its own sake. The tools had to feed a broader chain involving prospecting, agents and the CRM.
The built response
Launch, selection and collection tools able to support wider qualification and commercial exploitation workflows.
What this project shows in practice
These tools mainly make collection and qualification more usable inside a broader prospecting workflow.
Key features
Interconnections
What this changes
What was hard, what we settled
Technical stakes
- Pick the right sources and balance public directories, paid APIs and browser scraping
- Hold up against anti-bot protections (Cloudflare, rate limits, fingerprinting) without getting banned
- Normalize heterogeneous schemas across sources to produce data usable downstream
- Manage data freshness — know when to re-scrape without reprocessing everything
- Keep paid enrichment costs (Apollo, Hunter) under control without degrading coverage
Stack choices
- Node.js + TypeScript
String/JSON manipulation everywhere, mature scraping ecosystem (Playwright, Cheerio), fast dev cycle. One language from scraper to dashboard.
- Playwright
For SPAs that serve nothing without JavaScript. More stable long-term than Puppeteer, multi-browser support handy for bypassing some anti-bots.
- Postgres with jsonb columns
We store raw data as jsonb (replayable if parsing logic changes) and normalized data in typed columns (queryable fast). Best of both worlds.
- BullMQ for the scraping queue
Parallelize without killing target sites. Per-source throttle, exponential retries, dead-letter queue to analyze failures.
- Self-hosted infrastructure
AWS/GCP datacenters are blocklisted by serious anti-bot solutions. Our infra with residential IPs + proxies dodges 80% of blocks from day one.
Difficulties faced
Sites change without warning
CSS selectors break overnight. We versioned adapters per source and set up alerts when scraping returns 0 results or an unusual schema.
Aggressive anti-bot
Cloudflare, Datadome, Akamai. We added residential proxy rotation, realistic headers, adaptive throttle per source, and switched to headed Playwright for the hardest cases.
Dirty data
Mixed date formats, lost accents, inconsistent capitalization, hidden duplicates. Strict normalization pipeline with versioned rules and ability to replay on raw data.
Paid API enrichment costs
An Apollo call is expensive — multiplied by 50K leads it stings. 30-day cache on enrichments, nightly batches, upstream filtering so we only pay for what will actually be used.
What we learned
- Start with one well-mastered source rather than ten approximate ones. Pipeline quality beats connector quantity.
- Always store raw data BEFORE normalization. If parsing logic changes, replay without re-scraping.
- Detailed logs and alerts > automatic recovery in the early months. Better to know fast what breaks.
- Anti-bot budget must be planned from day one, not as a future feature. That's what separates a POC from a tool that holds.
A few useful views to show how the project comes together in practice.