Case study

Scraping tools

Scraping tools designed for prospecting, market monitoring, data enrichment and internal automation.

Automation / data collection

At a glance

Prospecting

Market watch

Enrichment

Internal automation

Context

These tools serve as upstream building blocks in a prospecting and exploitation chain. The value sits not only in the interface, but in the ability to collect, qualify and pass the right data forward.

The need

The point was not to collect data for its own sake. The tools had to feed a broader chain involving prospecting, agents and the CRM.

The built response

Launch, selection and collection tools able to support wider qualification and commercial exploitation workflows.

What this project shows in practice

These tools mainly make collection and qualification more usable inside a broader prospecting workflow.

Key features

Persona or target selection before launch

Tool launch flows with loading states and orchestration

Structured collection of useful data

Reusable base for monitoring, prospecting and enrichment

Interconnections

Scrapers collect and prepare the data

The data enriches the base used for prospecting

Prospecting agents consume those signals

The CRM then receives the useful leads or signals for sales follow-up

What this changes

Less manual work on collection and triage

Better prospecting inputs through structured data

Stronger continuity between collection, qualification and sales action

Under the hood

What was hard, what we settled

Technical stakes

Pick the right sources and balance public directories, paid APIs and browser scraping
Hold up against anti-bot protections (Cloudflare, rate limits, fingerprinting) without getting banned
Normalize heterogeneous schemas across sources to produce data usable downstream
Manage data freshness — know when to re-scrape without reprocessing everything
Keep paid enrichment costs (Apollo, Hunter) under control without degrading coverage

Stack choices

Node.js + TypeScript
String/JSON manipulation everywhere, mature scraping ecosystem (Playwright, Cheerio), fast dev cycle. One language from scraper to dashboard.
Playwright
For SPAs that serve nothing without JavaScript. More stable long-term than Puppeteer, multi-browser support handy for bypassing some anti-bots.
Postgres with jsonb columns
We store raw data as jsonb (replayable if parsing logic changes) and normalized data in typed columns (queryable fast). Best of both worlds.
BullMQ for the scraping queue
Parallelize without killing target sites. Per-source throttle, exponential retries, dead-letter queue to analyze failures.
Self-hosted infrastructure
AWS/GCP datacenters are blocklisted by serious anti-bot solutions. Our infra with residential IPs + proxies dodges 80% of blocks from day one.

Difficulties faced

Sites change without warning

CSS selectors break overnight. We versioned adapters per source and set up alerts when scraping returns 0 results or an unusual schema.

Aggressive anti-bot

Cloudflare, Datadome, Akamai. We added residential proxy rotation, realistic headers, adaptive throttle per source, and switched to headed Playwright for the hardest cases.

Dirty data

Mixed date formats, lost accents, inconsistent capitalization, hidden duplicates. Strict normalization pipeline with versioned rules and ability to replay on raw data.

Paid API enrichment costs

An Apollo call is expensive — multiplied by 50K leads it stings. 30-day cache on enrichments, nightly batches, upstream filtering so we only pay for what will actually be used.

What we learned

Start with one well-mastered source rather than ten approximate ones. Pipeline quality beats connector quantity.
Always store raw data BEFORE normalization. If parsing logic changes, replay without re-scraping.
Detailed logs and alerts > automatic recovery in the early months. Better to know fast what breaks.
Anti-bot budget must be planned from day one, not as a future feature. That's what separates a POC from a tool that holds.

Project views

A few useful views to show how the project comes together in practice.

Discuss your project

If your need is similar, we can frame a showcase site, a landing page or a more specific tool depending on your context.