Back to website landing
Case study

Scraping tools

Scraping tools designed for prospecting, market monitoring, data enrichment and internal automation.

Automation / data collection

At a glance

Prospecting
Market watch
Enrichment
Internal automation

Context

These tools serve as upstream building blocks in a prospecting and exploitation chain. The value sits not only in the interface, but in the ability to collect, qualify and pass the right data forward.

The need

The point was not to collect data for its own sake. The tools had to feed a broader chain involving prospecting, agents and the CRM.

The built response

Launch, selection and collection tools able to support wider qualification and commercial exploitation workflows.

What this project shows in practice

These tools mainly make collection and qualification more usable inside a broader prospecting workflow.

Key features

Persona or target selection before launch
Tool launch flows with loading states and orchestration
Structured collection of useful data
Reusable base for monitoring, prospecting and enrichment

Interconnections

Scrapers collect and prepare the data
The data enriches the base used for prospecting
Prospecting agents consume those signals
The CRM then receives the useful leads or signals for sales follow-up

What this changes

Less manual work on collection and triage
Better prospecting inputs through structured data
Stronger continuity between collection, qualification and sales action
Under the hood

What was hard, what we settled

Technical stakes

  • Pick the right sources and balance public directories, paid APIs and browser scraping
  • Hold up against anti-bot protections (Cloudflare, rate limits, fingerprinting) without getting banned
  • Normalize heterogeneous schemas across sources to produce data usable downstream
  • Manage data freshness — know when to re-scrape without reprocessing everything
  • Keep paid enrichment costs (Apollo, Hunter) under control without degrading coverage

Stack choices

  • Node.js + TypeScript

    String/JSON manipulation everywhere, mature scraping ecosystem (Playwright, Cheerio), fast dev cycle. One language from scraper to dashboard.

  • Playwright

    For SPAs that serve nothing without JavaScript. More stable long-term than Puppeteer, multi-browser support handy for bypassing some anti-bots.

  • Postgres with jsonb columns

    We store raw data as jsonb (replayable if parsing logic changes) and normalized data in typed columns (queryable fast). Best of both worlds.

  • BullMQ for the scraping queue

    Parallelize without killing target sites. Per-source throttle, exponential retries, dead-letter queue to analyze failures.

  • Self-hosted infrastructure

    AWS/GCP datacenters are blocklisted by serious anti-bot solutions. Our infra with residential IPs + proxies dodges 80% of blocks from day one.

Difficulties faced

Sites change without warning

CSS selectors break overnight. We versioned adapters per source and set up alerts when scraping returns 0 results or an unusual schema.

Aggressive anti-bot

Cloudflare, Datadome, Akamai. We added residential proxy rotation, realistic headers, adaptive throttle per source, and switched to headed Playwright for the hardest cases.

Dirty data

Mixed date formats, lost accents, inconsistent capitalization, hidden duplicates. Strict normalization pipeline with versioned rules and ability to replay on raw data.

Paid API enrichment costs

An Apollo call is expensive — multiplied by 50K leads it stings. 30-day cache on enrichments, nightly batches, upstream filtering so we only pay for what will actually be used.

What we learned

  • Start with one well-mastered source rather than ten approximate ones. Pipeline quality beats connector quantity.
  • Always store raw data BEFORE normalization. If parsing logic changes, replay without re-scraping.
  • Detailed logs and alerts > automatic recovery in the early months. Better to know fast what breaks.
  • Anti-bot budget must be planned from day one, not as a future feature. That's what separates a POC from a tool that holds.
Project views

A few useful views to show how the project comes together in practice.

Discuss your project

If your need is similar, we can frame a showcase site, a landing page or a more specific tool depending on your context.