Insights & Engineering

Deep dives into web data infrastructure, extraction techniques, and the future of structured data at scale.

Latest Articles

Web Scraping with Regex: A Practical Guide

TL;DR: Web scraping with regex shines when you need short, predictable text patterns (prices, SKUs, emails, dates) out of HTML you already trust. Pair Python's re module with Beautiful Soup, scope your patterns to a parsed node instead of raw markup, and keep regex out of the way of full HTML tree parsing. This guide walks through a working title and price scraper, advanced regex features, and the pitfalls that bite real scrapers in production.

Mihai Maxim10 min read
May 7, 2026

How to Use a Proxy with HttpClient in C#

TL;DR: To use a proxy with HttpClient in C#, build a WebProxy, attach it to an HttpClientHandler (or SocketsHttpHandler), and pass that handler to the HttpClient constructor. For production, swap manual loops for IHttpClientFactory, add NetworkCredential for authenticated proxies, and wrap calls in retries with Polly so dead IPs do not take your worker down.

Suciu Dan16 min read
May 8, 2026

How to Scrape HTML Tables Using Python

TL;DR: Most HTML tables can be scraped with a single line of pandas.read_html. When the table is paginated, JavaScript-rendered, or has merged headers, switch to Requests + BeautifulSoup or a headless browser like Playwright. This guide gives you a decision matrix, working code for all three approaches, and the cleaning steps that turn scraped rows into pipeline-ready data.

Andrei Ogiolan15 min read
May 7, 2026

Cheerio vs Puppeteer: How to Pick the Right Tool

TL;DR: Cheerio is a lightweight HTML parser; Puppeteer drives a real Chromium browser. Use Cheerio when the data is already in the raw HTML, Puppeteer when JavaScript renders it, and combine them when a JS-heavy page has many fields to extract per visit.

Sergiu Inizian8 min read
May 8, 2026

What Is Browser Automation? A Practical Guide

TL;DR: Browser automation is the practice of driving a real or headless web browser from code so it clicks, types, navigates, and reads pages on your behalf. This guide explains what is browser automation under the hood, compares Selenium, Playwright, Puppeteer, and Cypress, and shows when not to reach for a full browser.

Ștefan Răcilă10 min read
May 8, 2026