Scraping with Cheerio: How to Easily Collect Data from Web Pages
With Cheerio you can start collecting the data in a matter of minutes. No hassle, no learning curve required.
Deep dives into web data infrastructure, extraction techniques, and the future of structured data at scale.
With Cheerio you can start collecting the data in a matter of minutes. No hassle, no learning curve required.
TL;DR: Redfin exposes hidden API endpoints that return structured JSON for property listings, making it possible to skip fragile HTML parsing entirely. This guide walks you through building a Python scraper that extracts rental and sale data, searches by location, monitors new listings via XML sitemaps, and exports clean results to CSV or JSON.
TL;DR: XPath is a query language for navigating HTML/XML trees by path, attribute, or text content. This guide covers XPath syntax, axes, and functions, then shows working Python scrapers with lxml and Selenium. You will also get a consolidated cheat sheet and a troubleshooting section for the most common XPath mistakes.
TL;DR: cURL hides response headers by default. Use -i to see headers alongside the body, -I for a HEAD request that returns headers only, -v for full request/response debugging, and -D to save headers to a file. For modern scripting, cURL 7.83+ lets you extract individual headers or dump all of them as JSON with the -w write-out option.
TL;DR: A headless browser is a web browser that runs without a visible graphical interface, controlled entirely through code or command-line instructions. Developers use headless browsers for automated testing, web scraping, performance monitoring, and increasingly to power AI agents. This guide covers how they work internally, when to choose one over a regular browser, and which frameworks are worth your time.
TL;DR: Scrapy-Playwright lets you render JavaScript-heavy pages directly inside Scrapy spiders by controlling real Chromium, Firefox, or WebKit browsers through Playwright. This tutorial walks you through installation, configuration, page interactions, AJAX interception, anti-detection, and a production-ready project structure so you can scrape dynamic sites without leaving the Scrapy ecosystem.
Scrape Expedia hotel listings with Python using JS rendering, proxies, CSS selectors, and pagination, then clean and export data to CSV.