Back to Blog
Science of Web Scraping
Raluca PenciucLast updated on May 13, 202612 min read

HTTP Headers Web Scraping: Stop Getting Blocked

HTTP Headers Web Scraping: Stop Getting Blocked
TL;DR: HTTP headers are usually why your scraper gets a 403 while your browser loads the same URL fine. This guide shows which headers anti-bot systems actually inspect, how to capture a real browser's header set from DevTools, how to send and rotate them correctly in Python and Node.js, and when manual tuning stops paying off and a managed scraping API is the better move.

Most blocked scrapers are not blocked by their IP. They are blocked by the request they send before the body even starts. HTTP headers web scraping is the work of making your client's metadata look like a real browser instead of a default Python or Node.js library, and it is the cheapest, most underused lever you have against anti-bot detection.

In HTTP, a header is a colon-separated name-value pair that carries metadata about the request or response: the client identity, accepted languages, encoding, cookies, security context, and more. The MDN reference on HTTP headers and RFC 9110 define the canonical semantics. Detection systems compare your scraper's header set against the fingerprint of a real Chrome or Firefox session, and any mismatch in values, presence, casing, or order can flag the request.

This guide is for backend, data, and ops engineers whose scrapers are returning 403, 429, empty bodies, or a different page than the browser sees. You will leave knowing which headers matter, how to read them out of DevTools and replay them in Python or Node.js, how to deal with header order and TLS fingerprints, and when to stop tuning and offload the request layer to a managed service.

HTTP Headers Web Scraping 101: Headers vs. Cookies, Refreshed

Get the model right first. A request header is metadata your client attaches to an outgoing request: who you are, what you accept, where you came from. A response header is metadata the server sends back: status hints, content type, cache rules, and Set-Cookie directives.

Cookies are not a separate protocol; they are a stateful header. The server hands them out via Set-Cookie, and your client echoes them back in a Cookie header on every following request. Per RFC 6265, that round trip keeps sessions alive, carries auth tokens, locks in geo, and stores A/B buckets.

For HTTP headers web scraping, both layers matter. Mess up either and your client looks new on every hit, which is exactly what bot detection watches for. Our HTTP cookies explainer cover the underlying mechanics.

Which Headers Anti-Bot Systems Actually Watch

Servers do not score every header equally. A short list does the heavy lifting in real fingerprinting stacks, and the same names show up across detection vendors. The H3s below cover those headers roughly in order of impact for HTTP headers web scraping, from the one almost everyone gets wrong (User-Agent) down to client hints and per-session cookies.

User-Agent: The Most-Checked and Most-Faked Header

User-Agent identifies the client software, OS, and browser engine, and it is the first header anti-bot systems test. Default values like python-requests/2.x or the axios default get blocked instantly because no real visitor sends them. The most common real-world profile is Chrome on Windows, so that is the safest target to mimic.

Two patterns reliably fail. The first is reusing a single UA across millions of requests. The second is editing a real UA by hand, bumping a digit, and ending up with a version that no actual browser ever shipped. At write time, verify against a current Chrome or Firefox release rather than copying a version string from any tutorial, including this one.

Accept, Accept-Language, and Accept-Encoding

These three say what content your client can handle. Accept lists MIME types, Accept-Language lists locales such as en-US,en;q=0.9, and Accept-Encoding lists compression algorithms (gzip, deflate, br). Real browsers send rich, ordered values here, and the exact strings differ between Chrome and Firefox.

The traps are subtle. A US residential IP paired with Accept-Language: ru is a fingerprint mismatch. Sending br (Brotli) but then failing to decode a Brotli body looks just as bot-like. Match Accept-Language to your proxy geo, and only advertise compression formats your HTTP client actually decompresses transparently.

Referer and Origin

Referer tells the server which page sent the user to this URL. A deep product page hit with no Referer at all looks suspicious because real visitors usually arrive from a search engine, a category listing, or an internal link. Set a plausible Referer such as a Google search or the site's own category page.

Origin is its stricter cousin: browsers attach it to cross-origin POST and fetch/XHR requests. If you are replaying an API call you observed in DevTools, copy the Origin header verbatim, or the server will treat the request as forged.

Sec-Fetch and Client Hints (Sec-CH-UA)

Modern Chromium browsers attach a family of security-context headers that older scraping guides ignore: Sec-Fetch-Site, Sec-Fetch-Mode, Sec-Fetch-Dest, and Sec-Fetch-User. They describe whether the request is a top-level navigation, an embedded resource, or a cross-origin XHR. A direct page load from a fresh tab typically sends Sec-Fetch-Site: none, Sec-Fetch-Mode: navigate, and Sec-Fetch-Dest: document.

Client hints (Sec-CH-UA, Sec-CH-UA-Mobile, Sec-CH-UA-Platform) repeat your UA profile in a structured way. The fingerprint check is simple: if your User-Agent claims Chrome on Windows but your Sec-CH-UA-Platform says "macOS", you are flagged. Always source User-Agent, Sec-CH-UA, and Sec-CH-UA-Platform from the same real browser capture so they stay internally consistent.

Cookies and Session Headers

After the first request, Cookie becomes the most identity-laden header you send. Anti-bot systems use it to track whether a session is warmed up, whether you accepted a consent banner, and whether your CSRF token came from the same render as the form you are submitting. Drop cookies between requests and you look like a brand-new visitor every time.

Custom headers like x-csrf-token, x-api-key, and Authorization show up on XHR endpoints behind modern SPAs. Pull them out of the HTML or a prior JSON response, then attach them to the actual data call. Without that step, the API returns 401 or empty results.

Capture Real Browser Headers from DevTools

Stop guessing header values. Open the target site in a normal Chrome or Firefox session, right-click and choose Inspect, then switch to the Network tab. Reload the page, click the first document request (the HTML), and open the Headers panel. The Request Headers section is your ground truth.

Two tricks save time. Right-click the request and choose Copy as cURL to dump the full call, including headers, cookies, and body, into a shell-ready command. And before replaying, drop session-specific values like Cookie, x-client-data, and any single-use CSRF tokens.

Validate the replay by pointing it at httpbin.org/headers, which echoes back exactly what your client sent. If the echo does not match the DevTools capture, your HTTP library is rewriting things. Our cURL response headers guide covers deeper inspection tactics.

Send Custom Headers in Python and Node.js

The pattern is the same in every HTTP client: build a dictionary of header names and values, attach it to the request, and reuse a session so cookies persist. The two language-specific sections below show the gotchas that bite at scale.

Python (requests and httpx)

With requests, pass a headers dict to get, and use a Session so cookies persist across calls:

import requests

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
                  "AppleWebKit/537.36 (KHTML, like Gecko) "
                  "Chrome/<current> Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Dest": "document",
}

with requests.Session() as s:
    s.headers.update(headers)
    r = s.get("https://httpbin.org/headers")
    print(r.json())

The catch: requests does not preserve header order reliably, which is a known fingerprinting weakness. For HTTP headers web scraping at scale, prefer httpx, which keeps the exact insertion order from your dict and supports HTTP/2. Build the dict in the same order DevTools showed, and the wire image stays consistent.

Node.js (axios and fetch)

In Node.js, both axios and native fetch accept a headers object. Reuse an axios instance to share defaults, and use a cookie jar (axios-cookiejar-support or similar) so sessions survive across requests:

import axios from "axios";

const client = axios.create({
  headers: {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) " +
                  "AppleWebKit/537.36 (KHTML, like Gecko) " +
                  "Chrome/<current> Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Sec-Fetch-Site": "none",
    "Sec-Fetch-Mode": "navigate",
    "Sec-Fetch-Dest": "document",
  },
});

const { data } = await client.get("https://httpbin.org/headers");
console.log(data);

Watch out for axios defaults like X-Requested-With: XMLHttpRequest, which leaks the library. Override or delete them explicitly. Our axios headers deep-dive covers detection-avoidance patterns at the request layer.

Header Order, Casing, and TLS Fingerprints

Header names are case-insensitive by RFC, but their casing and ordering are very much a fingerprint. Real Chrome sends User-Agent with that exact casing in a specific slot relative to Accept and Sec-Fetch-*. Python requests historically lowercases names and does not guarantee order; httpx preserves whatever you give it; axios is generally faithful but adds its own defaults if you do not strip them.

That is only half the picture. Even with a perfect header set, your TLS handshake is its own fingerprint. JA3 and the newer JA4 hash the cipher suites, extensions, and elliptic curves your client offers. A Python TLS stack claiming to be Chrome but offering OpenSSL's cipher order is an obvious lie; see tlsfingerprint.io for how detectable that is.

Mitigations: use an HTTP/2-capable client with realistic TLS settings (curl_cffi, tls-client in Python; undici with custom TLS in Node), or escalate to a stealth proxy or managed API that owns the TLS layer for you.

Rotate and Refresh Header Sets at Scale

At scale, HTTP headers web scraping rotation works at the session level, not the request level. Hold one realistic header set for the lifetime of its cookies, and only swap when you start a new identity. Swapping User-Agent mid-session is itself a detection signal.

Build a pool of hundreds of real header sets from current Chrome, Firefox, Edge, and Safari builds. Refresh it as browsers update, otherwise you stand out for using last year's UA. Pair header rotation with proxy rotation so neither IP nor headers alone become a stable key. Our guide to avoiding IP bans and the Python requests proxy walkthrough cover the proxy side.

Default Library Headers to Strip Before You Send

Before any request goes out, audit what your library is sending without being asked. Common giveaways:

  • User-Agent: python-requests/<version> or the axios default UA
  • X-Requested-With: XMLHttpRequest from axios
  • Accept: */* instead of a real MIME list
  • Connection: close when real browsers keep-alive
  • Proxy-injected Via, X-Forwarded-For, and Forwarded

Replace the first four with realistic values, and test your proxy against httpbin.org/headers to catch the last group.

Debug Header-Based Blocks: A Checklist

The debug loop for HTTP headers web scraping is short. When you get 403, 429, or an empty body the browser renders fine, work this list in order:

  1. Diff the headers. Compare your DevTools capture against httpbin.org/headers line by line.
  2. Swap the User-Agent to a different current Chrome or Firefox string and retry.
  3. Check Accept-Encoding. If you advertise br, confirm your client decompresses it; otherwise drop it.
  4. Verify cookies. Confirm Cookie matches the session you warmed up.
  5. Replay with cURL from Copy as cURL. If cURL works and your code does not, the diff is in your client. If cURL also fails, it is IP or TLS, not headers.

When to Switch from Manual Headers to a Managed Scraping API

Manual HTTP headers web scraping has a ceiling. Escalate when you hit any of these:

  • TLS or JA3/JA4 blocks that survive perfect headers. The fix is below HTTP.
  • Concurrency above a few hundred sessions, where fresh UA pools and cookie jars become their own service.
  • Rotation cost in engineer-hours that exceeds what a managed endpoint charges per success.
  • Hard targets behind enterprise bot management that pin sessions to a full browser fingerprint.

A managed Scraper API or stealth proxy owns headers, TLS, IPs, and retries behind one endpoint, so your code keeps the parsing logic. Our 2026 guide on web scraping without getting blocked covers the full escalation path.

Key Takeaways

  • Default library headers are the loudest tell. Replace python-requests/x.y UAs, axios defaults, and X-Requested-With before anything else.
  • Capture, do not invent. Pull header sets from DevTools or Copy as cURL, strip session-specific values, and validate against httpbin.org/headers.
  • Internal consistency beats sheer realism. User-Agent, Sec-CH-UA, Sec-CH-UA-Platform, and Accept-Language must all describe the same browser, OS, and geo.
  • Order and TLS matter as much as values. Prefer httpx over requests in Python, and use a TLS-faithful client (or a managed API) when JA3/JA4 fingerprinting is in play.
  • Rotate by session, not by request. Hold a header set for the cookie lifetime, refresh the pool as browsers update, and pair it with proxy rotation.

FAQ

Why do my requests still get blocked after I copied the exact headers from Chrome?

Usually because the headers are not the only signal. Your TLS handshake (JA3/JA4), HTTP/2 frame ordering, IP reputation, and missing session cookies all fingerprint your client independently. Even byte-perfect headers fail when the TLS stack underneath says "Python" instead of "Chrome." Replay the request with cURL: if cURL succeeds and your code does not, the gap is below the header layer.

How often should I rotate User-Agent strings and full header sets across requests?

Rotate per session, not per request. A real visitor keeps the same browser for an entire visit, so a session that swaps User-Agent mid-flight is itself suspicious. Hold one header set for the lifetime of its cookie jar, then pick a new one for the next session. Refresh the underlying pool every few weeks as Chrome and Firefox ship new stable versions.

Do I still need to set HTTP headers if I am using a headless browser like Playwright or Puppeteer?

Yes, partially. A headless browser sends realistic headers automatically, including Sec-CH-UA and Sec-Fetch-*, so you can skip most manual work. You still need to override the headless-mode User-Agent (which often includes HeadlessChrome), set a plausible Accept-Language for your proxy geo, and disable the navigator.webdriver flag through a stealth plugin or launch flag.

Can a managed scraping API or smart proxy handle HTTP headers automatically for me?

Yes. Managed scraping APIs and stealth proxy networks pick a realistic header set per target, match it to a residential IP and TLS profile, and rotate everything in lockstep. You send the target URL, they return the HTML or JSON. The tradeoff is per-request cost versus the engineering time of building and maintaining your own header, proxy, and fingerprinting stack.

How can I tell whether a block is caused by headers, by IP, or by a TLS fingerprint?

Isolate one variable at a time. First, replay your exact request with curl --resolve from the same IP; if cURL succeeds, the issue is in your HTTP client's headers or order. Next, swap to a different residential IP with the same headers; if the block clears, the IP was flagged. If neither helps, the TLS handshake is the most likely culprit.

Wrapping Up

HTTP headers web scraping is not the most glamorous part of building a scraper, but it is the part with the highest return on a small amount of work. Capture real browser headers, send them in the order the browser used, keep User-Agent, Sec-CH-UA, and Accept-Language internally consistent, rotate per session instead of per request, and strip the library defaults that scream automation. That set of habits clears most of the easy 403s and 429s before you even reach for a proxy.

Beyond that, manual tuning runs into a wall. JA3/JA4 fingerprints, enterprise bot management, and large concurrency all push the work below the header layer, and that is the right moment to stop hand-rolling and let a managed service handle it. If your stack is at that point, WebScrapingAPI's Scraper API manages headers, TLS profiles, residential IPs, and retries behind a single endpoint, so you keep the parsing logic and drop the fingerprinting arms race. Start there when manual headers stop being the cheapest option.

About the Author
Raluca Penciuc, Full-Stack Developer @ WebScrapingAPI
Raluca PenciucFull-Stack Developer

Raluca Penciuc is a Full Stack Developer at WebScrapingAPI, building scrapers, improving evasions, and finding reliable ways to reduce detection across target websites.

Start Building

Ready to Scale Your Data Collection?

Join 2,000+ companies using WebScrapingAPI to extract web data at enterprise scale with zero infrastructure overhead.