Back to Blog
Guides
Raluca PenciucLast updated on May 12, 202611 min read

How to Scrape Walmart.com: 2026 End-to-End Guide

How to Scrape Walmart.com: 2026 End-to-End Guide
TL;DR: This guide walks through how to web scrape Walmart product data end-to-end in Python, from parsing the hidden __NEXT_DATA__ JSON to scaling with proxies, retries, and async fetches. It also draws an honest line for when a managed scraper API beats DIY.

Introduction: scraping Walmart at scale in 2026

Walmart is the kind of catalog that justifies a full pricing or competitive-intelligence pipeline. If you only need one product, a 10-line script gets you there. If you need thousands of SKUs refreshed daily, the picture changes fast: hidden JSON, pagination caps, ZIP-aware pricing, and an anti-bot stack that looks at far more than your User-Agent. This 2026 tutorial covers how to web scrape Walmart product data the way teams actually run it in production, including the point where it makes sense to stop fighting the anti-bot layer and switch to a managed API instead.

Why scrape Walmart product data

Before we get into how to web scrape Walmart in code, it helps to be clear about why people do it. The use cases tend to cluster around a handful of jobs: pricing intelligence and MAP monitoring across resellers, catalog and category mapping, restock alerts, review-sentiment analysis, and competitor SKU coverage. Walmart's first-party listings and third-party marketplace sellers share the same product-page schema, which makes the dataset uniquely useful for understanding how an entire category prices and ships in real time. Beyond titles and prices, product pages also expose ratings, review counts, variant matrices, seller info, and per-ZIP fulfillment data, the fields that actually feed pricing models.

Quick disclaimer first: this is general guidance, not legal advice. Public product data on walmart.com is generally considered fair game when collected at slow, respectful rates that do not damage the service, but legal risk is jurisdiction-specific and contract-specific (Walmart's Terms of Use matter). Read walmart.com/robots.txt and respect its Disallow directives. Stay clear of anything that requires a sign-in or that contains personal data, including reviewer email addresses, order numbers, and payment details. GDPR and CCPA constrain how you handle PII even if it is technically reachable. A safe default: stick to product, price, review, and stock fields, throttle aggressively, and consult a lawyer before any commercial deployment. (If you want a deeper read, our broader explainer on whether web scraping is legal covers the case law.)

Tools and project setup

You need Python 3.11+ and a clean virtual environment. The minimum kit:

python -m venv .venv && source .venv/bin/activate
pip install requests httpx beautifulsoup4 pandas loguru
  • requests (or httpx if you want async): the HTTP client
  • beautifulsoup4: HTML parsing for the visible DOM
  • pandas: tabular export and pd.json_normalize for nested JSON
  • loguru: structured logs that survive long runs

A reasonable folder layout:

walmart-scraper/
├── walmart/
│   ├── fetch.py        # request + retry layer
│   ├── parse.py        # __NEXT_DATA__ extractor
│   ├── discover.py     # sitemap + search crawler
│   └── scaler.py       # async runner
├── data/
└── main.py

How Walmart serves product data: HTML shell + NEXT_DATA JSON

NEXT_DATA JSON

Walmart.com is a Next.js application. When you request a product page, the server returns a minimal HTML shell plus a <script id="__NEXT_DATA__"> element that carries the entire pre-rendered state of the page as JSON. The browser then hydrates that state into the React tree you see; CSS selectors only catch what survives hydration, which on Walmart is a small subset of the underlying record.

That matters because most "my Walmart scraper broke" tickets come from CSS-only scrapers chasing class names that change with every release. The structured JSON in __NEXT_DATA__ is far more stable: it carries the canonical product object including price tiers, variants, ratings, sellers, fulfillment options, and ZIP-aware availability. Treat it as the primary parse target, and the rendered DOM as a fallback for fields you cannot find in the JSON.

Step 1: how to web scrape Walmart product pages with Python

Start small. Pick one canonical URL of the form https://www.walmart.com/ip/<slug>/<id> and learn how to web scrape Walmart with a single GET before you touch concurrency or proxies.

import requests

URL = "https://www.walmart.com/ip/AT-T-iPhone-14-128GB-Midnight/1756765288"
HEADERS = {
    "User-Agent": (
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 13_5) "
        "AppleWebKit/537.36 (KHTML, like Gecko) "
        "Chrome/124.0.0.0 Safari/537.36"
    ),
    "Accept-Language": "en-US,en;q=0.9",
}

resp = requests.get(URL, headers=HEADERS, timeout=20)
print(resp.status_code, len(resp.text))

Two responses are interesting here: a 200 with the product HTML, or a 200 with a "Robot or human?" interstitial. The interstitial is a soft block, not a 4xx, so always check the body, not just the status code. If the response contains the interstitial string or comes back unusually short (a few KB), treat the request as failed and back off.

Send realistic browser-like headers

A bare-bones Python request gets flagged immediately. At minimum send a current Chrome or Firefox User-Agent, plus Accept, Accept-Language, Sec-Ch-Ua, and a credible Referer (a Google search result or the relevant Walmart category page). Keep a small pool of UA strings and rotate per session, not per request. Also be aware that Walmart inspects the TLS handshake (JA3/JA4 fingerprints) on top of headers, so a perfect header stack from requests can still fail because the underlying TLS profile screams "Python." Tools like curl_cffi help mimic a real browser fingerprint when this becomes the bottleneck.

Step 2: parse core fields with BeautifulSoup

For quick checks and for fields the embedded JSON does not expose, BeautifulSoup is plenty.

from bs4 import BeautifulSoup

soup = BeautifulSoup(resp.text, "html.parser")

title = soup.find("h1", attrs={"itemprop": "name"})
price = soup.find("span", attrs={"itemprop": "price"})
images = [
    img.get("src")
    for img in soup.select("img[loading='lazy']")
    if img.get("src")
]

print(title.get_text(strip=True) if title else None)
print(price.get_text(strip=True) if price else None)

This works for the title and the visible price, but it is brittle. Walmart frequently swaps itemprop markup for utility classes, and the visible price often differs from the canonical price (subscriber pricing, rollback, store-local). Use this as a sanity layer. If you want a deeper grounding in the BeautifulSoup query patterns we are using, our dedicated BeautifulSoup tutorial is a solid companion read. Treat the __NEXT_DATA__ JSON we extract next as the source of truth.

Pull the rich payload from the NEXT_DATA script tag

NEXT_DATA script tag

Anyone serious about how to web scrape Walmart should make __NEXT_DATA__ the first stop. The full product record lives inside that single script tag.

import json

raw = soup.find("script", id="__NEXT_DATA__")
payload = json.loads(raw.string)

product = (
    payload["props"]["pageProps"]
    ["initialData"]["data"]["product"]
)

print(product["name"])
print(product["priceInfo"]["currentPrice"]["price"])
print(product["averageRating"], product["numberOfReviews"])

You now have one dictionary with name, price tiers, brand, model, image gallery, descriptions, average rating, review count, seller block, and a fulfillment node. Walk it once with pprint, then write the keys you actually need into a small extractor function. Wrap the indexed access in try/except KeyError, because Walmart re-shapes the tree (initialData.data.product versus initialData.data.contentLayout) without warning.

Step 3: scale beyond a single product

One URL teaches you the parser; production needs many. The lightweight pattern is httpx.AsyncClient with a bounded asyncio.Semaphore (start at 5 to 10 concurrent), a 1 to 3 second jitter between requests, and per-host session reuse so cookies stick around. Keep concurrency conservative: Walmart prefers steady, slow callers over bursts. Put fetch and parse on separate task groups so a parsing exception does not kill the fetch loop. The same pattern shows up in our Amazon scraping walkthrough if you want a side-by-side reference for another large catalog.

Discover product URLs via sitemaps and the search endpoint

You discover URLs in two complementary ways. First, walmart.com/robots.txt lists sitemap indexes; the category sitemap is the densest, with millions of /ip/ URLs grouped by department. Pull the index, fetch each child sitemap, and feed the URLs into your queue. Second, the on-site search endpoint accepts query parameters such as q, page, sort, and a long list of facets. The HTML response carries a JSON payload with the product list, so parse that JSON instead of scraping rendered cards. Combine the two: sitemaps for breadth, search for ranking-aware coverage of a specific category. (Our ultimate Walmart guide goes deeper on sitemap topology if you need a reference map.)

Handle Walmart pagination and the 25-page cap

According to publicly reported testing, Walmart caps a single search query at roughly 25 result pages, regardless of how many results actually match. Re-test before you trust any specific number, since the cap has shifted over the years. The workaround is segmentation: split a broad query by category, brand facet, price band, and condition, then hit each segment under the per-query ceiling. Reverse-sorting (sort=price_high plus sort=price_low) and combining facets can roughly double reachable coverage to about 50 pages or 2,000 products per query, again per third-party tests. Plan your discovery as a tree of narrow queries, not a flat page-number loop.

Capture reviews, variants, and fulfillment data

Once you can parse __NEXT_DATA__, the high-value fields are right there. Reviews and aggregate ratings sit under product.idmlInfo and product.reviews; the per-review array, when present, includes star count, title, body, and verified-purchase flag. Variants live under product.variantsMap keyed by SKU, with attributes like color and capacity. Fulfillment lives under product.fulfillmentOptions, with shipping ETAs, in-store pickup eligibility, and a storeId field. Pricing and availability vary by ZIP, so set the assortmentStoreId and customer-zip cookies before each request to lock in localized data.

reviews = product.get("reviews", {}).get("customerReviews", [])
variants = product.get("variantsMap", {})
fulfillment = product.get("fulfillmentOptions", [])

Export results to CSV or JSON with pandas

Once you have a list of product dicts, pandas does the rest:

import pandas as pd

df = pd.json_normalize(records, sep="_")
df.to_csv("walmart_products.csv", index=False)
df.to_parquet("walmart_products.parquet", index=False)

json_normalize flattens nested keys into columns like priceInfo_currentPrice_price, which is friendly for SQL. Write reviews and variants to separate tables with the parent product ID as a foreign key, since flattening one-to-many fields into a single row almost always bites you later.

Bypass blocks: proxies, rotation, retries, and backoff

Most teams treating how to web scrape Walmart as a recurring problem run a layered anti-block stack with four moving parts.

  • Proxies. Residential IPs are nearly indistinguishable from regular Walmart shoppers; datacenter IPs get flagged at scale. Provider-published success rates are marketing numbers, so benchmark on your own URLs before you commit.
  • Rotation cadence. Rotate per session for crawl-style discovery, per request for high-volume monitoring. Keep sessions sticky for at least the duration of a multi-step flow (search, product, reviews) so cookies stay coherent.
  • Retries with exponential backoff. On a 403 or a 429 (the HTTP status code defined in RFC 6585), wait 2^n + jitter seconds for up to 5 attempts before parking the URL for a later run.
  • Header and cookie discipline. Rotate UA pools alongside IPs and persist cookies inside a requests.Session() so Walmart sees one coherent visitor.
proxies = {
    "http":  "http://USER:PASS@gate.example.com:7777",
    "https": "http://USER:PASS@gate.example.com:7777",
}
resp = requests.get(URL, headers=HEADERS, proxies=proxies, timeout=20)

A managed proxy pool with sticky sessions saves more time than DIY rotation once you cross a few hundred pages a day. Our deeper guide to rotating proxies in Python covers the exact rotation patterns we have seen survive Walmart's anti-bot updates.

Use a Walmart scraper API for production workloads

At a few thousand pages a day, the math flips. You spend more on engineering hours patching JA3 fingerprints, refreshing UA pools, and chasing layout changes than you would on a managed endpoint. A Walmart scraper API like the WebScrapingAPI Scraper API handles the proxy network, header stack, JavaScript rendering, and CAPTCHA solving behind one URL and bills you only for successful responses. You keep your __NEXT_DATA__ parsing code; you only swap the fetch layer. If you need login flows or interactive crawling (clicking through size variants, expanding review pages), a hosted Browser API extends the same model to a remote Chrome you script with Puppeteer or Playwright.

Common pitfalls and troubleshooting

Save the raw HTML for every failed parse, then diff key paths week over week, because Walmart shifts JSON keys quietly. If prices look off, check the ZIP cookie. If review arrays come back empty, you are probably blocked, not finished. Always log the response length: a 4 KB "Robot or human?" page is your earliest signal that something has changed.

Wrapping up and next steps

You now have a complete blueprint for how to web scrape Walmart end to end. Pick the smallest piece you do not already have (sitemap discovery, retries, the JSON parser) and ship that next.

Key Takeaways

  • Treat the __NEXT_DATA__ JSON, not the rendered DOM, as your primary parse target on Walmart product and search pages.
  • Discovery is two systems combined: walmart.com/robots.txt sitemaps for catalog breadth, the search endpoint for ranking-aware coverage.
  • Walmart's reported ~25-page search cap is solved by segmenting queries by category, facet, price band, and reverse sort.
  • A real anti-block stack is layered: residential proxies, rotation cadence, exponential backoff on 403/429, and coherent session cookies.
  • Once your daily volume crosses a few thousand pages, a managed scraper API usually wins on total cost over DIY anti-bot maintenance.

FAQ

Generally yes for public product, price, and review data, with caveats. US case law (notably hiQ v. LinkedIn) has indicated that scraping public web data is not automatically a CFAA violation, but Walmart's Terms of Use, copyright on review text, and laws like GDPR and CCPA still apply. Avoid logged-in pages and personal data, throttle politely, and ask a lawyer before commercial use.

Does Walmart offer a public product API I can use instead of scraping?

Walmart runs an Affiliate API and a Marketplace Seller API, but neither is a general-purpose product-data API for the public. The Affiliate API is gated by program approval and limited in fields and rate, and the Marketplace API only exposes data for items you sell on Walmart yourself. For broad catalog, pricing, and review coverage, scraping is the practical option at the time of writing.

Why does my Walmart scraper get a 'Robot or human?' captcha page?

That page is Walmart's anti-bot challenge, triggered when one or more signals look bot-like: a datacenter IP, a Python TLS fingerprint, a missing Sec-Ch-Ua header, an unusual request cadence, or no first-party cookies. It is a soft 200, not a 403, so check the response body. The fix is layered: residential IP, browser-grade TLS, full header stack, and request pacing.

Do I need Selenium or Playwright to scrape Walmart, or are requests and BeautifulSoup enough?

For most product, search, and review pages, plain requests plus BeautifulSoup is enough, because the data lives in the server-rendered __NEXT_DATA__ JSON. Reach for Playwright or Puppeteer only when you need to click through interactive elements (size pickers, lazy-loaded review pages) or when the anti-bot challenge requires a real JavaScript environment to pass.

How do I scrape thousands of Walmart products without hitting the 25-page search limit?

Stop thinking page numbers and start thinking segments. Split the query by category, brand, price band ($0 to 25, $25 to 50, and so on), department, and condition, so each individual query fits under the per-query cap. Combine ascending and descending sorts to widen each segment. Cross-reference results against the category sitemap to backfill anything segmentation missed.

Conclusion

Scraping Walmart in 2026 is a solvable engineering problem if you respect how the site is built. Parse the embedded __NEXT_DATA__ JSON instead of fighting class-name churn. Discover URLs through sitemaps and the search endpoint together, segment your queries to slip past the page cap, and harden the fetch layer with residential proxies, sticky sessions, and exponential backoff on 403 and 429. Export through pd.json_normalize so the downstream analytics layer is happy, and keep raw HTML around so you can diff key paths after Walmart's next quiet change.

The honest cutover comes at scale. If you are spending more on engineering hours patching anti-bot logic than the data is worth, that is your signal. Our WebScrapingAPI Scraper API takes over the fetch, proxy, and CAPTCHA layer behind one endpoint, so you keep the parser you just built and only pay for successful responses. Whichever path you choose, you now have the playbook to ship a working Walmart pipeline this week.

About the Author
Raluca Penciuc, Full-Stack Developer @ WebScrapingAPI
Raluca PenciucFull-Stack Developer

Raluca Penciuc is a Full Stack Developer at WebScrapingAPI, building scrapers, improving evasions, and finding reliable ways to reduce detection across target websites.

Start Building

Ready to Scale Your Data Collection?

Join 2,000+ companies using WebScrapingAPI to extract web data at enterprise scale with zero infrastructure overhead.