The Ultimate Web Scraping Tips & Tricks List
Anda Miuțescu on Jun 15 2021
While web scraping can sound easy in practice, there are plenty of pitfalls that the uninitiated developer could run into. Instead of brute-forcing it until you run out of unbanned proxies, I dare to think that it’d be better to scrape smart, get the data you need and get out without ever being noticed.
We've prepared twelve tips for you. Use them and you'll see that all the Internet's data is just a few clicks away.
1. Plan before you scrape
Like any project, web scraping jobs go a lot easier if you devise a clear strategy before you even begin. First off, why do you need data? It may seem like an obvious question, but the answer is instrumental in determining what information you should gather.
Next, where are you going to get the info? The scraping tool should be well suited for the sites it has to go to, so examining your data sources will help you build or purchase the right program for the job.
Question three: how are you going to use the gathered information? You could process it yourself, use software or even send it down a complex pipeline. The answer will be your first step in deciding the structure and file format for the gathered data.
There are a lot of other questions and ideas you have to sort out, most of which heavily depend on what you want to achieve. One thing is certain, “measure twice, cut once” holds true for web scraping.
2. Act more human
If you want to know if a website visitor is a human or a bot, you only have to look at the way it behaves. Bots move lighting fast and never interact with the page unless otherwise instructed. As a result, they’re easy to spot and block.
To help the scraper avoid detection, you have to teach it to act more like a normal visitor, a human. The beauty here is that people act in all sorts of different ways, so you have a lot of freedom while coding. Here are some actions we suggest you add:
- Add random intervals of inaction, as if a human is reading the content on the page. A 5 to 10 second delay works just fine.
- Navigate pages in a tree-like pattern. If you’re scraping several child pages, always go through the parent page when moving on. It will imitate a person clicking on a page, then going back, then clicking on the next, and so on.
- Make the bot click on random things from time to time. That’s something all people do and not just me, right?
Anyway, the reason your bot has to act a certain way is because the website might be monitoring and logging its activity. But what if the website couldn’t keep track of the bot? Read the next point to get your answer.
3. Rotate your proxies
Using a proxy to make sure your real IP doesn’t get blocked is a bit of a no-brainer. So is getting a proxy from a specific geographical location to access area-restricted content. But proxies can do so much more for you with the right tools!
Right now, the tool you need is a server to rotate your proxy pool. With it, each request you send is allocated to a random IP within the pool and sent to the target. This way, you can scrape a website as much as you want and each request will look like it’s coming from a different place and person.
Furthermore, rotating proxies make sure that if an IP does get blocked, you won’t be stuck until you manually change proxies. A request may fail but the others won’t. A good tool will also retry any failed attempts, for example, WebScrapingAPI retries failed API calls to ensure there are no holes in your database.
For the best possible results, you’ll want to use residential rotating proxies. Residential IPs are least likely to be noticed or blocked and by rotating them, you make the scraper even harder to detect. Keep in mind that this can be overkill sometimes. If you’re not up against difficult anti-bot countermeasures, datacenter rotating proxies can do the job just as well for cheaper.
4. Use a headless browser
A headless browser is the same as a normal one, except that it doesn’t have a graphical user interface. To surf the web with one, you’ll have to use a command-line interface.
If you’re building a scraper from scratch, I recommend you try Puppeteer, here are some details on what it does and how to use it.
5. Cycle User-Agent headers
User-Agent is the name of an HTTP request header that tells the website you're visiting what browser and operating system you’re using. In a way, websites use this header to learn more about who is visiting them. It’s great for analytics and it’s incidentally useful for catching bots.
Here’s what a user agent string can look like:
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0
There are three main scenarios that you need to avoid:
- Having no user-agent at all. Most scrapers don’t use one which is suspicious and a good way to announce that they’re bots. Ensure that you’re sending normal-looking headers with each request.
- Having an outdated user-agent. With each new update, browsers generally change the header. So, if your list contains a user-agent for Mozilla but they’ve had an update and changed it, the website you visit can tell that something’s fishy.
- Using the same header for each request. You could use the normal user-agent for your actual browser, but it will result in hundreds of different requests, coming from different IPs having the exact same fingerprint. It’s a huge giveaway.
Here’s a bonus tip on the subject: try using the Googlebot user-agent. Of course, any website wants to be indexed, so they leave Google’s crawlers to their business. A word of caution, though: their bots also have specific IPs and behavior, so the user-agent alone isn’t a guarantee of success.
6. Add anti-CAPTCHA functionalities
If you’re scraping well, you shouldn’t run into CAPTCHAs except very rarely. Still, if you do, it’s a big annoyance since it could stop your scraper in its tracks or return false data (the code of the CAPTCHA page).
The anti-CAPTCHA battle plan has two parts: prevention and treatment. You should mainly focus on prevention as it’s by far the more efficient option. Here’s what you should do:
- Use proxies to make it look like your requests are coming from different sources.
- Switch up your request headers (especially the user-agent). Otherwise, you can still be detected despite using several IPs.
So, essentially, use tips #3, #4, and #5.
If the scraper still runs into problems, you’ll need a CAPTCHA solver. When choosing a web scraping tool, make sure that your pick has this feature built-in. Otherwise, it’s technically possible to integrate a solver yourself, but that means extra coding and extra wasted time. Just so you know, our API has this functionality, not that it runs into many CAPTCHAs in the first place.
7. Make an URL list
When you start a scraping job, you’ll probably compile a list of URLs with valuable information. Here’s an idea: instead of just sending the URLs to the scraper, keep a list and mark all links you’ve already crawled. For clarity, you can also keep the scraped data with the URL.
The reason behind this is simple: if your computer crashes or some other unexpected event occurs, you’ll still know what data you already have, preventing useless re-scraping.
Our advice is to write a script for data extraction record keeping. Updating the list manually means a lot of busy work and you won’t be able to keep up with the bot anyway.
8. Learn page structures
Gathering more data takes more time, naturally. So, a way to increase efficiency is to pinpoint exactly what information the scraper should grab. Easy in theory, but each website and webpage is unique. To reduce overhead and save time, you have to learn a few things about how your target pages are structured.
Here’s how you do it:
- Go to the page;
- Right-Click on the text you want and hit Inspect Element;
- Note how that information is nested, what classes it’s in, and under what tag;
- Look for structural patterns among other pages of interest. Chances are, you can create a script that gathers all the needed info from the whole website.
After going through these steps, it will be a lot easier for you to extract only the details you need. The benefit is that you’ll no longer have to deal with irrelevant HTML cluttering up your documents.
Understanding the layout is especially useful for scraping product information. Product pages on the same website will all be structured similarly, if not identically. Find the logic and you can extract and parse much faster.
9. Add a delay between requests
The point of a web scraper is to gather data faster than a human could. We get that, but speed has an undesired side-effect: it shows clear as day that the requests are being sent by a bot.
Remember what we said about acting like a normal visitor: if the scraper stands out, it’s much more likely to be blocked. Not to worry though, all you have to do is add random delays when using the same IP for several concurrent or successive visits.
Remember: do this when using the same proxy. If you’re changing IP and headers after every request, the delay shouldn’t be necessary. If you’re logged in with an IP, though, you should stick to that one, which means you’ll also need the delays.
Make sure the timer is slightly different, so the delays are random. Something between 5 and 10 seconds should work nicely.
10. Cache important pages
In some cases, you’ll want to revisit pages you previously scraped in order to get another bit of information. Instead of doing that, cache the page on the first visit and you’ll have all the data saved already.
Even with the most advanced web scraper, there’s still a chance the bot won’t get data on the first try, and even if it does, you’d still be wasting effort. Just grab all the HTML in one go and then you can extract any info you need from the saved version.
For example, you can cache a product page so that it’s always handy. If you need the product specifications today, but tomorrow you might want the price, the data is already collected, waiting to be processed.
Be mindful though, this works for static information! If you want stock prices, you’ll have to keep extracting fresh data, since the cached version will become outdated fast.
11. Be careful when logging in
The data you need might be hidden behind a login page. Social media platforms come to mind. Sure, you can get a few scraps of content without an account, but it’s more laborious and you might want something that is visible only if you're part of a group or friends list. In short, you might need to log in and that comes with some problems.
All website users with an account have to agree to its Terms of Service. In these terms, there may be a clause that forbids the use of bots, automated tools, or web scrapers specifically. In this case, extracting data would clearly be against rules that the user agreed to.
Another point to keep in mind is that while websites may not be extremely attentive to unregistered visitors, they pay more attention to the cookies sent by logged users. So, in essence, more eyes will be on your bot. Clear bot behavior or telling cookies are even more likely to get your scraper blocked.
What you should do:
- Carefully read the Terms of Service and make sure you’re not going against them.
- Ensure that you’re following all the other tips in this article, especially the ones about proxies, human behavior, JS rendering, and request headers.
12. Avoid causing damage to the website
Most web admins don’t like having scrapers on their websites. For some, bots are a minor annoyance, for others, they’re a major danger. The simple fact is that hackers and other malcontents use bots to cause problems and mischief, like crashing websites or trying to steal confidential data.
Even if your intentions are completely friendly, you may accidentally cause trouble. Boatloads of concurrent requests could bring down the server, so here are a few best practices to ensure you don’t leave mayhem in your wake:
- Slow the number of requests to avoid crashing the whole website;
- Read the Robots.txt file which should explain what actions bots are allowed to take. It’s not a legally binding document, but it does express the wishes of the site owner.
- Be mindful of how you use scraped data. Taking content and reposting it, for example, is both damaging and illegal, as that content is protected under copyright law.
- Whenever possible, ask the owner for permission to gather information from the website.
Through friendly and ethical actions, you can do your part to make sure that bots are seen as the useful tool they are, instead of some sort of digital marauders.
Bonus tip: choose the right web scraping tool
There is no perfect formula for web scraping, but there are factors to take into account that can lead to the best results in prime timing. This article was constructed to address any concern, every written or unwritten rule, and every best practice there is. An API will aid the many daily scraping pests, which is precisely why our first trick will always be automation.
Scrape wisely and enjoy the results of your work with 1000 free calls from us!
While some of the tips mentioned above have to do with how you use the scraping tool, many can be integrated and automated by the software itself, letting you focus on your own tasks and objectives. That’s why we think that choosing the right program is just as important as all the tips we discussed, if not more so.
I honestly think that WebScrapingAPI is an excellent option, especially since you can try the API out with a free plan and see for yourself how it handles itself before you invest any money.
If you’d like to browse a bit, we’ve written a huge buyer’s guide featuring 20 web scraping tools, so check that out!
News and updates
Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.
Explore the in-depth comparison between Scrapy and Selenium for web scraping. From large-scale data acquisition to handling dynamic content, discover the pros, cons, and unique features of each. Learn how to choose the best framework based on your project's needs and scale.
This tutorial will demonstrate how to crawl the web using Python. Web crawling is a powerful approach for collecting data from the web by locating all of the URLs for one or more domains.