Mastering Web Scraping: How to Rotate Proxies in Python
Raluca Penciuc on Feb 03 2023
In today's digital age, the use of web scraping has become increasingly popular. However, with this increased popularity comes increased scrutiny from websites and services that don't want their data scraped.
If you're looking to automate your data extraction tasks, you may have come across the need to use proxies to mask your IP address. However, using the same proxy for a long period of time can easily get you blocked. This is where proxy rotation appears in the picture.
In this article, we'll explore how to rotate proxies in Python, including installing the necessary prerequisites, using a single proxy, and increasing the speed while rotating the proxies. We'll also discuss some tips on how to make the most out of your proxy rotation efforts.
By the end of this guide, you'll be able to rotate proxies like a pro and ensure your web scraping efforts stay under the radar. Let's get started!
Setting up the environment
Before we begin, let's make sure we have the necessary tools in place. First, download and install the latest version of Python, from the official website.
For this tutorial, we will be using Visual Studio Code as our Integrated Development Environment (IDE) but you can use any other IDE of your choice. Create a new folder for your project and a new file index.py, where we’ll write our code.
Now open the terminal, and run the following command to install the package necessary for sending requests:
pip install requests
And that’s all! We can start coding.
Proxify a GET request
Let’s begin by sending a simple GET request. In the newly created index.py file, write the following code:
import requests
response = requests.get('https://api.ipify.org/?format=json')
print(response.text)
And now run the script using the following command:
py index.py
You’ll notice that the result is your real IP address and that no matter how many times you run the script, it will always show the same result.
The idea of this guide is to show you how can you hide your real IP address, and how to obtain a different IP at every script run. This is where the proxies come into the picture. You’ll find them under the following structure:
http://proxy_username:proxy_password@proxy_host:proxy_port
The protocol can be either “http”, “https”, “socks4” or “socks5”, while the “proxy_username” and the “proxy_password” are optional.
The updated code should look like this:
import requests
proxy_schema = {
"http": "http://proxy_username:proxy_password@proxy_host:proxy_port",
"https": "https://proxy_username:proxy_password@proxy_host:proxy_port"
}
URL = 'https://api.ipify.org/?format=json'
response = requests.get(URL, proxies=proxy_schema)
print(response.text)
You can replace the proxy template with a real one, provided by a free proxy service, just for the sake of testing. It’s worth mentioning however that they are not reliable and should be used for testing purposes only.
Either way, considering that even the best proxies can sometimes be unstable, it’s a best practice to handle exceptions in our code:
try:
URL = 'https://api.ipify.org/?format=json'
response = requests.get(URL, proxies=proxy_schema, timeout=30)
print(response.text)
except:
print('Unable to connect to the proxy')
If you managed to find a free working proxy, after running the code you should notice that the result has changed, as it’s no longer your real IP address.
Rotate the proxies
Now let’s see how can we use multiple proxies to anonymize our requests. At the same time, we’ll be handling the speed of our script. We’ll send the requests asynchronously, using the “concurrent.futures” library.
First, let’s say we have the following list of proxies (or proxy pool):
proxy_pool = [
"http://191.5.0.79:53281",
"http://202.166.202.29:58794",
"http://51.210.106.217:443",
"http://5103.240.161.109:6666"
]
This can be either hard-coded or read from a file, your choice. Using this, we can write the following code:
from concurrent.futures import ThreadPoolExecutor
with ThreadPoolExecutor(max_workers=8) as pool:
for response in list(pool.map(scrape_job, proxy_pool)):
pass
We initialize a thread pool with a maximum of 8 workers, to execute the function “scrape_job” (soon to be defined), which will receive as a parameter an element from the “proxy_pool” list.
Now, the “scrape_job” function will represent the code that we previously wrote, with a small adjustment:
def scrape_job(proxy):
try:
URL = 'https://api.ipify.org/?format=json'
proxy_scheme = {
"http": proxy,
"https": proxy
}
response = requests.get(URL, proxies=proxy_scheme, timeout=30)
print(response.text)
except:
print('Unable to connect to the proxy')
The “proxy_scheme” variable will be now dynamically assigned, according to the proxy that we send as a parameter. The script can be further extended to multiple URLs, with a specific timeout value for each.
Running the script should send multiple GET requests at the same time, whose result would be different IP addresses (if the free proxies are working).
Scaling up
When it comes to starting to rotate proxies, there are a few key best practices that can help ensure that your proxy rotation is as effective as possible. Here are a few tips to keep in mind:
Free proxies are a no-go
While free proxy services may seem like a cost-effective option, they are often unreliable and may even put your scraping efforts at risk. Consider investing in a premium proxy service that offers a higher level of security and reliability.
Test before scrape
Before you begin scraping, it's a good idea to test your proxy rotation to make sure it's working as intended. This will help you identify and fix any issues before they cause problems down the line.
Use user-agent rotation
Rotating your IP address is an effective way to bypass anti-scraping measures, but it's not the only technique you should rely on. Pairing IP rotation with user-agent rotation, which changes the browser signature, can make it even harder for websites to detect and block your scraper.
Premium proxy services
Not all proxy services are created equal, so it's important to do your research and choose a provider that offers a high level of security and reliability. Look for a service that offers a wide range of IPs and locations, as well as features like automatic IP rotation and anonymous browsing.
Use a web scraping API
Using a web scraping API can be a great way to simplify the process of rotating proxies, especially if you're new to web scraping. A good scraper API will handle the proxy rotation for you and provide you with a clean and easy-to-use interface to access the data you need.
Its proxy rotation mechanism avoids blocks altogether, and its extended knowledge base makes it possible to randomize the browser data so it will look like a real user.
However, if you wish to stick with your own scraper, or your use case is too complex, you also have the option to access the API as a proxy. The request will simply be redirected to the API while your script enjoys the proxy rotation.
Conclusion
Summing it up, knowing how to rotate proxies is an essential technique for web scraping, and Python makes it easy to implement. By following the steps outlined in this guide, you gained insights from proxying a single request to how to speed up the proxy rotation process.
Additionally, you've discovered some extra tips on proxy rotation when it comes to scaling up your project.
Remember, effective proxy rotation can greatly increase the success of your web scraping projects, and this guide has provided you with the tools and knowledge to do it like a pro. Happy scraping!
News and updates
Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.
We care about the protection of your data. Read our Privacy Policy.
Related articles
Explore the in-depth comparison between Scrapy and Selenium for web scraping. From large-scale data acquisition to handling dynamic content, discover the pros, cons, and unique features of each. Learn how to choose the best framework based on your project's needs and scale.
Learn how to use proxies with Axios & Node.js for efficient web scraping. Tips, code samples & the benefits of using WebScrapingAPI included.
Discover 3 ways on how to download files with Puppeteer and build a web scraper that does exactly that.