What Are ISP Proxies And How To Use It For Web Scraping
Mihnea-Octavian Manolache on Feb 22 2023
Proxies are a very important aspect of web scraping. And there are mainly three types of proxies. There are datacenter, residential and ISP proxies. Each type has its own use cases. Yet, there is little to no information available on what they mean and when to use them in web scraping. And it’s especially ISP proxies that are left unhandled by tech writers. That is why today, we will focus our attention on this type of proxies. By the end of today’s article, you should have a solid understanding of:
- What is a proxy in general and how does it work
- What is the definition of an ISP proxy and what are its particularities
- How and why to use ISP proxies for web scraping
What is a proxy server?
In short, proxies are middleware's between clients and servers. A proxy will act as an intermediary for requests from clients seeking resources from other servers. The flow of a client - proxy - server relationship looks something like this:
- A client connects to the proxy server, requesting some service from a destination server
- The proxy server evaluates the request, connects to the destination server and fetches the requested service
- Once received, it then transfers the service back to the client, unaltered.
Apart from that, proxies are also used to:
- increase security
- enhance network performance
- filter network traffic
- block unwanted websites
- bypass geographical restrictions on internet access
What is the definition of ISP proxies?
As we discussed in the introduction, there are mainly three types of proxies. For the first two, the definition is pretty straightforward. Datacenter proxies are proxies that are owned by data centers. Which means that their IP addresses are associated with the datacenter. Residential proxies have the IP address associated with a physical location. Moreover, these IP addresses are registered to a specific individual or organization.
Now when it comes to IPS proxies, there is a bit of a confusion. First of all, ISP stands for Internet Service Provider. And as you can imagine, all residential IPs originate from ISPs. Well, this small aspect partially answers the question. ISP proxies fit somewhere between datacenter and residential proxies.
Most of the time, you’ll find that an ISP proxy is actually a residential proxy, hosted on a datacenter machine. Hence, these proxies inherit advantages from the other two. And the list of benefits include for the most part (but are not limited to):
- IP legitimacy - using a residential IP address lowers the risk of bot detection
- Speed - hosting the proxy on a datacenter server increases the performance of the service
Why use ISP proxies for web scraping?
The use of proxies in web scraping is quite a common need. But before discussing ISP proxies in particular, let me tell you about why proxies are important for scraping. To begin with, let’s define what web scraping is. On a high level, web scraping is accessing a server with the aim of extracting resources. And that is usually done using automated software. Moreover, web scraping typically involves sending a lot of requests to the targeted server in a short period of time.
As you can imagine, this puts a lot of load on the server. And that is why web platforms are typically not ‘happy’ about scrapers accessing their servers. To prevent access from automated software, these platforms usually use some sort of detection and prevention system. And one of the methods of detection is as basic as it can be: checking the IP address. It’s common sense to think that IP addresses associated with data centers are more likely to host bots.
And I think this answers the question quite well. If we account for the main advantages discussed previously, we’ll have a wider understanding of the answer. We primarily use ISP proxies in web scraping to increase our success rate while maintaining an optimal performance. But that is not it. Let’s explore other scenarios:
#1: Accessing location specific websites
I am sure you’ve come across websites that target visitors from specific locations. In SEO, this concept is known as geo-location specific content. What happens is that websites first check the origin of the client’s IP address. And if it matches their pattern (say it’s a US website targeting US only clients) it will allow the client to connect. If on the other hand the client is from a foreign country, the website will lock its access.
In web scraping, this is a very common scenario. So, as a workaround, we will use proxies from that specific country. You may first want to try out a datacenter proxy. If you still get locked, you can then try ISP proxies, which again, offer a higher success rate.
#2: Sending large numbers of requests
When we want to access many resources on a server, we may put a lot of load on that server. And servers will usually see it as abuse and block the IP address that is sending all these requests. In web scraping, to avoid getting blocked, we would use a rotating system that switches between ISP proxies. This way, the server will ‘think’ there are different residential users accessing it. Hence, the bulk requests won’t get blocked.
How to use ISP proxies for web scraping?
There are primarily two types of web scrapers:
- Based on simple HTTP clients
~ » curl https://<REACT_APP>.com
#1: Create a new project
First things first, let’s create a new directory that will hold our files. Next, open the project in your favorite IDE (mine is Visual Studio Code) and open a new terminal. To open a new terminal from within VSCode, go to Terminal > New terminal. We’ll create a new virtual environment inside the project and activate it:
~ » python3 -m venv env && source env/bin/activate
In your project, let’s create a new ‘scraper.py’ file and add some code to it. The basic structure of a scraper with Selenium, from a functional programming perspective is:
from selenium import webdriver
driver = webdriver.Chrome()
And that is it. In 5 lines of code:
- We’re firing up an automated browser
- We’re accessing our target
- And we’re collecting its resources.
But remember we want to use ISP proxies with selenium. Such that our browser is not the stealthiest, but let’s say more undetectable. Luckily, things are quite simple in Python (and that’s why I love it). Here is how we introduce proxies in Selenium:
from selenium import webdriver
def scrape_page(url, proxy):
options = webdriver.ChromeOptions()
options.add_argument('--proxy-server=%s' % proxy)
driver = webdriver.Chrome(options=options)
We only added more lines inside the function. The last one is to call the function. If you run the script now, we should probably be able to see that the request originates from 126.96.36.199. For the purpose of this example, I’ve used a free proxy server from here. But if you want to build a real scraper, I suggest you look into more reliable sources. Preferable, proxy providers that offer ISP proxies as well.
How to check the origin of an IP address?
There are a few ways you can check to see if an IP address originates from an ISP proxy. But since we’re talking DevOps today, you should probably get yourself comfortable with using the terminal. Today I will introduce you to `whois`.
In Unix and Unix-like operating systems, `whois` is built-in. It is a command line utility that allows users to look up information about targets. And the targets can be either domain names or IP addresses. So let’s fire up a new terminal window and test this command.
First of all, let’s send a `curl` command to the API offered by ipify.org. This way, you can get your own IP address and perform the test using it. If you’re not familiar with `curl`, just go over my article on how to use curl.
~ » curl api.ipify.org
Now that we have an IP address to test on, just send your `whois` command. I’ve used my own IP, but feel free to replace <IP_ADDRESS> with yours:
~ » whois <IP_ADDRESS>
inetnum: 82.78.XX.0 - 82.78.XX.XX
descr: RCS & RDS Residential CGN
descr: City: Bucuresti
The output is larger, but I wanted you to have an overview of how easily a residential IP is detected. There are also public APIs that track datacenter IPs, such as the one offered by incolumitas.
Today we have explored both the DevOps and the Coding skills of building a web scraper. To wrap up, I’ll ask you a simple question. Can we say that ISP proxies are nothing more than datacenter proxies, hidden behind a residential IP address? I think this is not the most accurate definition, but it sure describes these proxies pretty well.
At Web Scraping API, we’re using both residential and datacenter proxies. That is because some targets allow traffic for non residential, others don’t. If you want to learn more about how you can use proxies with our API, check out our documentation.