How To Use CURL With Python For Web Scraping

Mihnea-Octavian Manolache on Nov 30 2022

The most basic action any web scraping app has to perform is to first gather the HTML file and only then focus on manipulating it. Of course, there are different ways you can achieve this. However, in today’s article, we will discover How To Use CURL With Python for a web scraper. Here is a preview of just a fraction of the things that you will learn after reading this article:

  • What is cURL and how to access it from the command-line
  • How to use cURL command to gather information from any website
  • How to use cURL in Python in order to build a simple web scraper

How To Use CURL With Python?

For short, cURL is mainly a command-line tool used to fetch data from a server. I know that when mentioning the command line, things may seem complicated. However, I can assure you that in practice, as you will discover throughout this article, cURL is maybe one of the easiest tools you will ever use as a programmer.

To use cURL from the command line, simply open a new terminal window and type `curl` followed by the URL you want to scrape. For example:

~ » curl 'https://api.ipify.org?format=json'

This simple command is accessing ipify’s API, requesting information from the server, just like a traditional browser would do. The output of this example will be a JSON object containing your IP address. Even though it might not seem like it, you have just built the infrastructure for a future web scraper. All in just one line of code.

cURL is actually a more advanced tool. If you want to learn more about curl how to use, you can go over the official documentation. You can also use the `--help` switch and read about the various options available. 

How to Use cURL to Fetch HTML Files

In the example above, the response we received from the ipify server was a JSON file. That is because this particular API endpoint returns data in JSON format. In terms of web scraping, you will usually come across traditional websites that serve HTML file, that you will then have to parse and extract data from.

However, for now, our focus is not data manipulation, but rather data extraction. And we know we can use cURL to scrape websites, but how do we actually do it? Well, if you haven’t already been curious and tried, simply ask curl to access any generic URL that you know would be a traditional HTML based website. Let’s take as an example httpbin.org:

curl 'https://httpbin.org/forms/post'

Type that command in your terminal and you will receive the plain HTML as a response:

<!DOCTYPE html>
<html>
<head>
</head>
<body>
<!-- Example form from HTML5 spec http://www.w3.org/TR/html5/forms.html#writing-a-form's-user-interface -->
<form method="post" action="/post">
<p><label>Customer name: <input name="custname"></label></p>
<p><label>Telephone: <input type=tel name="custtel"></label></p>
<p><label>E-mail address: <input type=email name="custemail"></label></p>
<fieldset>
<legend> Pizza Size </legend>
<p><label> <input type=radio name=size value="small"> Small </label></p>
<p><label> <input type=radio name=size value="medium"> Medium </label></p>
<p><label> <input type=radio name=size value="large"> Large </label></p>
</fieldset>
<fieldset>
<legend> Pizza Toppings </legend>
<p><label> <input type=checkbox name="topping" value="bacon"> Bacon </label></p>
<p><label> <input type=checkbox name="topping" value="cheese"> Extra Cheese </label></p>
<p><label> <input type=checkbox name="topping" value="onion"> Onion </label></p>
<p><label> <input type=checkbox name="topping" value="mushroom"> Mushroom </label></p>
</fieldset>
<p><label>Preferred delivery time: <input type=time min="11:00" max="21:00" step="900" name="delivery"></label></p>
<p><label>Delivery instructions: <textarea name="comments"></textarea></label></p>
<p><button>Submit order</button></p>
</form>
</body>
</html>

How to Use cURL in Python 

As you saw, extracting data with cURL is a straightforward solution and requires no actual coding. It is simply a matter of sending a command and receiving some information. If you want to build a real web scraping project, you will need to somehow use the data you collected. And since we are programmers, we want to manipulate the data programmatically. Here is where Python comes into play.

Why Choose Python For a Web Scraping Project

Undoubtedly, Python is one of the most popular programming languages. Not only that it is very powerful, but its simple syntax makes it perfect for beginner programmers. It also has a great community that is always ready to jump in and help. So if at any time you encounter an issue and you’re stuck, don’t hesitate to ask a question on Stackoverflow for example, and someone will surely help you. 

When it comes to web scraping in particular, Python is a great choice because of all the packages it comes with. As you will see later in this article, data manipulation requires parsing the HTML files, such that you can then ‘mine’ the elements and extract only the information you are targeting from that particular web page.

Web Scraping With cURL and Python

So far, we’ve discovered curl how to use it in the terminal, but how do we actually integrate it with Python? Well, there are actually multiple ways you can approach this. For instance, you could use Python’s `os` module and send terminal commands:

import os 
curl = os.system(f'curl "https://httpbin.org/forms/post"')
print(curl)

Or you can even build your own function around it and use it throughout the project:

import os  
def curl(website):
return os.system(f'curl "{website}"')

print(curl('https://httpbin.org/forms/post'))

However, like I said, one of Python’s greatest strengths is its package diversity. Because cURL is much more diverse, our function would need to be much more complex in order to accommodate all its features. So, instead of reinventing the wheel, I recommend we use an existing package for cURL integration in Python: PycURL.

What is PycURL and How To Install It

According to their website, PycURL is an interface to the cURL library, hence inheriting all libcURL capabilities. In short, PycURL is the means by which we will use cURL in Python. When it comes to installing it, like with any other Python package, we will use pip. If you’re not familiar with pip, it is a package management system for Python and Python developers use it all the time to quickly install dependencies.

This being said, to install PycURL, simply add the following command in your terminal:

~ » pip install pycurl 

Parsing HTML Using BeautifulSoup

Since we’re discussing dependencies and pip, it’s also worth mentioning that the Python community came up with quite a few solutions for HTML parsing. One of the most popular HTML parsing packages is BeautifulSoup. At WebScrapingAPI, we’ve actually dedicated an entire blog post on how to extract and parse web data with Python and BeautifulSoup

Just like with PycURL, installing BeautifulSoup only takes one command:

~ » pip install beautifulsoup4

How to Build a Web Scraper With Python and cURL

Now that we’ve covered the theoretical part and we know cURL how to use it both with the terminal and in Python, let’s jump right into coding. In this section, we will learn how to use curl in Python by building an actual web scraper. So, without further ado, let the coding game begin!

1. Setting up The Directory

As a software engineer, it is important to structure our projects such that they are easy to maintain and read by ourselves and by other developers as well. To keep everything organized, let’s start by creating a new directory that will hold all our project’s files. Open up a new terminal window, `cd` into Desktop and create a new folder named `py_scraper`:

~ » cd desktop && mkdir py_scraper && cd py_scraper

Let me briefly explain the commands we’ve used so far:

  1. `cd` - change current directory
  2. `&&` - execute the following command only if the previous one is successful
  3. `mkdir` - create new directory

Open up your project in your favorite IDE and create a new file named ‘scraper.py’ inside the `py_scraper` directory. Hint: You can also do it from the command-line by using this command:

~/desktop/py_scraper » touch scraper.py && code .

If you’re using VSCode (like I do), you will now be presented with a window that should look like this:

blog-image

2. Installing Packages

Your terminal should now be inside the `py_scraper` directory.  The last thing we need to do before coding the actual scraper is to install the packages we’ve previously presented and one more. However, we want to contain them only inside the `py_scraper` directory (and not have them installed globally). To do so, we will have to use Python’s virtual environments. These allow us to isolate the Python interpreter, libraries and scripts installed.

To set up a new virtual environment inside the `py_scraper` directory, use the following command:

~/desktop/py_scraper » python3 -m venv env

This will create a new `env` folder that we need to activate before installing the desired packages. Activate it by using this command:

~/desktop/py_scraper » source env/bin/activate

Now that you’ve created and activated your virtual environment, all there is left is to install the required packages by making use of the pip commands we’ve presented previously.

~/desktop/py_scraper » pip install pycurl beautifulsoup4 certify

3. Creating The Python and cURL Web Scraper

You are now all set up to use PycURL and BeautifulSoup. In order to use these packages, we need to first import them into our `scraper.py` file. Simply add this snippet at the top of the file:

import pycurl
import certify
from io import BytesIO
from bs4 import BeautifulSoup

# All our logic will go underneath this line

Now that you’ve imported the packages let’s handle the logic of our web scraper. From what we have discussed so far, we know that we need to cover two aspects: data extraction and data handling. The first section is covered by PycURL and the second section is covered by BeautifulSoup. For a better structure, I suggest we treat each section separately.

3.1. Scraping Data With cURL and Python

When I say scraping, I am referring to the extraction part of the web scraper. Having this in mind and knowing how to use curl in Python by interacting with the PycURL interface, let’s write the code:

# Setting global variables
TARGET_URL = 'https://httpbin.org/forms/post'

# Using cURL and Python to gather data from a server via PycURL
buffer = BytesIO()
curl = pycurl.Curl()
curl.setopt(curl.URL, TARGET_URL)
curl.setopt(curl.WRITEDATA, buffer)
curl.setopt(curl.CAINFO, certifi.where())
curl.perform()
curl.close()

# Using BytesIO to retrieve the scraped data
body = buffer.getvalue()

# Saving the output and printing it in terminal
data = body.decode('iso-8859-1')
print(data)

In the above code, we start by declaring the `TARGET_URL` global variable that holds the URL of the website we want to extract data from. Next we create a buffer using `BufferIO`, we initialize PycURL and set two options: one for the data transfer, one for the file name holding the certificates. Last but not least, we perform the curl action and we close the session afterwards. 

That’s it, you have successfully used Python to make a cURL request and printed the HTML file in your console. Now all we need to do is to take care of the second section, namely the data handling.

3.2. Parsing HTML With Python and BeautifulSoup

Having the raw data is redundant in web scraping, unless we perform some sort of action on it. As the most basic scope of any web scraper is to extract data from HTML. For our example, let us assume that we want to scrape all text inside the `<p>` elements from the `data` variable (that is currently holding all the scraped HTML). Here is how we do this using BeautifulSoup:

# Parsing data using BeautifoulSoup
soup = BeautifulSoup(data, 'html.parser')
# Finding elements using BeautifoulSoup
paragraphs = soup.find_all("p")
for p in paragraphs:
print(p.text)

As you can see, with BeautifulSoup, it only takes 4 lines of code to extract the desired result. Running the full script should now print the text inside each paragraph found in the HTML file we collected from our targeted website. 

So, assuming that you followed the instructions and that your `scraper.py` includes all the code we’ve written in this section, let’s head back to the terminal and run the script:

~/desktop/py_scraper » python3 scraper.py
Customer name:
Telephone:
E-mail address:
Small
Medium
Large
Bacon
Extra Cheese
Onion
Mushroom
Preferred delivery time:
Delivery instructions:
Submit order

Conclusion

Building a web scraper with Python and cURL is a very useful project and can be the starting point for a bigger web scraping app. The recommended approach to integrating the two technologies is by using PycURL. You can also write your own interface or function to interact with cURL in Python. It just takes a bit more time and effort :).

I hope this article was a good resource for learning curl, how to use it with Python, and build a basic web scraper. Moreover, I invite you to tweak the code and make it your own, such that you’ll have yet another project to add to your portfolio.

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

Subscribe

Related articles

thumbnail
GuidesWeb Scraping with JavaScript and Node.Js

Developers are using web scrapers for all kinds of data fetching. Let us show you how to build your own Web Scraping with JavaScript.

Robert Sfichi
author avatar
Robert Sfichi
16 min read
thumbnail
Science of Web ScrapingThe Top 9 Residential Proxy Service Providers

Bypassing geo-restrictions and IP blocking can be a true hurdle when web scraping. This list of the best residential proxy providers will help!

Anda Miuțescu
author avatar
Anda Miuțescu
10 min read
thumbnail
Science of Web ScrapingNode Unblocker for Web Scraping

Take your web scraping to the next level by creating and deploying a custom proxy using Node Unblocker and never get blocked again.

Suciu Dan
author avatar
Suciu Dan
8 min read