We will split our code in two sections: one for data extraction and one for data manipulation. The first part is covered by the Requests package, while the second part is covered by BeautifulSoup. WIthout further ado, let’s jump into coding, starting with the extraction part:
import requests
def scrape( url = None ):
# if there is no URL, there is no need to use Python HTTP clients
# We will print a message and stop execution
if url == None:
print('[!] Please add a target!')
return
response = requests.get( url )
return response
In this section, we are defining a function with only one parameter: the targeted URL. If the URL is not provided, we are going to print a message and stop the execution. Otherwise, we’re using Request’s get method to return the response. Now, we know that Python HTTP clinetts cover more methods, so let’s add a conditional parameter:
import requests
def scrape( method = 'get', url = None, data = None ):
# if there is no URL, there is no need to use Python HTTP clients
# We will print a message and stop execution
if url == None:
print('[!] Please add a target!')
return
if method.lower() == 'get':
response = requests.get( url )
elif method.lower() == 'post':
if data == None:
print('[!] Please add a payload to your POST request!')
return
response = requests.post( url, data )
return response
As you can see, we added a couple more parameters to our function. The `method` parameter specifies which method should be used for our request. The `data` represents the payload we are sending with the POST request. By default, the method is GET, hence the `method` parameter is not required.
Challenge: Add more methods to this function and enrich our scraper’s capabilities. Not only is it fun, but it’s also a good learning approach. Plus, you get to make the code your own so you can add it to your portfolio.
So far we’ve covered the data extraction. Let’s parse the HTML and do something with it:
from bs4 import BeautifulSoup
def extract_elements(data = None, el = None):
if data == None:
print('[!] Please add some data!')
return
if el == None:
print('[!] Please specify which elements you are targeting!')
return
soup = BeautifulSoup(data.text, 'html.parser')
elements = soup.find_all(el)
return elements
But a web scraper should be able to extract more specific data. For example, it should be able to locate and return elements based on their CSS selector. So let’s add the logic that handles this part:
from bs4 import BeautifulSoup
def extract_elements(data = None, el = None, attr = None, attr_value = None):
if data == None:
print('[!] Please add some data!')
return
if el == None:
print('[!] Please specify which elements you are targeting!')
return
soup = BeautifulSoup(data.text, 'html.parser')
elements = soup.find_all(el, { attr : attr_value })
return elements
BeautifulSoup allows us to extract specific data based on their attributes. So here we’ve added two new parameters that will help us locate and extract elements based on their attributes.
We now have everything we need. All that is left to do is to combine the two sections and we have our web scraper. Once you assemble your code, simply:
- Create a new variable that will hold the data extracted with Requests
- Print the elements returned by BeautifulSoup
Here are the two missing pieces of your code:
data = scrape('GET', 'https://webscrapingapi.com')
print( extract_elements(data, 'ul') )
I am sure you already figured out what everything does and there is no need for a translation at this point. Just as with our scraper, I challenge you to play around with the `extract_elements` function and make it do more than simply returning elements.