Ok, now we can write some code!
Begin by opening a terminal window in your IDE and run the following command, which will install BeautifulSoup, a library to help us extract the data from the HTML:
> pip install beautifulsoup4
Then, create a folder named “products”. It will help organize and store the scraping results in multiple CSV files.
Finally, create the “crawler.py” file. Here we are going to write all our code and crawling logic. When we are done, we can execute the file with the following command:
> py crawler.py
Moving forward, let’s import the libraries we need and then define some global variables:
import requests
from bs4 import BeautifulSoup
import csv
BASE_URL = "https://www.shopetee.com"
SECTION = "/collections/all-collections"
FULL_START_URL = BASE_URL + SECTION
ENDPOINT = "https://api.webscrapingapi.com/v1/"
API_KEY = "API_KEY"
Now, let’s define the entry point for our crawler:
def crawl(url, filename):
page_body = get_page_source(url, filename)
soup = BeautifulSoup(page_body, 'html.parser')
start_crawling(soup, filename)
crawl(FULL_START_URL, 'etee-page1.txt')
We implement the crawl function, which will extract the HTML documents through our get_page_source procedure. Then it will build the BeautifulSoup object that will make our parsing easier and call the start_crawling function, which will start navigating the website.
def get_page_source(url, filename):
params = {
"api_key": API_KEY,
"url": url,
"render_js": '1'
}
page = requests.request("GET", ENDPOINT, params=params)
soup = BeautifulSoup(page.content, 'html.parser')
body = soup.find('body')
file_source = open(filename, mode='w', encoding='utf-8')
file_source.write(str(body))
file_source.close()
return str(body)
As stated earlier, the get_page_source function will use WebScrapingAPI to get the HTML content of the website and will write in a text file in the <body> section, as it’s the one containing all the information we are interested in.
Now, let’s take a step back and check how to achieve our objectives. The products are organized in pages, so we need to access each page repeatedly to extract them all.
This means that our crawler will follow some recursive steps as long as there are available pages. To put this logic in code, we need to look at how the HTML describes these conditions.
If you get back to the Developer Console, you can see that each page number is actually a link to a new page. More than that, considering that we are on the first page and we don’t have any other before this, the left arrow is disabled.
So, the following algorithm has to:
- Access the page;
- Extract the data (we will implement this in the next step);
- Find the pagination container in the HTML document;Verify if the “Next Page” arrow is disabled, stop if it is and if not, get the new link and call the crawl function for the new page.
def start_crawling(soup, filename):
extract_products(soup, filename)
pagination = soup.find('ul', {'class': 'pagination-custom'})
next_page = pagination.find_all('li')[-1]
if next_page.has_attr('class'):
if next_page['class'] == ['disabled']:
print("You reached the last page. Stopping the crawler...")
else:
next_page_link = next_page.find('a')['href']
next_page_address = BASE_URL + next_page_link
next_page_index = next_page_link[next_page_link.find('=') + 1]
crawl(next_page_address, f'etee-page{next_page_index}.txt')