Now we can begin writing the code. To build our crawler, we’ll follow a recursive flow so we’ll access all the links we encounter. But first, let’s define our entry point:
def crawl(url, filename):
page_body = get_page_source(url, filename)
soup = BeautifulSoup(page_body, 'html.parser')
start_crawling(soup)
crawl(FULL_START_URL, 'ecoroots.txt')
We implement the crawl function, which will extract the HTML documents through our get_page_source procedure. Then it will build the BeautifulSoup object that will make our parsing easier and call the start_crawling function, which will start navigating the website.
def get_page_source(url, filename):
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'html.parser')
body = soup.find('body')
file_source = open(filename, mode='w', encoding='utf-8')
file_source.write(str(body))
file_source.close()
return str(body)
As stated earlier, the get_page_source function will use selenium to get the HTML content of the website and will write in a text file in the <body> section, as it’s the one containing all the internal links we are interested in.
unique_links = {}
def start_crawling(soup):
links = soup.find_all(lambda tag: is_internal_link(tag))
for link in links:
link_href = link.get('href')
if not link_href in unique_links.keys() or unique_links[link_href] == 0:
unique_links[link_href] = 0
link_url = BASE_URL + link_href
link_filename = link_href.replace(SECTION + '/products/', '') + '.txt'
crawl(link_url, link_filename)
unique_links[link_href] = 1
This is the main logic of the crawler. Once it receives the BeautifulSoup object, it will extract all the internal links. We do that using a lambda function, with a few conditions that we defined in the is_internal_link function:
def is_internal_link(tag):
if not tag.name == 'a': return False
if tag.get('href') is None: return False
if not tag.get('href').startswith(SECTION + '/products'): return False
return True
This means that for every HTML element that we encounter, we first verify if it’s a <a> tag, if it has an href attribute, and then if the href attribute’s value has an internal link.
After we get the rundown of the links, we iterate each one of them, build the complete URL and extract the product’s name. With this new data, we have a new website that we pass to the crawl function from our entry point, so the process begins all over again.
But what if we encounter a link that we already visited? How do we avoid an endless cycle? Well, for this situation, we have the unique_links data structure. For every link we iterate, we verify if it was accessed before starting to crawl it. If it’s a new one, then we simply mark it as visited once the crawling it’s done.
Once you run your script, the crawler will start navigating through the website’s products. It may take a few minutes according to the size of the website you choose. Finally, you should now have a bunch of text files that will hold the HTML of every page that your crawler visits.