Web scraping with regular expressions can be a powerful tool for extracting data from websites, however, it also has its limitations. One of the main issues with using regex for web scraping is that it can fail when the structure of the HTML changes.
For example, consider the following code sample where we are trying to extract the text from the h2 using regex:
<html>
<head>
<title>Example Title</title>
</head>
<body>
<h1>Page Title</h1>
<p>This is a paragraph under the title</p>
<h2>First Subtitle</h2>
<p>First paragraph under the subtitle</p>
<h2>Second Subtitle</p>
</body>
</html>
Compare the first <h2> tag with the second one. You may notice the second <h2> is not properly closed, and the code has </p> instead of </h2>. Let’s update the snippet with this:
import re
html = "<html><head><title>Example Title</title></head><body><h1>Page Title</h1><p>This is a paragraph under the title</p><h2>First Subtitle</h2><p>First paragraph under the subtitle</p><h2>Second Subtitle</p></body></html>"
headingTags = re.findall("<h2>(.*?)</h2>", html)
print(*headingTags, sep = "\n")
Let’s run the code and check the output:
First Subtitle
The text from the second heading tag is missing. This happens because the regex rule is not matching the unclosed heading tag.
One solution to this problem is to use a library like BeautifulSoup, which allows you to navigate and search the HTML tree structure, rather than relying on regular expressions. With BeautifulSoup, you can extract the title of a webpage like this:
from bs4 import BeautifulSoup
html = "<html><head><title>Example Title</title></head><body><h1>Page Title</h1><p>This is a paragraph under the title</p><h2>First Subtitle</h2><p>First paragraph under the subtitle</p><h2>Second Subtitle</p></body></html>"
soup = BeautifulSoup(html, 'html.parser')
for headingTag in soup.findAll('h2'):
print(headingTag.text)
BeautifulSoup manages to extract malformed tags and the output looks like this:
First Subtitle
Second Subtitle
This approach is more robust to changes in the HTML structure, as it does not rely on specific patterns in the HTML code. If you’re interested in finding out more about BeautifulSoup, this article is a perfect read.
Another solution is to use a web scraping API such as WebScrapingAPI, which abstracts away the complexities of web scraping and allows you to easily extract the data you need without worrying about the underlying HTML structure.
With WebScrapingAPI, you can extract data from any website with a simple API call, and it automatically handles changes in the HTML structure.