Parse HTML like a Pro: Mastering Web Scraping with Python and Regex

Suciu Dan on Apr 13 2023

blog-image

The amount of data available on the internet has grown in recent decades. Humans consume this data for a wide variety of purposes, from personal interests to business research.

Yet, if this data is not returned in a formatted way, such as XML or JSON, it can be difficult or impossible to read it via software applications. This is where the technique of web scraping comes in.

Web scraping is the process of gathering and processing raw data from the internet. This data is parsed and used for a variety of purposes, such as price intelligence, market research, training AI models, sentiment analysis, brand audits, and SEO audits.

One of the key aspects of web scraping is parsing HTML. This can be done using a variety of tools, such as BeautifulSoup for Python, Cheerio for NodeJS, and Nokogiri for Ruby.

Regular expressions (regex) are a sequence of characters that define a search pattern.

In this article, we will explore how to parse an HTML document using regex and Python. We will also discuss some of the challenges and alternative solutions that come with web scraping.

By the end of the article, you will have a comprehensive understanding of the topic and the different tools and techniques that are available.

Basic Regex Parsing

Most general-purpose programming languages support regex. You can use regex in a wide variety of programming languages, including Python, C, C++, Java, Rust, OCaml, and JavaScript.

Here’s what a regex rule for extracting the value from the <title> tag looks like:

<title>(.*?)</title>

Scary, isn’t it? Keep in mind this is the beginning. We’ll go down the rabbit hole soon.

For this article, I’m using Python 3.11.1. Let’s take this rule and put it into code. Create a file called main.py and paste this snippet:

import re

html = "<html><head><title>Scraping</title></head></html>"

title_search = re.search("<title>(.*?)</title>", html)

title = title_search.group(1)

print(title)

You can execute this code by running the command `python main.py`. The result you will see as an output is the word “Scraping”.

In this example, we are using the `re` module to work with regex. The `re.search()` function searches for a specific pattern within a string. The first argument is the regex pattern, and the second argument is the string in which we are searching.

The regex pattern in this example is "<title>(.*?)</title>". It consists of several parts:

  • <title>: This is a literal string, it will match the characters "<title>" exactly.
  • (.*?): This is a capturing group, denoted by parentheses. The . character matches any single character (except a newline), and the * quantifier means to match 0 or more of the preceding character. Additionally, ? makes the * non-greedy, meaning that it will stop as soon as it finds the closing tag.
  • </title>: This is also a literal string, it will match the characters "</title>" exactly.

The re.search() function returns a match object if a match is found, and the group(1) method is used to extract the text matched by the first capturing group, which is the text between the opening and closing title tags.

This text will be assigned to the variable title, and the output will be "Scraping".

Advanced Regex Parsing

Extracting the data from a single HTML tag is not that useful. It gives you a glimpse of what you can do with regular expressions but you can’t use it in a real-world situation.

Let’s check the PyPI website, the Python Package Index. On the home page, they display four stats: the number of projects, the number of releases, the number of files, and the number of users.

We want to extract the number of projects. For accomplishing this, we can use this regex:

([0-9,]+) projects

The regular expression will match any string that starts with one or more digits, optionally separated by commas, and ends with the word "projects". This is how it works:

  • ([0-9,]+): This is a capturing group, denoted by the parentheses; the square brackets [0-9,] matches any digit from 0 to 9 and the character `,`; the + quantifier means to match 1 or more of the preceding character.
  • projects: This is a literal string, it will match "projects" exactly.

Time to put the rule to the test. Update the `main.py` code with this snippet:

import urllib.request

import re

response = urllib.request.urlopen("https://pypi.org/")

html = response.read().decode("utf-8")

matches = re.search("([0-9,]+) projects", html)

projects = matches.group(1)

print(projects)

We are using the urlopen method from the urllib library to make a GET request to the pypi.org website. We read the response in the html variable. We run the regex rule against the HTML content and we print the first matched group.

Run the code with the `python main.py` command and check the output: it will display the number of projects from the site.

Extracting Links

Now that we have a simple scraper that can get the HTML document of a site, let’s play with the code a little.

We can extract all the links with this rule:

href=[\'"]?([^\'" >]+)

This regular expression consists of several parts:

  • href=: this is a literal string, it will match the characters "href=" exactly.
  • [\'"]?: the square brackets [] matches any single character inside them, in this case, the characters ' or "; the ? quantifier means to match zero or one of the preceding characters, it means the href value can be enclosed with " or ' or none.
  • ([^\'" >]+): this is a capturing group, denoted by the parentheses; the ^ inside the square brackets means negation, it will match any character that is not a ',",>, or a space; the + quantifier means to match 1 or more of the preceding character, it means the group will capture one or more characters that match the pattern.

Extracting Images

One more thing and we’re almost done with writing regex rules: we need to extract the images. Let’s use this rule:

<img.*?src="(.*?)"

This regular expression consists of several parts:

  • <img: This is a literal string, it will match the characters "<img" exactly.
  • .*?: the .* match any character (except a newline) 0 or more times, and the ? quantifier means to match as few as possible of the preceding character; this is used to match any character that appears before the src attribute in the <img> tag, and it allows the pattern to match any <img> tag regardless of the number of attributes it has.
  • src=": this is a literal string, it will match the characters "src=" exactly.
  • (.*?): this is a capturing group, denoted by the parentheses; the .*? match any character (except a newline) 0 or more times, and the ? quantifier means to match as few as possible of the preceding character; this group captures the src value of the <img> tag.
  • ": this is a literal string, it will match the character " exactly.

Let’s put it to the test. Replace the previous snippet code with this one:

import urllib.request

import re

response = urllib.request.urlopen("https://pypi.org/")

html = response.read().decode("utf-8")

images = re.findall('<img.*?src="(.*?)"', html)

print(*images, sep = "\n")

The output of this code will display a list with all the image links from the Pypi page.

Limitations

Web scraping with regular expressions can be a powerful tool for extracting data from websites, however, it also has its limitations. One of the main issues with using regex for web scraping is that it can fail when the structure of the HTML changes.

For example, consider the following code sample where we are trying to extract the text from the h2 using regex:

<html>

<head>

<title>Example Title</title>

</head>

<body>

<h1>Page Title</h1>

<p>This is a paragraph under the title</p>

<h2>First Subtitle</h2>

<p>First paragraph under the subtitle</p>

<h2>Second Subtitle</p>

</body>

</html>

Compare the first <h2> tag with the second one. You may notice the second <h2> is not properly closed, and the code has </p> instead of </h2>. Let’s update the snippet with this:

import re

html = "<html><head><title>Example Title</title></head><body><h1>Page Title</h1><p>This is a paragraph under the title</p><h2>First Subtitle</h2><p>First paragraph under the subtitle</p><h2>Second Subtitle</p></body></html>"

headingTags = re.findall("<h2>(.*?)</h2>", html)

print(*headingTags, sep = "\n")

Let’s run the code and check the output:

First Subtitle

The text from the second heading tag is missing. This happens because the regex rule is not matching the unclosed heading tag.

One solution to this problem is to use a library like BeautifulSoup, which allows you to navigate and search the HTML tree structure, rather than relying on regular expressions. With BeautifulSoup, you can extract the title of a webpage like this:

from bs4 import BeautifulSoup

html = "<html><head><title>Example Title</title></head><body><h1>Page Title</h1><p>This is a paragraph under the title</p><h2>First Subtitle</h2><p>First paragraph under the subtitle</p><h2>Second Subtitle</p></body></html>"

soup = BeautifulSoup(html, 'html.parser')

for headingTag in soup.findAll('h2'):

print(headingTag.text)

BeautifulSoup manages to extract malformed tags and the output looks like this:

First Subtitle

Second Subtitle

This approach is more robust to changes in the HTML structure, as it does not rely on specific patterns in the HTML code. If you’re interested in finding out more about BeautifulSoup, this article is a perfect read.

Another solution is to use a web scraping API such as WebScrapingAPI, which abstracts away the complexities of web scraping and allows you to easily extract the data you need without worrying about the underlying HTML structure.

With WebScrapingAPI, you can extract data from any website with a simple API call, and it automatically handles changes in the HTML structure.

Final thoughts

Data parsing with regular expressions can be a powerful tool for extracting data from websites.

In this article, we've discussed the basics of regular expressions, how to use them to parse HTML, and some of the challenges you may encounter when using them. We've also seen how libraries like BeautifulSoup can be used as an alternative solution.

You've learned how to extract data from web pages using regular expressions, and how to improve the reliability of your code by using a more robust library such as BeautifulSoup.

Web scraping can be a time-consuming task, but with the right tools, it can be easy and efficient. If you're looking for a web scraping solution that will save you time and effort, give WebScrapingAPI a try.

We offer a free trial for 14 days, where you can test our service and see the benefits of using a web scraping API.

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

We care about the protection of your data. Read our Privacy Policy.

Related articles

thumbnail
Use CasesUnleashing the Power of Financial Data: Exploring Traditional and Alternative Data

Dive into the transformative role of financial data in business decision-making. Understand traditional financial data and the emerging significance of alternative data.

Suciu Dan
author avatar
Suciu Dan
8 min read
thumbnail
Science of Web ScrapingWeb Scraping made easy: The Importance of Data Parsing

Discover how to efficiently extract and organize data for web scraping and data analysis through data parsing, HTML parsing libraries, and schema.org meta data.

Suciu Dan
author avatar
Suciu Dan
12 min read
thumbnail
Use CasesXPath Vs CSS Selectors

Are XPath selectors better than CSS selectors for web scraping? Learn about each method's strengths and limitations and make the right choice for your project!

Mihai Maxim
author avatar
Mihai Maxim
8 min read