Parsel: How to Extract Text From HTML in Python

Mihai Maxim on Jan 31 2023

blog-image

Introduction

Web scraping is the automated process of collecting data from websites by using a script or program. It's used to extract information such as text, images, and other types of data that can be useful for different purposes like research, data analysis, or market analysis.

Nowadays, there are a ton of solutions when it comes to web scraping with Python. Selenium and Scrapy are some of the most widely used and popular libraries. While these tools are great for complicated scraping tasks, they can be a bit overwhelming for casual use.

Enter Parsel, the little scraping library. This lightweight and easy-to-learn library is perfect for small projects and is great for those who are new to web scraping. It is able to parse HTML and extract data using CSS and XPath selectors, making it a great tool for any data lover looking for a fast and easy way to collect information from the web.

Buckle up and get ready to learn how to use this library as you join me on this adventure of automated data collection. Let's get scraping!

Getting Started With Parsel

You can install the Parsel library with:

pip install parsel

Now let’s dive straight into an example project and scrape all the countries data from this simple website https://www.scrapethissite.com/pages/simple/.

To get the HTML from the website, you will need to make a HTTP GET request.

We will be making HTTP requests with the “requests” Python library, so make sure you install it with:

pip install requests

Now you can fetch the HTML, write it to a file:

import parsel 

import requests

response = requests.get("https://www.scrapethissite.com/pages/simple/")

with open("out.html", "w", encoding="utf-8") as f:

f.write(response.text)

And examine the structure:

blog-image

Our data is stored in structures similar to this:

<div class="col-md-4 country">

<h3 class="country-name">

<i class="flag-icon flag-icon-af"></i>

Afghanistan

</h3>

<div class="country-info">

<strong>Capital:</strong> <span class="country-capital">Kabul</span><br>

<strong>Population:</strong> <span class="country-population">29121286</span><br>

<strong>Area (km<sup>2</sup>):</strong> <span class="country-area">647500.0</span><br>

</div>

</div><!--.col-->

In order to write selectors, you’ll need to pass the raw HTML to Parsel:

import parsel

import requests

response = requests.get("https://www.scrapethissite.com/pages/simple/")

raw_html = response.text

parsel_dom = parsel.Selector(text = raw_html)

Now we’re ready to write some selectors.

Extracting Text Using CSS Selectors

You can print the first country capital with:

parsel_dom = parsel.Selector(text=raw_html)

first_capital = parsel_dom.css(".country-capital::text").get()

print(first_capital)

// Output

Andorra la Vella
parsel_dom.css(".country-capital::text").get() will select the inner text of the first element that has the country-capital class.

You can print all the country names with:

countries_names = filter(lambda line: line.strip() != "", parsel_dom.css(".country-name::text").getall())

for country_name in countries_names:

print(country_name.strip())

// Output

Andorra

United Arab Emirates

Afghanistan

Antigua and Barbuda

Anguilla

. . .
parsel_dom.css(".country-name::text").getall() will select the inner texts of all the elements that have the "country-name" class. 

Notice that we had to clean-up the output a bit. We did that because all the elements that have the “.country-name” class also have an <i> tag nested inside of them. Also, the country name is surrounded by many trailing spaces.

<h3 class="country-name">

<i class="flag-icon flag-icon-ae"></i> //this is picked up as an empty string

United Arab Emirates // this is picked up as “ United Arab Emirates “

</h3>

Now let’s write a script to extract all the data with CSS selectors:

import parsel

import requests

response = requests.get("https://www.scrapethissite.com/pages/simple/")

raw_html = response.text

parsel_dom = parsel.Selector(text=raw_html)

countries = parsel_dom.css(".country")

countries_data = []

for country in countries:

country_name = country.css(".country-name::text").getall()[1].strip()

country_capital = country.css(".country-capital::text").get()

country_population = country.css(".country-population::text").get()

country_area = country.css(".country-area::text").get()

countries_data.append({

"name": country_name,

"capital": country_capital,

"population": country_population,

"area": country_area

})

for country_data in countries_data:

print(country_data)

// Outputs

{'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area': '468.0'}

{'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area': '82880.0'}

{'name': 'Afghanistan', 'capital': 'Kabul', 'population': '29121286', 'area': '647500.0'}

...

Extracting Text Using XPath Selectors

XPath is a query language for selecting nodes from an XML document. It stands for XML Path Language, and it uses a path notation similar to that of URLs to navigate through the elements and attributes of an XML document. XPath expressions can be used to select a single element, a set of elements, or a specific attribute of an element. XPath is primarily used in XSLT, but it can also be used to navigate through the Document Object Model (DOM) of any XML-like language document, such as HTML or SVG.

XPath can seem intimidating at first, but it is actually quite easy to get started with once you understand the basic concepts and syntax. One resource that can come in handy is our XPath selectors guide at https://www.webscrapingapi.com/the-ultimate-xpath-cheat-sheet.

Now let’s try some selectors:

Here is how you can print the first capital:

parsel_dom = parsel.Selector(text=raw_html)

first_capital = parsel_dom.xpath('//*[@class="country-capital"]/text()').get()

print(first_capital)

// Output

Andorra la Vella

And all the country names:

countries_names = filter(lambda line: line.strip() != "", 

parsel_dom.xpath('//*[@class="country-name"]//text()').getall())

for country_name in countries_names:

print(country_name.strip())

// Output

Andorra la Vella

Abu Dhabi

Kabul

St. John's

The Valley

Tirana

...

Let’s reimplement the script with XPath selectors:

import parsel

import requests

response = requests.get("https://www.scrapethissite.com/pages/simple/")

raw_html = response.text

parsel_dom = parsel.Selector(text=raw_html)

countries = parsel_dom.xpath('//div[contains(@class,"country")][not(contains(@class,"country-"))]')

countries_data = []

for country in countries:

country_name = country.xpath(".//h3/text()").getall()[1].strip()

country_capital = country.xpath(".//span/text()").getall()[0]

country_population = country.xpath(".//span/text()").getall()[1]

country_area = country.xpath(".//span/text()").getall()[2]

countries_data.append({

"name": country_name,

"capital": country_capital,

"population": country_population,

"area": country_area

})

for country_data in countries_data:

print(country_data)

// Output

{'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area': '468.0'}

{'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area': '82880.0'}

{'name': 'Afghanistan', 'capital': 'Kabul', 'population': '29121286', 'area': '647500.0'}

...

Removing elements

Removing elements is simple. Just apply the drop function to a selector:

selector.css(".my_class").drop()

Let’s showcase this functionality by writing a script that removes the “population” filed from each country:

import parsel

import requests

response = requests.get("https://www.scrapethissite.com/pages/simple/")

raw_html = response.text

parsel_dom = parsel.Selector(text=raw_html)

countries = parsel_dom.css(".country")

for country in countries:

country.css(".country-population").drop()

country.xpath(".//strong")[1].drop()

country.xpath(".//br")[1].drop()

countries_without_population_html = parsel_dom.get()

with open("out.html", "w", encoding="utf-8") as f:

f.write(countries_without_population_html)
blog-image

Exporting the data

When you've finished scraping the data, it's important to think about how you want to save it. Two common formats for storing this kind of data is .json and .csv. However, you should choose the one that works best for your project needs.

Exporting the data to .json

JSON (JavaScript Object Notation) is a lightweight data-interchange format that is easy for humans to read and write and easy for machines to parse and generate. It is often used for exchanging data between a web application and a server, or between different parts of a web application. JSON is similar to a Python dictionary, in that it is used to store data in key-value pairs, and it can be used to store and access the same type of data and have the same structure.

Exporting an array of Python dictionary to .json can be done with the json library:

import json

countries_dictionaries = [

{'name': 'Andorra', 'capital': 'Andorra la Vella', 'population': '84000', 'area': '468.0'},

{'name': 'United Arab Emirates', 'capital': 'Abu Dhabi', 'population': '4975593', 'area': '82880.0'}

]

json_data = json.dumps(countries_dictionaries, indent=4)

with open("data.json", "w") as outfile:

outfile.write(json_data)

// data.json

[

{

"name": "Andorra",

"capital": "Andorra la Vella",

"population": "84000",

"area": "468.0"

},

{

"name": "United Arab Emirates",

"capital": "Abu Dhabi",

"population": "4975593",

"area": "82880.0"

}

]

Exporting the data to .csv

A CSV is a simple way to store data in a text file, where each line represents a row and each value is separated by a comma. It's often used in a spreadsheet or database programs. Python has great built-in support for working with CSV files, through its csv module. One of the most powerful features of the CSV module is the DictWriter class, which allows you to write a Python dictionary to a CSV file in a simple way. The keys of the dictionary will be used as the column headers in the CSV file, and the values will be written as the corresponding data in the rows.

Here is how you can use the csv library to export an array of Python dictionaries to .csv.

countries_dictionaries = [

{"name": "John Smith", "age": 35, "city": "New York"},

{"name": "Jane Doe", "age": 28, "city": "San Francisco"}

]

with open("data.csv", "w") as outfile:

writer = csv.DictWriter(outfile, fieldnames=countries_dictionaries[0].keys())

writer.writeheader()

for row in countries_dictionaries:

writer.writerow(row)

// data.csv

name,age,city

John Smith,35,New York

Jane Doe,28,San Francisco

Wrapping up

In this article, we've explored the use of the Parsel library in Python. We've seen how easy it is to use the CSS and XPath selectors that Parsel provides to extract data from web pages. Overall, Parsel provides an efficient and versatile solution for web scraping. If you're interested in automating data collection, you should definitely give it a try.

Do you want to learn more about web scraping? Check out our product, WebScrapingAPI, and discover how you can bring your data extraction skills to the next level. Our powerful API is specifically designed to help you conquer the most common challenges of web scraping, like avoiding IP bans or rendering Javascript. And the best part? You can try it for free!

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

We care about the protection of your data. Read our Privacy Policy.

Related articles

thumbnail
GuidesHow To Scrape Amazon Product Data: A Comprehensive Guide to Best Practices & Tools

Explore the complexities of scraping Amazon product data with our in-depth guide. From best practices and tools like Amazon Scraper API to legal considerations, learn how to navigate challenges, bypass CAPTCHAs, and efficiently extract valuable insights.

Suciu Dan
author avatar
Suciu Dan
15 min read
thumbnail
Science of Web ScrapingWeb Scraping made easy: The Importance of Data Parsing

Discover how to efficiently extract and organize data for web scraping and data analysis through data parsing, HTML parsing libraries, and schema.org meta data.

Suciu Dan
author avatar
Suciu Dan
12 min read
thumbnail
Use CasesXPath Vs CSS Selectors

Are XPath selectors better than CSS selectors for web scraping? Learn about each method's strengths and limitations and make the right choice for your project!

Mihai Maxim
author avatar
Mihai Maxim
8 min read