Perfect! Now you can get it started.
For the moment I will illustrate some basic Cheerio features using a static HTML document. Simply copy and paste the content below into a new static.html file within your project:
<!DOCTYPE html>
<html lang="en">
<head>
<title>Page Name - Static HTML Example</title>
</head>
<body>
<div class="page-heading">
<h1>Page Heading</h1>
</div>
<div class="page-container">
<div class="page-content">
<ul>
<li>
<a href="#">Item 1</a>
<p class="price">$100</p>
<p class="stock">12</p>
</li>
<li>
<a href="#">Item 2</a>
<p class="price">$200</p>
<p class="stock">422</p>
</li>
<li>
<a href="#">Item 3</a>
<p class="price">$150</p>
<p class="stock">5</p>
</li>
</ul>
</div>
</div>
<footer class="page-footer">
<p>Last Updated: Friday, September 23, 2022</p>
</footer>
</body>
</html>
Next, you have to serve the HTML file as input to Cheerio, which will then return the resulting API:
import fs from 'fs'
import * as cheerio from 'cheerio'
const staticHTML = fs.readFileSync('static.html')
const $ = cheerio.load(staticHTML)
If you receive an error at this step, make sure that the input file contains a valid HTML document, as from Cheerio version 1.0.0 this criterion is verified as well.
Now you can start experimenting with what Cheerio has to offer. The NPM package is well-known for its jQuery-like syntax and the use of CSS selectors to extract the nodes you are looking for. You can check out their official documentation to get a better idea.
Let’s say you want to extract the page title:
const title = $('title').text()
console.log("Static HTML page title:", title)
We should test this, right? You are using Typescript, so you have to compile the code, which will create the dist directory, and then execute the associated index.js file. For simplicity, I will define the following script in the package.json file:
"scripts": {
"test": "npx tsc && node dist/index.js",
}
This way, all I have to do is to run:
npm run test
and the script will handle both steps that I just described.
Okay, but what if the selector matches more than one HTML element? Let’s try extracting the name and the stock value of the items presented in the unordered list:
const itemStocks = {}
$('li').each((index, element) => {
const name = $(element).find('a').text()
const stock = $(element).find('p.stock').text()
itemStocks[name] = stock
})
console.log("All items stock:", itemStocks)
Now run the shortcut script again, and the output from your terminal should look like this:
Static HTML page title: Page Name - Static HTML Example
All items stock: { 'Item 1': '12', 'Item 2': '422', 'Item 3': '5' }