Top 5 JavaScript Libraries For Web Scraping

Robert Sfichi on Apr 03 2021

As the computers era has evolved, people developed a lot of useful techniques that can create gigantic datasets. One of these techniques is called web scraping and is most frequently used by data scientists, statisticians, computer scientists, and software developers for gathering valuable insights about a specific issue.

The web scraping technique can be done in many ways using various tools such as APIs, browser extensions, frameworks and so on. But the focus of this article is the last category of tools mentioned.

Being one of the most frequently used programming languages for developing software, JavaScript has a lot of libraries that can help you in the process of gathering the data you need.

In this regard, we would like to offer you our recommendations for the 5 most useful JavaScript libraries in terms of web scraping.

A brief introduction to web scraping

As the name suggests, a web scraper is a piece of software that crawls different web pages in the process of extracting data, such as specific strings, numbers, or attachments.

You can see it as a better alternative to copying and pasting information from a web page into a spreadsheet. Instead of doing this, a computer program can do it for you. It’s a lot quicker and probably more precise. It’s pretty clear that without it, we would have to do a lot more work in order to get the same results.

Web scraping use cases

People use web scrapers for all kinds of reasons. The most popular use cases are:

Lead generation
Price comparison
Market analysis
Academic research
SEO Audit & keyword search
Creating training and testing datasets for Machine Learning processes

An eCommerce business might use a web scraper to gather information about a product’s photos, features, or description that have been written by competitors in order to have a better overview of the market.

GPT-3 is considered to be one of the most powerful software tools on the market at the moment. Machine learning scripts work best when they are fed with a large amount of data. However, this can take hours, even days. Data scientists can use the help provided by web scrapers to gather all the datasets they need in order to train the machine learning models.

Prepare your workspace

To be able to use the following libraries you must make sure you have installed Node.js on your machine. Please check this by running the following command in a new terminal window:

node -v

If you have Node.js installed you should see the version appearing on the next line. It will look something like this:

v14.15.0

If you have received the confirmation of having Node.js installed, please jump to the next section. For those of you who have not installed Node.js before, let’s go through the process of doing it right now.

First, go to the Node.js website and get the latest version (14.16.0 at the moment of writing this article). Click on the button that says “Recommended For Most Users” and wait for the download to complete. Launch the installer once the download is finished.

Once the installation is complete you can check the version of Node.js by running the following command in a new terminal window:

node -v

After a couple of seconds, you should see the version of Node.js you have installed.

Top 5 JavaScript tools used for web scraping

Axios

Axios is one of the most popular JavaScript libraries, used to make HTTP requests directly from a Node.js environment. Using Axios we can also download data with ease while removing the need to pass the results of the HTTP request to the .json() method. Needless to say Axios is a very powerful tool when it comes to web scraping.

In order to install Axios, run the following command in the project’s folder:

npm i axios

By using the next libraries, we will demonstrate the power of Axios more clearly.

Nightmare

Nightmare was created with the intention of helping its users automate different tasks across websites that don’t own an API.

At the moment most people use it to create a more realistic request when trying to scrape data from a web page. By using its core features, we can mimic a user’s action with an API that feels synchronous for each block of scripting.

Let’s take a look at a real-world example of how someone could use Nightmare for web scraping purposes.

Because it's using Electron instead of Chromium, the bundle size is a little bit smaller. Nightmare can be installed by running the following command.

npm i nightmare

We will try to take a screenshot of a random web page. Let's create a new index.js file and type or copy the following code:

const Nightmare = require('nightmare')

const nightmare = new Nightmare()
return nightmare.goto('https://www.old.reddit.com/r/learnprogramming')
	.screenshot('./screenshot.png') 
	.end()
	.then(() => {
		console.log('Done!')
	})
	.catch((err) => {
		console.error(err)
	})

As you can see in line 2 we create a new Nightmare instance, we point the browser at the web page we want to screenshot, take and save the screenshot at line 5 and end the Nightmare session at line 6.

To run it, type the following command in the terminal and hit enter.

node index.js

You should see two new files screenshot.png in the projects folder.

Cheerio

To test Cheerio's functionality, let's try to collect all the post titles on the same subreddit: /r/learnprogramming.

Let's create a new file called index.js and type or just copy the following lines:

const axios = require("axios");
const cheerio = require("cheerio");

const fetchTitles = async () => {
	try {
		const response = await axios.get('https://old.reddit.com/r/learnprogramming/');

        const html = response.data;

		const $ = cheerio.load(html);

		const titles = [];

		$('div > p.title > a').each((_idx, el) => {
			const title = $(el).text()
			titles.push(title)
		});

		return titles;
	} catch (error) {
		throw error;
	}
};

fetchTitles().then((titles) => console.log(titles));

As we can see, Cheerio implements a subset of core jQuery. To be entirely sure that we only select the anchor tags that contain the post’s title, we’re going to also select their parents by using the following selector on line 15.

To get each title individually we will loop through each post using the each() function. Finally, calling text() on each item will return us the title of that specific post.

To run it, just type the following command in the terminal and hit enter.

node index.js

You should see an array containing all the titles of the posts.

Puppeteer

Puppeteer helps us automate the most basic tasks we normally do while using a web browser, like completing a form or taking screenshots of specific pages.

Let's try to better understand its functionality by taking a screenshot of the /r/learnprogramming Reddit community. Run the following command in the projects folder to install the dependency:

npm i puppeteer

Now, create a new index.js file and type or copy the following code:

const puppeteer = require('puppeteer')

async function takeScreenshot() {
	try {
		const URL = 'https://www.old.reddit.com/r/learnprogramming/'
		const browser = await puppeteer.launch()
		const page = await browser.newPage()

		await page.goto(URL)
		await page.pdf({ path: 'page.pdf' })
		await page.screenshot({ path: 'screenshot.png' })

		await browser.close()
	} catch (error) {
		console.error(error)
	}
}

takeScreenshot()

We created the takeScreenshot() asynchronous function.

As you can see, the pdf() and screenshot() methods help us create a new PDF file and an image that contains the web page as a visual component.

To run it, run node index.js in a new terminal window. You should see two new files in the projects folder called page.pdf and screenshot.png.

Selenium

Selenium is used by a lot of automation specialists, data scientists, quality assurance engineers, and software developers alike. By simply installing it and writing less than 10 lines of code we can feel the power of web scraping.

Just like Nightmare, Selenium creates a more realistic HTTP request by completing actions like page opening, button click, or filling forms.

We can install Selenium by running the following command in a new terminal window:

npm i selenium-webdriver

Now, let’s give it a shot by running a google search. First, create a new index.js file and type or copy the following code:

const {Builder, By, Key, until} = require('selenium-webdriver');

(async function example() {
 let driver = await new Builder().forBrowser('firefox').build();
 try {
   await driver.get('http://www.google.com/');
   await driver.findElement(By.name('q'));
   await driver.sendKeys(web scraping, Key.RETURN);
   await driver.wait(until.titleIs('web scraping - Google Search'), 1000);
 } finally {
   await driver.quit();
 }
})();

Check the Selenium documentation for more information.

Conclusion

Web scraping is a very powerful technique to extract information from web pages. For any of the use cases presented above, web scraping can save a lot of money and time. If the script is programmed appropriately, the computer can extract and arrange much more information compared to a human being. This is why the right libraries matter.

Difficulties arise when doing web scraping. Problems are inevitable but they can usually be solved. In the end, it’s really just about your experience. If you’re more comfortable using Selenium instead of Nightmare, go ahead. There’s no perfect library and we know that. We just hope we managed to make your decision process a little bit less complicated.

You can find more information about web scraping by reading the following articles:

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

We care about the protection of your data. Read our Privacy Policy.

Guides How To Scrape Amazon Product Data: A Comprehensive Guide to Best Practices & Tools

Explore the complexities of scraping Amazon product data with our in-depth guide. From best practices and tools like Amazon Scraper API to legal considerations, learn how to navigate challenges, bypass CAPTCHAs, and efficiently extract valuable insights.