Unlock the Power of Data: How to Scrape Booking.com for Valuable Information

Raluca Penciuc on Apr 07 2023

As a leading online travel agency, Booking.com is a treasure trove of valuable data on hotels and vacation rentals around the world. From guest reviews to prices to availability, the information available on Booking can be incredibly useful for a wide range of purposes.

Whether you are a business owner assessing the competition or a data scientist analyzing trends in the hospitality industry, web scraping is a powerful tool that can help you gather the data you need from Booking pages.

By the end of this tutorial, you will have a solid understanding of how to scrape Booking.com using the popular Node.js automation library Puppeteer. We will be focusing on extracting listing details from Madeira, Portugal, but the techniques and concepts covered in this tutorial can be applied to other websites and data points as well.

Prerequisites

If you don’t already have your Node.js environment set up, just head over to their official website to download the latest version for your operating system. Then create a new directory and run the following command to initialize your project:

npm init -y

We will use TypeScript to write the code. The superset of JavaScript adds optional static typing and other features. It is useful for larger projects and can make it easier to catch mistakes early on. You need to add it to the project dev dependencies and initialize its configuration file:

npm install typescript -save-dev                                                                                                                                                      npx tsc -init

Just make sure that in the newly generated file tsconfig.json, the “outDir” property is set to “dist”, as we intend to separate the TypeScript code from the compiled one.

Lastly, the following command will add Puppeteer to our project dependencies:

npm install puppeteer

Puppeteer is a Node.js library that provides a high-level API for controlling a headless Chrome browser, which can be used for web scraping and automation task

Data location

For this tutorial we chose to scrape the properties available in Madeira Islands, Portugal: https://www.booking.com/searchresults.en-us.html?ss=Madeira+Islands&checkin=2023-01-13&checkout=2023-01-15. It’s important to add the check-in and check-out dates to the URL so all the property information will be available.

This guide covers the extraction of the following property data:

the name
the URL
the physical address
the price
the rating and the review count
the thumbnail

You can see them highlighted in the screenshot below:

By opening the Developer Tools on each of these elements you will be able to notice the CSS selectors that we will use to locate the HTML elements. If you’re fairly new to how CSS selectors work, feel free to reach out to this beginner guide.

Parsing the data

Since all listings have the same structure and data, we can extract all of the information for the entire properties list in our algorithm. After running the script, we can loop through all of the results and compile them into a single list.

After a first glance at the HTML document, you may have noticed that the Booking website is pretty complex and the class names are mostly randomly generated.

Luckily for us, the website is not relying solely on class names, and we can use the value of a specific attribute as an extraction criterion. In the screenshot above, we highlighted how accessible are the thumbnail, the name and the URL of a property.

import puppeteer from 'puppeteer';

async function scrapeBookingData(booking_url: string): Promise<void> {

    // Launch Puppeteer

    const browser = await puppeteer.launch({

        headless: false,

    	  args: ['--start-maximized'],

    	  defaultViewport: null

    })

    const page = await browser.newPage()

    // Navigate to the channel URL

    await page.goto(booking_url)

    // Extract listings name

    const listings_name = await page.evaluate(() => {

        const names = document.querySelectorAll('div[data-testid="title"]')

    	  const names_array = Array.from(names)

    	  return names ? names_array.map(n => n.textContent) : []

    })

    console.log(listings_name)

    // Extract listings location

    const listings_location = await page.evaluate(() => {

        const locations = document.querySelectorAll('a[data-testid="title-link"]')

    	  const locations_array = Array.from(locations)

    	  return locations ? locations_array.map(l => l.getAttribute('href')) : []

    })

    console.log(listings_location)

    // Extract listings thumbnail

    const listings_thumbnail = await page.evaluate(() => {

        const thumbnails = document.querySelectorAll('[data-testid="image"]')

    	  const thumbnails_array = Array.from(thumbnails)

    	  return thumbnails ? thumbnails_array.map(t => t.getAttribute('src')) : []

    })

    console.log(listings_thumbnail)

    await browser.close()

}

scrapeBookingData("https://www.booking.com/searchresults.en-us.html?ss=Madeira+Islands&checkin=2023-01-13&checkout=2023-01-15")

We used Puppeteer to open a browser instance, create a new page, navigate to our target URL, extract the mentioned data, and then close the browser. For visual debugging purposes, I am using the non-headless mode of the browser.

As explained above, the data was easily accessible thanks to the “data-testid” attribute that assigned a unique value to the HTML element. Run the following command to execute the script:

npx tsc && node dist/index.js

Your terminal should display 3 list results of the same size, representing the names, the URLs, and the thumbnails of all the properties on the current page.

For the next section of the HTML document, we highlighted the address, the rating and the review count for a property.

// Extract listings address

const listings_address = await page.evaluate(() => {

    const addresses = document.querySelectorAll('[data-testid="address"]')

    const addresses_array = Array.from(addresses)

    return addresses ? addresses_array.map(a => a.textContent) : []

})

console.log(listings_address)

// Extract listings rating and review count

const listings_rating = await page.evaluate(() => {

    const ratings = document.querySelectorAll('[data-testid="review-score"]')

    const ratings_array = Array.from(ratings)

    return ratings ? ratings_array.map(r => r.textContent) : []

})

console.log(listings_rating)

Like before, we made use of the “data-testid” attribute. Running the script again should show you 2 more lists, just like the previous ones.

And finally, in the last section, we extracted the price of the property. The code will not be different from what we did before:

// Extract listings price

const listings_price = await page.evaluate(() => {

    const prices = document.querySelectorAll('[data-testid="price-and-discounted-price"]')

    const prices_array = Array.from(prices)

    return prices ? prices_array.map(p => p.textContent) : []

})

console.log(listings_price)

To make the extracted data easier to further process, we will combine the resulting lists in a single one.

// Group the lists

const listings = []

for (let i = 0; i < listings_name.length; i++) {

    listings.push({

        name: listings_name[i],

        url: listings_location[i],

        address: listings_address[i],

        price: listings_price[i],

        ratings: listings_rating[i],

        thumbnails: listings_thumbnail[i]

    })

}

console.log(listings)

The final result should now look like this:

[

  {

    name: 'Pestana Churchill Bay',

    url: 'https://www.booking.com/hotel/pt/pestana-churchill-bay.html?aid=304142&label=gen173nr-1FCAQoggJCFnNlYXJjaF9tYWRlaXJhIGlzbGFuZHNIMVgEaMABiAEBmAExuAEXyAEM2AEB6AEB-AEDiAIBqAIDuAK9luydBsACAdICJGViMWY2MmRjLWJhZmEtNGZhZC04MDAyLWQ4MmU3YjU5MTMwZtgCBeACAQ&ucfs=1&arphpl=1&checkin=2023-01-13&checkout=2023-01-15&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=1&hapos=1&sr_order=popularity&srpvid=42cc81de452009eb&srepoch=1673202494&all_sr_blocks=477957801_262227867_0_1_0&highlighted_blocks=477957801_262227867_0_1_0&matching_block_id=477957801_262227867_0_1_0&sr_pri_blocks=477957801_262227867_0_1_0__18480&tpi_r=2&from_sustainable_property_sr=1&from=searchresults#hotelTmpl',

    address: 'Câmara de Lobos',

    price: '911 lei',

    ratings: '9.0Wonderful 727 reviews',

    thumbnails: 'https://cf.bstatic.com/xdata/images/hotel/square200/202313893.webp?k=824dc3908c4bd3e80790ce011f763f10fd4064dcb5708607f020f2e7c92d130e&o=&s=1'

  },

  {

    name: 'Hotel Madeira',

    url: 'https://www.booking.com/hotel/pt/madeira-funchal.html?aid=304142&label=gen173nr-1FCAQoggJCFnNlYXJjaF9tYWRlaXJhIGlzbGFuZHNIMVgEaMABiAEBmAExuAEXyAEM2AEB6AEB-AEDiAIBqAIDuAK9luydBsACAdICJGViMWY2MmRjLWJhZmEtNGZhZC04MDAyLWQ4MmU3YjU5MTMwZtgCBeACAQ&ucfs=1&arphpl=1&checkin=2023-01-13&checkout=2023-01-15&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=2&hapos=2&sr_order=popularity&srpvid=42cc81de452009eb&srepoch=1673202494&all_sr_blocks=57095605_262941681_2_1_0&highlighted_blocks=57095605_262941681_2_1_0&matching_block_id=57095605_262941681_2_1_0&sr_pri_blocks=57095605_262941681_2_1_0__21200&tpi_r=2&from_sustainable_property_sr=1&from=searchresults#hotelTmpl',

    address: 'Se, Funchal',

    price: '1,045 lei',

    ratings: '8.3Very Good 647 reviews',

    thumbnails: 'https://cf.bstatic.com/xdata/images/hotel/square200/364430623.webp?k=8c1e510da2aad0fc9ff5731c3874e05b1c4cceec01a07ef7e9db944799771724&o=&s=1'

  },

  {

    name: 'Les Suites at The Cliff Bay - PortoBay',

    url: 'https://www.booking.com/hotel/pt/les-suites-at-the-cliff-bay.html?aid=304142&label=gen173nr-1FCAQoggJCFnNlYXJjaF9tYWRlaXJhIGlzbGFuZHNIMVgEaMABiAEBmAExuAEXyAEM2AEB6AEB-AEDiAIBqAIDuAK9luydBsACAdICJGViMWY2MmRjLWJhZmEtNGZhZC04MDAyLWQ4MmU3YjU5MTMwZtgCBeACAQ&ucfs=1&arphpl=1&checkin=2023-01-13&checkout=2023-01-15&group_adults=2&req_adults=2&no_rooms=1&group_children=0&req_children=0&hpos=3&hapos=3&sr_order=popularity&srpvid=42cc81de452009eb&srepoch=1673202494&all_sr_blocks=395012401_247460894_2_1_0&highlighted_blocks=395012401_247460894_2_1_0&matching_block_id=395012401_247460894_2_1_0&sr_pri_blocks=395012401_247460894_2_1_0__100000&tpi_r=2&from_sustainable_property_sr=1&from=searchresults#hotelTmpl',

    address: 'Sao Martinho, Funchal',

    price: '4,928 lei',

    ratings: '9.5Exceptional 119 reviews',

    thumbnails: 'https://cf.bstatic.com/xdata/images/hotel/square200/270120962.webp?k=68ded1031f5082597c48eb25c833ea7fcedc2ec2bc5d555adfcac98b232f9745&o=&s=1'

  }

]

Alternatives

Even though the tutorial until this point seemed straightforward, we must mention the caveats usually met in web scraping, especially in the case where you want to scale up your project.

Nowadays websites implement various bot detection techniques and collect browser data so they can prevent or block automated traffic. Booking.com makes no exception to this rule. Using the PerimeterX protection, the website performs checks on your IP and collects multiple info:

properties from the Navigator object (deviceMemory, languages, platform, userAgent, webdriver, etc.)
font and plugin enumeration
screen dimensions checks
and many more.

One solution to these challenges is to use a scraping API, which offers a simple and reliable way to access data from websites like Booking.com without the need to build and maintain your own scraper.

WebScrapingAPI is such a product, that utilizes proxy rotation to bypass CAPTCHAs and randomizes browser data to mimic a real user. To get started, simply register for an account and obtain your API key from the dashboard. This key is used to authenticate your requests.

To quickly test the API with the already existing Node.js project, we can make use of its corresponding SDK. Simply run the following command:

npm install webscrapingapi

Now, all you need to do is adjust the previous CSS selectors to the API. The extraction rules feature allows you to parse data with minimal modifications, making it a powerful tool in your web scraping toolkit.

import webScrapingApiClient from 'webscrapingapi';

const client = new webScrapingApiClient("YOUR_API_KEY");

async function exampleUsage() {

    const api_params = {

        'render_js': 1,

    	  'proxy_type': 'datacenter',

    	  'timeout': 60000,

    	  'extract_rules': JSON.stringify({

            names: {

                selector: 'div[data-testid="title"]',

                output: 'text',

                all: '1'

        	},

        	locations: {

                selector: 'a[data-testid="title-link"]',

                output: '@href',

                all: '1'

        	},

        	addresses: {

                selector: '[data-testid="address"]',

                output: 'text',

                all: '1'

        	},

        	prices: {

                selector: '[data-testid="price-and-discounted-price"]',

                output: 'text',

                all: '1'

        	},

        	ratings: {

                selector: '[data-testid="review-score"]',

                output: 'text',

                all: '1'

        	},

        	thumbnails: {

                selector: '[data-testid="image"]',

                output: '@src',

                all: '1'

        	}

        })

    }

    const URL = "https://www.booking.com/searchresults.en-us.html?ss=Madeira+Islands&checkin=2023-01-13&checkout=2023-01-15"

    const response = await client.get(URL, api_params)

    if (response.success) {

        // Group the lists

    	  const listings = []

    	  for (let i = 0; i < response.response.data.names.length; i++) {

            listings.push({

               name: response.response.data.names[i],

               url: response.response.data.locations[i],

               address: response.response.data.addresses[i],

               price: response.response.data.prices[i],

               ratings: response.response.data.ratings[i],

               thumbnails: response.response.data.thumbnails[i]

            })

        }

        console.log(listings)

    } else {

        console.log(response.error.response.data)

    }

}

exampleUsage();

Conclusion

In this tutorial, we covered the basics of how to scrape Booking.com using Node.js and Puppeteer. We showed you how to set up your environment and extract listing details for Madeira, Portugal. However, these techniques and concepts can be applied to other websites and data points as well.

Web scraping can be an incredibly useful tool for businesses and data scientists alike. By gathering data from Booking.com, you can gain valuable insights into the hospitality industry, assess the competition, and more. However, it's important to keep in mind that web scraping may be against the terms of use for some websites, and it's always a good idea to check the specific policies before proceeding.

While it's possible to create your own web scraper, using a professional service can often be a safer and more efficient option, especially for larger projects. A professional scraper will have the expertise and resources to handle any challenges that may arise and deliver high-quality results.

We hope you enjoyed this tutorial and that you now feel equipped to gather valuable data from Booking.com using a Node.js environment. Thanks for reading!

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

We care about the protection of your data. Read our Privacy Policy.

Guides Scrapy Splash Tutorial: Mastering the Art of Scraping JavaScript-Rendered Websites with Scrapy and Splash

Learn how to scrape dynamic JavaScript-rendered websites using Scrapy and Splash. From installation to writing a spider, handling pagination, and managing Splash responses, this comprehensive guide offers step-by-step instructions for beginners and experts alike.

Ștefan Răcila

Aug 10 20236 min read

Use Cases Utilizing Web Scraping for Alternative Data in Finance: A Comprehensive Guide for Investors

Explore the transformative power of web scraping in the finance sector. From product data to sentiment analysis, this guide offers insights into the various types of web data available for investment decisions.