The Complete Guide to Web Scraping Real Estate Data
Blog arrow Use Cases
featured-image
Blog arrow Use Cases

The Complete Guide to Web Scraping Real Estate Data

Robert Munceanu
Full-stack developer @ WebScrapingAPI
Summary

Web scraping is now an integral part of the real estate industry. Both agents and regular folks gain much from scrapers. Here's how you do it:

The property market is constantly expanding, and with it, real estate agents and businesses try to find new solutions to pinpoint what the future holds. While real estate, in general, doesn’t change drastically overnight, it’s affected by way too many factors for one person or even an organization to keep track of.

So, will the prices rise or will they go down? What neighborhoods are in high demand? Are there properties that just need a makeover to skyrocket their value? These are just a few of the questions real estate agents are frequently asking themselves.

To answer these questions, one needs loads of research data for comparison, and to manually gather such amounts of information would be like a wild goose chase. Here is where web scraping comes in handy, it collects and structures data as fast as you can say:

As we all know by now, web scraping is the powerhouse of data extraction! So, if you want to know more about why anybody would want to scrape real estate data from the Internet and how to do it properly, let’s continue our journey together. We’ve prepared both a DIY solution and a step-by-step guide on how WebScrapingAPI can do it.

Why you should scrape real estate data

Scraping the web will ensure that the extracted information about real estate is precise, credible, and up to date. This way, one can predict if the real estate market will skyrocket any time soon or see in what price range their property will compete.

For businesses, web data is valuable because it leads to better decisions, better pricing, and a more significant profit margin. However, the catch is that each bit of information needs to be as fresh as possible, making web scraping the obvious solution.

The most commonly extracted types of real estate data are the following:

  • Property type
  • Sale price
  • Location
  • Size
  • Amenities
  • Monthly rental price
  • Parking spaces
  • Property agent

The information listed above can make or break a real estate agency. It makes a huge difference in communication, strategy, and efficiency, but the biggest advantage is how well agents get to know their properties and market. After that, it’s just a matter of finding the right client.

Let’s take a look at a few scenarios that illustrate the value of web scraping:

Real estate agencies

  • Decision-making: Taking risks is part of the job, but that doesn’t mean you must do it blindly. Researching before buying or selling something is mandatory to work, and more info means better deals.
  • Predicting the market: It is crucial to know when to buy and sell properties to get the best and most profitable outcome. Some types of properties soar in popularity while others lose their luster.  Some areas flourish while others stagnate. Knowing what’s around the corner is the key to longevity as a business.

Regular folk

Web scraping isn’t all about helping businesses. Actually, part of what makes it so popular is how easy it is for a single person to use. Sure, you need some knowledge of computer science, but there are plenty of tutorials to help. Heck, this is one of them!

  • Buying and selling: You need to accurately deduce the property’s value before buying or selling it. It would be a shame to sell your childhood home and see it a week later on a real estate website at double the price, wouldn’t it?
  • Investing: If you like to invest in properties, either by buying at a small price to sell it later for profit or simply rent the property, it is highly recommended to know fast you’ll break even and what returns you should expect.

Ok, that’s enough on use cases. Let’s look at some code!

For starters, let’s assume we are searching for a new home in New York City. We want to buy a property with at least two bedrooms and, of course, a bathroom. So, we’ll start our search on Realtor, extract data from there and compare it to find the best deal.

There are various ways one can extract content from web pages. This article will explain two methods: one in which we create our web scraper from scratch and one in which we use an already existing tool.

First, let’s try to do it ourselves. The code will later prove helpful once we use a professional web scraping tool.  

Building a web scraper to extract real estate data

I chose to write in Python because of how popular it is in web scraping.  We have a general-purpose tutorial for extracting web data in Python that you should check out!

Inspect the website code

The data we need to extract can be found in the nested tags of the said webpage. Before we start scraping, we need to find it. To do this, simply right-click on the element and select “Inspect.”

A “Browser Inspector Box” window will pop up, like this:

In this window, we will navigate to find the tags and classes under which our essential data can be found. It might seem a bit intimidating at first, but it only gets easier with experience!

Find the data you want to extract

We can see that everything we need to extract is within the <li> tag with the class ‘component_property-card’. If we go even deeper in the tag, we observe that the data referring to the number of beds and bathrooms are under the attribute ‘data-label’ with the values ‘pc-meta-beds’ and ‘pc-beta-baths’, respectively. Knowing this, we can proceed with writing our code!

Prepare the workspace

As mentioned before, we will use Python as our programming language, so you need to download and install it.

You can use whichever IDE you feel comfortable with, but I recommend using PyCharm.

After you’ve created a new project, make your work easier by using these libraries:

  • Selenium: Used for web testing and automating browser activities.
  • BeautifulSoup: Used for parsing HTML and XML documents.
  • Pandas: Used for data manipulation. The extracted data will be stored in a structured format.

Installing them within the project is quite simple. Just use this command line in the project’s terminal: python -m pip install selenium beautifulsoup4 pandas

Write the code

Let’s start by importing the libraries we’ve installed earlier:

from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd

To extract the data from the website, we have to load it by configuring the webdriver to use the Chrome browser. To do this, we simply need to specify the path where the chromedriver is located. Don’t forget to add the name of the executable at the end - not just its location!

driver = webdriver.Chrome('your/path/here/chromedriver')

Besides the number of beds and bathrooms, we can also extract the address, price, and, why not, the size of the property? The more information we have, the easier it will be to decide on a new home.

Declare the variables and set the URL of the to be scraped website.

prices = []
beds = []
baths = []
sizes = []
addresses = []
driver.get('https://www.realtor.com/realestateandhomes-search/New-York_NY')

We need to extract the data from the website, which is located in the nested tags as explained earlier. Find the tags with the previously mentioned attributes and store the data in the variables declared above. Remember that we only want to save properties with at least two beds and one bathroom!

content = driver.page_source
soup = BeautifulSoup(content, features='html.parser')
for element in soup.findAll('li', attrs={'class': 'component_property-card'}):
   price = element.find('span', attrs={'data-label': 'pc-price'})
   bed = element.find('li', attrs={'data-label': 'pc-meta-beds'})
   bath = element.find('li', attrs={'data-label': 'pc-meta-baths'})
   size = element.find('li', attrs={'data-label': 'pc-meta-sqft'})
   address = element.find('div', attrs={'data-label': 'pc-address'})

   if bed and bath:
       nr_beds = bed.find('span', attrs={'data-label': 'meta-value'})
       nr_baths = bath.find('span', attrs={'data-label': 'meta-value'})

       if nr_beds and float(nr_beds.text) >= 2 and nr_baths and float(nr_baths.text) >= 1:
           beds.append(nr_beds.text)
           baths.append(nr_baths.text)

           if price and price.text:
               prices.append(price.text)
           else:
               prices.append('No display data')

           if size and size.text:
               sizes.append(size.text)
           else:
               sizes.append('No display data')

           if address and address.text:
               addresses.append(address.text)
           else:
               addresses.append('No display data')

Great! We have all the information we need, but where should we store it? This is where the pandas library comes in handy and helps structure the data into a csv file for us to use in the future.

df = pd.DataFrame({'Address': addresses, 'Price': prices, 'Beds': beds, 'Baths': baths, 'Sizes': sizes})
df.to_csv('listings.csv', index=False, encoding='utf-8')

If we run the code, a file named ‘listings.csv’ will be created, and in it, our precious data!

We did it! We created our own web scraping tool! Now let’s jump right into it and see what steps we need to follow and which lines of code we need to modify to use a scraping tool.

Using a web scraping API

For this scenario, we will use WebScrapingAPI, of course.

Create a free WebScrapingAPI account

To make use of WebScrapingAPI, you need to create an account. Don’t worry, the first 5000 API calls are free, and you don’t need to share any personal data, like credit card info. After you successfully create your account and validate your email, we can move to the next step.

API Key

To use WebScrapingAPI, you will need to authenticate via the private API Key, which you can find on your account dashboard. Note that you mustn’t share this key with anyone, and if you suspect that it has been compromised, you can always reset the key by pressing the “Reset API Key” button.

Modify the code

Perfect! Now that you have the API Key, let’s make the necessary changes.

We won’t be using a webdriver anymore. Instead, the ‘requests’ library will send the request to WebScrapingAPI and retrieve the website’s HTML code as a response.

import requests
from bs4 import BeautifulSoup
import pandas as pd

Next, we have to prepare a few parameters for the request: the url of the website we wish to extract data from (realtor) and our API Key.

url = "https://api.webscrapingapi.com/v1"
params = {
 "api_key": "XXXXXXX",
 "url": "https://www.realtor.com/realestateandhomes-search/New-York_NY"
}
response = requests.request("GET", url, params=params)

Don’t forget to change which content beautifulsoup is parsing. Instead of the source from the chromedriver, we will use the response received from the API.

content = response.text

From this point on, you can use the same code from the previous scenario. The data will still be stored in a CVS file named ‘listings.csv.’

All done!

And that’s pretty much it; you can run the code. WebScrapingAPI will do the job, and you’ll get the necessary data to find the perfect home. But you might ask yourself: “What is the difference between using WebScrapingAPI and the scraper we built ourselves?”. Well, allow me to explain.

DIY vs. Pre-made

One of the most significant advantages of using WebScrapingAPI is its proxies. The service has a huge rotating proxy pool that ensures its users’ anonymity while surfing the web.

This feature is also helpful when someone wishes to scrape a website en masse. Making multiple requests on a website in a short amount of time will surely block your IP, thinking it is a grief attempt or a bot with bad intentions.

Using a rotating proxy pool will make the website think that multiple users are interacting with it, so you remain undetected and can scrape all day long.

Many more obstacles can come your way when scraping the web, such as CAPTCHAs or browser fingerprinting. As you might expect, we built WebScrapingAPI to side-step all those hurdles and make data extractions as easy as possible for you. If you want to know more about this topic, check out our article on the most common problems web scrapers encounter.

One tool, many use cases

We can all agree that scraping the web is an excellent solution for the real estate industry, but you can use it for other purposes as well. Here are just a few examples: monitoring your competition, comparing product prices, and training machine learning algorithms.

I could go on, but that’s already a whole new subject. I won’t drag this article on forever, so I recommend you check out these seven use cases for web scraping tools.

Creating a web scraping tool in your free time sounds pretty neat, but there are many things to consider, things that will burn a considerable amount of development time. Here you can find an in-depth discussion about DIY vs. pre-made web scraping tools.

If we are talking about scraping a few web pages, building the tool yourself can be a fast solution. Still, a professional job needs a professional tool, ideally an API, WebScrapingAPI. Did I mention the free trial?

Start scraping data with WebScrapingAPI

Get started with 5,000 free API calls.
No credit card required.
send