Let’s start by importing the libraries we’ve installed earlier:
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
To extract the data from the website, we have to load it by configuring the webdriver to use the Chrome browser. To do this, we simply need to specify the path where the chromedriver is located. Don’t forget to add the name of the executable at the end - not just its location!
driver = webdriver.Chrome('your/path/here/chromedriver')
Besides the number of beds and bathrooms, we can also extract the address, price, and, why not, the size of the property. The more information we have, the easier it will be to decide on a new home.
Declare the variables and set the URL of the to be scraped website.
prices = []
beds = []
baths = []
sizes = []
addresses = []
driver.get('https://www.realtor.com/realestateandhomes-search/New-York_NY')
We need to extract the data from the website, which is located in the nested tags as explained earlier. Find the tags with the previously mentioned attributes and store the data in the variables declared above. Remember that we only want to save properties with at least two beds and one bathroom!
content = driver.page_source
soup = BeautifulSoup(content, features='html.parser')
for element in soup.findAll('li', attrs={'class': 'component_property-card'}):
price = element.find('span', attrs={'data-label': 'pc-price'})
bed = element.find('li', attrs={'data-label': 'pc-meta-beds'})
bath = element.find('li', attrs={'data-label': 'pc-meta-baths'})
size = element.find('li', attrs={'data-label': 'pc-meta-sqft'})
address = element.find('div', attrs={'data-label': 'pc-address'})
if bed and bath:
nr_beds = bed.find('span', attrs={'data-label': 'meta-value'})
nr_baths = bath.find('span', attrs={'data-label': 'meta-value'})
if nr_beds and float(nr_beds.text) >= 2 and nr_baths and float(nr_baths.text) >= 1:
beds.append(nr_beds.text)
baths.append(nr_baths.text)
if price and price.text:
prices.append(price.text)
else:
prices.append('No display data')
if size and size.text:
sizes.append(size.text)
else:
sizes.append('No display data')
if address and address.text:
addresses.append(address.text)
else:
addresses.append('No display data')
Great! We have all the information we need, but where should we store it? This is where the pandas library comes in handy and helps structure the data into a csv file for us to use in the future.
df = pd.DataFrame({'Address': addresses, 'Price': prices, 'Beds': beds, 'Baths': baths, 'Sizes': sizes})
df.to_csv('listings.csv', index=False, encoding='utf-8')
If we run the code, a file named ‘listings.csv’ will be created, and in it, our precious data!
We did it! We created our own web scraping tool! Now let’s jump right into it and see what steps we need to follow and which lines of code we need to modify to use a scraping tool.