A web crawler, or a spider, is a type of bot that goes through a site, and extracts data using user defined fields through CSS selectors. A crawler can extract all the links from a page and use specific ones (like pagination links) to crawl more data.
It’s time to set the base for the crawler: create the ebay_scraper.ex file in the lib/elixir_spider folder and paste the following code to it:
# lib/elixir_spider/ebay.ex
defmodule EbayScraper do
use Crawly.Spider
@impl Crawly.Spider
def base_url(), do: ""
@impl Crawly.Spider
def init() do
end
@impl Crawly.Spider
def parse_item(response) do
end
end
This is just the skeleton for the file and it won’t run or return any results. Let’s talk about each function first and then we fill them one by one.
The base_url() function is called once and it returns the base URL for the target website the crawler will scrape; it’s also used to filter external links and prevent the crawler from following them. You don’t want to scrape the entire Internet.
@impl Crawly.Spider
def base_url(), do: "https://www.ebay.com/"
The init() function is called once and is used to initialise the default state for the crawler; in this case, the function returns the start_url from where the scraping will begin.
Replace your blank function with this one:
@impl Crawly.Spider
def init() do
[start_urls: ["https://www.ebay.com/sch/i.html?_nkw=ps5"]]
end
All the data extract magic happens in the parse_item(). This function is called for each scraped URL. Within this function, we use the Floki HTML parser to extract the fields we need: title, url and price.
The function will look like this:
@impl Crawly.Spider
def parse_item(response) do
# Parse response body to document
{:ok, document} = Floki.parse_document(response.body)
# Create item (for pages where items exists)
items =
document
|> Floki.find(".srp-results .s-item")
|> Enum.map(fn x ->
%{
title: Floki.find(x, ".s-item__title span") |> Floki.text(),
price: Floki.find(x, ".s-item__price") |> Floki.text(),
url: Floki.find(x, ".s-item__link") |> Floki.attribute("href") |> Floki.text(),
}
end)
%{items: items}
end
As you might have noticed, we’re using the classes we found in the Getting Started - Inspecting the target section to extract the data we need from the DOM elements.