Scrapy is a powerful framework for creating web crawlers in Python. It provides a built-in way to follow links and extract information from web pages. You will need to create a new Scrapy project and a spider to define the behavior of your crawler.
Before starting to crawl a website like Amazon, it is important to check the website's robots.txt file to see which URL paths are allowed. Scrapy automatically reads this file and follows it when the ROBOTSTXT_OBEY setting is set to true, which is the default for projects created using the Scrapy command `startproject`.
To create a new Scrapy project you need to run the following command:
$ scrapy startproject amazon_crawler
This command will generate a project with the following structure:
amazon_crawler/
├── scrapy.cfg
└── amazon_crawler
├── __init__.py
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── __init__.py
To create a spider use the `genspider` command from Scrapy’s CLI. The command has the following definition:
$ scrapy genspider [options] <name> <domain>
To generate a spider for this crawler we can run:
$ cd amazon_crawler
$ scrapy genspider baby_products amazon.com
It should create a file named `baby_products.py` inside the folder named `spiders` and have this code generated:
import scrapy
class BabyProductsSpider(scrapy.Spider):
name = 'wikipedia'
allowed_domains = ['en.wikipedia.com']
start_urls = ['http://en.wikipedia.com/']
def parse(self, response):
pass
Scrapy also offers a variety of pre-built spider classes, such as CrawlSpider, XMLFeedSpider, CSVFeedSpider, and SitemapSpider. The CrawlSpider class, which is built on top of the base Spider class, includes an extra "rules" attribute to define how to navigate through a website. Each rule utilizes a LinkExtractor to determine which links should be extracted from each page.
For our use case we should inherit our Spider class from CrawlSpider. We will also need to make a LinkExtractor rule that tells the crawler to extract links only from Amazon’s pagination. Remember that our goal was to collect data from all baby products from Amazon, so we don’t actually want to follow all the links we find on the page.
Then we need to create another two methods in our class, `parse_item` and `parse_product`. `parse_item` will be given as a callback function to our LinkExtractor rule and it will be called with each link extracted. `parse_product` will parse each product… ¯\_(ツ)_/¯
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from bs4 import BeautifulSoup
class BabyProductsSpider(CrawlSpider):
name = 'baby_products'
allowed_domains = ['amazon.com']
start_urls = ['https://amazon.com/s?k=baby+products']
rules = (
Rule(
LinkExtractor(
restrict_css='.s-pagination-strip'
),
callback='parse_item',
follow=True),
)
def parse_item(self, response):
soup = BeautifulSoup(response.text, 'html.parser')
products = soup.select('div[data-component-type="s-search-result"]')
data = []
for product in products:
parsed_product = self.parse_product(product)
if (parsed_product != 'error'):
data.append(parsed_product)
return {
'url': response.url,
'data': data
}
def parse_product(self, product):
try:
link = product.select_one('a.a-text-normal')
price = product.select_one('span.a-price > span.a-offscreen').text
return {
'product_url': link['href'],
'name': link.text,
'price': price
}
except:
return 'error'
To start the crawler you can run:
$ scrapy crawl baby_products
You will see lots of logs in the console (you can specify a log file with `--logfile [log_file_name]`).
I used Amazon Search as an example to demonstrate the basics of creating a web crawler in Python. However, the crawler does not find many links to follow and is not tailored for a specific use case for the data. If you are looking to extract specific data from Amazon Search, you can consider using our Amazon Product Data API. We created custom parsers for Amazon Search, Product and Category page and it returns data in JSON format ready to be used in your application.