In this example, I fetched the “https://www.urbandictionary.com/define.php?term=YOLO” URL and saved the html to the test_output.html file.
(scrapy_env) mihai@DESKTOP-0RN92KH:~/myproject$ scrapy shell --nolog
[s] Available Scrapy objects:
[s] scrapy scrapy module (contains scrapy.Request, scrapy.Selector, etc)
[s] crawler <scrapy.crawler.Crawler object at 0x7f1eef80f6a0>
[s] item {}
[s] settings <scrapy.settings.Settings object at 0x7f1eef80f4c0>
[s] Useful shortcuts:
[s] fetch(url[, redirect=True]) Fetch URL and update local objects (by default, redirects are followed)
[s] fetch(req) Fetch a scrapy.Request and update local objects
[s] shelp() Shell help (print this help)
[s] view(response) View response in a browser
>>> response // response is empty
>>> fetch('https://www.urbandictionary.com/define.php?term=YOLO')
>>> response
<200 https://www.urbandictionary.com/define.php?term=Yolo>
>>> with open('test_output.html', 'w') as f:
... f.write(response.text)
...
118260
Now let’s inspect test_output.html and identify the selectors we would need in order to extract the data for our Urban Dictionary scraper.
We can observe that:
- Every word definition container has the “definition” class.
- The meaning of the word is found inside the div with the class “meaning”.
- Examples for the word are found inside the div with the class “example”.
- Information about the post author and date are found within the div with the class “contributor”.
Now let’s test some selectors in the Scrapy Shell:
To get references to every definition containers we can use CSS or XPath selectors:
You can learn more about XPath selectors here: https://www.webscrapingapi.com/the-ultimate-xpath-cheat-sheet
definitions = response.css('div.definition')
definitions = response.xpath('//div[contains(@class,"definition")]')
We should extract the meaning, example and post information from every definition container. Let’s test some selectors with the first container:
>>> first_def = definitions[0]
>>> meaning = first_def.css('div.meaning').xpath(".//text()").extract()
>>> meaning
['Yolo ', 'means', ', '', 'You Only Live Once', ''.']
>>> meaning = "".join(meaning)
>>> meaning
'Yolo means, 'You Only Live Once'.'
>>> example = first_def.css('div.example').xpath(".//text()").extract()
>>> example = "".join(example)
>>> example
'"Put your seatbelt on." Jessica said.\r"HAH, YOLO!" Replies Anna.\r(They then proceed to have a car crash. Long story short...Wear a seatbelt.)'
>>> post_data = first_def.css('div.contributor').xpath(".//text()").extract()
>>> post_data
['by ', 'Soy ugly', ' April 24, 2019']
By using the Scrapy shell, we were able to quickly find a general selector that suits our needs.
definition.css('div.<meaning|example|contributor>').xpath(".//text()").extract()
// returns an array with all the text found inside the <meaning|example|contributor>
ex: ['Yolo ', 'means', ', '', 'You Only Live Once', ''.']
To learn more about Scrapy selectors, check out the documentation. https://docs.scrapy.org/en/latest/topics/selectors.html