Before we get into the nuts and bolts of web scraping, we should go over a few key concepts.
Most of the written content you’ll encounter on a website is stored in a text-based mark-up language, most commonly HTML. To make processing and rendering easier for all browsers and devices, HTML has a few general rules that all websites follow.
When humans enter a web page, they see the results of that HTML code. But robots, such as Google’s indexing crawlers, look at the code. Think of it as the same information, but in different forms.
If a person wants to copy all the information on a webpage, they would manually select all the content (most likely grabbing useless filler, too), hit “copy,” and then paste it to some local file. It doesn’t seem so bad, but imagine doing that two hundred times, several times a week. It’s going to become an unbelievable chore, and sorting all that data will be equally nightmarish.
Some websites make it hard for users to select content and copy it. While these sites aren’t prevalent, they can become the cherry on top of the sad sundae.
A web scraping tool is a bot that grabs HTML code from web pages. There are two significant differences compared to manual copying: the bot does the job for you, and it does it way faster. Harvesting the HTML from a single page can be instantaneous. The defining factor is your internet speed, which can slow you down while manually copying too.
Where scrapers genuinely shine, though, is when extracting data from multiple sources. For a powerful web scraper, there’s little difference between one webpage and a thousand. As long as you give it a list of URLs for pages you want scraping, the bot will set to work collecting data.