Effortlessly Download Web Pages and Files with Python and wget

Suciu Dan on Apr 21 2023

Are you tired of manually downloading web pages and files? Do you wish there was an easier way to get the data you need? Look no further! In this article, I will show you how to use Python and the command-line tool wget to automate the process.

In this article you will find out how to use Python and wget to automate the process of downloading from the web, while learning about wget's capabilities and limitations, as well as alternative tools for web scraping.

Why Python?

Python is a powerful and popular programming language used in various fields, such as automation, data science, and application development. It's easy to learn and has a large library of modules and frameworks that enhance its capabilities.

Why wget?

Wget is a command-line tool that allows you to download files from the internet. It's widely available on most Unix-like systems, including Linux and macOS. Wget is a versatile tool that can be used to download single files, multiple files, and even entire directories.

Here are some of the main features of wget:

Download files from the internet: Wget can be used to download files from the internet, including web pages, images, and other types of files.
Customizable download options: Wget allows you to specify various options to customize your downloads, such as the directory to save the file, the file name, and whether to overwrite existing files.
Resume interrupted downloads: If a download is interrupted, wget can resume it from where it left off, saving time and bandwidth.
Recursive download: wget can be used to download an entire website or directory by following links recursively.
Support for HTTP, HTTPS, and FTP: Wget can handle various types of internet protocols, including HTTP, HTTPS, and FTP, making it a versatile tool for downloading from different types of servers.
Available on most Unix-like systems: Wget is widely available on most Unix-like systems, including Linux and macOS, making it easy to use on a variety of platforms.

Why use wget with Python?

Python and wget can be combined to automate the process of downloading web pages and files, saving time and effort. Wget can be customized through Python, and web scraping or file downloading tasks can be integrated into existing Python scripts.

There are several reasons why one might choose to use wget with Python:

Automation: By using wget with Python, you can automate the process of downloading files from the internet. This can save time and effort, especially if you need to download a large number of files or websites.
Customization: Wget allows you to specify various options to customize your downloads, such as the directory to save the file, the file name, and whether to overwrite existing files. By using wget with Python, you can programmatically set these options and customize your downloads according to your needs.
Ease of use: Python is known for its simplicity and readability, making it an easy language to learn and use. By using wget with Python, you can leverage the power of Python to make web scraping and file downloading tasks easier.
Scalability: Python is a scalable language that is capable of handling large amounts of data. By using wget with Python, you can scale up your web scraping or file downloading tasks to handle larger datasets.

Getting started

Now that we've discussed the individual and combined benefits of Python and wget, let's move on to the code writing part.

Installing wget

Make sure wget is installed on your computer. If your operating system of choice is Linux, you already have it installed.

If you are a Windows user, you can download the binary from this page. Make sure you add the binary path to the PATH environment variable. Another option is to use WSL (Windows Subsystem for Linux). Read more about it here.
If you are a Mac user, install wget using brew

Don’t forget to check wget’s extensive documentation here.

Installing Python

Get the latest version of Python from the official website and follow the install instructions for your platform. After you have installed, you can check the version with this command:

python3 --version

Running System Commands in Python

The `subprocess` module in Python allows you to run system commands and capture their output. It's a powerful and flexible way to interact with the operating system from within your Python scripts.

To use the `subprocess` module, you'll first need to import it into your Python script. Then, you can use the `subprocess.run()` function to run a system command and capture its output.

The `run()` function takes the command to be run as a string and returns a `CompletedProcess` object, which contains the exit code, stdout, and stderr of the command.

Here's an example of using the `subprocess.run()` function to run the ls command, which lists the files in a directory:

import subprocess

result = subprocess.run(['ls', '-l'])

print(result.stdout)

Run this code with the python3 main.py command. The result should look like this.

total 4

-rw-r--r-- 1 dan dan 80 Jan  6 18:58 main.py

None

Downloading with wget

Download a file

Let’s start by downloading the WebScrapingAPI logo. Replace the arguments list with `wget` and the logo URL. The command will look like this:

result = subprocess.run(['wget', 'https://www.webscrapingapi.com/images/logo/logo-white.svg'])

The script will return the following output:

--2023-01-06 19:06:32--  https://www.webscrapingapi.com/images/logo/logo-white.svg

Resolving www.webscrapingapi.com (www.webscrapingapi.com)... 76.76.21.61, 76.76.21.98

Connecting to www.webscrapingapi.com (www.webscrapingapi.com)|76.76.21.61|:443... connected.

HTTP request sent, awaiting response... 200 OK

Length: 5391 (5.3K) [image/svg+xml]

Saving to: 'logo-white.svg'

logo-white.svg                                                   	100%[====================================================================================================================================================================>]   5.26K  --.-KB/s	in 0.06s   

2023-01-06 19:06:33 (91.6 KB/s) - 'logo-white.svg' saved [5391/5391]

None

From the output, we can see how `wget` resolves the domain name, connects to the domain, receives a `200 OK` response code, finds the file length (5.3k) and starts saving the file locally under the name `logo-white.svg`.

You can check the project folder for the `logo-white.svg` file.

Download in a directory

You can download the file into a custom directory by using the `-P` flag. Let’s update the script and run it to see the results:

result = subprocess.run(['wget', '-P', 'images', 'https://www.webscrapingapi.com/images/logo/logo-white.svg'])

The output is almost the same, with the only difference of the file being stored in the `./images/` directory.

Setting the downloaded file name

Using the `-O` flag, you can specify a new name for the downloaded file. Let’s give it a try:

result = subprocess.run(['wget', '-O', 'named-logo.svg', 'https://www.webscrapingapi.com/images/logo/logo-white.svg'])

Check the project folder for the `named-logo.svg` file.

Download file if remote version is newer

You can use the `-N` flag to download the remote file only if the version is newer than the local file. The command will look like this:

result = subprocess.run(['wget', '-N', 'https://www.webscrapingapi.com/images/logo/logo-white.svg'])

Resume interrupted downloads

If a download is interrupted, `wget` can resume it from where it left off, saving time and bandwidth. To do this, you'll need to use the `-c` flag, which tells `wget` to continue an interrupted download.

The command will look like this:

result = subprocess.run(['wget', '-c', 'https://www.webscrapingapi.com/images/logo/logo-white.svg'])

If the file download was completed,, you can see in the output the following message:

HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable

The file is already fully retrieved; nothing to do.

Recursive Download

Wget can be used to download an entire website or directory by following links recursively. To do this, you'll need to use the `-r` and `-l` flags: the `-r` flag tells the tool to follow links recursively and the `-l` flag specifies the maximum depth of the recursion.

result = subprocess.run(['wget', '-r', '-l', '2', 'https://www.webscrapingapi.com'])

This command will download the website at "https://www.webscrapingapi.com" and follow links to other pages on the same website to a maximum depth of 2.

When wget is not the solution?

There are several situations where it might be more appropriate to use curl, Beautiful Soup, or Selenium instead of wget:

When you need to set HTTP headers or cookies: Curl allows you to set HTTP headers and cookies, which can be useful when interacting with APIs or accessing protected websites. `wget` does not have this capability.
When you need to parse and extract data from HTML: Beautiful Soup is a Python library that makes it easy to parse and extract data from HTML documents. wget is not designed for parsing and extracting data from web pages.
When you need to interact with a website as a user: Selenium is a tool that allows you to automate the process of interacting with a website as a user. It can be used to fill out forms, click buttons, and perform other actions that are not possible with wget.

Conclusions

Python and wget are powerful tools for automating the process of downloading files and web pages. By using wget with Python, you can customize your downloads, integrate web scraping or file downloading tasks into your existing Python scripts, and save time and effort.

However, it's important to respect the terms of service of the websites you're downloading from and avoid overloading servers.

If you're looking for an alternative to wget for web scraping, consider using WebScrapingAPI. WebScrapingAPI is a professional web scraping service that allows you to easily extract data from websites without the need to build and maintain your own web scraper.

It's a fast, reliable, and cost-effective solution that is suitable for businesses of all sizes.

Give it a try today!

News and updates

Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.

We care about the protection of your data. Read our Privacy Policy.

Guides Scrapy Splash Tutorial: Mastering the Art of Scraping JavaScript-Rendered Websites with Scrapy and Splash

Learn how to scrape dynamic JavaScript-rendered websites using Scrapy and Splash. From installation to writing a spider, handling pagination, and managing Splash responses, this comprehensive guide offers step-by-step instructions for beginners and experts alike.

Ștefan Răcila

Aug 10 20236 min read

Guides Web Scraping API Quick Start Guide

Get started with WebScrapingAPI, the ultimate web scraping solution! Collect real-time data, bypass anti-bot systems, and enjoy professional support.

Mihnea-Octavian Manolache

Jul 14 20239 min read

Science of Web Scraping Web Scraping made easy: The Importance of Data Parsing

Discover how to efficiently extract and organize data for web scraping and data analysis through data parsing, HTML parsing libraries, and schema.org meta data.

Suciu Dan

Apr 26 202312 min read