Effortlessly Download Web Pages and Files with Python and wget
Suciu Dan on Apr 21 2023
Are you tired of manually downloading web pages and files? Do you wish there was an easier way to get the data you need? Look no further! In this article, I will show you how to use Python and the command-line tool wget to automate the process.
In this article you will find out how to use Python and wget to automate the process of downloading from the web, while learning about wget's capabilities and limitations, as well as alternative tools for web scraping.
Python is a powerful and popular programming language used in various fields, such as automation, data science, and application development. It's easy to learn and has a large library of modules and frameworks that enhance its capabilities.
Wget is a command-line tool that allows you to download files from the internet. It's widely available on most Unix-like systems, including Linux and macOS. Wget is a versatile tool that can be used to download single files, multiple files, and even entire directories.
Here are some of the main features of wget:
- Download files from the internet: Wget can be used to download files from the internet, including web pages, images, and other types of files.
- Customizable download options: Wget allows you to specify various options to customize your downloads, such as the directory to save the file, the file name, and whether to overwrite existing files.
- Resume interrupted downloads: If a download is interrupted, wget can resume it from where it left off, saving time and bandwidth.
- Recursive download: wget can be used to download an entire website or directory by following links recursively.
- Support for HTTP, HTTPS, and FTP: Wget can handle various types of internet protocols, including HTTP, HTTPS, and FTP, making it a versatile tool for downloading from different types of servers.
- Available on most Unix-like systems: Wget is widely available on most Unix-like systems, including Linux and macOS, making it easy to use on a variety of platforms.
Why use wget with Python?
Python and wget can be combined to automate the process of downloading web pages and files, saving time and effort. Wget can be customized through Python, and web scraping or file downloading tasks can be integrated into existing Python scripts.
There are several reasons why one might choose to use wget with Python:
- Automation: By using wget with Python, you can automate the process of downloading files from the internet. This can save time and effort, especially if you need to download a large number of files or websites.
- Customization: Wget allows you to specify various options to customize your downloads, such as the directory to save the file, the file name, and whether to overwrite existing files. By using wget with Python, you can programmatically set these options and customize your downloads according to your needs.
- Ease of use: Python is known for its simplicity and readability, making it an easy language to learn and use. By using wget with Python, you can leverage the power of Python to make web scraping and file downloading tasks easier.
- Scalability: Python is a scalable language that is capable of handling large amounts of data. By using wget with Python, you can scale up your web scraping or file downloading tasks to handle larger datasets.
Now that we've discussed the individual and combined benefits of Python and wget, let's move on to the code writing part.
Make sure wget is installed on your computer. If your operating system of choice is Linux, you already have it installed.
- If you are a Windows user, you can download the binary from this page. Make sure you add the binary path to the PATH environment variable. Another option is to use WSL (Windows Subsystem for Linux). Read more about it here.
- If you are a Mac user, install wget using brew
Don’t forget to check wget’s extensive documentation here.
Get the latest version of Python from the official website and follow the install instructions for your platform. After you have installed, you can check the version with this command:
Running System Commands in Python
The `subprocess` module in Python allows you to run system commands and capture their output. It's a powerful and flexible way to interact with the operating system from within your Python scripts.
To use the `subprocess` module, you'll first need to import it into your Python script. Then, you can use the `subprocess.run()` function to run a system command and capture its output.
The `run()` function takes the command to be run as a string and returns a `CompletedProcess` object, which contains the exit code, stdout, and stderr of the command.
Here's an example of using the `subprocess.run()` function to run the ls command, which lists the files in a directory:
result = subprocess.run(['ls', '-l'])
Run this code with the python3 main.py command. The result should look like this.
-rw-r--r-- 1 dan dan 80 Jan 6 18:58 main.py
Downloading with wget
Download a file
Let’s start by downloading the WebScrapingAPI logo. Replace the arguments list with `wget` and the logo URL. The command will look like this:
result = subprocess.run(['wget', 'https://www.webscrapingapi.com/images/logo/logo-white.svg'])
The script will return the following output:
--2023-01-06 19:06:32-- https://www.webscrapingapi.com/images/logo/logo-white.svg
Resolving www.webscrapingapi.com (www.webscrapingapi.com)... 126.96.36.199, 188.8.131.52
Connecting to www.webscrapingapi.com (www.webscrapingapi.com)|184.108.40.206|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5391 (5.3K) [image/svg+xml]
Saving to: 'logo-white.svg'
logo-white.svg 100%[====================================================================================================================================================================>] 5.26K --.-KB/s in 0.06s
2023-01-06 19:06:33 (91.6 KB/s) - 'logo-white.svg' saved [5391/5391]
From the output, we can see how `wget` resolves the domain name, connects to the domain, receives a `200 OK` response code, finds the file length (5.3k) and starts saving the file locally under the name `logo-white.svg`.
You can check the project folder for the `logo-white.svg` file.
Download in a directory
You can download the file into a custom directory by using the `-P` flag. Let’s update the script and run it to see the results:
result = subprocess.run(['wget', '-P', 'images', 'https://www.webscrapingapi.com/images/logo/logo-white.svg'])
The output is almost the same, with the only difference of the file being stored in the `./images/` directory.
Setting the downloaded file name
Using the `-O` flag, you can specify a new name for the downloaded file. Let’s give it a try:
result = subprocess.run(['wget', '-O', 'named-logo.svg', 'https://www.webscrapingapi.com/images/logo/logo-white.svg'])
Check the project folder for the `named-logo.svg` file.
Download file if remote version is newer
You can use the `-N` flag to download the remote file only if the version is newer than the local file. The command will look like this:
result = subprocess.run(['wget', '-N', 'https://www.webscrapingapi.com/images/logo/logo-white.svg'])
Resume interrupted downloads
If a download is interrupted, `wget` can resume it from where it left off, saving time and bandwidth. To do this, you'll need to use the `-c` flag, which tells `wget` to continue an interrupted download.
The command will look like this:
result = subprocess.run(['wget', '-c', 'https://www.webscrapingapi.com/images/logo/logo-white.svg'])
If the file download was completed,, you can see in the output the following message:
HTTP request sent, awaiting response... 416 Requested Range Not Satisfiable
The file is already fully retrieved; nothing to do.
Wget can be used to download an entire website or directory by following links recursively. To do this, you'll need to use the `-r` and `-l` flags: the `-r` flag tells the tool to follow links recursively and the `-l` flag specifies the maximum depth of the recursion.
result = subprocess.run(['wget', '-r', '-l', '2', 'https://www.webscrapingapi.com'])
This command will download the website at "https://www.webscrapingapi.com" and follow links to other pages on the same website to a maximum depth of 2.
When wget is not the solution?
There are several situations where it might be more appropriate to use curl, Beautiful Soup, or Selenium instead of wget:
- When you need to set HTTP headers or cookies: Curl allows you to set HTTP headers and cookies, which can be useful when interacting with APIs or accessing protected websites. `wget` does not have this capability.
- When you need to parse and extract data from HTML: Beautiful Soup is a Python library that makes it easy to parse and extract data from HTML documents. wget is not designed for parsing and extracting data from web pages.
- When you need to interact with a website as a user: Selenium is a tool that allows you to automate the process of interacting with a website as a user. It can be used to fill out forms, click buttons, and perform other actions that are not possible with wget.
Python and wget are powerful tools for automating the process of downloading files and web pages. By using wget with Python, you can customize your downloads, integrate web scraping or file downloading tasks into your existing Python scripts, and save time and effort.
However, it's important to respect the terms of service of the websites you're downloading from and avoid overloading servers.
If you're looking for an alternative to wget for web scraping, consider using WebScrapingAPI. WebScrapingAPI is a professional web scraping service that allows you to easily extract data from websites without the need to build and maintain your own web scraper.
It's a fast, reliable, and cost-effective solution that is suitable for businesses of all sizes.
News and updates
Stay up-to-date with the latest web scraping guides and news by subscribing to our newsletter.
Discover how to efficiently extract and organize data for web scraping and data analysis through data parsing, HTML parsing libraries, and schema.org meta data.