0

Creating A Threaded HTTP Downloader With Python and Beautiful Soup

 

A year ago I had been in the process of learning python.  Tired of doing examples in books I wanted to create something that was my own and useful (to me at least).

At the time I was working for a game host.   The need to download large numbers of files was a common practice.  An example of this was downloading mods for a game called Arma to individual game servers.  A single mod could be up to 4gb in size and spread over thousands of files.  We mirrored these files around the world over ~50 HTTP servers.  We would then use Wget to download and entire self contained directory for the mod.  However, an issue we frequently ran into was the speed of the downloads.   Wget is great for getting an entire remote directory for us, however, it only downloaded one file at a time.  In many cases this would result in far less than the available bandwidth being used.

I wanted to come up with something that could with a tool that could scan a directory listing and start a download for each file in a new thread.  This would allow us to specify how many simultaneous downloads to run so we could tune the downloads to the available bandwidth.

The solution I came up with uses Beautiful Soup to parse the directory list.   It then spawns a download thread for each file it finds.

This project posed a few interesting problems. For me, the largest at the time was crawling all sub-directories and building a list of all files that need to be downloaded.

The full solution can be seen here: https://github.com/barrycarey/Threaded_HTTP_Downloader

How does it work?

I’m only going to cover the scanning and downloading here.  I’ll omit the threading aspect.  If you check the GitHub it should be obvious how the threading is handled

First, it requires a public directory listing for the directory we’re trying to download. We then need to recursively scan all directories and build a list of files to download.

Directory Listing Example (From an IIS Server)

ss+(2016-02-09+at+10.37.32)

We then open the page in urllib and pass it to Beautiful Soup for parsing

response = urllib.request.urlopen('http://example.com/arma-mod/)
parsed_page = BeautifulSoup(response)

Once we have the parsed page we then need to find all URLs on the page and determine if they are a directory or a file. If it is a file, we add it to the list of files to download. If it’s a directory we need to scan deeper into it.

To accomplish this was ask Beautiful Soup to find all HTML links in the page.  We then loop over all of them and determine if they are files or directories.

The simplest means of determining if a given URL was a file was by using os.path.splitext on the raw URL.

name, ext = os.path.splitext('some-dir/test.dll')

If ext was None than it should be a directory. However, this turned out to be an issue if a directory had a . in the name.  To work around this I created a method that appended a / on to the URL and tried to open it with urllib.  If we get an exception we know it is a file. (IIS returns a 404 for a URL to a file with a / on the end)

 

for link in parsed_page.find_all('a'):
    # Avoid climbing back to previous level
    if link.string == '[To Parent Directory]':
    continue
    # Get file name and extension. If directory ext will be None
    name, ext = os.path.splitext(link.string)

    if ext:
        files.append(link.string)
    else:
        dirs.append(link.string)

Once we have all of the files and directories we can kick off the file download threads.

for file in files:

    threading.Thread(target=self._threaded_download, args=(download_url, output_file,)).start()

The method _threaded_download handles the download of the actual file.

Next, we have to deal with all of the directories we found.  To do this, all of the logic outlined above was put into a method that could be called recursively.  I loop through each directory we found, build a new URL based off the root URL and pass it to this same method.  It would then proceed to scan the new URL and continue down the tree.

That is the overly simplified version of it.  Feel free to review the whole project in Github.  Being my first ‘substantial’ Python project I was happy with how it turned out.  It allowed me to download large volumes of files via our HTTP mirrors much quicker than using Wget.

matt

Leave a Reply

Your email address will not be published. Required fields are marked *