Preserving Web Sites using WinHTTrack

How to use WinHTTrack to create dehydrated, navigable, offline copies of web sites.

Step 2: Selecting Content Categories for Preservation

After selecting the Set Options button, you should see the following screen:

Proxy Screen

Figure 4 Proxy Screen

                This is the proxy screen.  For the purposes of the tutorial we will not be using it, but you may use a proxy to access a site if you want.  Instead the tabs which will most prominently figure in any web scrape performed by WinHTTrack are: Scan Rules and Spider

                The Scan Rules page is where we will specify what file types we want the program to archive and store on our computer for offline use.  We will do this by adding the file extension to the text box using the following format: “+*.FILEEXTENSION”.  The “+” attribute will alert the program that the following file extension should be included in the resulting archive, while the “-“signifies that the file format is to be excluded from the preservation process.  You can use these attributes to exclude or include specific URLs as well if you so choose.  Common file types can be added in bulk with the check boxes you see on your screen.

Scan Rules

Figure 5 Scan Rules

                After you’ve included or excluded the appropriate file types, move on to the Spider tab.  The Spider tab requires less modification for this tutorial, but is still critical if you want to properly scrape a whole web site.  Make sure the “parse java files” and “accept cookies” boxes are checked, to ensure that applets and other API will be included in the scrape so certain parts of the web site function properly.  The Robots.txt rules are activated by default, for privacy, but if you are downloading a personal site or a site you are legally allowed to harvest the possible sensitive data you may alter the rules.  But you shouldn’t need to tell WinHTTrack to ignore the rules.

Spider Screen

Figure 6 Spider

This work is licensed under a Creative Commons Attribution-NonCommercial 2.0 Generic License.