
Figure 2 Project Creation Screen
Once WinHTTrack boots up you will be greeted by the above screen. This is your project creation screen, where you will name your project, give it a category (optional), and select where in your system you want the project to be saved. For the purposes of the tutorial I have selected the name “Blake’s Choice Test,” and given it the category work.
Now you’ve named your project select Next to go on to the Definitions screen. This is the most important screen you will visit, and we will spend the majority of this tutorial here. The text field marked “Web Addresses” is where you input the URLs you want to scrape and preserve. You can enter multiple URLs if you want to scrape multiple site for a project.
Please note that it is critical that you include the full URL, at the highest schema of the site. In this example, because I want to save one specific exhibit I have specified that the scraping beings at /blakeschoice/ rather than at /exhibits/. Were we to end the URL with exhibits, WinHTTrack would be scrape ALL exhibits attached to that URL, rather than the specific one we selected. WinHTTrack works best when you are specific about what you want to crawl, and is most effective in this scenario.
You should also notice the Action dropdown menu above the Add URL button. This function will dictate what kind of scrape you want WinHTTrack to commit to. In the below image I have selected “Update Existing Download” because I have already scraped this webpage but wish to update my archive of it with images. When you are scraping a web page for the first time you will select “Download Web site(s)” or “Download Web site(s) + Questions.”
Figure 3 Action Screen
The difference between the two is thus: “+ Questions” will ask you periodically throughout the web scrape if you want to include content from specific hyperlinks or other commonly mis-scraped content. If you want to truly preserve the feel of a site, links and all you will not need this option, but it is useful to have if you need to save space. I will also briefly touch upon the other options. “Get Separated Files” will download all the assets from the site, in the way the site has structured its database, but will not include HTML files of the webpages. “Download all sites in pages (multiple mirror)” will additionally scrape sites which link to the mirror site that will be created after scraping. “Test link in pages (bookmark test)” merely tests hyperlinks in a URL to ensure they are alive.
Once you’ve selected your URLs and defined the action you want click the Set Options button and you can begin Selecting Content Categories for Preservation