Skip to Main Content
University of Texas University of Texas Libraries

Preserving Web Sites using WinHTTrack

How to use WinHTTrack to create dehydrated, navigable, offline copies of web sites.

Vocabulary

Vocabulary

Scraping – What HTTrack does to your web content.  Rather than move the content from your website, HTTrack scrapes the metadata and information from the specific content on the website (images, video, text, formatting, etc.) so that it can recreate the content in html copies of your web pages.

Dehydration – Refers to the process by which “live” web content is frozen and turned into “non-live” or dehydrated material.  Think of it like taking a snapshot of the web content, one that is fully navigable, that will represent the “live” content once it ceases to be “live.”

Mirror – HTTrack bills itself as a “mirroring software.”  This means that HTTrack doesn’t take the content away from the site, but creates copied content with near identical properties, a mirrored version of your content.  As such the content you access once the process is complete is called a mirror.

Robots.txt – Robots.txt is the file containing the ruleset by which HTTrack selects and scrapes content, and what kind of content HTTrack is not allowed to scrape

Spider – HTTrack refers in several places to The Spider, which is the name of the bot that is crawling down your webpage scraping your content.  It crawls, it creates a web of content matching your original provenance, thus the name spider.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 2.0 Generic License.