Skip to Main Content
University of Texas University of Texas Libraries

Web Scraping

Home

Welcome

Welcome to the UT Libraries guide to web scraping! Scraping is a process of automated web browsing that allows you to pull images, text, and other information from the internet. With web scraping, you can quickly gather large amounts of data or data that is obscured by a messy or broken website.

This guide will show you how to scrape the web, what tools to use, and which websites you may access.

Ethics, Legality, and Robots.txt

When using a tool to automate data collection, it is crucial that you keep in mind the legal and ethical factors. Websites often contain material that is subject to copyright and not to be redistributed for commercial purposes. Users also will share personal information that may be unethical to spread or to take out of context. Here are some guidelines to follow and questions to ask yourself when you are scraping the web:

  • Respect the terms of service / use

Websites often have a statement on acceptable access and use of materials they host. Read through any terms of license agreements carefully before setting up a web scraping tool. If there is a circumstance where you're working on a project for education purposes and wish to ignore terms preventing you from scraping a website. You should consult with one of our librarians to make sure your work remains legal and ethical.

  • Consider copyright & the intellectual property of artists

Many websites host the works of artists, whether they be images, videos, or text pieces. By extracting these in bulk, you may put yourself at risk for copyright infringement, especially if you intend to present any of these unaltered in your own work. While educational works are often protected under fair use, this is not an ironclad defense. We strongly recommend that you schedule a consultation with a librarian before starting any project that works with data currently under copyright. Visit our Copyright Crash Course guide.

  • Use a web scraper that respects robots.txt 

Robots.txt is a file incorporated into the source of a webpage that tells any automated tool what it is allowed to do. This could include limiting the rate at which your tool can make requests, limiting what kinds of content can be scraped, or even fully disallowing automated use of the webpage. Robots.txt relies fully on voluntary compliance and can be completely ignored by a web scraper, however, it is best practice to respect the terms of the web developer. If robots.txt prevents you from accessing the data you need, consult with a librarian to discuss if it is an appropriate circumstance to ignore the terms.

  • When in doubt, ask for help

Web scraping can provide a great deal of data for your research, but its extractive nature can pose legal and ethical questions. Strong research cannot be built off of unethical data. If you are unsure of the best way to responsibly build a dataset, consult a librarian to ensure you following the best practices.

Contact

 

Contact: 

scholarslab@austin.utexas

Acknowledgement

The guide was created by William Borkgren, Scholars Lab Graduate Assistant in Spring 2025.

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.