Welcome to the UT Libraries guide to web scraping! Scraping is a process of automated web browsing that allows you to pull images, text, and other information from the internet. With web scraping, you can quickly gather large amounts of data or data that is obscured by a messy or broken website.
This guide will show you how to scrape the web, what tools to use, and which websites you may access.
When using a tool to automate data collection, it is crucial that you keep in mind the legal and ethical factors. Websites often contain material that is subject to copyright and not to be redistributed for commercial purposes. Users also will share personal information that may be unethical to spread or to take out of context. Here are some guidelines to follow and questions to ask yourself when you are scraping the web:
Websites often have a statement on acceptable access and use of materials they host. Read through any terms of license agreements carefully before setting up a web scraping tool. If there is a circumstance where you're working on a project for education purposes and wish to ignore terms preventing you from scraping a website. You should consult with one of our librarians to make sure your work remains legal and ethical.
Many websites host the works of artists, whether they be images, videos, or text pieces. By extracting these in bulk, you may put yourself at risk for copyright infringement, especially if you intend to present any of these unaltered in your own work. While educational works are often protected under fair use, this is not an ironclad defense. We strongly recommend that you schedule a consultation with a librarian before starting any project that works with data currently under copyright. Visit our Copyright Crash Course guide.
Robots.txt is a file incorporated into the source of a webpage that tells any automated tool what it is allowed to do. This could include limiting the rate at which your tool can make requests, limiting what kinds of content can be scraped, or even fully disallowing automated use of the webpage. Robots.txt relies fully on voluntary compliance and can be completely ignored by a web scraper, however, it is best practice to respect the terms of the web developer. If robots.txt prevents you from accessing the data you need, consult with a librarian to discuss if it is an appropriate circumstance to ignore the terms.
Web scraping can provide a great deal of data for your research, but its extractive nature can pose legal and ethical questions. Strong research cannot be built off of unethical data. If you are unsure of the best way to responsibly build a dataset, consult a librarian to ensure you following the best practices.
Contact:
scholarslab@austin.utexas
The guide was created by William Borkgren, Scholars Lab Graduate Assistant in Spring 2025.
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.