Skip to Main Content
University of Texas University of Texas Libraries

Web Scraping

Scraping with Python

Web Scraping with Python

Python is the most popular programming language, known for its intuitive language and wide range of supported libraries. Because of its massive support, there is almost no limit to what you can do with the data you gather.

Some recommended libraries for web scraping include:

A tool for parsing the HTML and XML of a webpage. BeautifulSoup can create trees that reveal the content of the parsed pages. It can help you to identify extra specific elements from a page.

Quick-start guide

This library enables you to make HTTP requests without adding query strings, making the process simpler and easier.

Quick-start guide

Built to test web applications, Selenium scrapes the web. It can be used to automate website access and scrape dynamic pages that utilize Javascript.

Getting started guide

Scrapy is a tool for web scraping and crawling, Scrapy lets you design “spiders” to collect URLs and follow links to a depth you set. Documentation.


For more information on using these four libraries, visit the guide on ScrapingDog.com

Python Google Colab Notebook

Are you interested in trying web scraping with Python but don't know where to start? Try our Google Colab notebook for a walkthrough on using Requests and BeautifulSoup to parse the HTML of a page and store scraped data in a table.

Click here to visit the Colab Notebook.

Data Management Libraries

After gathering data with web scraping, you can manipulate and visualize that data within Python. Here are some suggested Python libraries for managing and interpreting your data:

A popular library for managing and manipulating data in Python. It works with tabular data such as CSV, JSON, XLS, and SQL to create organized data frames. These data frames can then be visualized to share and interpret data.

Getting started guide

In-depth user guide

An alternative to Pandas for working with data in Python. It has the advantage of utilizing all of a computer's cores to work with large datasets more quickly and efficiently than Pandas. However, it is a newer and more complicated library so you should consider the size of your data and comfort with Python when choosing between the two.

Documentation

Getting started guide

NumPy is essential for working with arrays and performing numerical calculations within Python. It can be used with Pandas data frames to manage quantitative data.

Getting started guide

User guide

Natural Language Toolkit is the classic Python library for natural language processing tools. With it, you can perform tokenization, stemming, and text parsing, all essential when working with textual data.

NLTK wiki on GitHub

Detailed list of NLTK Modules

Python in the Library Catalog

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.