LibGuides: Web Scraping: Scraping with Python

Web Scraping with Python

Python is the most popular programming language, known for its intuitive language and wide range of supported libraries. Because of its massive support, there is almost no limit to what you can do with the data you gather.

Some recommended libraries for web scraping include:

BeautifulSoup

A tool for parsing the HTML and XML of a webpage. BeautifulSoup can create trees that reveal the content of the parsed pages. It can help you to identify extra specific elements from a page.

Quick-start guide

Requests

This library enables you to make HTTP requests without adding query strings, making the process simpler and easier.

Quick-start guide

Selenium

Built to test web applications, Selenium scrapes the web. It can be used to automate website access and scrape dynamic pages that utilize Javascript.

Getting started guide

Scrapy

Scrapy is a tool for web scraping and crawling, Scrapy lets you design “spiders” to collect URLs and follow links to a depth you set. Documentation.

For more information on using these four libraries, visit the guide on ScrapingDog.com

Python Google Colab Notebook

Are you interested in trying web scraping with Python but don't know where to start? Try our Google Colab notebook for a walkthrough on using Requests and BeautifulSoup to parse the HTML of a page and store scraped data in a table.

Click here to visit the Colab Notebook.

Data Management Libraries

After gathering data with web scraping, you can manipulate and visualize that data within Python. Here are some suggested Python libraries for managing and interpreting your data:

Pandas

A popular library for managing and manipulating data in Python. It works with tabular data such as CSV, JSON, XLS, and SQL to create organized data frames. These data frames can then be visualized to share and interpret data.

Getting started guide

In-depth user guide

Polars

An alternative to Pandas for working with data in Python. It has the advantage of utilizing all of a computer's cores to work with large datasets more quickly and efficiently than Pandas. However, it is a newer and more complicated library so you should consider the size of your data and comfort with Python when choosing between the two.

Documentation

Getting started guide

NumPy

NumPy is essential for working with arrays and performing numerical calculations within Python. It can be used with Pandas data frames to manage quantitative data.

Getting started guide

User guide

NLTK

Natural Language Toolkit is the classic Python library for natural language processing tools. With it, you can perform tokenization, stemming, and text parsing, all essential when working with textual data.

NLTK wiki on GitHub

Detailed list of NLTK Modules

Python in the Library Catalog

Web Scraping with Python by Ryan Mitchell
ISBN: 9781491910276

Publication Date: 2015-06-15

Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, you'll learn how to use Python scripts and web APIs to gather and process data from thousands--or even millions--of web pages at once.
Practical Web Scraping for Data Science by Seppe vanden Broucke; Bart Baesens
ISBN: 9781484235812

Publication Date: 2018-04-19

This book provides a complete and modern guide to web scraping, using Python as the programming language, without glossing over important details or best practices. Written with a data science audience in mind, the book explores both scraping and the larger context of web technologies in which it operates, to ensure full understanding.