Skip to Main Content
University of Texas University of Texas Libraries

Web Scraping

Scraping with R

Web Scraping with R

The R programming language is a popular and widely supported tool for web scraping. Beyond scraping, R is used for data analysis and visualization, making it an excellent choice for gathering and researching data.

Recommended packages for web scraping in R include:

rvest is a package for scraping the web and a component of Tidyverse. By using CSS tags, users can pull requested data from the web.

Getting started guide. Tutorial.

The polite package features the function “bow” to check for permission to scrape a page using its robots.txt file. It will limit the rate at which your web scraper requests information and respect the terms set by the website’s host. For more information on the importance of respecting the terms from robots.txt, consult the ethics section of the homepage of this guide.

polite on GitHub. polite documentation.

Shiny is used to build web applications in R. It is useful for creating an intractable and graphical representation of your data.

"A Step-by-Step Guide to Web-Scraping with R" by Better Data Science.

R Google Colab Notebook

Are you interested in trying web scraping with R but don't know where to start? Try our Google Colab notebook, where we will walk you through the process of identifying HTML elements and accessing web data using the rvest package!

Click here to visit the Colab Notebook.

Data Management Packages

R can help you to manipulate and visualize data you have scraped from the web. These packages are popular for making your data manageable and interpretable within an R environment.

Tidyverse is a meta package for R that makes working with data simple to learn and practice. It contains many packages that share a philosophy of data and similar language to maximize ease of use.

Packages within Tidyverse

TidyText applies the principles of data and usability from Tidyverse and brings them to text mining. Likewise, it is a meta package unifying a number of tools for text mining and manipulation in R. 

GGPlot2 is the most popular visualization tool for R. It takes data and creates highly customizable charts and graphics to represent it. Using this tool can help you identify trends in your web scraping data and communicate the findings of your research. This package is included in both Tidyverse and TidyText, or can be downloaded independently.

Getting started guide

R in the Library Catalog

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 Generic License.