LibGuides: Web Scraping: Scraping Social Media

Choosing a Tool - API or Web Scraping

While web scraping is a powerful tool for gathering data from the web, it is not the only method. An API (application programming interface) can be used to request data directly from a website’s server rather than interacting with the HTML like with web scraping. You can think of it as your computer communicating directly with another, rather than parsing a website designed for human interaction.

APIs are powerful but can be limited. They only allow certain requests for information and their server decides how much information will be delivered. Take, for example, a weather service. Their API would allow you to request the forecasted weather for several of cities, find the annual rainfall for a region, and find historical high and low temperatures. It likely wouldn’t serve you images of weather-related news articles, the icons used to visually display the weather, or information on the organization that can be found in the “about” section of their webpage. When choosing to use an API or build a web scraping tool, consider what information is important to you and if the API serves all the information or just summarizes it.

It is important to note that there are 2 kinds of API:

Free public APIs: Public APIs are available for many government and information services and some social media sites like Reddit.
APIs that require a key: A key is a special code needed to access API functionality for some services and typically must be paid for. Other social media sites, such as X or Meta’s products (Instagram, Facebook), utilize API keys and a paid plan must be subscribed to access them.

Because APIs share information on the web developer’s terms, there can be limits to the rate at which your machine can make requests and a cap as to how much you can access in a given period. Web scraping while respecting robots.txt can also impose rate limits. Consider the limitations of rate and data before deciding which approach better serves your research.

Working With APIs library guide.

Terms of Service - X (formerly Twitter)

X, formerly known as Twitter, is a social media platform where users share text and media posts. Its wide-scale adoption makes it a valuable resource for data on public discourse.

Visit our Scraping X (Twitter).

Terms of Service - Meta (Instagram & Facebook)

Meta, owner of Instagram, Facebook, and Threads, maintains the Meta Content Library for academic research. This library provides access to an archive of all publicly available information across these platforms and is maintained within the Social Media Archive (SOMAR) by the Inter-university Consortium for Political and Social Research (ICPSR) at the University of Michigan.

To gain API access to this database, one must apply to the ICPSR. Application periods are open only for certain windows, so consult their webpage for dates. For more information on the application requirements and process, visit the SOMAR Application Guide.

Social Media Scraping Tools and Guides

Instagram Scraper by Senthilnathan Karuppaiah

This javascript application is built to scrape Instagram posts without an API key while respecting the terms of service. Installation requirements and first steps can be found in the ReadMe on GitHub.

“Web Scraping X” from ScrapingDog.com

This article provides a step-by-step guide to scraping X using Python. Their method utilizes Beautiful Soup to parse the site’s HTML and XML and Selenium to browse dynamic content (elements that change with user interaction rather than static pages).

Examples of Social Media Web Scraping in Research

Camargo-Henriquez, I., & Nunez-Bernal, Y. (2022). A Web Scraping based approach for data research through social media: An Instagram case. V Congreso Internacional En Inteligencia Ambiental, Ingeniería de Software y Salud Electrónica y Móvil (AmITIC), 1–4. https://doi.org/10.1109/AmITIC55733.2022.9941290

This paper proposes a practical means of data retrieval from Instagram and discusses the applications of data gathered from the platform.

Leveraging web scraping and stacking ensemble machine learning techniques to enhance detection of major depressive disorder from social media posts. 2024. Social Network Analysis and Mining, 14(1), 239-. https://doi.org/10.1007/s13278-024-01392-w

In this paper, web scraping is used to pull large amounts of text from social media for analysis to identify depression in users. It also provides a comparison and analysis of similar projects to assess the efficacy of this methodology.