Web Scraping and Computationally Intensive Theory Development: Practices, Ethics, and an Outlook to the Future

Nicolai Etienne Fabian, Edin Smailhodzic, Abayomi Baiyere

Research output: Chapter in Book/Report/Conference proceedingConference abstract in proceedingsResearchpeer-review

Abstract

The Information Systems (IS) field is increasingly engaging in computationally intensive research (Berente et al., 2019; Miranda et al., 2022). The basis for these projects is often formed by digital trace data often coming using web scraping techniques to collect data from online environments like social media or websites (Boegershausen et al., 2022; Miranda et al., 2022). This ever-increasing data treasure offers unparalleled opportunities for researchers. Yet, we know little on the crucial link between web scraped digital trace data and subsequent computationally intensive theory development. Researchers can make use of web scraping (web crawlers/web spiders) and/or application programming interfaces (API) to automatically collect information from websites. However, the utilization of web scraping is highly sensitive in terms of practices (Boegershausen et al., 2022) and little is known about the subsequent impact on theory development. Whereas in the past, the IS field was at the forefront of discussing the use of data stemming from online environments (Allen et al., 2006), not much research has been added to that. Thereby, questions of on how to systematically use web scraping and the potential consequences of choices during the scraping process on theory development remain unanswered.

We sat out to understand current web scraping practices in the IS field. Therefore, we collected and analyzed 176 papers from the leading four IS journals. Our exploratory approach yielded challenging findings. Among our findings, we see that the practices of web scraping are only vaguely described, limiting potential replications. We also see that ethical challenges and data rights are barely covered. Additionally, our study revealed a strongly skewed distribution of only a handful or data sources (e.g., Twitter, Amazon) representing the overwhelming mass of publications. As such, we see grounds for discussing these choices, offering guidelines, and contributing to the literature on computationally intensive theory development.
Original languageEnglish
Title of host publicationICIS 2024 TREOS
EditorsHope Koch, Peter Ractham, Heinz-Theo Wagner
Number of pages1
Place of PublicationAtlanta, GA
PublisherAssociation for Information Systems. AIS Electronic Library (AISeL)
Publication date2024
Article number85
Publication statusPublished - 2024
EventThe 45th International Conference on Information Systems. ICIS 2024: Digital Platforms for Emerging Societies - Bangkok Marriott Marquis Queen’s Park, Bangkok, Thailand
Duration: 15 Dec 202418 Dec 2024
Conference number: 45
https://icis2024.aisconferences.org/

Conference

ConferenceThe 45th International Conference on Information Systems. ICIS 2024
Number45
LocationBangkok Marriott Marquis Queen’s Park
Country/TerritoryThailand
CityBangkok
Period15/12/202418/12/2024
Internet address

Cite this