Human Tests for Machine Models: What Lies “Beyond the Imitation Game”?

  • Noya Kohavi
  • , Anna Weichselbraun*
  • *Corresponding author for this work

Research output: Contribution to journalJournal articleResearchpeer-review

1 Downloads (Pure)

Abstract

Benchmarking large language models (LLMs) is a key practice for evaluating their capabilities and risks. This paper considers the development of “BIG Bench,” a crowdsourced benchmark designed to test LLMs “Beyond the Imitation Game.” Drawing on linguistic anthropological and ethnographic analysis of the project's GitHub repository, we examine how contributors developed tasks based on their lay understandings of language, cognition, and intelligence. By tracing how contributors make implicit judgments about what constitutes a meaningful test of intelligence, we show how widespread language ideologies shape the evaluation of LLMs and the imaginaries that guide their development.
Original languageEnglish
JournalJournal of Linguistic Anthropology
Number of pages24
ISSN1055-1360
DOIs
Publication statusPublished - 24 Nov 2025

Bibliographical note

Epub ahead of print. Published online: 24 November 2025.

Keywords

  • Benchmarking
  • Common sense
  • Intelligence
  • Language ideologies
  • Large language models

Cite this