Classifying Sluice Occurrences in Dialogue

Austin Baird, Anissa Hamza, Daniel Hardt

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

40 Downloads (Pure)


Ellipsis is an important challenge for natural language processing systems, and addressing that challenge requires large collections of relevant data. The dataset described by Anand and McCloskey (2015), consisting of 4100 occurrences, is an important step towards addressing this issue. However, many NLP technologies require much larger collections of data. Furthermore, previous collections of ellipsis are primarily restricted to news data, although sluicing presents a particularly important challenge for dialogue systems. In this paper we classify sluices as Direct, Reprise, Clarification. We perform manual annotation with acceptable inter-coder agreement. We build classifier models with Decision Trees and Naive Bayes, with accuracy of 67%. We deploy a classifier to automatically classify sluice occurrences in OpenSubtitles, resulting in a corpus with 1.7 million occurrences. This will support empirical research into sluicing in dialogue, and it will also make it possible to build NLP systems using very large datasets. This is a noisy dataset; based on a small manually annotated sample, we found that only 80% of instances are in fact sluices, and the accuracy of sluice classification is lower. Despite this, the corpus can be of great use in research on sluicing and development of systems, and we are making the corpus freely available on request. Furthermore, we are in the process of improving the accuracy of sluice identification and annotation for the purpose of created a subsequent version of this corpus.
Original languageEnglish
Title of host publicationThe LREC 2018 Proceedings : Eleventh International Conference on Language Resources and Evaluation
EditorsNicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga
Number of pages4
Place of PublicationParis
PublisherEuropean Language Resources Association
Publication date2018
ISBN (Electronic)9791095546009
Publication statusPublished - 2018
EventThe 11th International Conference on Language Resources and Evaluation. LREC 2018 - Miyazaki, Japan
Duration: 7 May 201812 May 2018
Conference number: 11 (Conference Proceedings)


ConferenceThe 11th International Conference on Language Resources and Evaluation. LREC 2018
Internet address


  • Sluicing
  • Ellipsis
  • Dialogue

Cite this