Classifying Sluice Occurrences in Dialogue

Austin Baird, Anissa Hamza, Daniel Hardt

Publikation: Bidrag til bog/antologi/rapportKonferencebidrag i proceedingsForskningpeer review

66 Downloads (Pure)

Abstract

Ellipsis is an important challenge for natural language processing systems, and addressing that challenge requires large collections of relevant data. The dataset described by Anand and McCloskey (2015), consisting of 4100 occurrences, is an important step towards addressing this issue. However, many NLP technologies require much larger collections of data. Furthermore, previous collections of ellipsis are primarily restricted to news data, although sluicing presents a particularly important challenge for dialogue systems. In this paper we classify sluices as Direct, Reprise, Clarification. We perform manual annotation with acceptable inter-coder agreement. We build classifier models with Decision Trees and Naive Bayes, with accuracy of 67%. We deploy a classifier to automatically classify sluice occurrences in OpenSubtitles, resulting in a corpus with 1.7 million occurrences. This will support empirical research into sluicing in dialogue, and it will also make it possible to build NLP systems using very large datasets. This is a noisy dataset; based on a small manually annotated sample, we found that only 80% of instances are in fact sluices, and the accuracy of sluice classification is lower. Despite this, the corpus can be of great use in research on sluicing and development of systems, and we are making the corpus freely available on request. Furthermore, we are in the process of improving the accuracy of sluice identification and annotation for the purpose of created a subsequent version of this corpus.
OriginalsprogEngelsk
TitelThe LREC 2018 Proceedings : Eleventh International Conference on Language Resources and Evaluation
RedaktørerNicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga
Antal sider4
UdgivelsesstedParis
ForlagEuropean Language Resources Association
Publikationsdato2018
Sider1580-1583
ISBN (Elektronisk)9791095546009
StatusUdgivet - 2018
BegivenhedThe 11th International Conference on Language Resources and Evaluation. LREC 2018 - Miyazaki, Japan
Varighed: 7 maj 201812 maj 2018
Konferencens nummer: 11
http://lrec2018.lrec-conf.org/en/
http://www.lrec-conf.org/proceedings/lrec2018/index.html (Conference Proceedings)

Konference

KonferenceThe 11th International Conference on Language Resources and Evaluation. LREC 2018
Nummer11
Land/OmrådeJapan
ByMiyazaki
Periode07/05/201812/05/2018
Internetadresse

Emneord

  • Sluicing
  • Ellipsis
  • Dialogue

Citationsformater