Classifying Sluice Occurrences in Dialogue

Austin Baird, Anissa Hamza, Daniel Hardt

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

7 Downloads (Pure)

Abstract

Ellipsis is an important challenge for natural language processing systems, and addressing that challenge requires large collections of relevant data. The dataset described by Anand and McCloskey (2015), consisting of 4100 occurrences, is an important step towards addressing this issue. However, many NLP technologies require much larger collections of data. Furthermore, previous collections of ellipsis are primarily restricted to news data, although sluicing presents a particularly important challenge for dialogue systems. In this paper we classify sluices as Direct, Reprise, Clarification. We perform manual annotation with acceptable inter-coder agreement. We build classifier models with Decision Trees and Naive Bayes, with accuracy of 67%. We deploy a classifier to automatically classify sluice occurrences in OpenSubtitles, resulting in a corpus with 1.7 million occurrences. This will support empirical research into sluicing in dialogue, and it will also make it possible to build NLP systems using very large datasets. This is a noisy dataset; based on a small manually annotated sample, we found that only 80% of instances are in fact sluices, and the accuracy of sluice classification is lower. Despite this, the corpus can be of great use in research on sluicing and development of systems, and we are making the corpus freely available on request. Furthermore, we are in the process of improving the accuracy of sluice identification and annotation for the purpose of created a subsequent version of this corpus.
Original languageEnglish
Title of host publicationThe LREC 2018 Proceedings : Eleventh International Conference on Language Resources and Evaluation
EditorsNicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga
Number of pages4
Place of PublicationParis
PublisherEuropean Language Resources Association
Publication date2018
Pages1580-1583
ISBN (Electronic)9791095546009
Publication statusPublished - 2018
EventThe 11th International Conference on Language Resources and Evaluation. LREC 2018 - Miyazaki, Japan
Duration: 7 May 201812 May 2018
Conference number: 11
http://lrec2018.lrec-conf.org/en/
http://www.lrec-conf.org/proceedings/lrec2018/index.html (Conference Proceedings)

Conference

ConferenceThe 11th International Conference on Language Resources and Evaluation. LREC 2018
Number11
CountryJapan
CityMiyazaki
Period07/05/201812/05/2018
Internet address

Keywords

  • Sluicing
  • Ellipsis
  • Dialogue

Cite this

Baird, A., Hamza, A., & Hardt, D. (2018). Classifying Sluice Occurrences in Dialogue. In N. Calzolari, K. Choukri, C. Cieri, T. Declerck, S. Goggi, K. Hasida, H. Isahara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, S. Piperidis, ... T. Tokunaga (Eds.), The LREC 2018 Proceedings: Eleventh International Conference on Language Resources and Evaluation (pp. 1580-1583). Paris: European Language Resources Association.
Baird, Austin ; Hamza, Anissa ; Hardt, Daniel. / Classifying Sluice Occurrences in Dialogue. The LREC 2018 Proceedings: Eleventh International Conference on Language Resources and Evaluation. editor / Nicoletta Calzolari ; Khalid Choukri ; Christopher Cieri ; Thierry Declerck ; Sara Goggi ; Koiti Hasida ; Hitoshi Isahara ; Bente Maegaard ; Joseph Mariani ; Hélène Mazo ; Asuncion Moreno ; Jan Odijk ; Stelios Piperidis ; Takenobu Tokunaga. Paris : European Language Resources Association, 2018. pp. 1580-1583
@inproceedings{f53110044a3e4121a59be06d4036548d,
title = "Classifying Sluice Occurrences in Dialogue",
abstract = "Ellipsis is an important challenge for natural language processing systems, and addressing that challenge requires large collections of relevant data. The dataset described by Anand and McCloskey (2015), consisting of 4100 occurrences, is an important step towards addressing this issue. However, many NLP technologies require much larger collections of data. Furthermore, previous collections of ellipsis are primarily restricted to news data, although sluicing presents a particularly important challenge for dialogue systems. In this paper we classify sluices as Direct, Reprise, Clarification. We perform manual annotation with acceptable inter-coder agreement. We build classifier models with Decision Trees and Naive Bayes, with accuracy of 67{\%}. We deploy a classifier to automatically classify sluice occurrences in OpenSubtitles, resulting in a corpus with 1.7 million occurrences. This will support empirical research into sluicing in dialogue, and it will also make it possible to build NLP systems using very large datasets. This is a noisy dataset; based on a small manually annotated sample, we found that only 80{\%} of instances are in fact sluices, and the accuracy of sluice classification is lower. Despite this, the corpus can be of great use in research on sluicing and development of systems, and we are making the corpus freely available on request. Furthermore, we are in the process of improving the accuracy of sluice identification and annotation for the purpose of created a subsequent version of this corpus.",
keywords = "Sluicing, Ellipsis, Dialogue, Sluicing, Ellipsis, Dialogue",
author = "Austin Baird and Anissa Hamza and Daniel Hardt",
year = "2018",
language = "English",
pages = "1580--1583",
editor = "Nicoletta Calzolari and Khalid Choukri and Christopher Cieri and Thierry Declerck and Sara Goggi and Koiti Hasida and Hitoshi Isahara and Bente Maegaard and Joseph Mariani and H{\'e}l{\`e}ne Mazo and Asuncion Moreno and Jan Odijk and Stelios Piperidis and Takenobu Tokunaga",
booktitle = "The LREC 2018 Proceedings",
publisher = "European Language Resources Association",
address = "France",

}

Baird, A, Hamza, A & Hardt, D 2018, Classifying Sluice Occurrences in Dialogue. in N Calzolari, K Choukri, C Cieri, T Declerck, S Goggi, K Hasida, H Isahara, B Maegaard, J Mariani, H Mazo, A Moreno, J Odijk, S Piperidis & T Tokunaga (eds), The LREC 2018 Proceedings: Eleventh International Conference on Language Resources and Evaluation. European Language Resources Association, Paris, pp. 1580-1583, Miyazaki, Japan, 07/05/2018.

Classifying Sluice Occurrences in Dialogue. / Baird, Austin; Hamza, Anissa; Hardt, Daniel.

The LREC 2018 Proceedings: Eleventh International Conference on Language Resources and Evaluation. ed. / Nicoletta Calzolari; Khalid Choukri; Christopher Cieri; Thierry Declerck; Sara Goggi; Koiti Hasida; Hitoshi Isahara; Bente Maegaard; Joseph Mariani; Hélène Mazo; Asuncion Moreno; Jan Odijk; Stelios Piperidis; Takenobu Tokunaga. Paris : European Language Resources Association, 2018. p. 1580-1583.

Research output: Chapter in Book/Report/Conference proceedingArticle in proceedingsResearchpeer-review

TY - GEN

T1 - Classifying Sluice Occurrences in Dialogue

AU - Baird, Austin

AU - Hamza, Anissa

AU - Hardt, Daniel

PY - 2018

Y1 - 2018

N2 - Ellipsis is an important challenge for natural language processing systems, and addressing that challenge requires large collections of relevant data. The dataset described by Anand and McCloskey (2015), consisting of 4100 occurrences, is an important step towards addressing this issue. However, many NLP technologies require much larger collections of data. Furthermore, previous collections of ellipsis are primarily restricted to news data, although sluicing presents a particularly important challenge for dialogue systems. In this paper we classify sluices as Direct, Reprise, Clarification. We perform manual annotation with acceptable inter-coder agreement. We build classifier models with Decision Trees and Naive Bayes, with accuracy of 67%. We deploy a classifier to automatically classify sluice occurrences in OpenSubtitles, resulting in a corpus with 1.7 million occurrences. This will support empirical research into sluicing in dialogue, and it will also make it possible to build NLP systems using very large datasets. This is a noisy dataset; based on a small manually annotated sample, we found that only 80% of instances are in fact sluices, and the accuracy of sluice classification is lower. Despite this, the corpus can be of great use in research on sluicing and development of systems, and we are making the corpus freely available on request. Furthermore, we are in the process of improving the accuracy of sluice identification and annotation for the purpose of created a subsequent version of this corpus.

AB - Ellipsis is an important challenge for natural language processing systems, and addressing that challenge requires large collections of relevant data. The dataset described by Anand and McCloskey (2015), consisting of 4100 occurrences, is an important step towards addressing this issue. However, many NLP technologies require much larger collections of data. Furthermore, previous collections of ellipsis are primarily restricted to news data, although sluicing presents a particularly important challenge for dialogue systems. In this paper we classify sluices as Direct, Reprise, Clarification. We perform manual annotation with acceptable inter-coder agreement. We build classifier models with Decision Trees and Naive Bayes, with accuracy of 67%. We deploy a classifier to automatically classify sluice occurrences in OpenSubtitles, resulting in a corpus with 1.7 million occurrences. This will support empirical research into sluicing in dialogue, and it will also make it possible to build NLP systems using very large datasets. This is a noisy dataset; based on a small manually annotated sample, we found that only 80% of instances are in fact sluices, and the accuracy of sluice classification is lower. Despite this, the corpus can be of great use in research on sluicing and development of systems, and we are making the corpus freely available on request. Furthermore, we are in the process of improving the accuracy of sluice identification and annotation for the purpose of created a subsequent version of this corpus.

KW - Sluicing

KW - Ellipsis

KW - Dialogue

KW - Sluicing

KW - Ellipsis

KW - Dialogue

M3 - Article in proceedings

SP - 1580

EP - 1583

BT - The LREC 2018 Proceedings

A2 - Calzolari, Nicoletta

A2 - Choukri, Khalid

A2 - Cieri, Christopher

A2 - Declerck, Thierry

A2 - Goggi, Sara

A2 - Hasida, Koiti

A2 - Isahara, Hitoshi

A2 - Maegaard, Bente

A2 - Mariani, Joseph

A2 - Mazo, Hélène

A2 - Moreno, Asuncion

A2 - Odijk, Jan

A2 - Piperidis, Stelios

A2 - Tokunaga, Takenobu

PB - European Language Resources Association

CY - Paris

ER -

Baird A, Hamza A, Hardt D. Classifying Sluice Occurrences in Dialogue. In Calzolari N, Choukri K, Cieri C, Declerck T, Goggi S, Hasida K, Isahara H, Maegaard B, Mariani J, Mazo H, Moreno A, Odijk J, Piperidis S, Tokunaga T, editors, The LREC 2018 Proceedings: Eleventh International Conference on Language Resources and Evaluation. Paris: European Language Resources Association. 2018. p. 1580-1583