Abstract
Ellipsis is an important challenge for natural language processing systems, and addressing that challenge requires large collections of relevant data. The dataset described by Anand and McCloskey (2015), consisting of 4100 occurrences, is an important step towards addressing this issue. However, many NLP technologies require much larger collections of data. Furthermore, previous collections of ellipsis are primarily restricted to news data, although sluicing presents a particularly important challenge for dialogue systems. In this paper we classify sluices as Direct, Reprise, Clarification. We perform manual annotation with acceptable inter-coder agreement. We build classifier models with Decision Trees and Naive Bayes, with accuracy of 67%. We deploy a classifier to automatically classify sluice occurrences in OpenSubtitles, resulting in a corpus with 1.7 million occurrences. This will support empirical research into sluicing in dialogue, and it will also make it possible to build NLP systems using very large datasets. This is a noisy dataset; based on a small manually annotated sample, we found that only 80% of instances are in fact sluices, and the accuracy of sluice classification is lower. Despite this, the corpus can be of great use in research on sluicing and development of systems, and we are making the corpus freely available on request. Furthermore, we are in the process of improving the accuracy of sluice identification and annotation for the purpose of created a subsequent version of this corpus.
Originalsprog | Engelsk |
---|---|
Titel | The LREC 2018 Proceedings : Eleventh International Conference on Language Resources and Evaluation |
Redaktører | Nicoletta Calzolari, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Koiti Hasida, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis, Takenobu Tokunaga |
Antal sider | 4 |
Udgivelsessted | Paris |
Forlag | European Language Resources Association |
Publikationsdato | 2018 |
Sider | 1580-1583 |
ISBN (Elektronisk) | 9791095546009 |
Status | Udgivet - 2018 |
Begivenhed | The 11th International Conference on Language Resources and Evaluation. LREC 2018 - Miyazaki, Japan Varighed: 7 maj 2018 → 12 maj 2018 Konferencens nummer: 11 http://lrec2018.lrec-conf.org/en/ http://www.lrec-conf.org/proceedings/lrec2018/index.html (Conference Proceedings) |
Konference
Konference | The 11th International Conference on Language Resources and Evaluation. LREC 2018 |
---|---|
Nummer | 11 |
Land/Område | Japan |
By | Miyazaki |
Periode | 07/05/2018 → 12/05/2018 |
Internetadresse |
|
Emneord
- Sluicing
- Ellipsis
- Dialogue