Scheller Boos, R.A., Prestes, K.V. and Villavicencio, A. orcid.org/0000-0002-3731-9168 (2014) Identification of multiword expressions in the brWaC. In: Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J. and Piperidis, S., (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Ninth International Conference on Language Resources and Evaluation (LREC'14), 26-31 May 2014, Reykjavik, Iceland. European Language Resources Association (ELRA) , pp. 728-735. ISBN 9782951740884
Abstract
Although corpus size is a well known factor that affects the performance of many NLP tasks, for many languages large freely available corpora are still scarce. In this paper we describe one effort to build a very large corpus for Brazilian Portuguese, the brWaC, generated following the Web as Corpus kool yinitiative. To indirectly assess the quality of the resulting corpus we examined the impact of corpus origin in a specific task, the identification of Multiword Expressions with association measures, against a standard corpus. Focusing on nominal compounds, the expressions obtained from each corpus are of comparable quality and indicate that corpus origin has no impact on this task.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | © 2014 European Language Resources Association (ELRA) |
Keywords: | Multiword Expressions; Corpora; Web-crawling |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 21 Nov 2019 14:43 |
Last Modified: | 21 Nov 2019 16:27 |
Published Version: | http://www.lrec-conf.org/proceedings/lrec2014/pdf/... |
Status: | Published |
Publisher: | European Language Resources Association (ELRA) |
Refereed: | Yes |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:153570 |