Wagner Filho, J.A., Wilkens, R., Idiart, M. et al. (1 more author) (2019) The brWaC corpus: A new open resource for Brazilian Portuguese. In: Calzolari, N., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Hasida, K., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Moreno, A., Odijk, J., Piperidis, S. and Tokunaga, T., (eds.) Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). , 07-12 May 2018, Miyazaki, Japan. European Language Resources Association (ELRA) , pp. 4339-4344. ISBN 9791095546009
Abstract
In this work, we present the construction process of a large Web corpus for Brazilian Portuguese, aiming to achieve a size comparable to the state of the art in other languages. We also discuss our updated sentence-level approach for the strict removal of duplicated content. Following the pipeline methodology, more than 60 million pages were crawled and filtered, with 3.5 million being selected. The obtained multi-domain corpus, named brWaC, is composed by 2.7 billion tokens, and has been annotated with tagging and parsing information. The incidence of non-unique long sentences, an indication of replicated content, which reaches 9% in other Web corpora, was reduced to only 0.5%. Domain diversity was also maximized, with 120,000 different websites contributing content. We are making our new resource freely available for the research community, both for querying and downloading, in the expectation of aiding in new advances for the processing of Brazilian Portuguese.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | © 2018 European Language Resources Association (ELRA) |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 21 Nov 2019 16:14 |
Last Modified: | 21 Nov 2019 16:14 |
Published Version: | https://www.aclweb.org/anthology/L18-1686 |
Status: | Published |
Publisher: | European Language Resources Association (ELRA) |
Refereed: | Yes |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:153554 |