Laranjeira, B.R., Moreira, V.P., Villavicencio, A. orcid.org/0000-0002-3731-9168 et al. (2 more authors) (2014) Comparing the quality of focused crawlers and of the translation resources obtained from them. In: Calzolari, N., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J. and Piperidis, S., (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Ninth International Conference on Language Resources and Evaluation (LREC'14), 26-31 May 2014, Reykjavik, Iceland. European Language Resources Association (ELRA) , pp. 3572-3578. ISBN 9782951740884
Abstract
Comparable corpora have been used as an alternative for parallel corpora as resources for computational tasks that involve domainspecific natural language processing. One way to gather documents related to a specific topic of interest is to traverse a portion of the web graph in a targeted way, using focused crawling algorithms. In this paper, we compare several focused crawling algorithms using them to collect comparable corpora on a specific domain. Then, we compare the evaluation of the focused crawling algorithms to the performance of linguistic processes executed after training with the corresponding generated corpora. Also, we propose a novel approach for focused crawling, exploiting the expressive power of multiword expressions.
Metadata
Item Type: | Proceedings Paper |
---|---|
Authors/Creators: |
|
Editors: |
|
Copyright, Publisher and Additional Information: | © 2014 European Language Resources Association (ELRA) |
Keywords: | Focused Crawling; Comparable Corpora; Machine Translation |
Dates: |
|
Institution: | The University of Sheffield |
Academic Units: | The University of Sheffield > Faculty of Engineering (Sheffield) > Department of Computer Science (Sheffield) |
Depositing User: | Symplectic Sheffield |
Date Deposited: | 21 Nov 2019 14:54 |
Last Modified: | 21 Nov 2019 16:28 |
Published Version: | http://www.lrec-conf.org/proceedings/lrec2014/pdf/... |
Status: | Published |
Publisher: | European Language Resources Association (ELRA) |
Refereed: | Yes |
Related URLs: | |
Open Archives Initiative ID (OAI ID): | oai:eprints.whiterose.ac.uk:153571 |