Web scraper work

The aim of this project was to write a perl program to go to an html page (personal website of researcher) specified by the user and from there extract the list of publications and use this to populate the institutional repository. The perl program works by removing all the html tags such that only the text remains. The text is then parsed to extract the authors’ name, year of publication, the title of article and the journal reference. This was quite difficult to program because the structure of the text was not in a consistent format and therefore some manual intervention was required to overcome some of the inconsistencies in the text. The program then outputs a (text) file that is in the same format as the Endnote text import file. This file was then opened in Endnote and saved with an Endnote file extention (.enl). This file was used to import in to EPrints using the eprints EndNote plugin. However, this did not appear to import into EPrints and generated an Error Warning, but this message was not useful in determining what was causing the error. We considered writing a text import plugin although this would have required more work and therefore we were considering the potential benefits of text import plugins. We continued to investigate the EndNote export further. By trial and error we changed the publication style setting to EndNote output and saved this file and tried importing into EPrints. This worked.

A perl program has been written to:

i) Get an html page with user specified URL.
ii) Strip-off all html tags such that only text remains.
iii) Parse text to extract the authors’ name, year of publication and the title of article.
iv) Print the above information to a file in the EndNote format (suitable for text import).
v) Open file in EndNote and export as EndNote file and use EndNote plugin to import into Eprints.

At step iv) manual intervention was required because the text format was very inconsistent and would have been too time-consuming to make it automated. It was not considered valuable to spend too much time on this as this script would not be generic enough to apply to other web sites (unless a similar format was used) and would need tweaking every time.

Issues surrounding EndNote

There were still some issues with using EndNote and that is EndNote tries to append to another library. (Minor point) however this was easily resolved 

Other difficulties arising from text inconsistency

Other issues

Potential re-use of code

Due to the fact that web publication formats are inconsistent it would not be possible to use this code without making some changes to it. The type of changes necessary depends on the format of the web page.

The Perl code is available from the White Rose Research Online repository.