Bulk import of RAE data

The RAE 2008 submission data for the University of York was held in a Microsoft Access database. The bibliographic data exported from the Access database was sent to us as an Excel formatted file which was then manipulated for import into EPrints. The Excel file provided contained the following column headings:

Cleaning up the data

There was some data tidying up required, including removing extra line spaces, strange characters, comments (?, n/a, other notes to readers) from the DOI field (done manually as very difficult to search automatically). Also the ISSN number was not always in the same format. The volume and the issue number from the short title field had to be separated and put into two separate columns, occasionally instead of an issue number there is a year and this had to be removed and put in a separate column. This was quite difficult to do automatically as the characters separating the numbers were not consistent. The volume and issue numbers were separated by a range of characters [brackets, comma, semi colon, vol., No, issue number, Volume, Number, NUMB, dots, PART 5.6, Suppl., (9 part 2)]. For conference papers there were no volume number or issue number and this field contained the conference title instead. Occasionally there was only volume number, volume or issue numbers contained roman numerals (XLV.1) or contained a hyphen indicating a volume or issue series (for e.g. 14 (3-4)) and also the volume or issue number contained alphanumeric characters (70 (8/081701(R)). In addition for a couple of cases the articles were “in press” so there was no volume or issue number available and therefore this column was filled with “in press”.

The author names were removed because there were not in any sort of order and only contained up to three authors.

The publisher details were added in the spreadsheet before importing into the database. This saved some time in that the data was ordered on the journal title and the same publisher information could be pasted into the spreadsheet for articles published in the same journal.

Import into Eprints

The cleaned up data was imported into the WRRO repository using the MultilineExcel import plug-in. There was no documentation about this plug-in regarding what format the data should be in. We therefore assumed that the heading of each column would have to be the field name and then test imported some data into the test repository. In addition from examining the MultilineExcel import plug-in perl code we realised that another column named eprintid had to be added to the spreadsheet. This worked however there were some items that did not import because they contained accented characters or contained characters that were not visible in word or excel (this must have been done when copying and pasting from different sources into the RAE database). It was quite hard to determine which record caused the error and the only way that could be done was by checking the records that had imported successfully against the original data set. The successfully imported records were then removed from the import file containing the original data set and then re-imported one by one. This was quite a time consuming process. We decided that it may be better to divide the data into smaller chunks of 10 records for each import file. Although this created lots of import files it saved some time because it was easier to determine which records failed to import. The records that failed to import were then modified (accented characters, extra spaces or line breaks were deleted) and when successfully imported into EPrints they were then modified again so that they contained the accented characters. In total there were approximately 40/2212 records that needed modification.

Manual Improvements