Guidelines for bulk import
Points to consider
Where is the data coming from?
Who owns the database? Are they happy for wholesale extraction of data? Will your actions slow down their server? Is there a relevant API to help you? What are the future plans for the service; are you potentially replacing the service, or feeding from/to it?What was the original purpose of the data?
Metadata quality and quantity
The quality and richness of the metadata depends on the purpose of its original use and who collated the data. It is possible that the completeness of the metadata was not important for the original use and therefore there may be many key pieces of information missing (such as authors, DOI, date of publication). Is it worth adding these manually?
Metadata format
- The format of the metadata may not be consistent particularly if it has been collated by several people. For example the names and initials are often inconsistent where they have been copied and pasted from the published journal articles.
- Some fields may be merged into one such as volume number and issue number; however the delimiter between the two numbers may not be consistent.
- Note if there are accented or other non-standard characters appearing in the text as this may affect the import of the data if it is not supported by the import plug-in.
- Certain fields such as for example DOI may contain question marks if the person was unable to find it or contain text explaining to the reader that the PDF of the work will be provided instead (for RAE data) or other instructions to the collator and examiner of the database.
Metadata import
Is there an import plug-in already available from your repository that can be used to import the data in the format provided? This will save quite a bit of time.- Check the data before importing
- Are the headings labelled correctly?
- May need to clean data prior to importing (make sure all data are consistent and the data contained under each heading is correct and that there is no other text)
- Remove data that are likely to cause problems with import and deal with them separately.
- Identify if there are any fields that can be filled in with same information for multiple records. For example the publisher and the ISSN fields may be better filled in prior to import by sorting the data on journal titles and then copying and pasting the ISSN and the publisher details for all the records with that title.
- May be useful to sort the data on for example, journal titles so that once imported it will be easier to process and add data manually for several records from the same HTTP link.
- Test import of data if possible.
- Divide the data into smaller subsets before importing; this will make it easier to troubleshoot if there are any errors with importing.
Resource allocation
- Metadata enrichment (How important is metadata completeness?)
This may be a decision for steering committee of the repository? This also depends on how much resource is available and how the imported data may be reused. - How will full texts be obtained and added to metadata records? Will someone have to contact each author for full text files?
- Is it necessary to associate each item with a department and/or add subject metadata?
- Is this a one off exercise or an on-going exercise? What resource is required to sustain this activity?
- Is bulk import going to create a bottleneck of data processing? Will it delay / distract from other activities?
Keeping track of imported data
- How will "In Press" items be handled? Is it possible to obtain a regular report of In Press items? How will you know when the publication status has changed? Who updates the record?
- Do you need provenance data to indicate where the data came from?
- What link if any will be put in place to the supplying system?
