achArXiv

The four arXiv plugins we created are available from Eprints File:
ArXiv Import plugin - ArXiv API URL,
ArXiv Import plugin - ArXivID,
ArXiv Import plugin - Author name
and
ArXiv Import plugin - XML Feed.

For further information on these plugins please read below.

Potentially, content in arXiv could provide a "quick win" for repository population. No arXiv depositor we have talked to to date has objected to our importing their work into WRRO. From discussions with arXiv users, we are assuming that local deposit in WRRO with a "push" of data to arXiv may be difficult to acheive - we'd need to demonstrate some clear benefit to the depositor. arXiv serves its community well. A more likely model may be that arXiv users continue to deposit as now but IRs "harvest" data from arXiv (or perhaps arXiv will develop a facility to push material into local IRs).

Before undertaking this exercise, though, it is worth considering:

Our experience writing an EPrints plugin for arXiv are described below. Consideration of metadata issues raised by arXiv import is included in our Research Publication Metadata report.

Writing arXiv Plug-in
ArXiv plug-in works in a similar way to PubMedID. The first step in this process is to identify an author and then to use the arXiv API to obtain data. An ArXiv ID number is entered into the text box (multiple ids can be entered with a carriage return after each ID) or can be uploaded from a file. This then retrieves all the citations associated with that ID (identifies an author) and outputs in an XML format. Another program PubMedXML (import plug-in) that has been modified to parse the arXiv XML file extracts all the citation metadata that has been output from arXiv and the relavant fields (in EPrints) are filled with this information.

The metadata extracted with ease from arXiv are the author list, article title, the year of publication and DOI where available. ArXiv also contains a summary field which is equivalent to an abstract. It has been difficult to disambiguate the “journal-ref” field. This field contains the journal title, volume and issue numbers. This was because the journal_ref is free text field in arXiv and therefore there is no consistency in the way the journal reference is written. In addition the journal titles are often abbreviated and the abbreviations are not always consistent. In order to overcome these difficulties the perl module Biblio::Citation::Parser::Jiao (written by Zhuoan Jiao and Ported to Biblio interface by: Mike Jewell) was installed. The Biblio::Citation::Parser::Jiao is also available from CPAN. Biblio::Citation::Parser::Jiao is a perl module for parsing citation metadata from a given reference. This module uses a reference-template parser to extract the metadata from a given references.

The quality of the metadata available from arXiv depends on the diligence of the authors entering data into arXiv. Some are very good whilst others are very poor and therefore the quality is not consistent. In addition there are some fields that just do not get filled in (because it is not a required field) which would have been very useful. One of these is the affiliation field which is usually left blank but would be very useful in identifying all authors from a particular institution. This would have enabled an efficient method for bulk importing.

There are different variations of the arXiv import plug-in that can import items using either an arXiv ID, arXiv Author name or the XML (arXiv XML) output from a particular search in the arXiv API.

ArXiv ID

An ArXiv ID number is entered into the text box (multiple ids can be entered with a carriage return after each ID) or can be uploaded from a file (Figure 1). This then retrieves all the metadate associated with the particular citation ID (Figure 2).


Figure 1. Arxiv ID import text box from EPrints.


Figure2. Metadata imported from the arxiv ID import plug-in.

ArXiv XML

The ArXiv XML plug-in maybe of potential benefit to users who want to search using the arXiv API and then save the resultant XML file. This file can either be uploaded or cut and pasted in the text box for import into EPrints(Figure 3).


Figure 3. ArXiv XML import plug-in text box.

ArXiv Author

The arXiv Author plug-in can import all items associated with an author. The search can be carried out using the author’s surname and can be made specific by using their initials (e.g. Name_A [Figure 4]). The actual query to the arXiv API would be http://export.arxiv.org/api/query?search_query=au:Name_A. However, this can result in many false hits due to many people having the same surname or even the same initials. An additional issue with using names is that the use of initials in publications is not always consistent and therefore this can either result in greater number of hits some of which will be false positives or by being too specific can miss some publications.


Figure 4. Arxiv Author import plug-in.

The search can also be carried out using both the author’s name and the institution to identify only authors from that particular institution (e.g. Name AND Sheffield [Figure 5]). The arXiv API query would be http://export.arxiv.org/api/query?search_query=au:Name+AND+au:Sheffield. Although if the authors leave the affiliation field blank then there is no way to identify the author uniquely and the only way to confirm if all the publications belong to that author would be to email them.


Figure 5. Importing items from arxiv using the author’s name and the institution affiliation with the arxiv author plug-in.

An additional issue is that if a combined search was carried out using the author’s name and affiliation for example Authorname AND Sheffield then a false positive hit would be returned if an additional author-name on the paper was Sheffield. Furthermore the XML feed is set only to import a maximum of 20 items even though there may be a greater number of hits. A warning message is generated when this occurs informing users of the actual number of hits generated with that search and that they can import all the hits if desired pasting the URL (suggested in the warning message) in the ArXiv API URL text box.

ArXiv API URL

There is also capability to search the arXiv API directly from EPrints using the arXiv API URL. The user is required to type in the correct query URL to search the API (http://export.arxiv.org/api_help/, http://export.arxiv.org/api_help/docs/user-manual.html#query_details). This returns a list of hits, the default output is set to 10 results but if there are more hits than 10 then this figure is indicated in the results page. The user can then change the max results setting to the number of hits generated (http://export.arxiv.org/api/query?search_query=all:electron&start=0&max_results=10). They can then scan the results and discard items that are false positive and select those that are genuine and import them. Searching with this method would cut out the step of going to the arXiv API and carrying out the search first and then deciding if the results are genuine. Further work is planned to enhance this plug-in which would enable users to modify their query depending on the results and be able to submit or discard citations (by checking the little box next to each hit) for import into EPrints.

ArXiv API - additional notes

The API can be called with a parameter 'max_results=0'. Whilst this may seem nonsensical at first (and in a browser return nothing) the result does actually contain a valuable piece of information:

<opensearch:totalResults xmlns:opensearch="http://a9.com/-/spec/opensearch/1.1/">95</opensearch:totalResults>
This tells us how many results the query produces (95 in this case), which we can return to the user to check that we will import a sensible set of data. We make use of this in both the author and API URL import methods.

Upgrade to EPrints version 3.1

Please note that the code needs modifying slightly to change spaces in URL to encoded values as EPrints (version 3.1) doesn't seem to like them very much. You need to add this code below line 48:

$pmid =~ s/^\s*(.*?)\s*$/$1/; #remove space from start and end of URL

$pmid =~ s/\s+/%20/g; #change spaces in URL to encoded values (EPrints doesn't seem to like them too much!)

You then need to restart apache.