White Rose University Consortium logo
University of Leeds logo University of Sheffield logo York University logo

Chemoinformatics: an application domain for information retrieval techniques

Willett, P. (2004) Chemoinformatics: an application domain for information retrieval techniques. In: Järvelin, K., James Allan, J.,, Bruza, P. and Sanderson, M., (eds.) Proceedings of the 27th International Conference on Research and Development in Information Retrieval. Annual ACM Conference, 25 - 29 July 2004, Sheffield, UK. New York: ACM Press , p. 393. ISBN 1-58113-881-4

Full text not available from this repository.


Chemoinformatics is the generic name for the techniques used to represent, store and process information about the two-dimensional (2D) and three-dimensional (3D) structures of chemical molecules [1, 2]. Chemoinformatics has attracted much recent prominence as a result of developments in the methods that are used to synthesize new molecules and then to test them for biological activity. These developments have resulted in a massive increase in the amounts of structural and biological information that is available to support discovery programmes in the pharmaceutical and agrochemical industries.Chemoinformatics may appear to be far removed from information retrieval (IR), and there are indeed many significant differences, most notably in the use of graph representations to encode chemical molecules, rather than the strings that are used to encode text; however, there are also many similarities between the two fields, and this paper will exemplify some of these relationships. The most obvious area of similarity is in the principal types of database search that are carried out, with both application domains making extensive use of exact match, partial match and best match searching procedures: in the IR context these are known-item searching, Boolean searching and ranked-output searching; in the chemical context, these are structure searching, substructure searching and similarity searching. In IR, there is a natural distinction between an initial ranked-output search and one in which relevance feedback can be employed, where the keywords in the query statement are assigned weights based on their differential occurrences in known-relevant and known-nonrelevant documents. In the chemoinformatics technique called substructural analysis, substructural fragments are assigned weights based on their occurrence in molecules that do possess, and molecules that do not possess, some desired biological activity [3]. The analogy between relevance and biological activity has also resulted in the development of measures to quantify the effectiveness of chemical searching procedures that are based on the standard IR concepts of recall and precision [4].Analogies such as these have provided the basis for some of the chemoinformatics research carried out in Sheffield. The starting point was the recognition that techniques applicable to documents represented by keywords might also be applicable to molecules represented by substructural fragments. This led directly to the introduction of similarity searching, something that is now a standard tool in chemoinformatics software systems; in particular, its use for virtual screening, i.e., the ranking of a database in order of decreasing probability of activity so as to maximize the cost-effectiveness of biological testing [5]. Measures of inter-molecular structural similarity also lie at the heart of systems for clustering chemical databases: just as IR has the Cluster Hypothesis (similar documents tend to be relevant to the same requests) as a basis for document clustering, so the Similar Property Principle (similar molecules tend to have similar properties) has led to clustering becoming a well-established tool for the organization of large chemical databases [6]. More recently, we have applied another IR technique, the use of data fusion to combine different rankings of a database, to chemoinformatics and again found that it is equally applicable in this new domain [7].The many similarities between IR and chemoinformatics that have already been identified suggest that chemoinformatics is a domain of which IR researchers should be aware when considering the applicability of new techniques that they have developed.

Item Type: Proceedings Paper
Institution: The University of Sheffield
Academic Units: The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)
Depositing User: Information Studies
Date Deposited: 25 Mar 2009 09:49
Last Modified: 19 May 2009 17:19
Published Version: http://dx.doi.org/10.1145/1008992.1008994
Status: Published
Publisher: New York: ACM Press
Refereed: No
Identification Number: 10.1145/1008992.1008994
URI: http://eprints.whiterose.ac.uk/id/eprint/8398

Actions (repository staff only: login required)