A fast algorithm for selecting sets of dissimilar molecules from large chemical databases

Abstract

Current algorithms for the selection of a set of n dissimilar molecules from a dataset of N molecules have an expected time complexity of O(n2N). This paper describes an improved algorithm that has an expected time complexity of O(nN) and that will identify exactly the same set of molecules as the normal algorithm if the cosine coefficient is used for the calculation of the inter-molecular (dis)similarities. The algorithm is applicable to any type of representation that characterises a molecule by a set of attribute values and to any procedure that involves calculating a sum of inter-molecular similarities. It is also both more effective and more efficient than our implementation of a genetic algorithm for the selection of maximally-dissimilar sets of molecules.

Metadata

Item Type:	Article
Authors/Creators:	Holliday, J.D. Ranade, S.S. Willett, P.
Keywords:	Algorithmic complexity; Compound selection; Dissimilarity selection; Random screening; Similarity coefficient
Dates:	Published: 1995
Institution:	The University of Sheffield
Academic Units:	The University of Sheffield > Faculty of Social Sciences (Sheffield) > Information School (Sheffield)
Depositing User:	Information Studies
Date Deposited:	26 Aug 2009 11:04
Last Modified:	26 Aug 2009 11:04
Published Version:	http://www3.interscience.wiley.com/journal/1133236...
Status:	Published
Publisher:	Wiley
Open Archives Initiative ID (OAI ID):	oai:eprints.whiterose.ac.uk:9237

CORE (COnnecting REpositories)

A fast algorithm for selecting sets of dissimilar molecules from large chemical databases

Abstract

Metadata

Download not available

Export

Statistics