What is Skoal

Skoal is a demo version of a web-service that allows you to perform similarity search in a database of chemical compounds. This web-service is created on the basis of Similarity Searchable DBMS CloudMSW (pdf) technology developed by MeraLabs company.

The main distinctive feature of Skoal system is the ability to find chemical compounds by the size of the common substructure.

The purpose of Skoal

The main purpose of Skoal is to demonstrate the ability to store and find objects described by non-trivial structures by defining distance metrics between the objects in the CloudMSW system.

The web-server allows you to use a fragment of a chemical compound to find all chemical compounds which contain this fragment or the compounds which are structurally similar to the given fragment.

No other search system is able to perform such search today and there is a number of reasons why they can’t.
Firstly, the problem of measuring structural similarity has a high computational complexity. Secondly, a set of molecules along with a distance function defined on this set produce a high dimension metric space. Finally, the computational complexity of creating a search index for a non-linear metric space is in exponential dependence on the space dimension. With this index, however. the search process would degenerate into an exhaustive (or “brute-force”) search when the algorithm has to calculate the distance between the search query and every data element in the database.

Therefore Skoal system doesn’t create traditional indexes to perform search, instead it uses CloudMSW technology which allows finding and adding data elements to the metric space using the algorithms which have logarithmic complexity dependence on the number of data elements and linear complexity dependence on the metric space dimension.

How Skoal works

To work with chemical compound data in Skoal, we defined two types of objects and two types of distance metrics in CloudMSW.

The first type of Skoal objects are structural molecular graphs, which are compared using a distance metrics based on selection of the maximum common isomorphic subgraph structure for a pair of molecules.

The distance between two molecular graphs G1 and G2 is defined by the following function-metric:


where V(G1,G2) and E(G1,G2) are the number of vertices and edges in the maximum common subgraph respectively.
Figure 1 shows the molecular structures and the distance between them, calculated according to the above-defined metric.

Figure 1

The second type of objects is molecular fingerprints – a bit string, where every bit corresponds to a fixed structural invariant (Fig. 2):

Figure 2

In order to determine the distance between two bit strings (fingerprints) the Tanimoto metric is calculated:
, where

  • “a” – the number of nonzero bits in the first molecule’s fingerprint

  • “b” – the number of nonzero bits in the second molecule’s fingerprint

  • “c” – the number of common nonzero bits for both molecules.


Figure 3 shows the results of tanimoto distance calculation for different molecules.

Figure 3

Conclusions

The Skoal system uses objects and metrics created specially to operate with chemical compound data. If you want to create a high-tech product which manipulates non-trivial objects using a notion of distance you would want to use Similarity Searchable DBMS CloudMSW (pdf) .