The Biopool platform is a biological sample search and location service for a biobanks network. Biopool allows sharing different collections of histological images as well as its associated clinical and histological information which is available at member biobanks. The use of this interconnected network of information sources has great potential in the field of medical research, education and as a tool for diagnosis support.
At present, researchers must contact several biobanks in order to find a biobank which can provide samples of the diseases to carry out their research. There is no agile mechanism that eases these searches. Biopool´s aim is to change this procedure by providing a tool that will allow the search for biological samples which is similar to that used as a reference by the search engine. Biopool will provide the set of relevant samples ranked by similarity as well as their location and diagnosis. In this manner, researchers shall be able to have a quicker access to many more samples, which will help speed both the time of research as well as the diagnosis and at the same time, reduce associated costs.
The Biopool platform can be accessed through the Web, being able to search both for image content as well as by text. The image search allows the selection of areas of interest as well as the extraction of pathological information in an automated manner for some types of cancer.
At present and in order for the network to be functioning, an interconnection of different biobanks is taking place via standardised communication protocols. This reference standard eases the acquisition, storage and exchange of images (and digitised samples) and their associated data. This standard is aligned with the international effort to standardize image formats that make up the image repository on the Web.
Tecnalia has a two-fold role. On the one hand, it is participating in the definition of the system´s architecture, while on the other it is responsible for the image search engine.
How does Biopool work?
The following are the elements that make the Biopool system work:
- Image and text search engines: are the brain of the platform. The image search system is responsible for obtaining a complete set of image features that allow the description of the different levels of detail so that they can later be used in image similarity searches. As pertains to text searches, the aim is to organise text information in order to find associated searches.
- Software infrastructure: this is the platform´s backbone with distributed storage, communication protocols and standardisation to be used, a sole indexing mechanism and the management of Web services.
- User interface: Its aim is to allow users to access the set of tools that is provided by Biopool, such as search, visualisation and image processing, new data uploading, etc. Furthermore, it is the access portal of the members of the biobank network so that they can upload the sample information in the Biopool system.
What makes an image search possible?
Two different procedures must take place in order to make an image similarity search: the indexing/description of the image and the retrieval for the search.
The indexing/description of an image is a complex process that takes place in several steps and that must try to find a set of features that can best provide a definition of the information that is contained in an image. In order to do so, the image is divided into smaller, same-sized regions which permits the separate analysis of the features and variations of these areas. Once this is accomplished, different algorithms are applied which allow the extraction of features of the region in order to create descriptors. These algorithms are used to analyse different image features, such as colour, texture or borders and contours of the elements. In order to analyse colour characteristics, conversions are made between the different colour spaces (RGB, Gray, HSI, L*a*b, etc.) and then values such as Mean, Variance or Histogram are calculated. The texture features are measured using operators such as the Local Binary Pattern (LBP), co-occurrence or Tamura matrices that take into account the information of neighbouring pixels to that being calculated and are assigned different weights according to the value that is trying to be measured. On the other hand, the calculation of border features is made using operators such as Sobel, Kirch or Canny that are based on image gradients, as well as others based on functions or wavelengths such as Gabor, Haar or Daubechies. Additionally, there are other approximations such as SIFT or SURF and variations of these, which allow describing the region in a way that they are invariant to scale, rotation or changes in lighting or noise.
A second phase uses this information in order to create a dictionary of visual words (Bag of Words-BoW) based on clustering techniques. Finally, the information obtained in the earlier step is quantified by making use of histograms which measure the appearance of each type of word in each image region. As a result, each image is described by using a series of numerical indices that in the end are entered into a vector of N numbers (indexing).
Figure1: Extraction process of image descriptors
The retrieval process is launched when the user wants to make an image search in order to obtain a result of images that are more similar to the reference. A good image description/indexing is key in order to generate correct search results, as the search engine works directly with previously generated indices. When a new search is carried out, the image that is entered is analysed in the same manner as was made for the data base images, thereby obtaining a series of numerical indices that describe it. These vectors that represent the image are then compared with all the set of vectors that are available in the database. There are several techniques with different objectives in order to do so. The most classical tries to calculate a distance (Chi-Square, Mahalanobis,…) to each of these images in order to know which are the closest. This is one of the best approximations as it makes an exhaustive analysis of all space and thus can find similar images in a very precise manner. Furthermore, this precision can be increased if there is a modification of the metrics used, by means of a technique known as metric learning. These types of algorithms (for example ICML) allow adapting a generic metric of Mahalanobis to a special case and also to the visual descriptors that are being worked with. Even so, in this case as well as in the earlier one, the complexity of the search is linear O(n) and thus it is not a very useful system with great quantities of images. In these cases, other techniques such as KD-Tree or others, must be used so that they can calculate these distances in an approximate manner or by a more efficient mean so that the complexity of the calculation is sublinear and therefore the system is completely scalable.
As previously mentioned, the entry image is compared by means of its mathematical description against the data base in order to find the most approximate values. The more similar they are, the more similar the images will be between them and therefore so will the first results be that are retrieved as a result of the search.
As has been demonstrated, both visual descriptors as well as their use require the precise understanding of their operation as well as their advantages versus others. This is the only way in which a similarity search system can be obtained that is efficient and valid for real problems in industry. This area of expertise is precisely the key feature of the Computer Vision group at Tecnalia.
NOTE The Biopool project (grant agreement 296162) is taking place under the supervision of the seventh framework programme of the European Commission and with the collaboration of 7 partners belonging to four different countries (Spain, France, United Kingdom and Netherlands). The project´s web page, where the latest advances can be seen, is the following: http://www.biopoolproject.eu →. The project began on September 2012 and shall end August 2014, with a two year expected duration.