Some efficient access methods for point objects include hB-tree [9] and TV-tree [8]. R-tree [6] and its variants, though originally designed as spatial access methods, work with point data as well. All these structures use a hierarchical structure and all data objects are stored in the leaf level of the tree. The fundamental idea for spatial indexing of non-point objects is the use of approximations. The most common approximation for non-point objects used by spatial access methods is the Minimum Bounding Rectangle. Here in this paper, we mainly concentrate on point access methods.
In context of content based image retrieval, image features such as color, texture, shape, etc, are extracted and mapped to a feature vector, say
. For example, color feature of an image can be represented as a histogram, where number of pixels in each color level (say 256 levels) of each primary color (red, green and blue) are represented. For storage and indexing purposes, we use n-dimensional vector notation for histograms, where n is the number of color levels in the histogram. For efficiency, these feature vectors are pre-computed and stored. Then, each feature vector is treated as a point in n-dimensional space, and an n-dimensional point access method is used for indexing. Most of the existing point access methods work satisfactorily for lower dimensions, but suffer from ``dimensionality curse" when applied to high dimensional feature vectors. The main cause for ``dimensionality curse" can be described as follows. When the dimensionality increases the size of the feature vector grows rapidly. As a result, the fan-out of the non-leaf nodes of the index structure decreases, and it reduces to sequential scanning. In this paper, we propose a point access method for spatial databases, which could be easily adopted for high dimensional feature vectors in image databases and other similar high dimensional applications.
In the proposed method, we use a mapping function to linearize non-overlapping regions of n-dimensional space onto linear space. We utilize a class of mapping functions, each of which is used in different levels of the structure. A member function of this class is defined in terms of several parameters. These mapping functions appear at non-leaf branches of the structure. Data objects or pointers to data objects are kept at leaf level. This linearization method encompasses the ``dimensionality curse" problem by introducing a novel architecture for the non-leaf nodes of the structure, where subsequent mapping functions are kept. Unlike any other point access method, the size of the mapping function does not depend on the dimensionality, hence the fan-out is kept as high as possible, regardless of the dimension of the feature vector.
When designing this multi-dimensional linearization method, we give our attention to two factors. First, we design a hibrid technique for overflow handling, so the structure can grow vertically and horizontally in a dynamic way. Second, we make our mapping functions adaptable to a certain extent, according to the distribution of the objects to provide a uniform distribution of objects among the buckets.
We may regard the proposed method as a multi-level dynamic linearization structure. At the beginning, it contains only one mapping function and one set of data buckets. When, one of these data buckets is overflowed due to insertions, an overflow bucket is attached to the data bucket. However, if further overflows occur, the system tries to reorganize the structure. In such a reorganization, the mapping function is replaced with a new mapping function, which maps data into more buckets than earlier. Data in the old buckets are mapped using the new function to a newly created set of buckets. The old mapping function is replaced with the new one, and old buckets are disposed from the structure. Such an expansion is called a longitudinal expansion. However, longitudinal expansion becomes expensive, when the volume of data movements exceed a certain limit. If it is not feasible, a new mapping function is introduced to the structure to re-map data in the overfilled bucket only. The overfilled bucket is replaced with a new mapping function and a new set of data buckets. The new mapping function is stored in the old data bucket. This type of growth is referred to as lateral expansion. Lateral expansion always results in an increase of the search length for objects, whereas longitudinal expansion does not. The system always prefers to use longitudinal expansion over lateral expansion as far as the cost of the expansion is under the bounds.
When an overflow occurs, it has to decide whether to apply a lateral expansion or a longitudinal expansion. It depends on the factors such as number of buckets in the sequence already expanded, total number of buckets in the sequence and depth of overfilled bucket. A decision metric is used to choose the most suitable expansion. An illustration on how this structure grows dynamically will appear later in the paper.
If objects are distributed uniformly, the structure tends to grow horizontally, resulting in shallow search paths. On the other hand, if the distribution is skewed, the structure tends to grow vertically on highly populated branches. The mapping function, being adaptable, absorbs deviations in data distribution. Hence, we can expect a uniform distribution in mapped data, though original data are skewed.
The remainder of the paper is organized as follows. Section 2 discusses some existing indexing structures for spatial databases. The mechanism of the proposed indexing structure is described in section 3. Section 4 contains the algorithms developed for manipulating it and section 5 presents performance results. Section 6 gives conclusions and directions for further research.