random() function in C is used for generating these data. For all the experiments, we kept the bucket size fixed to 5 objects per bucket and maximum number of buckets in a sequence to 64.
In the insertions experiments, we measure number of disk accesses made and number of buckets created during insertion process. When counting disk accesses for sequence of data buckets, we count them as individual bucket accesses, though they may be stored in contiguous disk blocks. Therefore, the real access time could be much less than the results recorded. These measurements were taken after each 10 000 insertions for first 100 000 insertions and then after each 100 000 insertions up to 3 million insertions. In Figure 4 we plot the cost of insertion (number of disk accesses) as a function of number of objects.
Then, the percent storage utilization is calculated and plotted against number of objects in Figure 5.
For the exact match search performance analysis, we consider two types of queries; successful searches and unsuccessful searches. For the first type of queries, we chose a sample of 1000 points from the database population and used them as query points. The cost (number of disk accesses) of each of these 1000 queries were measured and averaged. For the second type of queries, we generated the data explicitly and ran the queries. This process was repeated for 1000 unsuccessful searches. The results were plotted on Figure 6. We see that, the cost of unsuccessful exact match searches are always higher than the cost of successful searches. In unsuccessful queries, the overflow buckets (if exists) have to be accessed for any search whereas, in successful queries, the search terminates as soon as the object is found. The difference between these two results depends on the number of immediate overflow buckets used in the system. (In these experiments, we limited it to 1). The minimum insertion cost is 3 (disk accesses) and it does not increase rapidly with the database size, even for 3 million objects, it remains under 4.
We tested the performance of the k-nearest neighbor algorithm with neighborhood size(k) is set to 5, 10, 20 and 40. We used randomly generated points as queries. For each database size and for each neighborhood size, we did the experiment 1000 times, and measured the number of disk accesses required during the search and averaged the results. The results are presented in Figure 7.
The oscillating behaviour of these curves can be well described as a consequence of aging of the structure over its dynamic expansion [10]. It is an interesting problem to study: the aging behaviour of the structure over the lateral and longitudinal expansions.
On comparison with well known R*-tree structure [1], our results indicate that the structure performs better both in terms of search performance and storage utilization. Another remarkable feature is the insertion performance. For small database sizes, the search performance shows an oscillating behaviour, as a result of aging of the structure. However, as database grows, the impact of aging become less important and search cost starts to increase gradually. The degree of the increase of search costs is reasonably low (in linear order), as compared to other access methods. This is a good indicator for the absence of ``dimensionality curse" in our structure. The same phenomena applies to storage utilization as well. As database size grows, the probability of longitudinal expansions tend to decrease and as a result, the structure starts to grow vertically. Creation of more branches leads to leave more under-filled nodes, forcing a low storage utilization. Anyway, all these features depend on the decision metric used and both storage utilization and search costs are good indicators for the goodness of the decision metric used.