How good are similarity measures across distributed collections.

Information retrieval techniques are being used across distributed document collections. Queries are dispatched to probable collections and documents from each of these collections are then marshalled in some coherent fashion before the results are presented to the user. The aim of this empirical study presented is to gauge the usefulness of similarity measures that have been computed on separate collections to identify collections that are likely to satisfy the user information need, and to subsequently merge the ranked lists of documents returned by these collections. Our results indicate it can be very undesirable to directly compare similarity measures computed on separate collections using collection dependent weights.