A Testbed for Evaluating Indonesian Text Retrieval


Jelita Asian, Hugh E. Williams and S.M.M. Tahaghoghi
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V, Melbourne 3001, Australia
jelita@cs.rmit.edu.au, hugh@cs.rmit.edu.au, saied@cs.rmit.edu.au


This website is published in support of our recent short paper with the same title . This page contains the document collection, the queries and the relevance judgments used in the paper.

A Brief Abstract

Indonesia is the fourth most populous country and a close neighbour of Australia. However, despite media and intelligence interest in Indonesia, little work has been done on evaluating Information Retrieval techniques for Indonesian, and no standard testbed exists for such a purpose. An effective testbed should include a collection of documents, realistic queries, and relevance judgements. The TREC and TDT testbeds have provided such an environment for the evaluation of English, Mandarin, and Arabic text retrieval techniques. The NTCIR testbed provides a similar environment for Chinese, Korean, Japanese, and English. This paper describes an Indonesian TREC-like testbed we have constructed and made available for the evaluation of ad hoc retrieval techniques. To illustrate how the test collection is used, we briefly report the effect of stemming for Indonesian text retrieval, showing --- similarly to English --- that it has little effect on accuracy.

The Corpus

The sources of the documents are from the popular online newspaper called Kompas dated from January-June 2002 inclusive. There are 3,000 documents in the collection and out of this collection twenty queries were formed ranging from different topics covered by the collection. The documents collection and the queries are formatted in the standard ad hoc TREC-like formats. Relevance judgements were then done by reading each of the documents manually to see whether each document matches any of the queries.  Relevance is considered to be binary, so a document is either relevant (1) or not relevant (0). The ground truth is formatted in the trec_eval format.


This corpus was compiled by Jelita Asian in collaboration with Hugh E. Williams and S.M.M. Tahaghoghi. Permission to use this corpus is granted as long as the names of these authors are mentioned in any publications derived from work that use it. Citation of this work is:

Jelita Asian, Hugh E. Williams, and S.M.M. Tahaghoghi. A Testbed for Indonesian Text Retrieval. In Peter Bruza, Alistair Moffat, and Andrew Turpin (editors), Proceedings of the 9th Australasian Document Computing Symposium (ADCS 2004), Melbourne, Australia, 55-58, 13 December 2004. ISBN: 0 975 71720 0 Jelita Asian, 9 July 2006.