A Testbed for Evaluating Indonesian Text Retrieval
Jelita Asian, Hugh E. Williams and S.M.M. Tahaghoghi
School of Computer Science and Information Technology
RMIT University, GPO Box 2476V, Melbourne 3001, Australia
jelita@cs.rmit.edu.au, hugh@cs.rmit.edu.au, saied@cs.rmit.edu.au
This website is published in support of our
recent short paper with the same title . This page contains the
document
collection, the queries and the relevance judgments used in the paper.
A Brief Abstract
Indonesia is the fourth most populous country and a close neighbour of
Australia.
However, despite media and intelligence interest in Indonesia, little
work
has been done on evaluating Information Retrieval techniques for
Indonesian, and no standard testbed exists for such a purpose.
An effective testbed should include a collection of documents,
realistic
queries, and relevance judgements.
The TREC and TDT testbeds have provided such an environment for the
evaluation of English, Mandarin, and Arabic text retrieval techniques.
The NTCIR testbed provides a similar environment for Chinese, Korean,
Japanese, and English.
This paper describes an Indonesian TREC-like testbed we have
constructed and
made available for the evaluation of ad hoc retrieval techniques.
To illustrate how the test collection is used, we briefly report the
effect
of stemming for Indonesian text retrieval, showing --- similarly to
English --- that it has little effect on accuracy.
The Corpus
The sources of the documents are from the popular online newspaper
called Kompas dated from
January-June 2002
inclusive. There are 3,000 documents in the collection and out of this
collection twenty queries were formed ranging from different topics
covered by the collection. The documents collection and the queries are
formatted in the standard ad hoc TREC-like
formats. Relevance judgements were then done by reading each of the
documents manually to see whether each document matches any of the
queries. Relevance is considered to be binary, so a document is
either relevant (1) or not relevant (0). The ground truth is formatted
in the trec_eval
format.
This corpus was compiled by Jelita
Asian in collaboration with Hugh E. Williams and S.M.M. Tahaghoghi.
Permission to use this corpus is granted as long as the names of these
authors are mentioned in any publications derived from work that use it. Citation of this work is:
Jelita Asian, Hugh E. Williams, and S.M.M. Tahaghoghi.
A Testbed for Indonesian Text Retrieval.
In Peter Bruza, Alistair Moffat, and Andrew Turpin (editors), Proceedings of the 9th Australasian Document Computing Symposium (ADCS 2004),
Melbourne, Australia,
55-58,
13 December 2004.
ISBN: 0 975 71720 0
Jelita Asian, 9 July 2006.