Graphing. I have assembled a collection of examples of jgraph scripts and other information about jgraph.
Random numbers. The code in this zip file generates random numbers according to an arbitrary distribution, from random numbers generated according to a uniform distribution. It is handy for simulations.
Another handy utility is unsort, which randomises the order of lines in the input.
Efficient data structures. Several of my papers rely on code for high-efficiency hashing and tree structures. Some of the code for these structures was put on a web page to support a paper on efficient hashing (full details in my list of publications). The full set of code is in this zip file. Code for burst tries is available on request. Alternatively, check out the excellent Judy package. Judy tries have many of the same characteristics as burst tries.
Search engine. The Lucy system is a high-performance implementation of a scalable text search engine. Other SEG-related projects include useful code for, for example, index compression.
String sets. Several of the large sets of strings used in our experiments are available on the string sorting page developed by Ranjan Sinha, based on our papers. Some of this data is derived from the TREC web data collections.
String sorting. A simple implementation of ternary quick sort for sorting an array of strings is available. Faster string sorting routines are available on our string sorting page.
Integer coding. Many of my papers make use of integer coding techniques. Source for some of these techniques, including Elias and Golomb coding, is in this zip file.
Approximate string matching. Code for searching databases of strings, such as names, is in the vrank suite. These are string-based techniques such as edit distances; there is also code for phonetic methods, in the ipa suite. There are also some collections of data.
Synthetic text databases. The finnegan suite can be used to generate artificial text databases that are useful for retrieval efficiency experiments; the suite includes the quangle code for generating queries.
Stopping and text processing. The routine rmstop is a simple utility for removing stop words from text; the source includes some stop lists. The awk script double is a simple utility for detecting repeated words in text files.
The C program getstat reads a file of text (with one word per line) and counts how often each distinct word occurs. Test it on a book; the output should look like this.
Statistics The programs anova (a shell script), t-test (a shell script), and wilcoxon (in C) are for testing significance of hypotheses. All three operate on paired columns of numbers, and should be used in conjunction with statistical tables.
Scripting. Hints on basic scripting activities, such as awk and jgraph, were assembled by Hugh Williams and are available here.
Return to Justin Zobel's home page.