This is the code that generated the results for the paper 'Active Learning using Dirichlet Processes for Rare Class Discovery and Classification', though note that it has since been edited, so there is the potential of some bug having been introduced (Editting was due to me removing code for an unpublished improvement, that never worked anyway, so chances are pretty slim.). A quick test run completed succefully, sugesting everything is fine, though it only tested a tiny subset of possible code paths.

The algorithm in the paper is 'p_wrong_soft' - note that there is also 'p_wrong_hard' which selects the sample most likelly to contribute the most information, rather than using it as a weighting (And further variants still, if you look inside the dp_al code.)

There is a single executable python script for each of the data sets - running the script will make it generate and store runs in a directory that it creates. The directory created each time a script is run includes a randomly generated component, such that the same script can be run multiple times simultaneously, to make use of multi-core machines, or having multiple machines (If a directory name collision occurs it will exit.).

When enough runs have been generated the script collate.py collects all the information for all the runs and averages them, writting out the results into csv files in subdirectories. It is the numbers from these files that are plotted in the paper. Note that the number of queries is hard coded into this script, so it can be run whilst the result generating scripts continue to run, such that it will cull incomplete result files.


Note that the digits and shuttle data files were provided as .mat files by Tim Hospedales, and are loaded using the python .mat file stuff.

