Software for credal classification
Posted on February 6, 2014 by Giorgio Corani [ go back to blog ]
A classifier is a statistical model of the relationship between the attributes (features) of an object and its category (class). Classifiers are learned from a training set and later are used on the test set to predict the class of a new object given its features. Credal classifiers extend traditional classifiers allowing for set-valued (or indeterminate) predictions of classes. The output set is typically larger when the data set is small or it contains many missing values. Credal classifiers aim at producing reliable classifications also in conditions of poor information. I am aware of only two software suitable for credal classification.
The first is the JNCC2, authored by myself together with M. Zaffalon. It runs from the command line and it is implemented in Java. The code is open source. It reads the data from an ARFF file; this is the open format developed for WEKA (one of the most important open source software for data mining). The downloadable zip file contains the source code and the user manual, with worked examples. Once a data set is provided, the software performs cross-validation, comparing Naive Bayes and Naive Credal Classifier. Continuous features are discretized using (Fayyad and Irani, 1993) algorithm. JNCC2 enables the conservative treatment of missing data. One can declare whether each attribute of the classification problem is subject to a MAR (missing at random) or to a non-MAR missingness process. See here for an introductory discussion of MAR and non-MAR missing data; here for a practical example of non-MAR data; here for a theoretical discussion on how to perform conservative inference in presence of non-MAR missing data. JNCC2 reports the most traditional indicators of performance for credal classification; for instance it measures the accuracy of Naive Bayes on the instances on which Naive Credal Classifier is determinate or indeterminate. It also computes other typical indicators of performance, such as the percentage of indeterminate classifications, the number of classes returned on average when indeterminate and so on. The results are written to a text file. The software has been published on the JMLR special track on open source software.
The weka-IP plugin has been developed when preparing the classification chapter of the book Introduction to Imprecise Probabilities. I had the opportunity to spend some time in Granada with J. Abellan, A. Masegosa and S. Moral. Their research group has a large experience in credal classification based on decision trees. We thus decided to make available a number of credal classifiers under the WEKA interface. Andres linked the code of WEKA with that of credal decision trees and JNCC2. He is the maintainer of the package. Weka-IP contains the following credal classifiers:
- credal decision trees (paper);
- naïve credal classifier (paper);
- lazy naïve credal classifier (paper);
- credal model averaging (paper). The last two algorithms had been developed by me by extending the JNCC2 code- base, but I had not previously released the code. The credal classifiers are available in a separate folds of classifiers, which is not present in the standard WEKA interface. Weka-IP computes also the utility-based metrics for scoring credal classifiers; in this respect it is more up-to-date compared to JNCC2. I recommend using the Experimenter interface (manual) of Weka-IP. The Experimenter allows comparing via statistical tests the performance of different credal classifiers, based on the results of cross-validation. Moreover, it allows running several credal classifiers on multiple data sets. Installation and usage details are discussed in the user manual. Some details require some attention. Credal decision tree require removing missing data, which is done by the filter E_FilteredClassifier. Credal classifiers derived from JNCC2 requires the feature to be discrete, which is done by the filter E_Discretize. Due to interface reasons, it is not possible to deal with non-MAR missing data. Another advantage of Weka-IP is that one can exploit the powerful Weka functionalities for data pre-processing, such as feature selection.
The Weka-IP allows to easily get in touch with different credal classifiers, to run them from a graphical user interface and to statistically compare their performances. The code might be regarded as preliminary in different respects (see my previous comment on the need for filters), but its interface is generally very easy to use. If you plan to develop your own new algorithm for credal classification, it might be a good idea to try to add it to the Weka-IP package. In this way, you could quickly compare your new algorithm with previously existing ones.
About the author
G. Corani is senior researcher at the Imprecise Probability Group of IDSIA (Lugano, Switzerland). He obtained his PhD in Information Technology from the Politecnico di Milano in 2005. His research interests are mainly data mining and applied statistic. Most of his work done with imprecise probability regards credal classification. He is author of about 50 international publications.