AutoClass solves the problem of automatic discovery of classes in data.
AutoClass solves the problem of automatic discovery of classes in data (sometimes called clustering or unsupervised learning), as distinct from the generation of class descriptions from labeled examples (called supervised learning). It aims to discover the 'natural' classes in the data.
The AutoClass project is applicable to observations of things that can be described by a set of attributes, without referring to other things.
The data values corresponding to each attribute are limited to be either numbers or the elements of a fixed set of symbols. With numeric data, a measurement error must be provided.
In previous years, the Bayes group at Ames Research Center developed the basic theory and associated algorithms for various kinds of general data analysis techniques. Our earliest efforts were applied to the problem of automatic classification of data. We implemented this theory in the Autoclass series of programs.
AutoClass takes a database of cases described by a combination of real and discrete valued attributes, and automatically finds the natural classes in that data. It does not need to be told how many classes are present or what they look like -- it extracts this information from the data itself.
The classes are described probabilistically, so that an object can have partial membership in the different classes, and the class definitions can overlap. AutoClass generates reports on the classes it has found at the end of its search. AutoClass has been used and tested on many data sets, both within NASA and by industry, academia and other agencies. These applications typically find surprising classifications that show patterns in the data unknown to the user.
Examples include: discovery of new classes of infra-red stars in the IRAS Low Resolution Spectral catalogue (see figure below; and see here and here for more information), new classes of airports in a database of all USA airports, discovery of classes of proteins, introns and other patterns in DNA/protein sequence data, and others.
Key features are:
- determines the number of classes automatically;
- can use mixed discrete and real valued data;
- can handle missing values;
- processing time is roughly linear in the amount of the data;
- cases have probabilistic class membership;
- allows correlation between attributes within a class;
- generates reports describing the classes found; and
- predicts "test" case class memberships from a "training" classification.
If you liked this article, subscribe to the feed by clicking the image below to keep informed about new contents of the blog: