5.3 Comparison of Seriation Algorithms with Iris Data

In this section, Iris data (Fisher 1936) is used to compare the performance of the proposed seriations with several commonly used sorting algorithms. The target proximity matrix is the Euclidean distance matrix of the 150 iris flowers on four variables. Two conventional seriation algorithm sets,

(a).farthest and nearest insertion spanning tour,

(b). single, complete, and average linkage clustering trees, are compared with the proposed set,

(c).rank-2 ellipse, rank-1 tree, and rank-1&2 double ellipse.

The conventional algorithms are discussed in Minnotte and West (1998). The three permuted distance maps are displayed in Figure 7. A clear near Robinson pattern can be identified in Figure 7h, using the proposed rank-1&2 algorithms. Figure 7a with farthest insertion spanning algorithm on the other hand only approach the Robinson pattern in local regions. Performance of the average clustering tree in Figure 7f stands in between the spanning tours and the proposed algorithms.

Figure 7. Permuted Euclidean Distance Maps for Iris Data with Conventional and Proposed Seriation Algorithms.

For a numerical comparison, three anti-Robinson loss functions (Streng, 1978) are calculated for each permuted matrix, D=[], for the amount of deviation from a Robinson form with distance-type poximity:

AR(i) counts only the number of anti-Robinson events occur in the permuted matrix. AR(s) sums up all the absolute value of anti-Robinson deviations. AR(w) is a weighted version of AR(s) penalized by the difference of column indices of two entries. The result is summarized in Table 1. Clearly the proposed algorithms outperform the conventional methods with a significan margin in the Robinsonian sense.

Seriation Algorithm \ Loss Fun.

AR(i)

AR(s)

AR(w)

MS

(a) Farthest Insertion Spanning

339,392

2,391,228.7

75,265,472.0

530.5

(b) Nearest Insertion Spanning

344,117

2,594,751.6

86,392,680.8

557.4

(c) Single Linkage Clustering

192,180

603,196.8

9,822,063.2

673.8

(d) Complete Linkage Clustering

153,508

405,983.3

6,233,901.7

568.8

(e) Average Linkage Clustering

148,950

381,679.8

4,576,913.2

558.8

(f) Rank-OneTree

86,367

166,953.6

1,613,008.1

625.5

(g) Rank-TwoEllipse

91,411

186,148.0

2,098,849.9

883.1

(h) Rank-1&2 Double Ellipse

83,217

146,115.5

1,602,892.1

789.5

Table 1. Anti-Robinson Deviations for the Permuted Distance Matrices for Iris data.

The last column in Table 1 displays the amount of minimal span loss function for each permuted matrix, . This is the object to be minimized in a minimal spanning algorithm such as the traveling salesman problem. The idea is to find a shortest path through all data points. The major concern is on the optimization for local structures only which is different from the search of global structure in the Robinson problem. If only the local pattern is of concern then the conventional methods permute the matrix better than the proposed algorithms, but the difference is not as significant as that in the Robinson setup.

Reference:

  • Fisher, R. A. (1936), "The Use of Multiple Measurements in Taxonomic Problems," Ann. Eugen., 7, 179-188.

  • Minnotte, M., and West, R. W. (1998), "The Data Image: A tool for exploring high dimensional data sets," 1998 Proceedings of the Section on Statistical Graphics, American Statistical Association.

  • Streng, R. (1991) "Classification and Seriation by Iterative Reordering of a Data Matrix," In Classification, Data Analysis, and Knowledge Organization: Models and Methods with Applications, 121-130, edited by Bock, H. H., and