6 more features of GAP
We have shown that the converging sequence has many powerful and useful properties that can be applied in many areas of multivariate statistical analysis. In this section, We use the psychosis disorder data with 95 patients on 50 symptoms to illustrate the framework of a complete GAP analysis, in Figure 9. GAP integrates the following four major steps to extract and summarize information embedded in a multivariate data set with n subjects and p variables.
6.1 Raw Data and Proximity Matrix Maps with Suitable Color Projection
The raw data matrix is denoted as
. A gray spectrum is applied to project ordinal numbers into gray dots with different intensities. The correlation matrix is calculated as the proximity matrix
for the 50 symptoms. For the 95 patients, also the correlation matrix is used as the proximity matrix
(We also tried the Euclidean (standardized) distance). The diverging bluered color scheme is used to represent the bidirectional property of the correlation coefficients. For a data profile with various variable scales, the variables can be transformed (standardized) and projected through suitable color spectrum to represent the characteristics of the scales. Covariance matrix with the Euclidean (standardized) distance matrix can also be calculated as
and .
6.2 The Sorted Matrix Maps with the Principle of Geometry
The next step is to form the sequences of correlation matrices for
and
to identify the ellipses and
at iteration 7 and 5 respectively. The elliptical seriations for the patients and the symptoms are then applied to arrange the two correlation matrices
and
into and . The same seriations are also used to reshape the raw data matrix
into . The difference between
and
is not much since is already grouped by the SAPS and SANS symptom tables. However, there is a dramatic change from
to
since the patients are admitted in a random order. There is a clear latent structure in
. A band of dark gray dots moves from the upper right corner to the lower left corner. Since the seriations for
are identical to those for
and , these three maps are closely related to each other and should be crossexamined to find the information embedded in the raw data matrix and two proximity matrices.
Figure 9. Complete GAP Procedure for the Psychosis Disorder Data Set with Ninetyfive Patients and Fifty Symptoms.
Without a suitable seriation to arrange these matrixmaps such that rows and columns with similar (distinct) properties are placed at closer (farther) positions, these matrix maps are useless. We shall call this concept of placing similar (distinct) objects at positions close to (far away from) each other in a plot for representing the association structure the principle of geometry. In the graphic tools for continuous variables (histogram, lineplot, scatterplot, etc.) this geometryprinciple always hold because the nature of a continuous variable plot is to represent the metric structure of subjects geometrically. In a raw data matrix map or a proximity matrix map, this principle of geometry has to be forced in with appropriate
seriations.
6.3 Partitioned Matrix Maps with near Stationary Iterations
In Section 5.6, Figure 8a partitions the correlation matrix of the fifty symptoms into five major groups. In this section we take a look at the possible patientclusters and the general behavior of patientclusters on symptomgroups. It seems that there is no clear patientcluster structure in
except the negative betweengroup correlations on the offdiagonal area. It takes nine iterations for
to converge and to split all the 95 patients into two groups. The first group is a mixed group of 26 bipolardisorder patients with 12 schizophrenia patients and the second is a pure group with 57 schizophrenia patients.
In Figure 9c(=
), a coherent group on the upper left corner with sharp edges is easily identified. This group, to be denoted as S0, comes with all 26 bipolardisorder patients and only 4 schizophrenia patients. At the lower right corner, there is a large group, S1, of pure schizophrenia patients but the structure is not as tight as that of S0. In between S1 and S0 is a group of pure schizophrenia patients but the between group relationship for this group with S1 and S0 are about equal. We shall use S01 to denote this group of patients.
We then plot again the twoway sorted raw data map with the sorted correlation maps for patients at
and for symptoms at
attach to it, in Figure 9c. The green lines represent the partitions for symptom groups and for patientclusters. The general behavior of patientclusters on the symptomgroups can be easily identified in
.
6.4 The Sufficient Graph with Three Multivariate Linkages
In order to extract and summarize the visualized information in Figure 9b, we can further convert these matrix maps into a simplified version. Illustrated in Figure 9d are the meanstructure maps of the three matrices for raw data and proximities. Original proximity matrices for variables and subjects are represented by squares with different mean intensities on the diagonal for withingroup structure and rectangles off diagonal for betweengroup relationship. The double sorted raw data matrix map is also represented by rectangles with various meangray intensities to express the interaction effect between each subjectcluster on every variablegroup. These three mosaicdisplays in Figure 9d contain the principal structural information embedded in the original data set. The mean function in Figure 9d can be replaced with any statistic for displaying desired information structure. We shall name these three mosaic displays the sufficient graph for a multivariate data set. The sufficient graph is then used to answer the three multivariate problems raised by the psychiatrist. Fifty symptoms are divided into five symptomgroups with different within and between group structure. Ninetyfive patients are also grouped into three clusters. The general behavior of these three patientclusters on each of the five symptomgroups can now be easily comprehended. One can always go back to consult the three original sorted matrix maps (Figure 9b) for fear of losing too much information.
