Revelle (1979) proposed that hierachical cluster analysis could be used to estimate a new coefficient (beta) that was an estimate of the general factor saturation of a test. More recently, Zinbarg, Revelle, Yovel and Li (2005) compared McDonald's Omega to Chronbach's alpha and Revelle's beta. They conclude that omega is the best estimate. An algorithm for estimating omega using R is available as part of the "psych" package at http://personality-project.org/r.
Although ICLUST was originally written in FORTRAN for a CDC 6400 and then converted for IBM mainframes, Cooksey and Soutar, (2005) report using a PC adaptation of the original FORTRAN program. A modified version, written in Lightspeed Pascal has been available for the Mac for several years but did not have all the graphic features of the original program.
I now release a completely new version of ICLUST, written in R. It is still under development but is included as part of the "psych" package. Although early testing suggests it is stable, let me know if you have problems. Please email me if you want help with this version of ICLUST or if you desire more features.
The program currently has three primary functions: cluster, loadings, and graphics. There is no documentation yet for the program, although one can figure out most requirements by reading the code.
A sample output using the 24 variable problem by Harman can be represented both graphically and in terms of the cluster order. Note that the graphic is created using GraphViz in the dot language. ICLUST.graph produces the dot code for Graphviz. I have also tried doing this in Rgraphviz with less than wonderful results. Dot code can be viewed directly in Graphviz or can be tweaked using commercial software packages (e.g., OmniGraffle)
Note that for this problem, with these parameters, the data formed one large cluster. (This is consistent with the Very Simple Structure (VSS) output as well, which shows a clear one factor solution for complexity 1 data.) See below for an example with this same data set, but with more stringent parameter settings.
r.mat<- Harman74.cor$cov
iq.clus <- ICLUST(r.mat)
ICLUST.graph(iq.clus,title="ICLUST of 24 mental variables",out.file="iclust.dot")
$title
[1] "ICLUST"
$clusters
VisualPerception Cubes PaperFormBoard Flags GeneralInformation PargraphComprehension
1 1 1 1 1 1
SentenceCompletion WordClassification WordMeaning Addition Code CountingDots
1 1 1 1 1 1
StraightCurvedCapitals WordRecognition NumberRecognition FigureRecognition ObjectNumber NumberFigure
1 1 1 1 1 1
FigureWord Deduction NumericalPuzzles ProblemReasoning SeriesCompletion ArithmeticProblems
1 1 1 1 1 1
$corrected
[,1]
[1,] 1
$loadings
[,1]
VisualPerception 0.55
Cubes 0.34
PaperFormBoard 0.38
Flags 0.44
GeneralInformation 0.60
PargraphComprehension 0.59
SentenceCompletion 0.57
WordClassification 0.60
WordMeaning 0.59
Addition 0.41
Code 0.51
CountingDots 0.43
StraightCurvedCapitals 0.55
WordRecognition 0.39
NumberRecognition 0.37
FigureRecognition 0.48
ObjectNumber 0.44
NumberFigure 0.49
FigureWord 0.42
Deduction 0.56
NumericalPuzzles 0.55
ProblemReasoning 0.56
SeriesCompletion 0.63
ArithmeticProblems 0.60
$fit
$fit$clusterfit
[1] 0.76
$fit$factorfit
[1] 0.76
$results
Item/Cluster Item/Cluster similarity correlation alpha1 alpha2 beta1 beta2 size1 size2 rbar1 rbar2 r1 r2 alpha beta rbar size
C1 V9 V5 1.00 0.72 0.72 0.72 0.72 0.72 1 1 0.72 0.72 0.78 0.78 0.84 0.84 0.72 2
C2 V12 V10 1.00 0.58 0.58 0.58 0.58 0.58 1 1 0.58 0.58 0.66 0.66 0.74 0.74 0.58 2
C3 V18 V17 1.00 0.45 0.45 0.45 0.45 0.45 1 1 0.45 0.45 0.53 0.53 0.62 0.62 0.45 2
C4 V23 V20 1.00 0.51 0.51 0.51 0.51 0.51 1 1 0.51 0.51 0.59 0.59 0.67 0.67 0.51 2
C5 V13 V11 1.00 0.54 0.54 0.54 0.54 0.54 1 1 0.54 0.54 0.61 0.61 0.70 0.70 0.54 2
C6 V7 V6 1.00 0.72 0.72 0.72 0.72 0.72 1 1 0.72 0.72 0.78 0.78 0.84 0.84 0.72 2
C7 V4 V1 0.98 0.47 0.47 0.48 0.47 0.48 1 1 0.47 0.48 0.55 0.55 0.64 0.64 0.47 2
C8 V16 V14 0.98 0.41 0.43 0.41 0.43 0.41 1 1 0.43 0.41 0.50 0.49 0.58 0.58 0.41 2
C9 C1 C6 0.93 0.78 0.84 0.84 0.84 0.84 2 2 0.72 0.72 0.86 0.86 0.90 0.87 0.69 4
C10 C4 V22 0.91 0.56 0.67 0.56 0.67 0.56 2 1 0.51 0.56 0.71 0.62 0.74 0.71 0.49 3
C11 V21 V24 0.86 0.45 0.51 0.53 0.51 0.53 1 1 0.51 0.53 0.56 0.58 0.62 0.62 0.45 2
C12 C10 C11 0.86 0.58 0.74 0.62 0.71 0.62 3 2 0.49 0.45 0.76 0.67 0.79 0.74 0.43 5
C13 C9 V8 0.85 0.64 0.90 0.64 0.87 0.64 4 1 0.69 0.64 0.90 0.69 0.90 0.78 0.64 5
C14 C8 V15 0.84 0.41 0.58 0.41 0.58 0.41 2 1 0.41 0.41 0.61 0.49 0.64 0.59 0.37 3
C15 C2 C5 0.82 0.59 0.74 0.70 0.74 0.70 2 2 0.58 0.54 0.74 0.72 0.79 0.74 0.49 4
C16 V3 C7 0.81 0.41 0.41 0.64 0.41 0.64 1 2 0.41 0.47 0.48 0.65 0.66 0.58 0.39 3
C17 V19 C3 0.80 0.40 0.41 0.62 0.41 0.62 1 2 0.41 0.45 0.48 0.64 0.65 0.57 0.38 3
C18 C12 C16 0.78 0.57 0.79 0.66 0.74 0.58 5 3 0.43 0.39 0.79 0.68 0.82 0.72 0.37 8
C19 C17 C14 0.76 0.49 0.64 0.64 0.57 0.59 3 3 0.38 0.37 0.66 0.65 0.74 0.65 0.32 6
C20 C18 C19 0.76 0.59 0.82 0.74 0.72 0.65 8 6 0.37 0.32 0.81 0.73 0.86 0.74 0.30 14
C21 C20 C13 0.71 0.62 0.86 0.90 0.74 0.78 14 5 0.30 0.64 0.85 0.78 0.90 0.77 0.32 19
C22 C15 C21 0.65 0.55 0.79 0.90 0.74 0.77 4 19 0.49 0.32 0.65 0.89 0.91 0.71 0.31 23
C23 C22 V2 0.62 0.35 0.91 0.35 0.71 0.35 23 1 0.31 0.35 0.91 0.37 0.91 0.52 0.30 24
$cor
[,1]
[1,] 1
$alpha
[1] 0.91
$size
[1] 24
Graphviz (or OmniGraffle) can open the dot file (ICLUST.dot) and will show the figure. (See above). The dot file can be edited with a normal text editor to make the figure cleaner.
The previous solution showed that all 24 mental tests formed one, high level cluster (g). An alternative is to form smaller, tighter clusters. These were formed, of course, in the previous analysis, but it would be nice to find the correlations, loadings, and fit statistics for a multiple cluster solution.
Note that the cluster fit, and the more stringent factor fit are worse for this solution than the previous solution. Also note that one of the "clusters" is a single variable (cubes).
r.mat<- Harman74.cor$cov
iq.clus <- ICLUST(r.mat,nclusters=4)
graph.out <- file.choose()
ICLUST.graph(iq.clus,graph.out,title = "ICLUST of 24 mental variables")
iq.clus
$title
[1] "ICLUST"
$clusters
V2 C13 C15 C20
VisualPerception 0 0 0 1
Cubes 1 0 0 0
PaperFormBoard 0 0 0 1
Flags 0 0 0 1
GeneralInformation 0 1 0 0
PargraphComprehension 0 1 0 0
SentenceCompletion 0 1 0 0
WordClassification 0 1 0 0
WordMeaning 0 1 0 0
Addition 0 0 1 0
Code 0 0 1 0
CountingDots 0 0 1 0
StraightCurvedCapitals 0 0 1 0
WordRecognition 0 0 0 1
NumberRecognition 0 0 0 1
FigureRecognition 0 0 0 1
ObjectNumber 0 0 0 1
NumberFigure 0 0 0 1
FigureWord 0 0 0 1
Deduction 0 0 0 1
NumericalPuzzles 0 0 0 1
ProblemReasoning 0 0 0 1
SeriesCompletion 0 0 0 1
ArithmeticProblems 0 0 0 1
$corrected
V2 C13 C15 C20
V2 1.00 0.26 0.21 0.41
C13 0.24 0.90 0.47 0.71
C15 0.19 0.40 0.79 0.67
C20 0.38 0.62 0.55 0.86
$loadings
V2 C13 C15 C20
VisualPerception 0.32 0.38 0.39 0.57
Cubes 1.00 0.24 0.19 0.38
PaperFormBoard 0.32 0.31 0.15 0.42
Flags 0.23 0.38 0.22 0.45
GeneralInformation 0.28 0.77 0.39 0.50
PargraphComprehension 0.23 0.77 0.31 0.53
SentenceCompletion 0.16 0.80 0.32 0.49
WordClassification 0.16 0.67 0.40 0.56
WordMeaning 0.20 0.79 0.27 0.54
Addition 0.06 0.29 0.63 0.35
Code 0.15 0.36 0.61 0.48
CountingDots 0.14 0.21 0.65 0.40
StraightCurvedCapitals 0.24 0.40 0.62 0.51
WordRecognition 0.10 0.31 0.27 0.41
NumberRecognition 0.13 0.26 0.22 0.41
FigureRecognition 0.27 0.28 0.27 0.56
ObjectNumber 0.00 0.29 0.36 0.47
NumberFigure 0.26 0.25 0.43 0.54
FigureWord 0.11 0.29 0.27 0.46
Deduction 0.29 0.51 0.27 0.58
NumericalPuzzles 0.31 0.36 0.50 0.54
ProblemReasoning 0.23 0.49 0.30 0.58
SeriesCompletion 0.35 0.54 0.40 0.62
ArithmeticProblems 0.21 0.50 0.54 0.54
$fit
$fit$clusterfit
[1] 0.51
$fit$factorfit
[1] 0.16
$results
Item/Cluster Item/Cluster similarity correlation alpha1 alpha2 beta1 beta2 size1 size2 rbar1 rbar2 r1 r2 alpha beta rbar size
C1 V9 V5 1.00 0.72 0.72 0.72 0.72 0.72 1 1 0.72 0.72 0.78 0.78 0.84 0.84 0.72 2
C2 V12 V10 1.00 0.58 0.58 0.58 0.58 0.58 1 1 0.58 0.58 0.66 0.66 0.74 0.74 0.58 2
C3 V18 V17 1.00 0.45 0.45 0.45 0.45 0.45 1 1 0.45 0.45 0.53 0.53 0.62 0.62 0.45 2
C4 V23 V20 1.00 0.51 0.51 0.51 0.51 0.51 1 1 0.51 0.51 0.59 0.59 0.67 0.67 0.51 2
C5 V13 V11 1.00 0.54 0.54 0.54 0.54 0.54 1 1 0.54 0.54 0.61 0.61 0.70 0.70 0.54 2
C6 V7 V6 1.00 0.72 0.72 0.72 0.72 0.72 1 1 0.72 0.72 0.78 0.78 0.84 0.84 0.72 2
C7 V4 V1 0.98 0.47 0.47 0.48 0.47 0.48 1 1 0.47 0.48 0.55 0.55 0.64 0.64 0.47 2
C8 V16 V14 0.98 0.41 0.43 0.41 0.43 0.41 1 1 0.43 0.41 0.50 0.49 0.58 0.58 0.41 2
C9 C1 C6 0.93 0.78 0.84 0.84 0.84 0.84 2 2 0.72 0.72 0.86 0.86 0.90 0.87 0.69 4
C10 C4 V22 0.91 0.56 0.67 0.56 0.67 0.56 2 1 0.51 0.56 0.71 0.62 0.74 0.71 0.49 3
C11 V21 V24 0.86 0.45 0.51 0.53 0.51 0.53 1 1 0.51 0.53 0.56 0.58 0.62 0.62 0.45 2
C12 C10 C11 0.86 0.58 0.74 0.62 0.71 0.62 3 2 0.49 0.45 0.76 0.67 0.79 0.74 0.43 5
C13 C9 V8 0.85 0.64 0.90 0.64 0.87 0.64 4 1 0.69 0.64 0.90 0.69 0.90 0.78 0.64 5
C14 C8 V15 0.84 0.41 0.58 0.41 0.58 0.41 2 1 0.41 0.41 0.61 0.49 0.64 0.59 0.37 3
C15 C2 C5 0.82 0.59 0.74 0.70 0.74 0.70 2 2 0.58 0.54 0.74 0.72 0.79 0.74 0.49 4
C16 V3 C7 0.81 0.41 0.41 0.64 0.41 0.64 1 2 0.41 0.47 0.48 0.65 0.66 0.58 0.39 3
C17 V19 C3 0.80 0.40 0.41 0.62 0.41 0.62 1 2 0.41 0.45 0.48 0.64 0.65 0.57 0.38 3
C18 C12 C16 0.78 0.57 0.79 0.66 0.74 0.58 5 3 0.43 0.39 0.79 0.68 0.82 0.72 0.37 8
C19 C17 C14 0.76 0.49 0.64 0.64 0.57 0.59 3 3 0.38 0.37 0.66 0.65 0.74 0.65 0.32 6
C20 C18 C19 0.76 0.59 0.82 0.74 0.72 0.65 8 6 0.37 0.32 0.81 0.73 0.86 0.74 0.30 14
C21 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0
C22 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0
C23 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0
$cor
V2 C13 C15 C20
V2 1.00 0.24 0.19 0.38
C13 0.24 1.00 0.40 0.62
C15 0.19 0.40 1.00 0.55
C20 0.38 0.62 0.55 1.00
$alpha
V2 C13 C15 C20
1.00 0.90 0.79 0.86
$size
V2 C13 C15 C20
1 5 4 14
#ICLUST - a function to form homogeneous item composites
# based upon Revelle, W. (1979). Hierarchical cluster analysis and the internal structure of tests. Multivariate Behavioral Research, 14, 57-74.
#
# psudo code
# find similarity matrix
# original is either covariance or correlation
# corrected is disattenuated
#find most similar pair
#if size of pair > min size, apply beta criterion
# if beta new > min(beta 1, beta 2) combine pair
#update similarity matrix
#repeat until finished
#then report various summary statistics
#example code
#r.mat<- Harman74.cor$cov
# print(ICLUST(r.mat),digits=2)
#ICLUST is the main function and calls other routines
"ICLUST" <-
function (r.mat,nclusters=1,alpha=3,beta=1,beta.size=4,alpha.size=3,correct=TRUE,reverse=TRUE,beta.min=.5,output=1,digits=2,labels=NULL,cut=0,n.iterations=0,title="ICLUST") {#should allow for raw data, correlation or covariances
#ICLUST.options <- list(n.clus=1,alpha=3,beta=2,beta.size=4,alpha.size=3,correct=TRUE,reverse=TRUE,beta.min=.5,output=1,digits=2)
ICLUST.options <- list(n.clus=nclusters,alpha=alpha,beta=beta,beta.size=beta.size,alpha.size=alpha.size,correct=correct,reverse=reverse,beta.min=beta.min,output=output,digits=digits)
if(dim(r.mat)[1]!=dim(r.mat)[2]) {r.mat <- cor(r.mat,use="pairwise") } #cluster correlation matrices, find correlations if not square matrix
iclust.results <- ICLUST.cluster(r.mat,ICLUST.options)
loading <- cluster.loadings(iclust.results$clusters,r.mat,digits=digits)
fits <- cluster.fit(r.mat,loading$loadings,iclust.results$clusters,digits=digits)
sorted <- ICLUST.sort(ic.load=loading$loadings,labels=labels,cut=cut) #sort the loadings
#now, iterate the cluster solution to clean it up (if desired)
clusters <- iclust.results$clusters
old.clusters <- clusters
old.fit <- fits$clusterfit
clusters <- factor2cluster(loading$loadings,cut=cut,loading=FALSE)
load <- cluster.loadings(clusters,r.mat,digits=digits)
if (n.iterations > 0) { #it is possible to iterate the solution to perhaps improve it
for (steps in 1:n.iterations) { #
load <- cluster.loadings(clusters,r.mat,digits=digits)
clusters <- factor2cluster(loading$loadings,cut=cut,loading=FALSE)
if(dim(clusters)[2]!=dim(old.clusters)[2]) {change <- 999
load <- cluster.loadings(clusters,r.mat,digits=digits) } else {
change <- sum(abs(clusters)-abs(old.clusters)) } #how many items are changing?
fit <- cluster.fit(r.mat,loading$loadings,clusters,digits=digits)
old.clusters <- clusters
print(paste("iterations ",steps," change in clusters ", change, "current fit " , fit$clusterfit))
if ((abs(change) < 1) | (fit$clusterfit <= old.fit)) {break} #stop iterating if it gets worse or there is no change in cluster definitions
old.fit <- fit$cluster.fit
}
}
p.fit <- cluster.fit(r.mat,loading$loadings,clusters,digits=digits)
p.sorted <- ICLUST.sort(ic.load=loading$loadings,labels=labels,cut=cut)
purified <- cluster.cor(clusters,r.mat,digits=digits)
list(title=title,clusters=iclust.results$clusters,corrected=loading$corrected,loadings=loading$loadings,fit=fits,results=iclust.results$results,cor=loading$cor,alpha=loading$alpha,size=loading$size,sorted=sorted,p.fit = p.fit,p.sorted = p.sorted,purified=purified)
}
"ICLUST.sort"<- function (ic.load,cut=0,labels=NULL,loading=FALSE) {
nclust <- dim(ic.load)[2]
nitems <- dim(ic.load)[1]
if (length(labels)==0) {
var.labels <- rownames(ic.load)} else {var.labels=labels}
if (length(var.labels)==0) {var.labels =paste('V',seq(1:nitems),sep='')} #unlabled variables
if(loading) {loadings <- ic.load$loadings} else {loadings <- ic.load}
loads <- data.frame(item=seq(1:nitems),content=var.labels,cluster=rep(0,nitems),loadings)
#first find the maximum for each row and assign it to that cluster
loads$cluster <- apply(abs(loadings),1,which.max)
for (i in 1:nitems) {if (abs(loadings[i,loads$cluster[i]]) < cut) {loading$cluster[i] <- nclust+1}} #assign the ones that missed the cut a location
ord <- sort(loads$cluster,index.return=TRUE)
loads[1:nitems,] <- loads[ord$ix,]
rownames(loads)[1:nitems] <- rownames(loads)[ord$ix]
items <- c(table(loads$cluster),1) #how many items are in each cluster?
if(length(items) < (nclust+1)) {items <- rep(0,(nclust+1)) #this is a rare case where some clusters don't have anything in them
for (i in 1:nclust+1) {items[i] <- sum(loads$cluster==i) } }
#now sort the loadings that have their highest loading on each cluster
first <- 1
for (i in 1:nclust) {
if(items[i]>0 ) {
last <- first + items[i]- 1
ord <- sort(abs(loads[first:last,i+3]),decreasing=TRUE,index.return=TRUE)
loads[first:last,] <- loads[ord$ix+first-1,]
rownames(loads)[first:last] <- rownames(loads)[ord$ix+first-1]
first <- first + items[i]}
}
if (first < nitems) loads[first:nitems,"cluster"] <- 0 #assign items less than cut to 0
ICLUST.sort <- list(sorted=loads) }
ICLUST.cluster <- function (r.mat,ICLUST.options) {#should allow for raw data, correlation or covariances
#options: alpha =1 (minimum alpha) 2 (average alpha) 3 (maximum alpha)
# beta =1 (minimum beta) 2 (average beta) 3 (maximum beta)
# correct for reliability
# reverse score items if negative correlations
# stop clustering if beta for new clusters < beta.min
# output =1 (short) 2 (show steps) 3 show rejects as we go
#initialize various arrays and get ready for the first pass
output <- ICLUST.options$output
num.var <- nrow(r.mat)
keep.clustering <- TRUE #used to determine when we are finished clustering
results <- data.frame(matrix(rep(0,18*(num.var-1)),ncol=18))
names(results) <- c("Item/Cluster", "Item/Cluster","similarity","correlation","alpha1","alpha2",
"beta1","beta2","size1","size2","rbar1","rbar2","r1","r2","alpha","beta","rbar","size")
rownames(results) <- paste("C",1:(num.var-1),sep="")
clusters <- diag(1,nrow =nrow(r.mat)) #original cluster structure is 1 item clusters
rownames(clusters) <- rownames(r.mat)
colnames(clusters) <- paste("V",1:num.var,sep="")
count=1
#master loop
while (keep.clustering) { #loop until we figure out we should stop
#find similiarities
#we will do most of the work on a copy of the r.mat
cluster.stats <- cluster.cor(clusters,r.mat,FALSE)
sim.mat <- cluster.stats$cor #the correlation matrix
diag(sim.mat) <- 0 #we don't want 1's on the diagonal to mess up the maximum
#two ways to estimate reliability -- for 1 item clusters, max correlation, for >1, alpha
#this use of initial max should be an option
if (ICLUST.options$correct) { #find the largest and smallest similarities for each variable
row.range <- apply(sim.mat,1,range,na.rm=TRUE)
row.max <- pmax(abs(row.range[1,]),abs(row.range[2,])) #find the largest absolute similarity
} else {row.max <- rep(1, nrow(sim)) } #don't correct for largest similarity
item.rel <- cluster.stats$alpha
for (i in 1: length(item.rel)) { if (cluster.stats$size[i]<2) {
item.rel[i] <- row.max[i]
#figure out item betas here?
}}
sq.max <- diag(1/sqrt(item.rel)) #used to correct for reliabilities
#this is the corrected for maximum r similarities
if (ICLUST.options$correct) {sim <- sq.max %*% sim.mat %*% sq.max
} else {sim <- sim.mat}
diag(sim) <- NA #we need to not consider the diagonal when looking for maxima
#find the most similar pair and apply tests if we should combine
test.alpha <- FALSE
test.beta <- FALSE
while(!(test.alpha&test.beta)){
max.cell <- which.max(sim) #global maximum
if (length(max.cell) < 1) {
keep.clustering <- FALSE
break} #there are no non-NA values left
sign.max <- 1
if ( ICLUST.options$reverse ) { #normal case is to reflect if necessary
min.cell <- which.min(sim) #location of global minimum
if (sim[max.cell] < abs(sim[min.cell] )) {
sign.max <- -1
max.cell <- min.cell }
if (sim[max.cell] < 0.0) {sign.max <- -1 }}
#this is a weird case where all the similarities are negative
max.col <- trunc(max.cell/nrow(sim))+1 #is in which row and column?
max.row <- max.cell - (max.col-1)*nrow(sim) #need to fix the case of first column
if (max.row < 1) {max.row <- nrow(sim)
max.col <- max.col-1 }
#combine these two rows if the various criterion are passed
beta.combined <- 2* sign.max*sim.mat[max.cell]/(1+sign.max* sim.mat[max.cell]) #unweighted beta
size1 <- cluster.stats$size[max.row]
if(size1 < 2) {V1 <- 1
beta1 <- item.rel[max.row]
alpha1 <- item.rel[max.row]
rbar1 <- item.rel[max.row] } else {
rbar1 <- results[cluster.names[max.row],"rbar"]
beta1 <- results[cluster.names[max.row],"beta"]
alpha1 <- results[cluster.names[max.row],"alpha"]}
V1 <- size1 + size1*(size1-1) * rbar1
size2 <- cluster.stats$size[max.col]
if(size2 < 2) {V2 <- 1
beta2 <- item.rel[max.col]
alpha2 <- item.rel[max.col]
rbar2 <- item.rel[max.col] } else {
rbar2 <- results[cluster.names[max.col],"rbar"]
beta2 <- results[cluster.names[max.col],"beta"]
alpha2 <- results[cluster.names[max.col],"alpha"]}
V2 <- size2 + size2 * (size2-1) * rbar2
Cov12 <- sign.max* sim.mat[max.cell] * sqrt(V1*V2)
V12 <- V1 + V2 + 2 * Cov12
size12 <- size1 + size2
alpha <- (V12 - size12)*(size12/(size12-1))/V12
rbar <- alpha/(size12-alpha*(size12-1))
#what is the correlation of this new cluster with the two subclusters?
#this considers item overlap problems
c1 <- sign.max*rbar1*size1*size1 + sign.max* Cov12 #corrects for item overlap
c2 <- rbar2*size2*size2 + Cov12 #only flip one of the two correlations with the combined cluster
if(size1 > size2) {r1 <- c1/sqrt(V1*V12)
r2 <- sign.max* c2/sqrt(V2*V12) } else {r1 <-sign.max* c1/sqrt(V1*V12)
#flip the smaller of the two clusters
r2 <- c2/sqrt(V2*V12) }
#test if we should combine these two clusters
#first, does alpha increase?
test.alpha <- TRUE
if (ICLUST.options$alpha.size < min(size1,size2)) {
switch(ICLUST.options$alpha, {if (alpha < min(alpha1,alpha2)) {if (output>2) {print(
paste ('do not combine ', cluster.names[max.row],"with", cluster.names[max.col],
'new alpha =', round (alpha,2),'old alpha1 =',round( alpha1,2),"old alpha2 =",round(alpha2,2)))}
test.alpha <- FALSE }},
{if (alpha < mean(alpha1,alpha2)) {if (output>2) {print(paste ('do not combine ',
cluster.names[max.row],"with", cluster.names[max.col],'new alpha =', round (alpha,2),
'old alpha1 =',round( alpha1,2),"old alpha2 =",round(alpha2,2)))}
test.alpha <- FALSE }},
{if (alpha < max(alpha1,alpha2)) {if (output>2) {print(paste ('do not combine ',
cluster.names[max.row],"with", cluster.names[max.col],'new alpha =', round (alpha,2),
'old alpha1 =',round( alpha1,2),"old alpha2 =",round(alpha2,2)))}
test.alpha <- FALSE }}) #end switch
} #end if options$alpha.size
#second, does beta increase ?
test.beta <- TRUE
if (ICLUST.options$beta.size < min(size1,size2)) {
switch(ICLUST.options$beta, {if (beta.combined < min(beta1,beta2)) {if (output>2) {print(
paste ('do not combine ', cluster.names[max.row],"with", cluster.names[max.col],'new beta =',
round (beta.combined,2),'old beta1 =',round( beta1,2),"old beta2 =",round(beta2,2)))}
test.beta <- FALSE }},
{if (beta.combined < mean(beta1,beta2)) {if (output>2) {print(paste ('do not combine ',
cluster.names[max.row],"with", cluster.names[max.col],'new beta =', round (beta.combined,2),
'old beta1 =',round( beta1,2),"old beta2 =",round(beta2,2)))}
test.beta <- FALSE }},
{if (beta.combined < max(beta1,beta2)) {if (output>2) {print(paste ('do not combine ',
cluster.names[max.row],"with", cluster.names[max.col],'new beta =', round (beta.combined,2),
'old beta1 =',round( beta1,2),"old beta2 =",round(beta2,2)))}
test.beta <- FALSE }}) #end switch
} #end if options$beta.size
if(test.beta&test.alpha) {break } else {
if (beta.combined < ICLUST.options$beta.min) {
keep.clustering <- FALSE #the most similiar pair is not very similar, we should quit
break} else {sim[max.row,max.col] <- NA
sim[max.col,max.row] <- NA }
} #end of test.beta&test.alpha
} #end of while test.alpha&test.beta.loop
#combine and summarize
if (keep.clustering)
{ # we have based the alpha and beta tests, now combine these two variables
clusters[,max.row] <- clusters[,max.row] + sign.max * clusters[,max.col]
cluster.names <- colnames(clusters)
#summarize the results
results[count,1] <- cluster.names[max.row]
results[count,2] <- cluster.names[max.col]
results[count,"similarity"] <- sim[max.cell]
results[count,"correlation"] <- sim.mat[max.cell]
results[count,"alpha1"] <- item.rel[max.row]
results[count,"alpha2"] <- item.rel[max.col]
size1 <- cluster.stats$size[max.row]
size2 <- cluster.stats$size[max.col]
results[count,"size1"] <- size1
results[count,"size2"] <- size2
results[count,"beta1"] <- beta1
results[count,"beta2"] <- beta2
results[count,"rbar1"] <- rbar1
results[count,"rbar2"] <- rbar2
results[count,"r1"] <- r1
results[count,"r2"] <- r2
results[count,"beta"] <- beta.combined
results[count,'alpha'] <- alpha
results[count,'rbar'] <- rbar
results[count,"size"] <- size12
results[count,3:18] <- round(results[count,3:18],ICLUST.options$digits)
#update
cluster.names[max.row] <- paste("C",count,sep="")
colnames(clusters) <- cluster.names
clusters <- clusters[,-max.col]
cluster.names<- colnames(clusters)
#row.max <- row.max[-max.col]
} #end of combine section
if(output > 1) print(results[count,],digits=2)
count=count+1
if ((num.var - count) < ICLUST.options$n.clus) {keep.clustering <- FALSE}
if(num.var - count < 1) {keep.clustering <- FALSE} #only one cluster left
} #end of keep clustering loop
ICLUST.cluster <- list(results=results,clusters=clusters,number <- num.var - count)
} # end ICLUST.cluster
cluster.loadings <-
function (keys, r.mat, correct = TRUE,digits=2)
{
if (!is.matrix(keys))
keys <- as.matrix(keys)
item.covar <- r.mat %*% keys #item by cluster covariances
covar <- t(keys) %*% item.covar #variance/covariance of clusters
var <- diag(covar)
sd.inv <- 1/sqrt(var)
key.count <- diag(t(keys) %*% keys) #how many items in each cluster?
if (correct) {
cluster.correct <- diag((key.count/(key.count - 1)))
for (i in 1:ncol(keys)) {
if (key.count[i]==1) { #fix the case of 1 item keys
cluster.correct[i,i] <- 1
} else { cluster.correct[i,i] <- key.count[i]/(key.count[i]-1)
item.covar[,i] <- item.covar[,i] - keys[,i]}
} #i loop
correction.factor <- keys %*% cluster.correct
correction.factor[ correction.factor < 1] <- 1
item.covar <- item.covar * correction.factor
}
ident.sd <- diag(sd.inv, ncol = length(sd.inv))
cluster.loading <- item.covar %*% ident.sd
cluster.correl <- ident.sd %*% covar %*% ident.sd
key.alpha <- ((var - key.count)/var) * (key.count/(key.count - 1))
key.alpha[is.nan(key.alpha)] <- 1
key.alpha[!is.finite(key.alpha)] <- 1
colnames(cluster.loading) <- colnames(keys)
colnames(cluster.correl) <- colnames(keys)
rownames(cluster.correl) <- colnames(keys)
rownames(cluster.loading) <- rownames(r.mat)
if( ncol(keys) >1) {cluster.corrected <- correct.cor(cluster.correl, t(key.alpha))} else {cluster.corrected <- cluster.correl}
return(list(loadings=round(cluster.loading,digits), cor=round(cluster.correl,digits),corrected=round(cluster.corrected,digits), sd = round(sqrt(var),digits), alpha = round(key.alpha,digits),
size = key.count))
}
cluster.fit <- function(original,load,clusters,diagonal=FALSE,digits=2) {
sqoriginal <- original*original #squared correlations
totaloriginal <- sum(sqoriginal) - diagonal*sum(diag(sqoriginal) ) #sum of squared correlations - the diagonal
load <- as.matrix(load)
clusters <- as.matrix(clusters)
model <- load %*% t(load) #reproduce the correlation matrix by the factor law R= FF'
residual <- original-model #find the residual R* = R - FF'
sqresid <- residual*residual #square the residuals
totalresid <- sum(sqresid)- diagonal * sum(diag(sqresid) ) #sum squared residuals - the main diagonal
fit <- 1-totalresid/totaloriginal #fit is 1-sumsquared residuals/sumsquared original (of off diagonal elements
clusters <- abs(clusters)
model.1 <- (load * clusters) %*% t(load*clusters)
residual <- original - model.1
sqresid <- residual*residual #square the residuals
totalresid <- sum(sqresid)- diagonal * sum(diag(sqresid) ) #sum squared residuals - the main diagonal
fit.1 <- 1-totalresid/totaloriginal #fit is 1-sumsquared residuals/sumsquared original (of off diagonal elements
cluster.fit <- list(clusterfit=round(fit.1,digits),factorfit=round(fit,digits))
}