Compute hierarchical or kmeans cluster analysis and return the group assignment for each observation as vector.

cluster_analysis(
  x,
  n_clusters = NULL,
  method = c("hclust", "kmeans"),
  distance = c("euclidean", "maximum", "manhattan", "canberra", "binary", "minkowski"),
  agglomeration = c("ward", "ward.D", "ward.D2", "single", "complete", "average",
    "mcquitty", "median", "centroid"),
  iterations = 20,
  algorithm = c("Hartigan-Wong", "Lloyd", "MacQueen"),
  force = TRUE,
  package = c("NbClust", "mclust"),
  verbose = TRUE
)

Arguments

x

A data frame.

n_clusters

Number of clusters used for the cluster solution. By default, the number of clusters to extract is determined by calling n_clusters.

method

Method for computing the cluster analysis. By default ("hclust"), a hierarchical cluster analysis, will be computed. Use "kmeans" to compute a kmeans cluster analysis. You can specify the initial letters only.

distance

Distance measure to be used when method = "hclust" (for hierarchical clustering). Must be one of "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". See dist. If is method = "kmeans" this argument will be ignored.

agglomeration

Agglomeration method to be used when method = "hclust" (for hierarchical clustering). This should be one of "ward", "single", "complete", "average", "mcquitty", "median" or "centroid". Default is "ward" (see hclust). If method = "kmeans" this argument will be ignored.

iterations

Maximum number of iterations allowed. Only applies, if method = "kmeans". See kmeans for details on this argument.

algorithm

Algorithm used for calculating kmeans cluster. Only applies, if method = "kmeans". May be one of "Hartigan-Wong" (default), "Lloyd" (used by SPSS), or "MacQueen". See kmeans for details on this argument.

force

Logical, if TRUE, ordered factors (ordinal variables) are converted to numeric values, while character vectors and factors are converted to dummy-variables (numeric 0/1) and are included in the cluster analysis. If FALSE, factors and character vectors are removed before computing the cluster analysis.

package

Package from which methods are to be called to determine the number of clusters. Can be "all" or a vector containing "NbClust", "mclust", "cluster" and "M3C".

verbose

Toggle warnings and messages.

Value

The group classification for each observation as vector. The returned vector includes missing values, so it has the same length as nrow(x).

Details

The print() and plot() methods show the (standardized) mean value for each variable within each cluster. Thus, a higher absolute value indicates that a certain variable characteristic is more pronounced within that specific cluster (as compared to other cluster groups with lower absolute mean values).

Note

There is also a plot()-method implemented in the see-package.

References

Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K (2014) cluster: Cluster Analysis Basics and Extensions. R package.

See also

n_clusters to determine the number of clusters to extract, cluster_discrimination to determine the accuracy of cluster group classification and check_clusterstructure to check suitability of data for clustering.

Examples

# Hierarchical clustering of mtcars-dataset
groups <- cluster_analysis(iris[, 1:4], 3)
groups
#> # Cluster Analysis (mean z-score by cluster)
#> 
#>          Term Group 1 Group 2 Group 3
#>  Sepal.Length   -1.01    0.09    1.24
#>   Sepal.Width    0.85   -0.70    0.07
#>  Petal.Length   -1.30    0.38    1.14
#>   Petal.Width   -1.25    0.31    1.19
#> 
#> # Accuracy of Cluster Group Classification
#> 
#>  Group Accuracy
#>      1  100.00%
#>      2   95.31%
#>      3   97.22%
#> 
#> Overall accuracy of classification: 97.33%
#> 

# K-means clustering of mtcars-dataset, auto-detection of cluster-groups
if (FALSE) {
groups <- cluster_analysis(iris[, 1:4], method = "k")
groups
}