# Principal Component Analysis (PCA) and Factor Analysis (FA)

Source:`R/factor_analysis.R`

, `R/principal_components.R`

, `R/utils_pca_efa.R`

`principal_components.Rd`

The functions `principal_components()`

and `factor_analysis()`

can
be used to perform a principal component analysis (PCA) or a factor analysis
(FA). They return the loadings as a data frame, and various methods and
functions are available to access / display other information (see the
Details section).

## Usage

```
factor_analysis(
x,
n = "auto",
rotation = "none",
sort = FALSE,
threshold = NULL,
standardize = TRUE,
cor = NULL,
...
)
principal_components(
x,
n = "auto",
rotation = "none",
sparse = FALSE,
sort = FALSE,
threshold = NULL,
standardize = TRUE,
...
)
rotated_data(pca_results)
# S3 method for parameters_efa
predict(object, newdata = NULL, names = NULL, keep_na = TRUE, ...)
# S3 method for parameters_efa
print(x, digits = 2, sort = FALSE, threshold = NULL, labels = NULL, ...)
# S3 method for parameters_efa
sort(x, ...)
closest_component(pca_results)
```

## Arguments

- x
A data frame or a statistical model.

- n
Number of components to extract. If

`n="all"`

, then`n`

is set as the number of variables minus 1 (`ncol(x)-1`

). If`n="auto"`

(default) or`n=NULL`

, the number of components is selected through`n_factors()`

resp.`n_components()`

. In`reduce_parameters()`

, can also be`"max"`

, in which case it will select all the components that are maximally pseudo-loaded (i.e., correlated) by at least one variable.- rotation
If not

`"none"`

, the PCA / FA will be computed using the**psych**package. Possible options include`"varimax"`

,`"quartimax"`

,`"promax"`

,`"oblimin"`

,`"simplimax"`

, or`"cluster"`

(and more). See`psych::fa()`

for details.- sort
Sort the loadings.

- threshold
A value between 0 and 1 indicates which (absolute) values from the loadings should be removed. An integer higher than 1 indicates the n strongest loadings to retain. Can also be

`"max"`

, in which case it will only display the maximum loading per variable (the most simple structure).- standardize
A logical value indicating whether the variables should be standardized (centered and scaled) to have unit variance before the analysis (in general, such scaling is advisable).

- cor
An optional correlation matrix that can be used (note that the data must still be passed as the first argument). If

`NULL`

, will compute it by running`cor()`

on the passed data.- ...
Arguments passed to or from other methods.

- sparse
Whether to compute sparse PCA (SPCA, using

`sparsepca::spca()`

). SPCA attempts to find sparse loadings (with few nonzero values), which improves interpretability and avoids overfitting. Can be`TRUE`

or`"robust"`

(see`sparsepca::robspca()`

).- pca_results
The output of the

`principal_components()`

function.- object
An object of class

`parameters_pca`

or`parameters_efa`

- newdata
An optional data frame in which to look for variables with which to predict. If omitted, the fitted values are used.

- names
Optional character vector to name columns of the returned data frame.

- keep_na
Logical, if

`TRUE`

, predictions also return observations with missing values from the original data, hence the number of rows of predicted data and original data is equal.- digits, labels
Arguments for

`print()`

.

## Details

### Methods and Utilities

`n_components()`

and`n_factors()`

automatically estimates the optimal number of dimensions to retain.`performance::check_factorstructure()`

checks the suitability of the data for factor analysis using the sphericity (see`performance::check_sphericity_bartlett()`

) and the KMO (see`performance::check_kmo()`

) measure.`performance::check_itemscale()`

computes various measures of internal consistencies applied to the (sub)scales (i.e., components) extracted from the PCA.Running

`summary()`

returns information related to each component/factor, such as the explained variance and the Eivenvalues.Running

`get_scores()`

computes scores for each subscale.Running

`closest_component()`

will return a numeric vector with the assigned component index for each column from the original data frame.Running

`rotated_data()`

will return the rotated data, including missing values, so it matches the original data frame.Running

`plot()`

visually displays the loadings (that requires the**see**-package to work).

### Complexity

Complexity represents the number of latent components needed to account
for the observed variables. Whereas a perfect simple structure solution
has a complexity of 1 in that each item would only load on one factor,
a solution with evenly distributed items has a complexity greater than 1
(*Hofman, 1978; Pettersson and Turkheimer, 2010*).

### Uniqueness

Uniqueness represents the variance that is 'unique' to the variable and
not shared with other variables. It is equal to `1 – communality`

(variance that is shared with other variables). A uniqueness of `0.20`

suggests that `20%`

or that variable's variance is not shared with other
variables in the overall factor model. The greater 'uniqueness' the lower
the relevance of the variable in the factor model.

### MSA

MSA represents the Kaiser-Meyer-Olkin Measure of Sampling Adequacy
(*Kaiser and Rice, 1974*) for each item. It indicates whether there is
enough data for each factor give reliable results for the PCA. The value
should be > 0.6, and desirable values are > 0.8 (*Tabachnick and Fidell, 2013*).

### PCA or FA?

There is a simplified rule of thumb that may help do decide whether to run a factor analysis or a principal component analysis:

Run

*factor analysis*if you assume or wish to test a theoretical model of*latent factors*causing observed variables.Run

*principal component analysis*If you want to simply*reduce*your correlated observed variables to a smaller set of important independent composite variables.

(Source: CrossValidated)

### Computing Item Scores

Use `get_scores()`

to compute scores for the "subscales" represented by the
extracted principal components. `get_scores()`

takes the results from
`principal_components()`

and extracts the variables for each component found
by the PCA. Then, for each of these "subscales", raw means are calculated
(which equals adding up the single items and dividing by the number of items).
This results in a sum score for each component from the PCA, which is on the
same scale as the original, single items that were used to compute the PCA.
One can also use `predict()`

to back-predict scores for each component,
to which one can provide `newdata`

or a vector of `names`

for the components.

### Explained Variance and Eingenvalues

Use `summary()`

to get the Eigenvalues and the explained variance for each
extracted component. The eigenvectors and eigenvalues represent the "core"
of a PCA: The eigenvectors (the principal components) determine the
directions of the new feature space, and the eigenvalues determine their
magnitude. In other words, the eigenvalues explain the variance of the
data along the new feature axes.

## References

Kaiser, H.F. and Rice. J. (1974). Little jiffy, mark iv. Educational and Psychological Measurement, 34(1):111–117

Hofmann, R. (1978). Complexity and simplicity as objective indices descriptive of factor solutions. Multivariate Behavioral Research, 13:2, 247-250, doi:10.1207/s15327906mbr1302_9

Pettersson, E., & Turkheimer, E. (2010). Item selection, evaluation, and simple structure in personality data. Journal of research in personality, 44(4), 407-420, doi:10.1016/j.jrp.2010.03.002

Tabachnick, B. G., and Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Boston: Pearson Education.

## Examples

```
library(parameters)
# \donttest{
# Principal Component Analysis (PCA) -------------------
principal_components(mtcars[, 1:7], n = "all", threshold = 0.2)
#> # Loadings from Principal Component Analysis (no rotation)
#>
#> Variable | PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | Complexity
#> -------------------------------------------------------------------
#> mpg | -0.93 | | | -0.30 | | | 1.30
#> cyl | 0.96 | | | | | -0.21 | 1.18
#> disp | 0.95 | | | -0.23 | | | 1.16
#> hp | 0.87 | 0.36 | | | 0.30 | | 1.64
#> drat | -0.75 | 0.48 | 0.44 | | | | 2.47
#> wt | 0.88 | -0.35 | 0.26 | | | | 1.54
#> qsec | -0.54 | -0.81 | | | | | 1.96
#>
#> The 6 principal components accounted for 99.30% of the total variance of the original data (PC1 = 72.66%, PC2 = 16.52%, PC3 = 4.93%, PC4 = 2.26%, PC5 = 1.85%, PC6 = 1.08%).
#>
# Automated number of components
if (require("nFactors")) {
principal_components(mtcars[, 1:4], n = "auto")
}
#> # Loadings from Principal Component Analysis (no rotation)
#>
#> Variable | PC1 | Complexity
#> -----------------------------
#> mpg | -0.93 | 1.00
#> cyl | 0.96 | 1.00
#> disp | 0.95 | 1.00
#> hp | 0.91 | 1.00
#>
#> The unique principal component accounted for 87.55% of the total variance of the original data.
#>
# Sparse PCA
if (require("sparsepca")) {
principal_components(mtcars[, 1:7], n = 4, sparse = TRUE)
principal_components(mtcars[, 1:7], n = 4, sparse = "robust")
}
#> Loading required package: sparsepca
#> # Loadings from Principal Component Analysis (no rotation)
#>
#> Variable | PC1 | PC2 | PC3 | PC4 | Complexity
#> -----------------------------------------------------
#> mpg | -0.92 | 0.03 | -0.11 | -0.31 | 1.27
#> cyl | 1.00 | 0.07 | -0.07 | -0.05 | 1.03
#> disp | 0.96 | -0.06 | 0.08 | -0.23 | 1.14
#> hp | 0.74 | 0.32 | 0.07 | 0.00 | 1.38
#> drat | -0.68 | 0.46 | 0.47 | -0.03 | 2.62
#> wt | 1.03 | -0.32 | 0.24 | -0.03 | 1.31
#> qsec | -0.49 | -0.85 | 0.17 | 0.00 | 1.69
#>
#> The 4 principal components accounted for 96.42% of the total variance of the original data (PC1 = 72.75%, PC2 = 16.53%, PC3 = 4.91%, PC4 = 2.24%).
#>
# Rotated PCA
if (require("psych")) {
principal_components(mtcars[, 1:7],
n = 2, rotation = "oblimin",
threshold = "max", sort = TRUE
)
principal_components(mtcars[, 1:7], n = 2, threshold = 2, sort = TRUE)
pca <- principal_components(mtcars[, 1:5], n = 2, rotation = "varimax")
pca # Print loadings
summary(pca) # Print information about the factors
predict(pca, names = c("Component1", "Component2")) # Back-predict scores
# which variables from the original data belong to which extracted component?
closest_component(pca)
}
#> mpg cyl disp hp drat
#> 1 1 1 1 2
# }
# Factor Analysis (FA) ------------------------
if (require("psych")) {
factor_analysis(mtcars[, 1:7], n = "all", threshold = 0.2)
factor_analysis(mtcars[, 1:7], n = 2, rotation = "oblimin", threshold = "max", sort = TRUE)
factor_analysis(mtcars[, 1:7], n = 2, threshold = 2, sort = TRUE)
efa <- factor_analysis(mtcars[, 1:5], n = 2)
summary(efa)
predict(efa)
# \donttest{
# Automated number of components
factor_analysis(mtcars[, 1:4], n = "auto")
# }
}
#> Warning: Could not retrieve information about missing data. Returning only
#> complete cases.
#> # Loadings from Factor Analysis (no rotation)
#>
#> Variable | MR1 | Complexity | Uniqueness
#> ------------------------------------------
#> mpg | -0.90 | 1.00 | 0.19
#> cyl | 0.96 | 1.00 | 0.08
#> disp | 0.93 | 1.00 | 0.13
#> hp | 0.86 | 1.00 | 0.26
#>
#> The unique latent factor accounted for 83.55% of the total variance of the original data.
#>
```