R Biplot Example Csv

Posted on by admin
  • Preliminaries
    • Extracting the variables
  • Principal Component Analysis
    • Principal component scores
  • Multidimensional scaling
    • Visualization


Principal Component Analysis (PCA) is a useful technique for exploratory data analysis, allowing you to better visualize the variation present in a dataset with many variables. It is particularly helpful in the case of 'wide' datasets, where you have many variables for each sample. In this tutorial, you'll discover PCA in R. Data standardization. In principal component analysis, variables are often scaled (i.e. This is particularly recommended when variables are measured in different scales (e.g: kilograms, kilometers, centimeters, ); otherwise, the PCA outputs obtained will be severely affected.

Introduction

  • Of the variance of the data. # summary method summary(ir.pca) Importance of components: PC1 PC2 PC3 PC4 Standard deviation 1.7125 0.9524 0.36470 0.16568 Proportion of Variance 0.7331 0.2268 0.03325 0.00686 Cumulative Proportion 0.7331 0.9599 0.99314 1.00000.
  • Description Usage Arguments Details Value Author(s) Examples. View source: R/biplot.R. Draw a bi-plot, comparing 2 selected principal components.

We will consider principal components analysis (PCA) and multidimensional scaling (MDS) as examples of multivariate dimension reduction. Both techniques are included in the base R installation, respectively as prcomp and cmdscale. We will also use the (best practice) graphics package ggplot2 for our plots.

We will use the Guerry_85 file that contains observations on socio-economic characteristics for the 85 French departments in 1830.

Note: this is written with R beginners in mind, more seasoned R users can probably skip most of the comments on data structures and other R particulars. Also, as always in R, there are typically several ways to achieve a specific objective, so what is shown here is just one way that works, but there often are others (that may even work faster, or scale better).

Items covered:

  • scaling a multivariate data set (i.e., standardizing to mean zero and variance one) using scale

  • computing principal components using prcomp

  • extracting loadings, scores and proportion explained variance

  • creating a scree plot to assess the proportion variance explained and to select the number of meaningful components

  • using ggplot2 to create a scatter plot with meaningful labels

  • creating a biplot to interpret the relative contribution of two PC

  • computing multivariate distance using dist

  • carrying out multidimensional scaling using cmdscale

  • plotting the results of multi-dimensional scaling using ggplot2

Packages used:

R Biplot Example Csv Query

  • foreign

  • ggplot2

Preliminaries

R Biplot Example Csv

As is customary by now, we start by installing the required packages, read in the data, and provide a summary of the data. The input file is Guerry_85.dbf and we use read.dbf to turn it into an R data frame (your contents of the Guerry file may be slightly different).

Extracting the variables

We will extract a subset of the variables included in the data frame for use in the PCA. First, we define a list with all the variable names, and then we use the standard column subsetting of the initial data frame (note the empty space before the comma to specify that we select all observations or rows). We also summarize the new data frame.

Standardizing the variables

We standardize the variables by means of the scale command. We again also provide a summary. Obviously, the mean is zero for all variables. We check the variance for one selected variable (Crm_prs) and it is indeed one.

Note that the resulting object is a matrix and not a data.frame (the $ notation does not work to extract the Crm_prs column). We specify the column explicitly by giving its name (in quotes), preceded by a comma with empty space before the comma (meaning all the rows are selected).

Principal Component Analysis

The computations for PCA are carried out by means of the prcomp function. Since we already scaled our variables, we do not need to specify this as an argument and the only item passed to the function is the name of the matrix containing the scaled variables, vds in our example (see the help file for other options).

The result of this computation is an object of the special class prcomp. It contains lots of information, which we can check with the usual str command.

A summary of the principal component object yields the standard deviation associated with each component (the variance corresponds to the eigenvalue), and the corresponding proportion and cumulative proportion of the explained variance. The standard deviations are also contained in the sdev attribute.

The standard deviation can also be extracted separately as dev.

The results are not that great. Three components (out of six!) are needed to explain 75% of the variance, and even five do not explain 95%. This is due largely to the low correlation among the variables (cor).

Scree plot

The scree plot shows the proportion variance explained as a decreasing function of the principal components (each component explains a little less than the previous component). This is used to “eyeball” a reasonable number of components to use in further analysis. Note that there is no point to using all the principal components because then there would be no dimension reduction. The whole objective is to capture most of the variance in the data by means of a small number of components.

To select this number, one looks for an elbow or kink in the scree plot, i.e., a meaningful change in the slope such that the additional variance explained is small relative to the previous PC.

The prcomp function does not include an explicit function to create a scree plot, but it is relatively straightfoward to write a small function to accomplish this goal. The function scree_plot below does just that.

The function takes as argument the principal component object. The default is to create a scree plot. However, setting the option cumulative=TRUE creates the complement, i.e., a curve showing the cumulative variance explained. Note how that is computed using the cumsum command in the function. The plot is a simple line plot (type = “b”) with titles appropriate for each plot (this illustrates the use of if).

In our example, the scree plot is created using scree_plot(prc), as shown below.

Unlike most textbook examples, this plot does not have a clear kink. In part, this is due to the low correlations between the variables, which does not lend itself to identifying common dimensions that explain a lot of the underlying variance.

The graph with the cumulative proportion of the explained variable is obtained by setting cumulative=TRUE, as shown below.

Here again, there is no clear kink. This is a graphical description of the variance proportions we saw above.

Loadings

The loadings (i.e., the coefficients that apply to each of the original variables to obtain the principal component score for an observation) are contained in the rotation attribute of the PC object. We extract in the usual fashion. The loadings are the row elements for each of the columns of the matrix that correspond to a principal component.

The interpretation of the loadings can be tricky, but sometimes there is a clear interpretation when only a subset of the variables shows high values for the coefficients, or when the signs for a given variable are very different between the components. We return to this below when we consider the biplot.

More informative than the loadings is the matrix of squared correlations between the original variables and the principal components. We return to this below.

Principal component scores

The score for a principal component for each observation is obtained by multiplying the original values for the variables that went into the components by the matching loading (remember that all the variables were standardized, so it is the standardized version that gets multiplied by the loadings). A small number of these principal component scores can then be used instead of the full set of variables to represent a sizeable fraction of the variance in the data.

The component scores can be used as regular variables, in that they can be plotted, mapped, etc. However, keep in mind that they are orthogonal by construction, so the slope in a bivariate scatter plot of two components will always be zero (i.e., the linear fit will be a horizontal line).

The scores are contained in the x attribute of the principal component object.

They are contained in a class matrix.

And, of course, are uncorrelated.

A more informative way to interpret the connection between the original variables and the scores is the squared correlation matrix.

The elements along each row give the proportion of the variance of the variable in that row explained by the respective principal component. The values in each column for a principal component give the squared correlation between the original variable and that component, suggesting the relative importance of the former in interpreting the latter.

The sum across each of the rows (i.e., variables) equals 1.

Biplot

Converting the scores to a data frame

As we saw above, the scores are a matrix object, not a data frame. If we want to output these results (e.g., to join with a map in GeoDa), we need to turn them into a data frame.

In the few lines below, we first create a data frame from the matrix and then add the department ID (dept) as an additional variable. The resulting data frame can be written out to a csv file using write.csv and it can also be used by the plotting commands in ggplot (see further below).

Plotting the PC scores

We can now construct a scatter plot for any pair of principal component scores. For example, below we use ggplot to easily add the labels corresponding to the departments to the plot. It turns out this is a bit tricky in the standard plot command, and it gives us an excuse to start exploring the functionality of ggplot.

The ggplot grammar follows Wilkinson’s grammar of graphics, which is an elegant way to abstract the construction of a wide array of statistical graphs. A characteristic of the ggplot approach is that a graph is created incrementally. In its bare bones essentials, there are two commands. First is the specification of the data and variables for the plot, entered as arguments to the ggplot command. In our case, the data set is the just created pcs1 and the two components PC1 and PC2 that correspond to the x and y coordinates in the plot. In ggplot, this is specified through the aes attribute (for aesthetics).

Beyond this first command, a graph is built up by adding (litterally, using the + sign) different geometric objects. In our case, we specify that the graph is a point graph (for x-y coordinates, this will give a scatter plot) by means of geom_point (all the different types are prefaced by geom). In addition, we add the labels as text using geom_text. We specify the AREA_ID as the label (as part of an aes attribute) and we use nudge-y to move the label above the point (the default is to have it listed on top of the point). For further specifics and options to make this look really fancy, check out the documentation of ggplot2 (or the excellent book on this package by its creator, Hadley Wickham).

We can write the data frame to a csv file (using write.csv) and then merge this with a layer for the French departments in the Guerry layer in GeoDa. Using linking and brushing, we can examine the extent to which neighbors in multivariate space (points close together in the scatter plot) are also geographical neighbors. We will revisit this issue when we deal with clusters.

Biplot

An alternative way to visualize the results of a principal components analysis is by means of a so-called biplot. For any pair of components, this combines the scatter plot with a set of vectors showing the loadings for each variable. The vector is centered at zero. Its x dimension shows the importance of the loading for that variable for the principal component on the x-axis. The y dimension does the same for the component on the y-axis.

An easy case to interpret is when the loading is large for one component and small for the other, which will result in a very steep curve. Other easy cases are when the signs of the loadings are opposite. For example, a positive sign for PC1 and a negative sign for PC2 would give a vector pointing down and to the right. The information in the biplot confirms what we saw earlier in the matrix of squared correlations.

The biplot is invoked by the biplot command. The default is to plot the first two components, so that the only argument to the function is the principal component object. In our example, this is prc. We add the option scale=0 to make sure the arrows are scaled such that they reflect the loadings.

Multidimensional scaling

Multidimensional scaling consists of finding a lower dimensional representation of the data that respects the multidimensional distance between observation pairs as much as possible.

Word

In R, this is computed using the cmdscale command. It takes as input a dissimilarity or distance matrix, computed using the dist command. As a default, this uses p-dimensional Euclidean distance, but several other options are available as well (see the documentation). In most circumstances, it is most appropriate the base the distance computation on standardized variables (use scale first).

The output of the MDS procedure is a matrix (not a data frame) with the coordinates in the lower-dimensional space (typically two dimensional) for each observation. These can be readily plotted. Note that unlike what holds for the principal components, it does not make sense to plot or map the coordinates in the MDS plot by themselves. However, similar to what holds for principal components, points that are close in the MDS plot are close in multivariats space (but not necessarily in geographical space).

Creating the distance matrix

In our example, we first create the distance matrix by passing the standardized values in vds to the dist function. We will then use the resulting object as input into the MDS procedure.

MDS calculation

In the default case (considered here), the cmdscale command takes vdiss as the only argument.

The result is an n by 2 matrix.

Visualization

Convert matrix to data frame

We follow the same procedure as for the principal components to convert the matrix of coordinates into a data frame and to add the department identifiers. We can use the result to write out to a csv file (to merge with a GeoDa layer) or to plot using ggplot.


How To Interpret Pca Biplot

  1. University of Chicago, Center for Spatial Data Science – anselin@uchicago.edu↩

This is an example of using PCA and biplot. The observations are colored by k-means clustering.

R Biplot Example Csv Format

pca_kmeans_biplot.py
#!/usr/bin/env python2.7
# -*- coding: utf-8 -*-
''Biplot example using pcasvd from statsmodels and matplotlib.
This is an example of how a biplot (like that in R) can be produced
using pcasvd and matplotlib. Additionally, this example does k-means
clustering and color observations by which cluster they belong to.
''
importmatplotlib.pyplotasplt
importnumpyasnp
importpandasaspd
fromscipy.cluster.vqimportkmeans, vq
fromstatsmodels.sandbox.tools.tools_pcaimportpcasvd
defbiplot(plt, pca, labels=None, colors=None,
xpc=1, ypc=2, scale=1):
''Generate biplot from the result of pcasvd of statsmodels.
Parameters
----------
plt : object
An existing pyplot module reference.
pca : tuple
The result from statsmodels.sandbox.tools.tools_pca.pcasvd.
labels : array_like, optional
Labels for each observation.
colors : array_like, optional
Colors for each observation.
xpc, ypc : int, optional
The principal component number for x- and y-axis. Defaults to
(xpc, ypc) = (1, 2).
scale : float
The variables are scaled by lambda ** scale, where lambda =
singular value = sqrt(eigenvalue), and the observations are
scaled by lambda ** (1 - scale). Must be in [0, 1].
Returns
-------
None.
''
xpc, ypc= (xpc-1, ypc-1)
xreduced, factors, evals, evecs=pca
singvals=np.sqrt(evals)
# data
xs=factors[:, xpc] *singvals[xpc]**(1.-scale)
ys=factors[:, ypc] *singvals[ypc]**(1.-scale)
iflabelsisnotNone:
fori, (t, x, y) inenumerate(zip(labels, xs, ys)):
c='k'ifcolorsisNoneelsecolors[i]
plt.text(x, y, t, color=c, ha='center', va='center')
xmin, xmax=xs.min(), xs.max()
ymin, ymax=ys.min(), ys.max()
xpad= (xmax-xmin) *0.1
ypad= (ymax-ymin) *0.1
plt.xlim(xmin-xpad, xmax+xpad)
plt.ylim(ymin-ypad, ymax+ypad)
else:
colors='k'ifcolorsisNoneelsecolors
plt.scatter(xs, ys, c=colors, marker='.')
# variables
tvars=np.dot(np.eye(factors.shape[0], factors.shape[1]),
evecs) *singvals**scale
fori, colinenumerate(xreduced.columns.values):
x, y=tvars[i][xpc], tvars[i][ypc]
plt.arrow(0, 0, x, y, color='r',
width=0.002, head_width=0.05)
plt.text(x*1.4, y*1.4, col, color='r', ha='center', va='center')
plt.xlabel('PC{}'.format(xpc+1))
plt.ylabel('PC{}'.format(ypc+1))
defmain():
''Run a PCA on state.x77 from R and generate its biplot. Color
observations by k-means clustering.''
df=pd.io.parsers.read_csv('data/state.x77')
printdf.describe()
printdf.head()
columns= ['Population', 'Income', 'Illiteracy',
'Life Exp', 'Murder', 'HS Grad']
data=df[columns]
data= (data-data.mean()) /data.std()
pca=pcasvd(data, keepdim=0, demean=False)
values=data.values
centroids, _=kmeans(values, 3)
idx, _=vq(values, centroids)
colors= ['gby'[i] foriinidx]
plt.figure(1)
biplot(plt, pca, labels=data.index, colors=colors,
xpc=1, ypc=2)
plt.show()
if__name__'__main__':
main()
Sign up for freeto join this conversation on GitHub. Already have an account? Sign in to comment