Covariance Matrix and PCA

Since you are here, you probably knows that PCA stands for Principal Component Analysis, and it's primary goal is to reduce data dimension (to compress data or simplify feature space) while keeping useful information as much as possible. And you wonder:

Why and How does PCA work? What's the intuition behind it?

Now I've wive through some pile of material and let me just say: first read these, preferably in given order:

If you do go through these awesome material (possibly branch out into a few stackexchange questions), Welcome back! Let's review some important things:

This section provides a summary of the assumptions behind PCA and hint at when these assumptions might perform poorly.
Linearity Linearity frames the problem as a change of basis. Several areas of research have explored how extending these notions to nonlinear regimes (see Discussion).
Large variances have important structure. This assumption also encompasses the belief that the data has a high SNR. Hence, principal components with larger associated variances represent interesting structure, while those with lower variances represent noise. Note that this is a strong, and sometimes, incorrect assumption (see Discussion).
The principal components are orthogonal. This assumption provides an intuitive simplification that makes PCA soluble with linear algebra decomposition techniques. These techniques are highlighted in the two following sections.

Jonathon Shlens gives a vivi story about PCA, to me it's deeply enlightening except that with limited math background, after reading section V. SOLVING PCA USING EIGENVECTOR DECOMPOSITION, it doesn't seem intuitive to me:

How does Covariance Matrix come into play with PCA?

Why the eigen vector of Covariance Matrix is the pricipal component?

PCA is to find a new set of basis (w₁, w₂, ...) that:

maximize variance under new basis
the basis are orthorgonal

let's first consider finding one best principal component, rephrase PCA:

given a (zero-meaned) matrix X, a projection vector w. How do we choose w to maximize variance?

(Xw)^T(Xw) = w^TX^TXw = w^T (X^TX) w

(notice the Covariance Matrix X^TX comes out!), now let's make A = X^TX, rephrase PCA

Given transformatin matrix A, find projection w that maximize w (Aw).

Now it become clear how to find w that maximize w(Aw). Because the geometric interpretation of w(Aw) is simple: a vector w, go through transformation A, compute the dot product with original self. (w has fixed length said 1)

w (Aw) = len(w) * len(Aw) * cos(w, Aw)

= 1 * len(Aw) * cos(w, Aw)

= len(Aw) * cos(w, Aw)

If we could make vector Aw in max length while coming out (after transform from w -> Aw) in same direction with w, then we could maximize w(Aw). I claim if we assign eigen vector with largest eigen value to w, then we maximize len(Aw) * cos(w, Aw)

By denifition of eigen vector, Aw is in same direction with w. cos(w, Aw) = 1 is max.
because w associates largest eigen value, length of vector Aw is largest. (not hard to prove)

What these says is that by assign best eigen vector (the guy with largest eigen value) to w, w becomes the best principal component. Now assign other eigen vector in descending order with respect to their eigen value, we have our set of principal components! Because

From above, the variance is largest
eigen vectors are othorgonal, so does our principal component

Now we have a nice geomatric understanding and intuition of why Covariance Matrix relates with PCA :D