This is part 2 of Introduction to Dimensionality Reduction. In this blog post, we would several different mathematical prerequisites that one must know before trying to understand machine learning.
Mean Vector
The sample mean is a vector each of whose elements is the sample mean of one of the random variables – that is, each of whose elements is the arithmetic average of the observed values of one of the variables.
Lets say we have two vectors
x1 = [2.2, 4.2,...]
x2 = [1.2, 3.2,...]
x_mean = 1/2(x1+x2)
= 0.5* [3.4, 7.4, ..]
= [1.7,3.7, .. ]
So essentially we summed up elements at ith index of the first array and the corresponding index of the second array.
So we can say every array can be considered as a vector with each of its indices as one of the dimensions.
If we plot all these arrays and their indices in a multidimensional space,we would see that they look like a 3d scattered plot.
We can define mean vector for that scattered group geometrically something like this picture below.
Data preprocessing: Column Standardization
Column standardization is a type of feature normalization where we move the data in such a way that the mean of the data becomes 0 and the standard deviation becomes 1.
feature | f1 | f2 | f3 | f4 |
---|---|---|---|---|
X=1 | 10 | a1 | 2 | 3 |
X=2 | 20 | a2 | 1 | 4 |
X=3 | 30 | a3 | 4 | 4 |
X=4 | 30 | a3 | 4 | 4 |
X=n | x | an | z |
Let a1, a2... an, represent n values of a feature fj
Let's say we apply column standardization on these and we get a new feature
fj standardized
Let a1', a2'... an' represent n values of a feature fj standardized
then the mean of all such vectors would be defined as
mean(a1',a2'...an') =0
and the standard deviation
std(a1',a2'...an') =1
The way we do it is by subtracting each element from the mean and dividing by standard deviation. On doing so for the new vector, the mean becomes 0 and the standard deviation becomes 1.
So how do we obtain a1', a2'... an'
let's say mean(a1', a2'... an') be amean
and standard deviation(a1', a2'... an') be astd
ai' = (ai- amean)/astd
Geometrically speaking we move the distribution to the origin and constrict it in a hypercube of unit 1. Hence it is also called mean centering.
We may need to squish or expand the data depending upon if the standard dev of the data is greater or less than 1.
Covariance Matrix
Let's say we have a matrix X
features | f1 | f2 | f3 | f4 |
---|---|---|---|---|
X=1 | x11 | x12 | x13 | x14 |
X=2 | x21 | x22 | x23 | x24 |
X=3 | x31 | x32 | x33 | x34 |
we can define its covariance matrix S
features | f1 | f2 | f3 | f4 |
---|---|---|---|---|
1 | s11 | s12 | s13 | s14 |
2 | s21 | s22 | s23 | s24 |
3 | s31 | s32 | s33 | s34 |
4 | s41 | s42 | s43 | s44 |
where xi,j are ith row and jth column in X
and si,j are ith row and jth column in S
The dimensions of X here are n*d where n is number of points and d is number of dimensions(or features)
Whereas the covariance matrix is of size d*d. Hence covariance matrix is always a square matrix
Covariance has two properties
1) Cov(X,Y)=Cov(Y,X)
2) Cov(X,X)= Var(X)
This means s(i,j) = s(j,i)
Another interesting property is for a column normalized vector X,
covariance matrix S=(1/n)(Xtranspose *X)
We will leave the proof as an exercise but proving it should be trivial.
Stay tuned for the next post on dimensionality reduction. We would start describing a classical way of doing Dimensionality Reduction called Principal component Analysis.
0 Comments