Data Preprocessing and Cleaning: Part 1: Column Normalization

Before applying any dimensionality reduction technique sometimes it is important to preprocess the data.

There are several ways which we can use for preprocessing data.

In this post, we will explore one of the common ways to do data preprocessing which is column normalization (or feature scaling).

Feature scaling

Feature scaling is a method used to standardize the range of independent variables or features of data.

In data processing, it is also known as data normalization(or column normalization) and is generally performed during the data preprocessing step.

Column normalization is defined as the process of scaling all the features of a column within values of 0 and 1.

Calculations

Let’s say there is a feature with values [a1,a2….an] for n items. This could be all the height of students in a class.

The way to normalize is we find the min(a_min) and max(a_max) of all the features and then normalize it by substracting each element with a_min and then dividing it by (a_max-a_min)

Let

a₁,a₂…. a_n –> be n-values of a feature vector f_j.

We then find a_maxand a_min

max(a_i) = a_max >= a_i (i: 1->n)

min(a_i) = a_min <= a_i (i: 1->n)

In this way we can define a new feature which is a normalized version f_j‘

a₁‘,a₂‘…. a_n‘ will be the normalized values of this new feature

Here a_i‘ = (a_i– a_min)/ (a_max– a_min)

As a result, a_i‘ lies in the range [0,1]

Needless to say as a_min‘ =0 and a_max‘ =1

In this way, we created a new normalized vector from our original vector and the values of that normalized vector lie between 0 and 1.

Why do we need to do feature/column normalization?

We may have multiple sources of data of which we want to combine all of it, however not all of it is in the same unit system. For eg two datasets one with length in cm and one with length in meters.

So to put everything on the same scale. The second advantage is when it comes to regression as it converges faster to a minimum.

The normalized values would lie between 0 to 1 inclusive.

Geometrically after normalization, if we plot the features before(on the left) and after normalization(on the right) we will see all the points just shift to a small square between 0 to 1.

The one thing to however notice here is it doesn’t change the relationship between the different features.

For eg for two features if we plot it and they lie on y=mx+c, even after normalization they will still lie on the same line.

As one can see, The relationship between X and Y is not affected because of feature scaling.

This is it for this part. We learned about column normalization. Hope you liked it. Comment what you would like to read about and keep sharing.

Data Preprocessing and Cleaning: Part 1: Column Normalization

Published by Deepanshu Lulla on April 20, 2019April 20, 2019

Feature scaling

Calculations

Why do we need to do feature/column normalization?

Like this:

0 Comments

What do you think?Cancel reply

Machine learning

Coursera Week 1: Intro to Neural Networks Notes

Distributed systems

Exploring Vector Databases

Machine learning

Best Practices for Building Machine Learning Applications

Data Preprocessing and Cleaning: Part 1: Column Normalization

Published by Deepanshu Lulla on April 20, 2019April 20, 2019

Feature scaling

Calculations

Why do we need to do feature/column normalization?

Like this:

0 Comments

What do you think?Cancel reply

Related Posts

Machine learning

Coursera Week 1: Intro to Neural Networks Notes

Distributed systems

Exploring Vector Databases

Machine learning

Best Practices for Building Machine Learning Applications

Subscribe For Latest Updates