How To Do Data Preprocessing in Machine Learning?

Manik Soni
4 min readSep 24, 2020

--

Data Preprocessing for machine Learning

Data Preprocessing is important in Machine Learning because it helps data-scientist to do machine learning prediction.

There are sequential steps involved in Preprocessing the machine learning dataset.

Step 1: Importing the important libraries: There are 3 such libraries that are used to do prediction as well as doing Preprocessing of the dataset.

Libraries

A) Numpy Libraries: Numpy stands for numerical python This library helps to do numerical calculations which are important in machine learning. There are several inbuild functions that help to do working with arrays. (Examples of linear algebra, Fourier transform, and matrices). Numpy aims to provide fast computation rather than working with the traditional python list.

B) Pandas libraries: Pandas library provides high performance, easy to use data structures and data analysis tools for python programming. It is built on the NumPy package and its key data structure is called a data frame. It allows you to store and manipulate data in rows and columns.

C) Matplotlib Libraries: it is a plotting library for doing the visualization of data. There are various inbuilt functions that help to do visualization and Matplotlib is having various plots to do visualization.

Step 2: importing the dataset in CSV format.

Importing dataset in CSV format

Step 3: Divide the data into the ‘X’ and ‘y’ variable. ‘X’ stands for a matrix of features and ‘y’ stands for the dependent variable.

‘X’ is the matrix of features and ‘y’ is the dependent variable

Step 4: Finding the missing values and then replace it with values such as mean, median, or most frequently appeared values of a particular column.

Dataset having missing values (NAN)
Replacing the missing values with mean of the column

Step 5: Splitting the data into testing and training dataset. The training part helps to train the machine learning model and the Testing part helps to test the trained model on which we want to do the prediction.

Splitting the dataset into Testing and Training Parts

Step 6: Feature Scaling(optional): This part of preprocessing is optional because we want to do bring the data on the same scale.

The sum of the square of the distance

because the sum square distance between the points should not be large.

See the difference between the points is high

So now applying the feature scaling on the 2 columns we can perform the predictions.

Code to apply feature scaling on the dataset
Data after doing a feature scaling(columns are age and salary)

Step 7: Encode the Data: Encoding comes into play when the matrix of features is having values in ‘Yes’ and ‘No’ format.

‘Purchased’ is the column which needs encoding

We need to encode the above column in the dataset. For that, if the category of data is equal to 2(that is ‘yes ’and ‘no’), then we use a normal label encoder.

Code to encode the data
Result after encoding

If we are having category more than 2 then we need to include OneHotEncoder to append multiple category.

In this column we are having 3 categories.
encoding to a perticular column number 3

--

--

No responses yet