How To Do Data Preprocessing in Machine Learning?
Data Preprocessing is important in Machine Learning because it helps data-scientist to do machine learning prediction.
There are sequential steps involved in Preprocessing the machine learning dataset.
Step 1: Importing the important libraries: There are 3 such libraries that are used to do prediction as well as doing Preprocessing of the dataset.
A) Numpy Libraries: Numpy stands for numerical python This library helps to do numerical calculations which are important in machine learning. There are several inbuild functions that help to do working with arrays. (Examples of linear algebra, Fourier transform, and matrices). Numpy aims to provide fast computation rather than working with the traditional python list.
B) Pandas libraries: Pandas library provides high performance, easy to use data structures and data analysis tools for python programming. It is built on the NumPy package and its key data structure is called a data frame. It allows you to store and manipulate data in rows and columns.
C) Matplotlib Libraries: it is a plotting library for doing the visualization of data. There are various inbuilt functions that help to do visualization and Matplotlib is having various plots to do visualization.
Step 2: importing the dataset in CSV format.
Step 3: Divide the data into the ‘X’ and ‘y’ variable. ‘X’ stands for a matrix of features and ‘y’ stands for the dependent variable.
Step 4: Finding the missing values and then replace it with values such as mean, median, or most frequently appeared values of a particular column.
Step 5: Splitting the data into testing and training dataset. The training part helps to train the machine learning model and the Testing part helps to test the trained model on which we want to do the prediction.
Step 6: Feature Scaling(optional): This part of preprocessing is optional because we want to do bring the data on the same scale.
because the sum square distance between the points should not be large.
So now applying the feature scaling on the 2 columns we can perform the predictions.
Step 7: Encode the Data: Encoding comes into play when the matrix of features is having values in ‘Yes’ and ‘No’ format.
We need to encode the above column in the dataset. For that, if the category of data is equal to 2(that is ‘yes ’and ‘no’), then we use a normal label encoder.
If we are having category more than 2 then we need to include OneHotEncoder to append multiple category.