 ## What is StandardScaler – How & Why We Use

Ensuring consistency in the numerical input data is crucial to enhancing the performance of machine learning algorithms. To achieve this uniformity, it is necessary to adjust the data to a standardized range.

Standardization and Normalization are both widely used techniques for adjusting data before feeding it into machine learning models.

In this article, you will learn how to utilize the `StandardScaler` class to scale the input data.

## What is Standardization?

Before diving into the fundamentals of the `StandardScaler` class, you need to understand the standardization of the data.

Standardization is a data preparation method that involves adjusting the input (features) by first centering them (subtracting the mean from each data point) and then dividing them by the standard deviation, resulting in the data having a mean of 0 and a standard deviation of 1.

The formula for standardization can be written like the following:

• standardized_val = ( input_value – mean ) / standard_deviation

Assume you have a mean value of 10.4 and a standard deviation value of 4. To standardize the value of 15.9, put the given values into the equation as follows:

• standardized_val = ( 15.9 – 10.4 ) / 4
• standardized_val = ( 5.5 ) / 4
• standardized_val = 1.37

The `StandardScaler` stands out as a widely used tool for implementing data standardization.

## What is StandardScaler?

The `StandardScaler` class provided by Scikit Learn applies the standardization on the input (features) variable, making sure they have a mean of approximately 0 and a standard deviation of approximately 1.

It adjusts the data to have a standardized distribution, making it suitable for modeling and ensuring that no single feature disproportionately influences the algorithm due to differences in scale.

## Why Bother Using it?

Well, so far you’ve already understood the idea of using StandardScaler in machine learning but just to highlight, here are the primary reasons why you should use StandardScaler:

• For the betterment of the performance of the machine learning models
• Maintains the consistency of data points
• Useful when working with machine learning algorithms that can be negatively influenced by differences in the scale of the features of the data.

## How to Use StandardScaler?

First, you should bring in the `StandardScaler` class from the `sklearn.preprocessing` module. After that, create an instance of the `StandardScaler` class by using `StandardScaler()`. Following that, apply the `fit_transform` method to the input data by fitting it to the created instance.

An instance of the `StandardScaler` class is created and stored in the variable `scaler`. This instance will be used to standardize the data.

The `fit_transform` method of the `StandardScaler` object (`scaler`) is called with the original data `arr` as the input.

The `fit_transform` method will compute the mean and deviation for each data point in the input data `arr` and then apply the standardization to the input data.

Here’s the original array and the standardized version of the original array.

## Does Standardization Affect the Accuracy of the Model?

In this section, you’ll see how the model’s performance is affected after applying standardization to features of the dataset.

Let’s see how the model will perform on the raw dataset without standardizing the feature variables.

The breast cancer dataset is loaded from the `sklearn.datasets` and then the features (`df.data`) and target (`df.target`) are stored inside the `X` and `y` variables.

The K-nearest neighbors classifier (KNN) model is instantiated using the `KNeighborsClassifier` class and stored inside the model variable.

The `cross_val_score` function is used to evaluate the KNN model’s performance. It passes the model (`KNeighborsClassifier()`), features (`X`), target (`y`), and specifies that accuracy (`scoring='accuracy'`) should be used as the evaluation metric.

This will evaluate the accuracy scores by dividing the dataset equally into 10 parts (`cv=10`) which means the dataset will be trained and tested 10 times. Here, `n_jobs=-1` means using all the available CPU cores for faster cross-validation.

Finally, the average of the accuracy scores (`mean(scores)`) is printed.

Without standardizing the dataset’s feature variables, the average accuracy score is 93%.

### Using StandardScaler for Applying Standardization

The dataset’s features undergo scaling with the `StandardScaler()`, and the resulting scaled dataset is stored in the `X_scaled` variable.

Next, this scaled dataset is used as input for the `cross_val_score` function to compute and subsequently display the accuracy.

It is noticeable that the accuracy score has significantly increased to 97% when compared to the previous accuracy score of 93%.

The application of `StandardScaler()`, which standardized the data’s features, has notably improved the model’s performance.

## Conclusion

StandardScaler is used to standardize the input data in a way that ensures that the data points have a balanced scale, which is crucial for machine learning algorithms, especially those that are sensitive to differences in feature scales.

Standardization transforms the data such that the mean of each feature becomes zero (centered at zero), and the standard deviation becomes one.

Let’s recall what you’ve learned:

• What actually is StandardScaler
• What is standardization and how it is applied to the data points
• Impact of StandardScaler on the model’s performance

🏆Other articles you might be interested in if you liked this one

That’s all for now

Keep Coding✌✌