Standardization of Datasets

November 7, 2024 4 minute read

Data standardization is a crucial for machine learning algorithms. For instance, let us think the two features have different scales or units:

Range of Feature A: 0 - 25
Range of Feature B: 500 - 25000

The feature B will most likely dominate the model with the large ranges. The data standardization rescale the data to ensure the that features contribute proportionately. In scikit-learn, sklearn.preprocessing module provides various scalers.

from sklearn.preprocessing import (
    StandardScaler,
    MinMaxScaler,
    MaxAbsScaler,
    RobustScaler,
    PowerTransformer,
    QuantileTransformer,
    )

Linear scalers

StandardScaler: `StandardScaler()`

StandardScaler scales a features by removing the mean and scaling to unit variance.

\[X_{std} = \frac{X - \overline{X}} {\sigma}\]

where $ \overline{X} $ is the mean of the feature and $\sigma$ is the standard deviation.

from sklearn.preprocessing import StandardScaler

X = np.array([[4, -4, -9],
              [2, -1,  0],
              [4,  2,  2],
              [2,  5,  3]])
              
StandardScaler().fit_transform(X)

[[ 1.         -1.34164079 -1.68654809]
 [-1.         -0.4472136   0.21081851]
 [ 1.          0.4472136   0.63245553]
 [-1.          1.34164079  0.84327404]]

Sensitive to outliers: It uses the mean and standard deviation to center and scale the data.
Assumes normal distribution: It works best when the data is normally distributed.
Can be influenced by outliers: Outliers can significantly affect the mean and standard deviation, leading to suboptimal scaling.

MinMaxScaler: `MinMaxScaler()`

MinMaxScaler rescales a feature to a specific range, typically between 0 and 1.

\[X_{std} = \frac{X - min(X)} {max(X) - min(X)}\]

from sklearn.preprocessing import MinMaxScaler

X = np.array([[4, -4, -9],
              [2, -1,  0],
              [4,  2,  2],
              [2,  5,  3]])
              
MinMaxScaler().fit_transform(X)

[[1.         0.         0.        ]
 [0.         0.33333333 0.75      ]
 [1.         0.66666667 0.91666667]
 [0.         1.         1.        ]]

Preserves original distribution: MinMaxScaler doesn’t change the shape of the original distribution.
Sensitive to outliers: Outliers can significantly affect the scaling, as they can influence the minimum and maximum values.
Suitable for neural networks: It’s often used in neural networks where inputs are expected to be in a specific range.

MaxAbsScaler: `MaxAbsScaler()`

MaxAbsScaler rescales a feature by its maximum absolute value.

\[X_{std} = \frac{X} {max(|X|)}\]

from sklearn.preprocessing import MaxAbsScaler

X = np.array([[4, -4, -9],
              [2, -1,  0],
              [4,  2,  2],
              [2,  5,  3]])
              
MaxAbsScaler().fit_transform(X)

[[ 1.         -0.8        -1.        ]
 [ 0.5        -0.2         0.        ]
 [ 1.          0.4         0.22222222]
 [ 0.5         1.          0.33333333]]

Preserves sparsity: It doesn’t shift or center the data, so it’s ideal for sparse datasets.
Simple scaling: It’s a straightforward scaling technique that doesn’t involve complex calculations.
Suitable for specific scenarios: It’s particularly useful when you want to scale features without affecting their zero-centered nature.

RobustScaler: `RobustScaler()`

RobustScaler rescales a feature by removing the median and scaling it by the interquartile range.

\[X_{std} = \frac{X - \tilde{X}} {IQR}\]

where $\tilde{X}$ is the median of the feature and $IQR$ is The interquartile range (Q3 - Q1).

from sklearn.preprocessing import RobustScaler

X = np.array([[4, -4, -9],
              [2, -1,  0],
              [4,  2,  2],
              [2,  5,  3]])
              
RobustScaler().fit_transform(X)

[[ 0.5        -1.         -2.22222222]
 [-0.5        -0.33333333 -0.22222222]
 [ 0.5         0.33333333  0.22222222]
 [-0.5         1.          0.44444444]]

Robust to outliers: It uses the median and interquartile range (IQR) to center and scale the data.
Less affected by outliers: The median and IQR are less sensitive to extreme values, making it more robust to outliers.
Suitable for non-normally distributed data: It can handle skewed distributions more effectively.

Non-linear scalers

PowerTransformer: `PowerTransformer()`

PowerTransformer transform a features to a more Gaussian-like distribution.

Skewness and Kurtosis: The distribution of the features is analyzed to identify skewness and kurtosis.
Parameter Estimation: The optimal power parameter (lambda) is estimated using maximum likelihood estimation.
Transformation: The data is transformed using the estimated lambda.

Parametric: It assumes a specific parametric distribution (e.g., Box-Cox or Yeo-Johnson).
Distribution-based: It transforms data to a specific distribution, often a normal distribution.
Sensitive to Outliers: While less sensitive than some other methods, outliers can still influence the transformation.
May Distort Rank Order: The relative order of data points might change after transformation.

QuantileTransformer: `QuantileTransformer()`

QuantileTransformer transform a features to a specified distribution, typically a uniform or normal distribution.

Rank-based Transformation: Each data point is assigned a rank based on its position in the sorted list.
Quantile Mapping: The ranks are then mapped to quantiles of the desired distribution (uniform or normal).
Transformation: The data is transformed with their corresponding quantile map.

Non-parametric: It doesn’t assume a specific distribution for the data.
Rank-based: It maps data points to quantiles of a desired distribution (uniform or normal).
Robust to Outliers: It’s less sensitive to outliers as it focuses on the relative ranks of data points.
Preserves Rank Order: The relative order of data points remains unchanged after transformation.

External Link

Jupyter notebook

Share on

Twitter Facebook LinkedIn

Joon Hee Jang

Standardization of Datasets

Linear scalers

StandardScaler: `StandardScaler()`

MinMaxScaler: `MinMaxScaler()`

MaxAbsScaler: `MaxAbsScaler()`

RobustScaler: `RobustScaler()`

Non-linear scalers

PowerTransformer: `PowerTransformer()`

QuantileTransformer: `QuantileTransformer()`

External Link

Share on

You may also enjoy

Setting Up Pi-hole on Ubuntu with AT&T BGW320 Router

Seaborn Data Visualization

How to Run LLM Model Locally

Core Types of Machine Learning

Joon Hee Jang

Linear scalers

StandardScaler: StandardScaler()

MinMaxScaler: MinMaxScaler()

MaxAbsScaler: MaxAbsScaler()

RobustScaler: RobustScaler()

Non-linear scalers

PowerTransformer: PowerTransformer()

QuantileTransformer: QuantileTransformer()

External Link

Share on

You may also enjoy

Setting Up Pi-hole on Ubuntu with AT&T BGW320 Router

Seaborn Data Visualization

How to Run LLM Model Locally

Core Types of Machine Learning

StandardScaler: `StandardScaler()`

MinMaxScaler: `MinMaxScaler()`

MaxAbsScaler: `MaxAbsScaler()`

RobustScaler: `RobustScaler()`

PowerTransformer: `PowerTransformer()`

QuantileTransformer: `QuantileTransformer()`