Handle Missing Values

class handle_missing_values.MissingValueHandler[source]

A class to handle missing values in datasets using various strategies such as simple imputation, KNN-based imputation, iterative imputation, and machine learning models.

static add_missing_indicator(data: DataFrame, columns: List[str] | None = None, inplace: bool = False) → DataFrame[source]

Adds a binary indicator column for each feature, showing where missing values were located.

Parameters:

data (pd.DataFrame) – The input DataFrame.
columns (List[str], optional) – List of columns to create indicators for. If None, only columns with missing values are used.
inplace (bool, optional) – If True, perform operation in-place.

Returns:

The original DataFrame with additional indicator columns for missing values.

Return type:

pd.DataFrame

static drop_missing(data: DataFrame, axis: int = 0, how: str = 'any', thresh: int | None = None, subset: List[str] | None = None, inplace: bool = False) → DataFrame[source]

Drops rows or columns with missing values.

Parameters:

data (pd.DataFrame) – The input DataFrame.
axis (int, optional) – Specifies whether to drop rows (0) or columns (1). Default is 0 (drop rows).
how (str, optional) – ‘any’ or ‘all’. If ‘any’, drop if any NA values are present. If ‘all’, drop if all values are NA.
thresh (int, optional) – Require that many non-NA values. Overrides ‘how’.
subset (List[str], optional) – Labels along the axis to consider.
inplace (bool, optional) – If True, perform operation in-place.

Returns:

The DataFrame with missing rows or columns dropped.

Return type:

pd.DataFrame

static drop_missing_threshold(data: DataFrame, threshold: float = 0.5, axis: int = 0, inplace: bool = False) → DataFrame[source]

Drops rows or columns with missing values that exceed a specified threshold.

Parameters:

data (pd.DataFrame) – The input DataFrame.
threshold (float) – The maximum allowed proportion of missing values (between 0 and 1). Default is 0.5.
axis (int) – Specifies whether to drop rows (0) or columns (1). Default is 0 (drop rows).
inplace (bool, optional) – If True, perform operation in-place.

Returns:

The DataFrame with rows or columns dropped based on the missing value threshold.

Return type:

pd.DataFrame

static fill_missing(data: DataFrame, strategy: str | Dict[str, str] = 'mean', fill_value: Any | None = None, columns: List[str] | None = None, inplace: bool = False) → DataFrame[source]

Fills missing values in the DataFrame using specified strategies.

Parameters:

data (pd.DataFrame) – The input DataFrame.
strategy (Union[str, Dict[str, str]]) – The imputation strategy (‘mean’, ‘median’, ‘most_frequent’, ‘constant’) or a dictionary mapping column names to strategies.
fill_value (Any, optional) – When strategy=’constant’, used to fill missing values.
columns (List[str], optional) – List of columns to impute. If None, all columns are imputed.
inplace (bool, optional) – If True, perform operation in-place.

Returns:

The DataFrame with missing values filled according to the strategy.

Return type:

pd.DataFrame

static fill_missing_bfill(data: DataFrame, columns: List[str] | None = None, inplace: bool = False, limit: int | None = None) → DataFrame[source]

Fills missing values using backward fill method.

Parameters:

data (pd.DataFrame) – The input DataFrame.
columns (List[str], optional) – List of columns to backward fill. If None, all columns are used.
inplace (bool, optional) – If True, perform operation in-place.
limit (int, optional) – The maximum number of consecutive NaNs to fill.

Returns:

The DataFrame with missing values filled using backward fill.

Return type:

pd.DataFrame

static fill_missing_ffill(data: DataFrame, columns: List[str] | None = None, inplace: bool = False, limit: int | None = None) → DataFrame[source]

Fills missing values using forward fill method.

Parameters:

data (pd.DataFrame) – The input DataFrame.
columns (List[str], optional) – List of columns to forward fill. If None, all columns are used.
inplace (bool, optional) – If True, perform operation in-place.
limit (int, optional) – The maximum number of consecutive NaNs to fill.

Returns:

The DataFrame with missing values filled using forward fill.

Return type:

pd.DataFrame

static fill_missing_iterative(data: DataFrame, estimator: RegressorMixin | None = None, columns: List[str] | None = None, inplace: bool = False, **kwargs: Any) → DataFrame[source]

Fills missing values using Iterative Imputer.

Parameters:

data (pd.DataFrame) – The input DataFrame.
estimator (RegressorMixin, optional) – The estimator to use at each step of the imputation. If None, BayesianRidge is used.
columns (List[str], optional) – List of columns to impute. If None, all columns are imputed.
inplace (bool, optional) – If True, perform operation in-place.
**kwargs – Additional keyword arguments to pass to IterativeImputer.

Returns:

The DataFrame with missing values filled using Iterative Imputer.

Return type:

pd.DataFrame

static fill_missing_knn(data: DataFrame, n_neighbors: int = 5, weights: str = 'uniform', metric: str = 'nan_euclidean', columns: List[str] | None = None, inplace: bool = False) → DataFrame[source]

Fills missing values using K-Nearest Neighbors (KNN) imputation.

Parameters:

data (pd.DataFrame) – The input DataFrame.
n_neighbors (int) – Number of neighboring samples to use for imputation.
weights (str) – Weight function used in prediction (‘uniform’ or ‘distance’).
metric (str) – Distance metric for searching neighbors.
columns (List[str], optional) – List of columns to impute. If None, all columns are imputed.
inplace (bool, optional) – If True, perform operation in-place.

Returns:

The DataFrame with missing values filled using KNN imputation.

Return type:

pd.DataFrame

static fill_missing_ml(data: DataFrame, target_column: str, model: RegressorMixin | ClassifierMixin | None = None, search_type: str = 'grid', param_grid: Dict[str, List[Any]] | None = None, cv: int = 5, inplace: bool = False, **kwargs: Any) → DataFrame[source]

Fills missing values in the target column using a machine learning model trained on the other columns, with hyperparameter tuning using cross-validation.

Parameters:

data (pd.DataFrame) – The input DataFrame.
target_column (str) – The name of the column with missing values to impute.
model (Union[RegressorMixin, ClassifierMixin], optional) – The machine learning model to use. If None, RandomForestRegressor or RandomForestClassifier is used.
search_type (str) – Type of search for hyperparameter tuning (‘grid’ or ‘random’).
param_grid (Dict[str, List[Any]], optional) – The hyperparameter grid for tuning.
cv (int) – Number of cross-validation folds for hyperparameter tuning.
inplace (bool) – If True, perform operation in-place.
**kwargs – Additional keyword arguments to pass to the model.

Returns:

The DataFrame with missing values in the target column filled using the tuned model.

Return type:

pd.DataFrame

static identify_missing(data: DataFrame) → DataFrame[source]

Identifies missing values in the dataset.

Parameters:

data (pd.DataFrame) – The input DataFrame.

Returns:

A DataFrame of the same shape as the input, with boolean values: indicating where values are missing (True for missing values).

Return type:

pd.DataFrame

static interpolate_missing(data: DataFrame, method: str = 'linear', axis: int = 0, limit: int | None = None, inplace: bool = False, **kwargs: Any) → DataFrame[source]

Fills missing values using interpolation.

Parameters:

data (pd.DataFrame) – The input DataFrame.
method (str, optional) – Interpolation method. Defaults to ‘linear’.
axis (int, optional) – Axis along which to interpolate. Defaults to 0.
limit (int, optional) – Maximum number of consecutive NaNs to fill.
inplace (bool, optional) – If True, perform operation in-place.
**kwargs – Additional keyword arguments to pass to interpolate.

Returns:

The DataFrame with missing values filled using interpolation.

Return type:

pd.DataFrame

static missing_summary(data: DataFrame) → DataFrame[source]

Provides a summary of missing values for each column in the dataset.

Parameters:: data (pd.DataFrame) – The input DataFrame.
Returns:: A DataFrame with columns ‘missing_count’ and ‘missing_percentage’.
Return type:: pd.DataFrame