Solve Regression

class solve_regression.RegressionSolver(models: Dict[str, BaseEstimator] | None = None, random_state: int = 42)[source]

A comprehensive class for solving regression problems using various machine learning models. Includes methods for data preprocessing, model training, evaluation, hyperparameter tuning, cross-validation, model merging, and model persistence.

auto_select_best_model(X_train: DataFrame, y_train: Series, cv: int = 5, scoring: str = 'neg_mean_squared_error') → Tuple[str, float][source]

Automatically selects the best model based on cross-validated score. It checks if a hyperparameter-tuned version of the model is available and uses it if present.

Parameters:

X_train (pd.DataFrame) – Training features.
y_train (pd.Series) – Training target.
cv (int) – Number of cross-validation folds (default: 5).
scoring (str) – Scoring metric for evaluation.

Returns:

The name of the best performing model and its score based on cross-validation.

Return type:

Tuple[str, float]

compare_models(X_train: DataFrame, y_train: Series, cv: int = 5, scoring: str = 'neg_mean_squared_error') → DataFrame[source]

Compares multiple models based on cross-validation scores.

Parameters:

X_train (pd.DataFrame) – Training features.
y_train (pd.Series) – Training target.
cv (int) – Number of cross-validation folds.
scoring (str) – Scoring metric for evaluation.

Returns:

DataFrame containing models and their scores.

Return type:

pd.DataFrame

cross_validate_model(model: BaseEstimator, X: DataFrame, y: Series, cv: int = 5, scoring: List[str] | None = None) → Dict[str, Any][source]

Cross-validates the model using the specified number of folds and returns detailed metrics.

Parameters:

model (BaseEstimator) – The model to cross-validate.
X (pd.DataFrame) – Feature matrix.
y (pd.Series) – Target variable.
cv (int) – Number of cross-validation folds.
scoring (Optional[List[str]]) – List of scoring metrics to evaluate (default: None uses common regression metrics).

Returns:

Cross-validation metrics including R2, MAE, MSE, RMSE, etc.

Return type:

Dict[str, Any]

evaluate_model(model: BaseEstimator, X_test: DataFrame, y_test: Series) → Dict[str, Any][source]

Evaluates the regression model on test data.

Parameters:

model (BaseEstimator) – The trained model.
X_test (pd.DataFrame) – Testing features.
y_test (pd.Series) – Testing target.

Returns:

A dictionary containing evaluation metrics.

Return type:

Dict[str, Any]

hyperparameter_tuning(model_name: str, X_train: DataFrame, y_train: Series, param_grid: Dict[str, List[Any]] | None = None, cv: int = 5, search_type: str = 'grid', n_iter: int = 50, scoring: str = 'neg_mean_squared_error') → None[source]

Performs hyperparameter tuning using GridSearchCV, RandomizedSearchCV, or Bayesian Optimization for one or all models and stores the best models.

Parameters:

model_name (str) – The name of the model to tune. If ‘all’, tunes all models in self.models.
X_train (pd.DataFrame) – Training features.
y_train (pd.Series) – Training target.
param_grid (Optional[Dict[str, List[Any]]]) – Parameter grid for hyperparameter tuning. If None, uses default.
cv (int) – Number of cross-validation folds.
search_type (str) – Type of search (‘grid’, ‘random’, or ‘bayesian’).
n_iter (int) – Number of iterations for RandomizedSearchCV or Bayesian Optimization.
scoring (str) – Scoring metric for evaluation.

Returns:

The best models are stored in self.tuned_models.

Return type:

None

load_model(filename: str) → BaseEstimator[source]

Loads a trained model from disk.

Parameters:: filename (str) – The path and filename to load the model from.
Returns:: The loaded model.
Return type:: BaseEstimator

model_merging(base_models: List[str], X_train: DataFrame, y_train: Series, method: str = 'stacking', final_estimator: BaseEstimator | None = None, passthrough: bool = False, cv: int = 5, n_estimators: int = 10) → BaseEstimator[source]

Creates an ensemble model by merging multiple base models using different ensemble techniques. Supports stacking, bagging, boosting, and voting.

Parameters:

base_models (List[str]) – List of model names to be used as base models.
X_train (pd.DataFrame) – Training features.
y_train (pd.Series) – Training target.
method (str) – The ensemble method to use (‘stacking’, ‘bagging’, ‘boosting’, or ‘voting’).
final_estimator (Optional[BaseEstimator]) – The final estimator to combine base models for stacking. Defaults to Ridge.
passthrough (bool) – If True, pass the original features to the final estimator (only for stacking).
cv (int) – Number of cross-validation folds for stacking.
n_estimators (int) – Number of estimators for bagging or boosting.

Returns:

The ensemble model.

Return type:

BaseEstimator

plot_feature_importance(model: BaseEstimator, feature_names: List[str]) → None[source]

Plots feature importance for models that support it.

Parameters:

model (BaseEstimator) – The trained model.
feature_names (List[str]) – List of feature names.

plot_learning_curve(model: BaseEstimator, X_train: DataFrame, y_train: Series, cv: int = 5, scoring: str = 'neg_mean_squared_error') → None[source]

Plots the learning curve of the model.

Parameters:

model (BaseEstimator) – The model to plot learning curve for.
X_train (pd.DataFrame) – Feature matrix.
y_train (pd.Series) – Target variable.
cv (int) – Number of cross-validation folds.
scoring (str) – Scoring metric.

plot_residual_distribution(model: BaseEstimator, X_test: DataFrame, y_test: Series) → None[source]

Plots the distribution of residuals (prediction errors).

Parameters:

model (BaseEstimator) – The trained model.
X_test (pd.DataFrame) – Testing features.
y_test (pd.Series) – Testing target.

plot_residuals(model: BaseEstimator, X_test: DataFrame, y_test: Series) → None[source]

Plots residuals of the regression model.

Parameters:

model (BaseEstimator) – The trained model.
X_test (pd.DataFrame) – Testing features.
y_test (pd.Series) – Testing target.

save_model(model: BaseEstimator, filename: str) → None[source]

Saves the trained model to disk.

Parameters:

model (BaseEstimator) – The trained model.
filename (str) – The path and filename to save the model.

split_data(X: DataFrame, y: Series, test_size: float = 0.2, random_state: int | None = None) → Tuple[DataFrame, DataFrame, Series, Series][source]

Splits the data into training and testing sets.

Parameters:

X (pd.DataFrame) – Feature matrix.
y (pd.Series) – Target variable.
test_size (float) – Proportion of the dataset to include in the test split.
random_state (Optional[int]) – Random seed.

Returns:

Training and testing sets for features and target.

Return type:

Tuple[pd.DataFrame, pd.DataFrame, pd.Series, pd.Series]

train_model(model_name: str, X_train: DataFrame, y_train: Series, use_pipeline: bool = False) → BaseEstimator[source]

Trains a given regression model.

Parameters:

model_name (str) – The name of the model to train.
X_train (pd.DataFrame) – Training features.
y_train (pd.Series) – Training target.
use_pipeline (bool) – Whether to use a pipeline with scaling.

Returns:

The trained model.

Return type:

BaseEstimator