Visualize Data

class visualize_data.DataVisualizer[source]

A class for visualizing different aspects of the dataset, including distributions, feature interactions, outlier detection, temporal data, dimensionality reduction, and more.

Methods: - plot_distribution: Plot the distribution of specified columns. - plot_missing_data: Visualize missing data in the dataframe. - plot_correlation_heatmap: Plot a heatmap of correlations between numerical features. - plot_swarmplot: Create a swarmplot to visualize data distribution across categories. - plot_3d_scatter: Create a 3D scatter plot for three numerical features. - plot_pairwise_relationships: Plot pairwise relationships between features. - plot_scatter_with_outliers: Plot scatter plot with outliers highlighted. - plot_boxplot_with_outliers: Plot boxplots for columns to visualize potential outliers. - plot_isolation_forest_outliers: Highlight outliers detected by Isolation Forest. - plot_time_series: Plot time series data with optional rolling window. - plot_pca: Plot the results of Principal Component Analysis. - plot_tsne: Plot the results of t-SNE dimensionality reduction. - plot_umap: Plot the results of UMAP dimensionality reduction. - plot_clusters: Plot data points color-coded by cluster labels. - plot_interactive_histogram: Create an interactive histogram using Plotly. - plot_interactive_correlation: Create an interactive correlation heatmap using Plotly. - plot_interactive_scatter: Create an interactive scatter plot using Plotly. - plot_feature_importance: Plot feature importance from a machine learning model. - plot_barplot: Create a barplot for aggregated numerical values across categories. - plot_boxplot_categorical: Create a boxplot for numerical distribution across categories. - plot_categorical_distribution: Plot the distribution of a categorical feature. - plot_categorical_heatmap: Create a heatmap for co-occurrences between two categorical features. - plot_target_distribution: Plot the distribution of a target variable. - display_basic_data: Display basic data such as the number of unique elements in each column and the number of missing values.

display_basic_data(df: DataFrame) → None[source]

Display basic data such as the number of unique elements in each column and the number of missing values.

Parameters:: df (pd.DataFrame) – Input dataframe.

plot_3d_scatter(df: DataFrame, x: List[str] | None = None, y: List[str] | None = None, z: List[str] | None = None, color: str | None = None) → None[source]

Create a 3D scatter plot for visualizing relationships between three numerical features.

Parameters:

df (pd.DataFrame) – Input dataframe.
x (List[str], optional) – X-axis column(s).
y (List[str], optional) – Y-axis column(s).
z (List[str], optional) – Z-axis column(s).
color (str, optional) – Column for coloring the points.

plot_barplot(df: DataFrame, x: List[str] | None = None, y: List[str] | None = None, hue: str | None = None) → None[source]

Create barplots for visualizing the aggregated values of numerical features across categories.

Parameters:

df (pd.DataFrame) – Input dataframe.
x (List[str], optional) – The categorical feature(s) to plot on the x-axis.
y (List[str], optional) – The numerical feature(s) to aggregate and plot on the y-axis.
hue (str, optional) – Column name for adding a hue to the plot.

plot_boxplot_categorical(df: DataFrame, x: List[str] | None = None, y: List[str] | None = None, hue: str | None = None, max_unique: int = 10) → None[source]

Create boxplots to visualize the distribution of numerical features across different categories.

Parameters:

df (pd.DataFrame) – Input dataframe.
x (List[str], optional) – The categorical feature(s) to plot on the x-axis.
y (List[str], optional) – The numerical feature(s) to plot on the y-axis. If None, only columns with more than ‘max_unique’ unique elements are considered.
hue (str, optional) – Column name for adding a hue to the plot.
max_unique (int) – Maximum number of unique values to consider a column categorical.

plot_boxplot_with_outliers(df: DataFrame, columns: List[str] | None = None) → None[source]

Plot boxplots for columns to visualize potential outliers.

Parameters:

df (pd.DataFrame) – Input dataframe.
columns (List[str], optional) – List of column names to plot.

plot_categorical_distribution(df: DataFrame, columns: List[str] | None = None, hue: str | None = None, max_unique: int = 10) → None[source]

Plot the distribution of categorical features.

Parameters:

df (pd.DataFrame) – Input dataframe.
columns (List[str], optional) – Names of the categorical columns.
hue (str, optional) – Column name for adding a hue to the plot.
max_unique (int) – Maximum number of unique values to consider a column categorical.

plot_categorical_heatmap(df: DataFrame, cols: List[str] | None = None, max_unique: int = 10) → None[source]

Create heatmaps for visualizing the frequency of co-occurrences between categorical features.

Parameters:

df (pd.DataFrame) – Input dataframe.
cols (List[str], optional) – List of categorical columns.
max_unique (int) – Maximum number of unique values to consider a column categorical.

plot_clusters(df: DataFrame, cluster_labels: Series, method: str = 'pca', n_components: int = 2) → None[source]

Plot data points color-coded by cluster labels using dimensionality reduction.

Parameters:

df (pd.DataFrame) – The input dataframe containing the features.
cluster_labels (pd.Series) – The cluster labels for each data point.
method (str) – The dimensionality reduction method (‘pca’, ‘umap’, ‘tsne’, or ‘identity’). Default is ‘pca’.
n_components (int) – Number of dimensions to reduce to. Default is 2.

plot_correlation_heatmap(df: DataFrame, method: str = 'pearson') → None[source]

Plot a heatmap of correlations between numerical features in the dataframe.

Parameters:

df (pd.DataFrame) – Input dataframe.
method (str) – Correlation method (‘pearson’, ‘spearman’, ‘kendall’). Default is ‘pearson’.

plot_distribution(df: DataFrame, columns: List[str] | None = None, kind: str = 'histogram') → None[source]

Plot the distribution of specified columns or all possible combinations of columns in the dataframe.

Parameters:

df (pd.DataFrame) – Input dataframe.
columns (List[str], optional) – List of column names to plot. If None, all numeric columns are considered.
kind (str) – Type of plot (‘histogram’, ‘kde’, or ‘box’). Default is ‘histogram’.

plot_feature_importance(feature_importances: ndarray, feature_names: List[str]) → None[source]

Plot feature importance from a machine learning model.

Parameters:

feature_importances (np.ndarray) – Array of feature importance values.
feature_names (List[str]) – List of feature names.

plot_interactive_correlation(df: DataFrame) → None[source]

Create an interactive correlation heatmap using Plotly.

Parameters:: df (pd.DataFrame) – Input dataframe.

plot_interactive_histogram(df: DataFrame, columns: List[str] | None = None) → None[source]

Create interactive histograms for specified columns.

Parameters:

df (pd.DataFrame) – Input dataframe.
columns (List[str], optional) – List of columns to visualize.

plot_interactive_scatter(df: DataFrame, x: List[str] | None = None, y: List[str] | None = None, color: str | None = None, size: str | None = None, max_unique: int = 10) → None[source]

Create interactive scatter plots for all possible combinations of x and y columns.

Parameters:

df (pd.DataFrame) – Input dataframe.
x (List[str], optional) – X-axis column(s).
y (List[str], optional) – Y-axis column(s).
color (str, optional) – Column for color encoding.
size (str, optional) – Column for size encoding.
max_unique (int) – Maximum number of unique values to consider a column categorical.

plot_isolation_forest_outliers(df: DataFrame, outliers: Series) → None[source]

Highlight outliers detected by Isolation Forest in a scatter plot.

Parameters:

df (pd.DataFrame) – Input dataframe (should have at least two columns).
outliers (pd.Series) – Boolean series indicating outliers.

plot_missing_data(df: DataFrame) → None[source]

Visualize missing data in the dataframe using a heatmap.

Parameters:: df (pd.DataFrame) – Input dataframe.

plot_pairwise_relationships(df: DataFrame, columns: List[str] | None = None) → None[source]

Plot pairwise relationships between features in the dataframe.

Parameters:

df (pd.DataFrame) – Input dataframe.
columns (List[str], optional) – List of column names to plot pairwise relationships.

plot_pca(df: DataFrame, columns: List[str] | None = None, n_components: int = 2, color: str | None = None) → None[source]

Plot the results of Principal Component Analysis (PCA).

Parameters:

df (pd.DataFrame) – Input dataframe.
columns (List[str], optional) – List of columns to use for PCA. If None, all numeric columns are used.
n_components (int) – Number of components to reduce to. Default is 2.
color (str, optional) – Column name to use for coloring the points.

plot_scatter_with_outliers(df: DataFrame, outliers: Series, x: List[str] | None = None, y: List[str] | None = None) → None[source]

Plot scatter plots with outliers highlighted.

Parameters:

df (pd.DataFrame) – Input dataframe.
x (List[str], optional) – X-axis column(s).
y (List[str], optional) – Y-axis column(s).
outliers (pd.Series) – Boolean series indicating outliers.

plot_swarmplot(df: DataFrame, x: List[str] | None = None, y: List[str] | None = None, hue: str | None = None, marker_size: int = 5, max_unique: int = 10) → None[source]

Create a swarmplot to visualize the distribution of data points across different categories.

Parameters:

df (pd.DataFrame) – Input dataframe.
x (List[str], optional) – The categorical feature(s) to plot on the x-axis.
y (List[str], optional) – The numerical feature(s) to plot on the y-axis.
hue (str, optional) – Column name for adding a hue to the plot.
marker_size (int) – Marker size of the plot.
max_unique (int) – Maximum number of unique values to consider a column categorical.

plot_target_distribution(df: DataFrame, target_columns: List[str] | None = None) → None[source]

Plot the distribution of target variable(s).

Parameters:

df (pd.DataFrame) – Input dataframe.
target_columns (List[str], optional) – Names of the target columns.

plot_time_series(df: DataFrame, date_col: str | None = None, value_cols: List[str] | None = None, rolling_window: int | None = None) → None[source]

Plot time series data with an optional rolling window.

Parameters:

df (pd.DataFrame) – Input dataframe.
date_col (str, optional) – Name of the datetime column. If None, uses the first datetime column.
value_cols (List[str], optional) – Names of the value columns to plot.
rolling_window (int, optional) – Optional rolling window size.

plot_tsne(df: DataFrame, n_components: int = 2, perplexity: int = 30, color: str | None = None) → None[source]

Plot the results of t-SNE dimensionality reduction.

Parameters:

df (pd.DataFrame) – Input dataframe.
n_components (int) – Number of components to reduce to. Default is 2.
perplexity (int) – Perplexity parameter for t-SNE. Default is 30.
color (str, optional) – Column name to use for coloring the points.

plot_umap(df: DataFrame, n_components: int = 2, n_neighbors: int = 15, min_dist: float = 0.1, color: str | None = None) → None[source]

Plot the results of UMAP dimensionality reduction.

Parameters:

df (pd.DataFrame) – Input dataframe.
n_components (int) – Number of components to reduce to. Default is 2.
n_neighbors (int) – The size of the local neighborhood.
min_dist (float) – Minimum distance between points in the low-dimensional space.
color (str, optional) – Column name to use for coloring the points.