Cluster Data

class cluster_data.AgglomerativeClusteringModel(n_clusters: int = 2, affinity: str = 'euclidean', linkage: str = 'ward', distance_threshold: float | None = None)[source]

Agglomerative Clustering algorithm.

class cluster_data.BaseClustering[source]

Base class for clustering algorithms.

evaluate(X: DataFrame, metrics: List[str] | None = None) Dict[str, float][source]

Evaluate the clustering result using the specified metrics.

Parameters:
  • X (pd.DataFrame) – The input data.

  • metrics (List[str], optional) – The evaluation metrics to use. Defaults to [‘silhouette’, ‘davies_bouldin’, ‘calinski_harabasz’].

Returns:

A dictionary of evaluation scores.

Return type:

Dict[str, float]

fit_predict(X: DataFrame, y: Series | None = None) ndarray | None[source]

Perform clustering on X and returns cluster labels.

Parameters:
  • X (array-like of shape (n_samples, n_features)) – Input data.

  • y (Ignored) – Not used, present for API consistency by convention.

  • **kwargs (dict) –

    Arguments to be passed to fit.

    Added in version 1.4.

Returns:

labels – Cluster labels.

Return type:

ndarray of shape (n_samples,), dtype=np.int64

class cluster_data.DBSCANClustering(eps: float = 0.5, min_samples: int = 5, metric: str = 'euclidean', algorithm: str = 'auto', leaf_size: int = 30, p: float | None = None, n_jobs: int | None = None)[source]

DBSCAN clustering algorithm.

class cluster_data.GaussianMixtureClustering(n_components: int = 1, covariance_type: str = 'full', tol: float = 0.001, reg_covar: float = 1e-06, max_iter: int = 100, n_init: int = 1, init_params: str = 'kmeans', random_state: int | None = None, warm_start: bool = False, verbose: int = 0)[source]

Gaussian Mixture Model clustering algorithm.

class cluster_data.KMeansClustering(n_clusters: int = 8, init: str = 'k-means++', n_init: int = 10, max_iter: int = 300, tol: float = 0.0001, random_state: int | None = None, algorithm: str = 'auto')[source]

K-Means clustering algorithm.

cluster_data.evaluate_clustering(X: DataFrame, labels: ndarray, metrics: List[str] | None = None) Dict[str, float][source]

Evaluate clustering performance using specified metrics.

Parameters:
  • X (pd.DataFrame) – The input data.

  • labels (np.ndarray) – Cluster labels.

  • metrics (List[str], optional) – List of evaluation metrics. Defaults to [‘silhouette’, ‘davies_bouldin’, ‘calinski_harabasz’].

Returns:

Dictionary of evaluation metric scores.

Return type:

Dict[str, float]

cluster_data.find_optimal_k(X: DataFrame, max_k: int = 10, method: str = 'silhouette') Dict[int, float][source]

Find the optimal number of clusters for KMeans clustering.

Parameters:
  • X (pd.DataFrame) – The input data.

  • max_k (int) – Maximum number of clusters to try.

  • method (str) – Evaluation metric to use (‘silhouette’, ‘davies_bouldin’, ‘calinski_harabasz’).

Returns:

Dictionary mapping number of clusters to evaluation score.

Return type:

Dict[int, float]