Cluster Data
- class cluster_data.AgglomerativeClusteringModel(n_clusters: int = 2, affinity: str = 'euclidean', linkage: str = 'ward', distance_threshold: float | None = None)[source]
Agglomerative Clustering algorithm.
- class cluster_data.BaseClustering[source]
Base class for clustering algorithms.
- evaluate(X: DataFrame, metrics: List[str] | None = None) Dict[str, float][source]
Evaluate the clustering result using the specified metrics.
- Parameters:
X (pd.DataFrame) – The input data.
metrics (List[str], optional) – The evaluation metrics to use. Defaults to [‘silhouette’, ‘davies_bouldin’, ‘calinski_harabasz’].
- Returns:
A dictionary of evaluation scores.
- Return type:
Dict[str, float]
- fit_predict(X: DataFrame, y: Series | None = None) ndarray | None[source]
Perform clustering on X and returns cluster labels.
- Parameters:
X (array-like of shape (n_samples, n_features)) – Input data.
y (Ignored) – Not used, present for API consistency by convention.
**kwargs (dict) –
Arguments to be passed to
fit.Added in version 1.4.
- Returns:
labels – Cluster labels.
- Return type:
ndarray of shape (n_samples,), dtype=np.int64
- class cluster_data.DBSCANClustering(eps: float = 0.5, min_samples: int = 5, metric: str = 'euclidean', algorithm: str = 'auto', leaf_size: int = 30, p: float | None = None, n_jobs: int | None = None)[source]
DBSCAN clustering algorithm.
- class cluster_data.GaussianMixtureClustering(n_components: int = 1, covariance_type: str = 'full', tol: float = 0.001, reg_covar: float = 1e-06, max_iter: int = 100, n_init: int = 1, init_params: str = 'kmeans', random_state: int | None = None, warm_start: bool = False, verbose: int = 0)[source]
Gaussian Mixture Model clustering algorithm.
- class cluster_data.KMeansClustering(n_clusters: int = 8, init: str = 'k-means++', n_init: int = 10, max_iter: int = 300, tol: float = 0.0001, random_state: int | None = None, algorithm: str = 'auto')[source]
K-Means clustering algorithm.
- cluster_data.evaluate_clustering(X: DataFrame, labels: ndarray, metrics: List[str] | None = None) Dict[str, float][source]
Evaluate clustering performance using specified metrics.
- Parameters:
X (pd.DataFrame) – The input data.
labels (np.ndarray) – Cluster labels.
metrics (List[str], optional) – List of evaluation metrics. Defaults to [‘silhouette’, ‘davies_bouldin’, ‘calinski_harabasz’].
- Returns:
Dictionary of evaluation metric scores.
- Return type:
Dict[str, float]
- cluster_data.find_optimal_k(X: DataFrame, max_k: int = 10, method: str = 'silhouette') Dict[int, float][source]
Find the optimal number of clusters for KMeans clustering.
- Parameters:
X (pd.DataFrame) – The input data.
max_k (int) – Maximum number of clusters to try.
method (str) – Evaluation metric to use (‘silhouette’, ‘davies_bouldin’, ‘calinski_harabasz’).
- Returns:
Dictionary mapping number of clusters to evaluation score.
- Return type:
Dict[int, float]