Perform clustering on the given dataset using the specified method.
Perform clustering on the given dataset using the specified method.
Available clustering methods are:
- **'kmeans'**: K-Means clustering algorithm, which partitions the data into `k` clusters.
- **'ap'**: Affinity Propagation clustering, a graph-based algorithm that identifies clusters without predefining the number of clusters.
- **'hdbscan'**: HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), a method that identifies clusters of varying densities and can detect noise points.
Parameters
Parameters
----------
----------
X : DataFrame or array-like, shape (n_samples, n_features)
X : DataFrame or array-like, shape (n_samples, n_features)
The input data for clustering.
The input data for clustering. It can be a pandas DataFrame or a numpy array where
each row represents a sample, and each column represents a feature.
method : str, optional, default='kmeans'
method : str, default='kmeans'
The clustering method to use. Options are:
The clustering method to use. The following clustering methods are available:
- 'kmeans': K-Means clustering.
- 'kmeans': K-Means clustering. A method for partitioning data into `k` clusters.
- 'ap': Affinity Propagation clustering.
- 'ap': Affinity Propagation clustering. A graph-based clustering method.
- 'hdbscan': HDBSCAN clustering.
- 'hdbscan': HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise). A method that works well for data with varying densities.
kwargs : dict
kwargs : dict, optional
Additional arguments specific to the clustering method.
Additional keyword arguments that can be passed to the specific clustering method.
The following parameters are accepted for each method:
- For 'kmeans':
- 'max_k': int, the maximum number of clusters to consider when finding the optimal number of clusters using the Silhouette Score (default is 10).
- For 'hdbscan':
- 'min_samples': int, the number of samples in a neighborhood for a point to be considered a core point (default is 8).
- 'min_cluster_size': int, the minimum size of clusters (default is 10).
- 'metric': str, the distance metric to use (default is 'euclidean').
Returns
Returns
-------
-------
DataFrame
tuple
A DataFrame containing the cluster assignments for each sample. The index corresponds
A tuple containing:
to the sample names (from X), and a column "names" lists the cluster labels.
- A pandas DataFrame with the cluster assignments for each sample. The index corresponds to the sample names (from X),
and a column "names" lists the cluster labels.
- An integer representing the number of clusters found.
Raises
------
ValueError
If an unknown clustering method is specified, a ValueError will be raised.
Notes
-----
- The KMeans method uses Silhouette Score to determine the optimal number of clusters.
- Affinity Propagation uses a preference value (default -50) to determine cluster centers.
- HDBSCAN can assign some points to a 'Non clustered' category if they are considered noise or outliers (denoted by -1).