documentation

d836044e · DIANE · 980c3ed7 · d836044e · d836044e · d836044e
Commit d836044e authored 3 months ago by DIANE
--- a/docs/Clustering.md
+++ b/docs/Clustering.md
 # Clustering Methods
+::: src.utils.clustering.clustering
-## K-Means clustering
-::: src.utils.clustering.Sk_Kmeans
\ No newline at end of file
-## HDBSCAN clustering
-::: src.utils.clustering.Hdbscan
--- a/docs/index.md
+++ b/docs/index.md
@@ -2,13 +2,23 @@
 This workflow aims at ...
-## Samples Selection
+## I - Samples Selection
-## Dimension Reduction
+### Dimension Reduction
-## Clustering
+### Clustering analysis
 [K-Means](Clustering.md#k-means-clustering)  
 [HDBSCAN](Clustering.md#hdbscan-clustering)
-## Models Creation
+### Representative subset selection
-[lwPlsR from Jchemo (Julia)](model_creation.md)
\ No newline at end of file
+## II - Models Creation
+### Data split into train/test subsets
+### Predictive model creation
+[lwPlsR from Jchemo (Julia)](model_creation.md)
+### Predictive model evaluation
+## III - Prediction making
+## IV - Reporting
\ No newline at end of file
--- a/docs/prediction.md
+++ b/docs/prediction.md
+# Prediction making
+## This section uses the result files obtained from model creation step, and makes prediction on new set of data 
\ No newline at end of file
--- a/src/utils/clustering.py
+++ b/src/utils/clustering.py
@@ -12,28 +12,65 @@ import pandas as pd
 def clustering(X, method='kmeans', **kwargs):
    """
-    Perform clustering on the given dataset using the specified method.
+    Perform clustering on the given dataset using the specified method. 
+    Available clustering methods are:
+    - **'kmeans'**: K-Means clustering algorithm, which partitions the data into `k` clusters.
+    - **'ap'**: Affinity Propagation clustering, a graph-based algorithm that identifies clusters without predefining the number of clusters.
+    - **'hdbscan'**: HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise), a method that identifies clusters of varying densities and can detect noise points.
    Parameters
    ----------
    X : DataFrame or array-like, shape (n_samples, n_features)
-        The input data for clustering.
+        The input data for clustering. It can be a pandas DataFrame or a numpy array where 
+        each row represents a sample, and each column represents a feature.
-    method : str, optional, default='kmeans'
+    method : str, default='kmeans'
-        The clustering method to use. Options are:
+        The clustering method to use. The following clustering methods are available:
-        - 'kmeans': K-Means clustering.
+        - 'kmeans': K-Means clustering. A method for partitioning data into `k` clusters.
-        - 'ap': Affinity Propagation clustering.
+        - 'ap': Affinity Propagation clustering. A graph-based clustering method.
-        - 'hdbscan': HDBSCAN clustering.
+        - 'hdbscan': HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise). A method that works well for data with varying densities.
-    kwargs : dict
+    kwargs : dict, optional
-        Additional arguments specific to the clustering method.
+        Additional keyword arguments that can be passed to the specific clustering method. 
+        The following parameters are accepted for each method:
+        - For 'kmeans': 
+            - 'max_k': int, the maximum number of clusters to consider when finding the optimal number of clusters using the Silhouette Score (default is 10).
+        - For 'hdbscan': 
+            - 'min_samples': int, the number of samples in a neighborhood for a point to be considered a core point (default is 8).
+            - 'min_cluster_size': int, the minimum size of clusters (default is 10).
+            - 'metric': str, the distance metric to use (default is 'euclidean').
    Returns
    -------
-    DataFrame
+    tuple
-        A DataFrame containing the cluster assignments for each sample. The index corresponds
+        A tuple containing:
-        to the sample names (from X), and a column "names" lists the cluster labels.
+        - A pandas DataFrame with the cluster assignments for each sample. The index corresponds to the sample names (from X), 
+          and a column "names" lists the cluster labels.
+        - An integer representing the number of clusters found.
+    Raises
+    ------
+    ValueError
+        If an unknown clustering method is specified, a ValueError will be raised.
+    Notes
+    -----
+    - The KMeans method uses Silhouette Score to determine the optimal number of clusters.
+    - Affinity Propagation uses a preference value (default -50) to determine cluster centers.
+    - HDBSCAN can assign some points to a 'Non clustered' category if they are considered noise or outliers (denoted by -1).
+    Examples
+    --------
+    # Example using K-Means clustering:
+    result, num_clusters = clustering(X, method='kmeans', max_k=12)
+    # Example using Affinity Propagation clustering:
+    result, num_clusters = clustering(X, method='ap')
+    # Example using HDBSCAN clustering:
+    result, num_clusters = clustering(X, method='hdbscan', min_samples=10, min_cluster_size=15)
    """
    if method == 'KMEANS':
        max_k = kwargs.get('max_k', 10)