Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
NIRS_Workflow
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
CEFE
PACE
NIRS_Workflow
Commits
53510d15
Commit
53510d15
authored
3 months ago
by
DIANE
Browse files
Options
Downloads
Patches
Plain Diff
reformat file
parent
5ad7e157
No related branches found
Branches containing commit
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
src/utils/dim_reduction.py
+62
-56
62 additions, 56 deletions
src/utils/dim_reduction.py
with
62 additions
and
56 deletions
src/utils/dim_reduction.py
+
62
−
56
View file @
53510d15
...
...
@@ -8,13 +8,13 @@ import numpy as np
from
pandas
import
DataFrame
from
sklearn.preprocessing
import
LabelEncoder
from
umap
import
UMAP
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pca ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pca ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
class
LinearPCA
:
"""
A class for performing Principal Component Analysis (PCA) on a given data matrix X.
This class applies PCA for dimensionality reduction, providing the projections of the data onto the principal components
(scores), the contribution of each feature to the principal components (loadings),
and the residuals (reconstruction errors) after dimensionality reduction.
...
...
@@ -38,88 +38,89 @@ class LinearPCA:
def
__init__
(
self
,
X
,
Ncomp
=
10
):
"""
Initialize the LinearPCA class with the data matrix X and the number of principal components Ncomp.
Parameters:
X (pandas.DataFrame): The input data matrix where rows represent samples and columns represent features.
Ncomp (int): The number of principal components to compute (default is 10).
"""
# Store input data matrix
self
.
__x
=
X
# Set the number of principal components
self
.
__ncp
=
Ncomp
# Initialize and fit the PCA model
from
sklearn.decomposition
import
PCA
self
.
model
=
PCA
(
n_components
=
self
.
__ncp
)
self
.
model
.
fit
(
X
)
@property
def
eig_val
(
self
):
"""
Returns the eigenvalues and the diagonal matrix (Lambda) of eigenvalues.
Eigenvalues are the square of the singular values obtained from the PCA model.
Returns:
tuple: A tuple containing eigenvalues (eigvals) and the Lambda matrix (diagonal matrix of eigenvalues).
"""
eigvals
=
self
.
model
.
singular_values_
**
2
/
self
.
__x
.
shape
[
0
]
labels
=
[
f
'
PC
{
i
+
1
}
(
{
100
*
self
.
model
.
explained_variance_ratio_
[
i
].
round
(
2
)
}
%)
'
for
i
in
range
(
self
.
__ncp
)]
Lambda
=
DataFrame
(
np
.
diag
(
eigvals
),
index
=
labels
,
columns
=
labels
)
eigvals
=
self
.
model
.
singular_values_
**
2
/
self
.
__x
.
shape
[
0
]
labels
=
[
f
'
PC
{
i
+
1
}
(
{
100
*
self
.
model
.
explained_variance_ratio_
[
i
].
round
(
2
)
}
%)
'
for
i
in
range
(
self
.
__ncp
)]
Lambda
=
DataFrame
(
np
.
diag
(
eigvals
),
index
=
labels
,
columns
=
labels
)
return
eigvals
,
Lambda
@property
def
qexp_ratio
(
self
):
"""
Returns the explained variance ratio for each principal component.
This shows the percentage of variance explained by each principal component.
Returns:
pandas.DataFrame: DataFrame containing the explained variance ratio for each principal component.
"""
Qexp_ratio
=
pd
.
DataFrame
(
100
*
self
.
model
.
explained_variance_ratio_
,
columns
=
[
"
Qexp
"
],
index
=
[
f
'
PC
{
i
+
1
}
'
for
i
in
range
(
self
.
__ncp
)])
columns
=
[
"
Qexp
"
],
index
=
[
f
'
PC
{
i
+
1
}
'
for
i
in
range
(
self
.
__ncp
)])
return
Qexp_ratio
@property
def
scores_
(
self
):
"""
Returns the scores matrix, which is the projection of the original data onto the principal components.
The scores are the transformed data in the lower-dimensional space after applying PCA.
Returns:
pandas.DataFrame: The scores matrix (transformed data).
"""
scores
=
pd
.
DataFrame
(
self
.
model
.
transform
(
self
.
__x
),
index
=
self
.
__x
.
index
,
columns
=
[
f
'
PC
{
i
+
1
}
(
{
100
*
self
.
model
.
explained_variance_ratio_
[
i
].
round
(
2
)
}
%)
'
for
i
in
range
(
self
.
__ncp
)])
scores
=
pd
.
DataFrame
(
self
.
model
.
transform
(
self
.
__x
),
index
=
self
.
__x
.
index
,
columns
=
[
f
'
PC
{
i
+
1
}
(
{
100
*
self
.
model
.
explained_variance_ratio_
[
i
].
round
(
2
)
}
%)
'
for
i
in
range
(
self
.
__ncp
)])
return
scores
@property
def
loadings_
(
self
):
"""
Returns the loadings matrix, which contains the contribution of each feature to the principal components.
The loadings describe how much each original feature contributes to each principal component.
Returns:
pandas.DataFrame: The loadings matrix.
"""
p
=
pd
.
DataFrame
(
self
.
model
.
components_
,
columns
=
self
.
__x
.
columns
,
index
=
[
f
'
PC
{
i
+
1
}
(
{
100
*
self
.
model
.
explained_variance_ratio_
[
i
].
round
(
2
)
}
%)
'
for
i
in
range
(
self
.
__ncp
)])
index
=
[
f
'
PC
{
i
+
1
}
(
{
100
*
self
.
model
.
explained_variance_ratio_
[
i
].
round
(
2
)
}
%)
'
for
i
in
range
(
self
.
__ncp
)])
return
p
def
residuals_
(
self
,
components
):
"""
Returns the residuals (reconstruction errors) between the original data and its reconstruction
using a subset of principal components.
Parameters:
components (list): A list of principal component names (e.g.,
'
PC1
'
,
'
PC2
'
) used to reconstruct the data.
Returns:
pandas.DataFrame: The residuals matrix, showing the difference between the original data and its reconstruction.
"""
...
...
@@ -133,29 +134,32 @@ class LinearPCA:
# Reconstruct the data using the selected components
for
i
in
range
(
self
.
__ncp
):
# Reconstruct the data using the first i+1 principal components
xp
=
np
.
dot
(
self
.
model
.
transform
(
self
.
__x
)[:,
axis
],
self
.
model
.
components_
[
axis
,
:])
xp
=
np
.
dot
(
self
.
model
.
transform
(
self
.
__x
)[
:,
axis
],
self
.
model
.
components_
[
axis
,
:])
# Calculate residuals (difference between original and reconstructed data)
qres
=
self
.
__x
-
xp
return
qres
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ umap ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
class
Umap
:
"""
The UMAP (Uniform Manifold Approximation and Projection) algorithm for dimensionality reduction.
This class implements the UMAP algorithm to reduce the dimensionality of numerical data, with an option to include
categorical data (if provided) which will be encoded using `LabelEncoder`. The inclusion of categorical data helps in
improving clustering and visualization, especially when working with mixed data types.
Attributes:
-----------
numerical_data : pandas DataFrame
The numerical features (data) on which the dimensionality reduction will be performed.
categorical_data : list or None, optional
A list of categorical values that can be included for improved structure of the UMAP embedding. Default is None.
categorical_data_encoded : list or None
The encoded version of `categorical_data`, processed using `LabelEncoder` for model fitting. This is used only if
`categorical_data` is provided.
...
...
@@ -174,16 +178,16 @@ class Umap:
scores_ : property
Returns the transformed data (embedding) in the lower-dimensional space.
"""
def
__init__
(
self
,
numerical_data
,
cat_data
=
None
):
"""
Initializes and fits the UMAP model using the provided numerical data and optional categorical data.
Parameters:
-----------
numerical_data : pandas DataFrame
The numerical data (features) to be used for dimensionality reduction.
cat_data : list or None, optional
A list of categorical values associated with the data. If provided, this will be encoded and used during fitting.
Default is None.
...
...
@@ -191,10 +195,10 @@ class Umap:
# Ensure that the numerical data is a pandas DataFrame
if
not
isinstance
(
numerical_data
,
pd
.
DataFrame
):
raise
TypeError
(
"
numerical_data must be a pandas DataFrame
"
)
# Store numerical data
self
.
numerical_data
=
numerical_data
# Process categorical data if provided
if
cat_data
is
None
:
self
.
categorical_data_encoded
=
cat_data
...
...
@@ -203,50 +207,52 @@ class Umap:
# Use LabelEncoder to encode categorical data
from
sklearn.preprocessing
import
LabelEncoder
self
.
le
=
LabelEncoder
()
self
.
categorical_data_encoded
=
self
.
le
.
fit_transform
(
self
.
categorical_data
)
self
.
categorical_data_encoded
=
self
.
le
.
fit_transform
(
self
.
categorical_data
)
else
:
self
.
categorical_data_encoded
=
None
# Initialize the UMAP model with hyperparameters
from
umap
import
UMAP
self
.
model
=
UMAP
(
n_neighbors
=
20
,
n_components
=
3
,
min_dist
=
0.0
)
# Fit the model using the numerical data, with optional categorical data encoding
self
.
model
.
fit
(
self
.
numerical_data
,
y
=
self
.
categorical_data_encoded
)
@property
def
scores_
(
self
):
"""
Returns the lower-dimensional representation (embedding) of the numerical data in the transformed space.
The transformed data is represented in the lower-dimensional UMAP embedding. The data is presented as a
pandas DataFrame with columns labeled
'
UMAP_1
'
,
'
UMAP_2
'
, ..., for each component in the embedding.
Returns:
--------
pandas DataFrame
The transformed data (embedding) in the lower-dimensional space, with the original rows as the index.
"""
# Apply the UMAP transformation and store the transformed data in a DataFrame
scores
=
pd
.
DataFrame
(
self
.
model
.
transform
(
self
.
numerical_data
),
index
=
self
.
numerical_data
.
index
,
scores
=
pd
.
DataFrame
(
self
.
model
.
transform
(
self
.
numerical_data
),
index
=
self
.
numerical_data
.
index
,
columns
=
[
f
'
UMAP_
{
i
+
1
}
'
for
i
in
range
(
self
.
model
.
n_components
)])
return
scores
# ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ nmf ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~#
class
Nmf
:
"""
A class that performs Non-negative Matrix Factorization (NMF) on an input matrix (pandas DataFrame) to extract
latent components and their associated scores. The NMF model is fitted using scikit-learn
'
s NMF implementation,
and the results include the transformed data (scores) and the components (loadings).
Parameters:
-----------
X : pandas DataFrame
The input matrix (data) to be decomposed. All values in the matrix should be non-negative. Rows represent
samples, and columns represent features.
Ncomp : int, optional, default=3
The number of components to compute. This is the rank of the factorization.
...
...
@@ -255,11 +261,11 @@ class Nmf:
scores_ : pandas DataFrame
A DataFrame containing the transformed data (scores) for each of the components. Rows correspond to the
samples, and columns represent the components (e.g.,
'
H1
'
,
'
H2
'
, ...).
loadings_ : pandas DataFrame
A DataFrame containing the components (loadings). Rows represent the components (e.g.,
'
H1
'
,
'
H2
'
, ...),
and columns correspond to the original features in the input matrix.
Methods:
--------
scores_ : pandas DataFrame
...
...
@@ -279,13 +285,13 @@ class Nmf:
def
__init__
(
self
,
X
:
pd
.
DataFrame
,
Ncomp
:
int
=
3
):
"""
Initializes the NMF model and fits it to the input pandas DataFrame.
Parameters:
-----------
X : pandas DataFrame
The input matrix (data) to be decomposed. All values in the matrix should be non-negative.
Rows represent samples, and columns represent features.
Ncomp : int, optional, default=3
The number of components to compute. This is the rank of the factorization.
"""
...
...
@@ -309,9 +315,9 @@ class Nmf:
def
scores_
(
self
)
->
pd
.
DataFrame
:
"""
Returns the transformed data (scores) from the NMF model as a pandas DataFrame.
The rows correspond to the samples, and the columns correspond to the latent components.
Returns:
--------
pandas DataFrame
...
...
@@ -326,9 +332,9 @@ class Nmf:
def
loadings_
(
self
)
->
pd
.
DataFrame
:
"""
Returns the loadings (components) from the NMF model as a pandas DataFrame.
The rows correspond to the components, and the columns correspond to the original features.
Returns:
--------
pandas DataFrame
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment