Searching Parsimonious Models in Small and High-Dimensional Datasets with HYBparsimony Python Package

The HYBparsimony package can assist in finding models that are less complex while maintaining comparable levels of accuracy

Francisco Javier Martínez de Pisón Ascacíbar

9 min readSep 29, 2023

Principle of Parsimony in Machine Learning

In many real-world machine learning scenarios, small but high-dimensional (SHHD) datasets are frequently found.

It’s important to clarify that when we mention SHHD, we are referring to tabular datasets typically containing hundreds or several thousand instances, along with tens or hundreds of features. It’s also essential to distinguish them from other high-dimensional datasets where the number of features is significantly larger than the number of observations (for example, in Genetics).

AutoML frameworks like Autogluon, MLJar, or H2O AutoML address modeling challenges with SHHD by creating ensemble models that combine artificial neural networks and tree-based methods like RandomForest, LightGBM, XGBoost, and CatBoost, and so on. These frameworks employ robust training and validation methods to automatically generate highly accurate models without the need for prior feature selection.

However, these complex models may contain undetectable biases. As a result, there is a growing demand for explainable models to facilitate bias detection in decision-making processes. Many times, linear models or decision trees can be more valuable for decision-making.

Also, many companies aim to mitigate the risk of overfitting by choosing the simplest model (more parsimonious) with similar accuracy among available options.

It is well known that a less complex model will yield more stable predictions, exhibit greater resistance to noise and disturbances, and be easier to maintain and analyze. In addition, decreasing the number of features can lead to further savings by reducing the use of sensors, energy consumption, information acquisition costs, maintenance requirements and the need to retrain models due to feature fluctuations caused by noise, outliers, data drift, etc.

Principle of Parsimony

The principle of parsimony also referred as Occam’s razor explains the selection of the simplest explanation that fits…

medium.com

Even a black-box model or an ensemble constructed with a restricted set of the original features can be readily analyzed using techniques such as ELI5 and SHAP. For example, this code shows how to reduce with HYBparsimony package the input features from 85 to 44 (51.7%) of an AutoGluon model for the COIL2000 dataset (from openml.com). The code shows the use of HYBparsimony which performed the search of low complexity models within a similar accuracy (the difference in the ‘log_loss’, with a new test dataset, of the model trained with the 85 features versus the 44 features was only 0.000312).

...
from hybparsimony import HYBparsimony
...
# See complete code in:
# https://github.com/jodivaso/hybparsimony/blob/master/examples/Autogluon_with_SHDD.ipynb
HYBparsimony_model = HYBparsimony(fitness=fitness_custom,
                                features=input_names,
                                rerank_error=0.001, 
                                gamma_crossover=0.50,
                                seed_ini=0,
                                npart=15, 
                                maxiter=100, 
                                early_stop=20, 
                                verbose=1, 
                                n_jobs=1)
HYBparsimony_model.fit(train_data[input_names], train_data[label].values)
best_model_probsfeats = HYBparsimony_model.best_model_conf[-len(input_names):]
selec_feats = np.array(input_names)[best_model_probsfeats>=0.50]
print(f'Selected feats with HYB-PARSIMONY num={len(selec_feats)}:{selec_feats}')
print('######################################################')

...

Running iteration 0
Best model -> Score = -0.211364 Complexity = 76.0 
Iter = 0 -> MeanVal = -0.21634  ValBest = -0.211364   ComplexBest = 76.0 Time(min) = 7.386587

Running iteration 1
Best model -> Score = -0.211364 Complexity = 76.0 
Iter = 1 -> MeanVal = -0.218023  ValBest = -0.212938   ComplexBest = 73.0 Time(min) = 7.295187

...

Running iteration 39
Best model -> Score = -0.20586 Complexity = 44.0 
Iter = 39 -> MeanVal = -0.215344  ValBest = -0.210025   ComplexBest = 43.0 Time(min) = 5.697854

Running iteration 40
Best model -> Score = -0.20586 Complexity = 44.0 
Iter = 40 -> MeanVal = -0.213437  ValBest = -0.208561   ComplexBest = 41.0 Time(min) = 5.605442

Early stopping reached. Stopped.
Selected feats with HYB-PARSIMONY num=44:['MAANTHUI' 'MGEMOMV' 'MGEMLEEF' 'MGODPR' 'MGODOV' 'MGODGE' 'MRELSA'
 'MFALLEEN' 'MFWEKIND' 'MBERZELF' 'MBERBOER' 'MBERARBO' 'MSKB1' 'MSKB2'
 'MSKC' 'MZFONDS' 'MZPART' 'MINKM30' 'MINKGEM' 'PWAPART' 'PWALAND'
 'PPERSAUT' 'PBESAUT' 'PVRAAUT' 'PAANHANG' 'PTRACTOR' 'PGEZONG' 'PBRAND'
 'PZEILPL' 'PINBOED' 'PBYSTAND' 'AWAPART' 'AWABEDR' 'AWALAND' 'APERSAUT'
 'AMOTSCO' 'ATRACTOR' 'ABROM' 'APERSONG' 'ABRAND' 'AZEILPL' 'APLEZIER'
 'AINBOED' 'ABYSTAND']
######################################################

In conclusion, when working with high-dimensional datasets, it is often more productive to focus on finding a model with a manageable number of features that provides acceptable performance, rather than striving for maximum accuracy. This approach is valid as long as the performance difference between models can be tolerated.

HYBparsimony is a Python package that simultaneously performs automatic: feature selection (FS), model hyperparameter optimization (HO), and parsimonious model selection (PMS).

HYB-PARSIMONY methodology

The simultaneous optimization of hyperparameters (HO) and feature selection (FS) to achieve Parsimonious Model Selection (PMS) remains an active research area. However, effectively choosing the right hyperparameters and feature subset poses a challenging combinatorial problem, often necessitating the use of efficient heuristic methods.

In 2022, we introduced HYB-PARSIMONY as a continuation of two previous methods: GA-PARSIMONY and PSO-PARSIMONY. The new method is a hybrid approach that combines GA operations (selection, crossover and mutation) with PSO optimization for automatic HO, FS and PMS.

HYB-PARSIMONY dynamically determines the percentage of particles replaced by GA crossover in each iteration, initially favoring GA-based mechanisms with a higher replacement percentage (see figure). As the optimization progresses, this percentage exponentially decreases until it stabilizes at 10%. Consequently, HYB-PARSIMONY starts by emphasizing the facilitation of parsimonious model discovery using GA and gradually shifts its focus toward optimizing accuracy with PSO.

Example of thirteen curves created with different Gamma values.

Experimental results (see last section) demonstrate that the HYB-PARSIMONY methodology consistently produces superior, more parsimonious, and robust models compared to alternatives. Additionally, it effectively reduces the number of iterations, minimizing computational efforts.

How to use HYBparsimony Python package

The HYB-PARSIMONY methodology has been implemented in the HYBparsimony Python package under the MIT license. The repository with more examples and information can be found at: https://github.com/jodivaso/hybparsimony/.

Installation is straightforward:

pip install hybparsimony

Next example demonstrates how to utilize the HYBparsimony package to search for a parsimonious (low complexity) KernelRidge model (with RBF kernel) in the diabetes dataset. HYBparsimony performs an exhaustive search for the optimal input features and KernelRidge’s hyperparameters (specifically alpha and gamma). By default, the models are assessed using a 5-fold cross-validation with negative mean squared error (neg_MSE) in order to maximize it. Furthermore, a root mean squared error (RMSE) is calculated with an independent test dataset to evaluate the model’s generalization performance.

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.datasets import load_diabetes
from sklearn.preprocessing import StandardScaler
from hybparsimony import HYBparsimony

# Load 'diabetes' dataset
diabetes = load_diabetes()

X, y = diabetes.data, diabetes.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=1234)

# Standarize X and y
scaler_X = StandardScaler()
X_train = scaler_X.fit_transform(X_train)
X_test = scaler_X.transform(X_test)
scaler_y = StandardScaler()
y_train = scaler_y.fit_transform(y_train.reshape(-1,1)).flatten()
y_test = scaler_y.transform(y_test.reshape(-1,1)).flatten()

algo = 'KernelRidge'
HYBparsimony_model = HYBparsimony(algorithm=algo,
                                features=diabetes.feature_names,
                                rerank_error=0.001,
                                n_jobs=1,
                                verbose=1)

# Search the best hyperparameters and features 
# (increasing 'time_limit' to improve RMSE with high consuming algorithms)
HYBparsimony_model.fit(X_train, y_train, time_limit=0.20)

In this example, rerank_error represents the maximum allowable difference between the performance metrics (in this case, MSE) of two models to consider them as having similar accuracy. For this dataset, rerank_error is set to 0.001, but different values can be employed to strike a balance between model complexity and accuracy.

The quest for the most parsimonious model, characterized by the smallest number of features, relies on the model complexity measure (Mc), which is defined as:

where N_FS is the number of selected input features and internal_comp is the internal measure of model complexity, which depends on the algorithm used for training. In this example, internal_comp for KernelRidge is measured by the sum of the squared coefficients. Therefore, between two models with the same number of features, the smaller sum of the squared weights will determine the more parsimonious model (smaller weights reduce the propagation of perturbations and improve robustness).

Detected a regression problem. Using 'neg_mean_squared_error' as default scoring function.
Running iteration 0
Best model -> Score = -0.510786 Complexity = 9,017,405,352.5 
Iter = 0 -> MeanVal = -0.88274  ValBest = -0.510786   ComplexBest = 9,017,405,352.5 Time(min) = 0.014713

Running iteration 1
Best model -> Score = -0.499005 Complexity = 8,000,032,783.88 
Iter = 1 -> MeanVal = -0.659969  ValBest = -0.499005   ComplexBest = 8,000,032,783.88 Time(min) = 0.007782

...
...
...

Running iteration 45
Best model -> Score = -0.489457 Complexity = 8,000,002,199.12 
Iter = 45 -> MeanVal = -0.531697  ValBest = -0.490061   ComplexBest = 8,000,000,443.89 Time(min) = 0.003815

Running iteration 46
Best model -> Score = -0.489457 Complexity = 8,000,002,199.12 
Iter = 46 -> MeanVal = -0.541818  ValBest = -0.493126   ComplexBest = 7,000,030,081.16 Time(min) = 0.003704

Time limit reached. Stopped.
KernelRidge(alpha=0.26747155972470016, gamma=0.010478997542788611, kernel='rbf')

In each iteration, the first row shows the score and complexity of the best model. The second row shows the average score, and the score and complexity of the best model obtained in that iteration. The values to the left of the first comma of the complexity correspond to the number of features (NFS), the rest is the internal complexity of the model.

Show final results with a testing dataset:

Note: The results of the examples may vary depending on the hardware available.

preds = HYBparsimony_model.predict(X_test)
print(f'\n\nBest Model = {HYBparsimony_model.best_model}')
print(f'Selected features:{HYBparsimony_model.selected_features}')
print(f'Complexity = {round(HYBparsimony_model.best_complexity, 2):,}')
print(f'5-CV MSE = {-round(HYBparsimony_model.best_score,6)}')
print(f'RMSE test = {round(mean_squared_error(y_test, preds, squared=False),6)}')

Best Model = KernelRidge(alpha=0.26747155972470016, gamma=0.010478997542788611, kernel='rbf')
Selected features:['age' 'sex' 'bmi' 'bp' 's1' 's4' 's5' 's6']
Complexity = 8,000,002,199.12
5-CV MSE = 0.489457
RMSE test = 0.681918

We can compare with different algorithms and using more time (time_limit=5 minutes).

# Note: The results of the examples may vary depending on the hardware available.
algorithms_reg = ['Ridge', 'Lasso', 'KernelRidge', 'KNeighborsRegressor', 'MLPRegressor', 'SVR',
'DecisionTreeRegressor', 'RandomForestRegressor']
res = []
for algo in algorithms_reg:
    print('#######################')
    print('Searching best: ', algo)
    HYBparsimony_model = HYBparsimony(algorithm=algo,
                                    features=diabetes.feature_names,
                                    rerank_error=0.001,
                                    verbose=1)
    # Search the best hyperparameters and features 
    # (increasing 'time_limit' to improve RMSE with high consuming algorithms)
    HYBparsimony_model.fit(X_train, y_train, time_limit=5)
    # Check results with test dataset
    preds = HYBparsimony_model.predict(X_test)
    print(algo, "RMSE test", mean_squared_error(y_test, preds, squared=False))
    print('Selected features:',HYBparsimony_model.selected_features)
    print(HYBparsimony_model.best_model)
    print('#######################')
    # Append results
    res.append(dict(algo=algo,
                    MSE_5CV= -round(HYBparsimony_model.best_score,6),
                    RMSE=round(mean_squared_error(y_test, preds, squared=False),6),
                    NFS=HYBparsimony_model.best_complexity//1e9,
                    selected_features = HYBparsimony_model.selected_features,
                    best_model=HYBparsimony_model.best_model))

res = pd.DataFrame(res).sort_values('RMSE')
# Visualize results
print(res[['best_model', 'MSE_5CV', 'RMSE', 'NFS', 'selected_features']])

Obtaining the following results:

best_model      MSE_5CV   RMSE      NFS selected_features
4  MLPRegressor 0.491424  0.672799  7.0 [sex, bmi, bp, s1, s4, s5, s6]
2  KernelRidge  0.488908  0.679108  7.0 [age, sex, bmi, bp, s3, s5, s6]
1  Lasso        0.495795  0.694631  8.0 [sex, bmi, bp, s1, s2, s4, s5, s6]
0  Ridge        0.495662  0.694885  8.0 [sex, bmi, bp, s1, s2, s4, s5, s6]
5  SVR          0.487899  0.696137  7.0 [sex, bmi, bp, s1, s4, s5, s6]
3  KNeighbors   0.523190  0.705371  6.0 [sex, bmi, bp, s3, s5, s6]
7  RandomForest 0.535958  0.760138  6.0 [sex, bmi, s1, s4, s5, s6]
6  DecisionTree 0.625424  0.847182  3.0 [bmi, s4, s6]

To speed up the calculation process of the example, we selected a dataset with a reduced number of rows. With such small dataset it is necessary to use more time and robust validation methods such as bootstrapping or repeated cross-validation in order to obtain more solid conclusions. It is also recommended to repeat the use of ‘hybparsimony’ with different random seeds.

Therefore, improvements in RMSEtst and NFS can be achieved by increasing the time limit to 60 minutes (time_limit), setting the maximum number of iterations (maxiter) to 1000, and employing a more robust validation approach, such as a 10-repeated 5-fold cross-validation.

HYBparsimony_model = HYBparsimony(algorithm=algo,
                                   features=diabetes.feature_names,
                                   rerank_error=0.001,
                                   cv=RepeatedKFold(n_repeats=10, n_splits=5),
                                   n_jobs=10, # each job executes one fold
                                   maxiter=1000,
                                   verbose=1)
HYBparsimony_model.fit(X_train, y_train, time_limit=60)

Note that n_jobs represents the number of CPU cores used within the cross_val_score() function included in an internal default_cv_score(). Also, it is important to mention that certain scikit-learn algorithms inherently employ parallelization as well. Thus, with some algorithms it will be necessary to consider the sharing of cores between the algorithm and the cross_val_score() function.

The following table shows the best models found for each algorithm. In this case, the model that best generalizes the problem is an ML regressor with only 6 features out of 10 and a single neuron in the hidden layer!

Best parsimonious regression models obtained with more iterations, more time and using a robust validation.

More Examples

You can find additional examples in the package’s GitHub repository at https://github.com/jodivaso/hybparsimony:

Examples for binary and multiclass problems.
Custom evaluation: Selecting alternative scoring metrics, creating new validation methods, and more.
Customized search methods.
Guidance on searching for a parsimonious Autogluon ensemble model.

Conclusions

In many real-world machine learning scenarios, small yet high-dimensional (SHHD) datasets are common, typically comprising tabular data with hundreds or even thousands of instances and tens to hundreds of features. Complex models, while accurate, can harbor hidden biases, driving the need for more interpretable models that aid in bias detection and reduce the risk of overfitting. Also, less complex models offer stability, resilience to noise, and ease of maintenance and analysis.

Addressing the challenge of simultaneously optimizing hyperparameters and feature selection for Parsimonious Model Selection (PMS), we present HYBparsimony Python package that combines genetic algorithms (GA) and particle swarm optimization (PSO) for simultaneous and automatic hyperparameter optimization, feature selection, and PMS in SHDD.

Reference

Divasón, J., Pernia-Espinoza, A., Martinez-de-Pison, F.J. (2023). HYB-PARSIMONY: A hybrid approach combining Particle Swarm Optimization and Genetic Algorithms to find parsimonious models in high-dimensional datasets. Neurocomputing, 560, 126840. 2023, Elsevier. https://doi.org/10.1016/j.neucom.2023.126840