Welcome to ITMO_FS!¶
Install and contribution¶
Prerequisites¶
The feature selection library requires the following dependencies:
- python (>=3.6)
- numpy (>=1.13.3)
- scipy (>=0.19.1)
- scikit-learn (>=0.22)
- imblearn (>=0.0)
- qpsolvers (>=1.0.1)
Install¶
ITMO_FS is currently available on the PyPi’s repositories and you can install it via pip:
pip install -U ITMO_FS
If you prefer, you can clone it and run the setup.py file. Use the following commands to get a copy from Github and install all dependencies:
git clone https://github.com/LastShekel/ITMO_FS.git
cd ITMO_FS
pip install .
Or install using pip and GitHub:
pip install -U git+https://github.com/LastShekel/ITMO_FS.git
Test and coverage¶
You want to test the code before to install:
$ make test
You wish to test the coverage of your version:
$ make coverage
You can also use pytest:
$ pytest ITMO_FS -v
User Guide¶
Introduction¶
API’s of feature selectors¶
Available selectors follow the scikit-learn API using the base estimator and selector mixin:
Transformer: | The base object, implements a selector.fit(data, targets)
To select features from a data set after learning, each selector implements: data_selected = selector.transform(data)
To learn from data and select features from the same data set at once, each selector implements: data_selected = selector.fir_transform(data, targets)
To reverse the selection operation, each selector implements: data_reversed = selector.fir_transform(data)
|
---|
Feature selectors accept the same inputs that in scikit-learn:
data
: array-like (2-D list, pandas.Dataframe, numpy.array) or sparse matrices;targets
: array-like (1-D list, pandas.Series, numpy.array).
The output will be of the following type:
data_selected
: array-like (2-D list, pandas.Dataframe, numpy.array) or- sparse matrices;
data_reversed
: array-like (2-D list, pandas.Dataframe, numpy.array) or- sparse matrices;
Sparse input
For sparse input the data is converted to the Compressed Sparse Rows
representation (see scipy.sparse.csr_matrix
) before being fed to the
sampler. To avoid unnecessary memory copies, it is recommended to choose the
CSR representation upstream.
Problem statement regarding data sets with redundant features¶
Feature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model. Fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.
Here is one of examples of feature selection improving the classification quality:
>>> from sklearn.datasets import make_classification
>>> from sklearn.linear_model import SGDClassifier
>>> from ITMO_FS.embedded import MOS
>>> X, y = make_classification(n_samples=300, n_features=10, random_state=0, n_informative=2)
>>> sel = MOS()
>>> trX = sel.fit_transform(X, y, smote=False)
>>> cl1 = SGDClassifier()
>>> cl1.fit(X, y)
>>> cl1.score(X, y)
0.9033333333333333
>>> cl2 = SGDClassifier()
>>> cl2.fit(trX, y)
>>> cl2.score(trX, y)
0.9433333333333334
As expected, the quality of the SVGClassifier’s results is impacted by the presence of redundant features in data set. We can see that after using of feature selection the mean accuracy increases from 0.903 to 0.943.
ITMO_FS API¶
This is the full API documentation of the ITMO_FS toolbox.
ITMO_FS.filters
: Filter methods¶
ITMO_FS.filters.univariate
: Univariate filter methods¶
filters.univariate.VDM ([weighted]) |
Creates Value Difference Metric builder http://aura.abdn.ac.uk/bitstream/handle/2164/10951/payne_ecai_98.pdf?sequence=1 https://www.jair.org/index.php/jair/article/view/10182 |
filters.univariate.UnivariateFilter (measure) |
Basic interface for using univariate measures for feature selection. |
Measures for univariate filters¶
filters.univariate.fit_criterion_measure (X, y) |
|
filters.univariate.f_ratio_measure (X, y) |
Calculates Fisher score for features. |
filters.univariate.gini_index (X, y) |
Gini index is a measure of statistical dispersion. |
filters.univariate.su_measure (X, y) |
SU is a correlation measure between the features and the class calculated, via formula SU(X,Y) = 2 * I(X|Y) / (H(X) + H(Y)) |
filters.univariate.spearman_corr (X, y) |
Calculates spearman correlation for each feature. |
filters.univariate.pearson_corr (X, y) |
Calculates pearson correlation for each feature. |
filters.univariate.fechner_corr (X, y) |
Calculates Sample sign correlation (Fechner correlation) for each feature. |
filters.univariate.kendall_corr (X, y) |
Calculates Sample sign correlation (Kendall correlation) for each feature. |
filters.univariate.reliefF_measure (X, y[, …]) |
Counts ReliefF measure for each feature |
filters.univariate.chi2_measure (X, y) |
Calculates score for the test chi-squared statistic from X. |
filters.univariate.information_gain (X, y) |
Calculates mutual information for each feature by formula, I(X,Y) = H(X) - H(X|Y) |
Cutting rules for univariate filters¶
ITMO_FS.filters.multivariate
: Multivariate filter methods¶
filters.multivariate.DISRWithMassive ([…]) |
Creates DISR (Double Input Symmetric Relevance) feature selection filter based on kASSI criterin for feature selection which aims at maximizing the mutual information avoiding, meanwhile, large multivariate density estimation. |
filters.multivariate.FCBFDiscreteFilter () |
Creates FCBF (Fast Correlation Based filter) feature selection filter based on mutual information criteria for data with discrete features This filter finds best set of features by searching for a feature, which provides the most information about classification problem on given dataset at each step and then eliminating features which are less relevant than redundant |
filters.multivariate.MultivariateFilter (…) |
Provides basic functionality for multivariate filters. |
filters.multivariate.STIR ([n_features_to_keep]) |
Feature selection using STIR algorithm. |
filters.multivariate.TraceRatioFisher (…) |
Creates TraceRatio(similarity based) feature selection filter performed in supervised way, i.e fisher version |
filters.multivariate.MIMAGA (mim_size, …) |
Measures for multivariate filters¶
filters.multivariate.MIM (selected_features, …) |
Mutual Information Maximization feature scoring criterion. |
filters.multivariate.MRMR (selected_features, …) |
Minimum-Redundancy Maximum-Relevance feature scoring criterion. |
filters.multivariate.JMI (selected_features, …) |
Joint Mutual Information feature scoring criterion. |
filters.multivariate.CIFE (selected_features, …) |
Conditional Infomax Feature Extraction feature scoring criterion. |
filters.multivariate.MIFS (selected_features, …) |
Mutual Information Feature Selection feature scoring criterion. |
filters.multivariate.CMIM (selected_features, …) |
Conditional Mutual Info Maximisation feature scoring criterion. |
filters.multivariate.ICAP (selected_features, …) |
Interaction Capping feature scoring criterion. |
filters.multivariate.DCSF (selected_features, …) |
Dynamic change of selected feature with the class scoring criterion. |
filters.multivariate.CFR (selected_features, …) |
The criterion of CFR maximizes the correlation and minimizes the redundancy. |
filters.multivariate.MRI (selected_features, …) |
Max-Relevance and Max-Independence feature scoring criteria. |
filters.multivariate.IWFS (selected_features, …) |
Interaction Weight base feature scoring criteria. |
filters.multivariate.generalizedCriteria (…) |
This feature scoring criteria is a linear combination of all relevance, redundancy, conditional dependency Given set of already selected features and set of remaining features on dataset X with labels y selects next feature. |
ITMO_FS.filters.unsupervised
: Unsupervised filter methods¶
filters.unsupervised.TraceRatioLaplacian (…) |
Creates TraceRatio(similarity based) feature selection filter performed in unsupervised way, i.e laplacian version |
ITMO_FS.filters.sparse
: Sparse filter methods¶
filters.sparse.MCFS (d[, k, p, scheme, sigma]) |
Performs the Unsupervised Feature Selection for Multi-Cluster Data algorithm. |
filters.sparse.NDFS (p[, c, k, alpha, beta, …]) |
Performs the Nonnegative Discriminative Feature Selection algorithm. |
filters.sparse.RFS (p[, gamma, …]) |
Performs the Robust Feature Selection via Joint L2,1-Norms Minimization algorithm. |
filters.sparse.SPEC (p[, k, gamma, sigma, …]) |
Performs the Spectral Feature Selection algorithm. |
filters.sparse.UDFS (p[, c, k, gamma, l, …]) |
Performs the Unsupervised Discriminative Feature Selection algorithm. |
ITMO_FS.ensembles
: Ensemble methods¶
ITMO_FS.ensembles.measure_based
: Measure based ensemble methods¶
ensembles.measure_based.WeightBased (filters) |
ITMO_FS.ensembles.model_based
: Model based ensemble methods¶
ensembles.model_based.BestSum (models, …) |
ITMO_FS.ensembles.ranking_based
: Ranking based ensemble methods¶
ensembles.ranking_based.Mixed (filters) |
Performs feature selection based on several filters, selecting features this way: Get ranks from every filter from input. |
ITMO_FS.embedded
: Embedded methods¶
embedded.MOS ([model, loss, seed]) |
Performs Minimizing Overlapping Selection under SMOTE (MOSS) or under No-Sampling (MOSNS) algorithm. |
ITMO_FS.hybrid
: Hybrid methods¶
hybrid.FilterWrapperHybrid (filter_, wrapper) |
|
hybrid.Melif (filter_ensemble[, scorer, verbose]) |
ITMO_FS.wrappers
: Wrapper methods¶
ITMO_FS.wrappers.deterministic
: Deterministic wrapper methods¶
wrappers.deterministic.AddDelWrapper (…[, …]) |
Creates add-del feature wrapper |
wrappers.deterministic.BackwardSelection (…) |
Backward Selection removes one feature at a time until the number of features to be removed is reached. |
wrappers.deterministic.RecursiveElimination (…) |
Performs a recursive feature elimination until the required number of features is reached. |
wrappers.deterministic.SequentialForwardSelection (…) |
Sequentially Adds Features that Maximises the Classifying function when combined with the features already used TODO add theory about this method |
Deterministic wrapper function¶
wrappers.deterministic.qpfs_wrapper (X, y, alpha) |
Performs Quadratic Programming Feature Selection algorithm. |
ITMO_FS.wrappers.randomized
: Randomized wrapper methods¶
wrappers.randomized.HillClimbingWrapper (…) |
|
wrappers.randomized.SimulatedAnnealing (…) |
Performs feature selection using simulated annealing |
wrappers.randomized.TPhMGWO ([wolfNumber, …]) |
Performs Grey Wolf optimization with Two-Phase Mutation |
Getting started¶
Information to install, test, and contribute to the package.
User Guide¶
User guide of ITMO_FS
API¶
The main documentation. This contains an in-depth description of all algorithms and how to apply them.
API Documentation¶
The exact API of all functions and classes, as given in the doctring. The API documents expected types and allowed features for all functions, and all parameters available for the algorithms.