Welcome to ITMO_FS!¶
Install and contribution¶
Prerequisites¶
The feature selection library requires the following dependencies:
- python (>=3.6)
- numpy (>=1.13.3)
- scipy (>=0.19.1)
- scikit-learn (>=0.22)
- imblearn (>=0.0)
- qpsolvers (>=1.0.1)
Install¶
ITMO_FS is currently available on the PyPi’s repositories and you can install it via pip:
pip install -U ITMO_FS
If you prefer, you can clone it and run the setup.py file. Use the following commands to get a copy from Github and install all dependencies:
git clone https://github.com/LastShekel/ITMO_FS.git
cd ITMO_FS
pip install .
Or install using pip and GitHub:
pip install -U git+https://github.com/LastShekel/ITMO_FS.git
Test and coverage¶
You want to test the code before to install:
$ make test
You wish to test the coverage of your version:
$ make coverage
You can also use pytest:
$ pytest ITMO_FS -v
User Guide¶
Introduction¶
API’s of feature selectors¶
Available selectors follow the scikit-learn API using the base estimator and selector mixin:
Transformer: | The base object, implements a selector.fit(data, targets)
To select features from a data set after learning, each selector implements: data_selected = selector.transform(data)
To learn from data and select features from the same data set at once, each selector implements: data_selected = selector.fir_transform(data, targets)
To reverse the selection operation, each selector implements: data_reversed = selector.fir_transform(data)
|
---|
Feature selectors accept the same inputs that in scikit-learn:
data
: array-like (2-D list, pandas.Dataframe, numpy.array) or sparse matrices;targets
: array-like (1-D list, pandas.Series, numpy.array).
The output will be of the following type:
data_selected
: array-like (2-D list, pandas.Dataframe, numpy.array) or- sparse matrices;
data_reversed
: array-like (2-D list, pandas.Dataframe, numpy.array) or- sparse matrices;
Sparse input
For sparse input the data is converted to the Compressed Sparse Rows
representation (see scipy.sparse.csr_matrix
) before being fed to the
sampler. To avoid unnecessary memory copies, it is recommended to choose the
CSR representation upstream.
Problem statement regarding data sets with redundant features¶
Feature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model. Fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.
Here is one of examples of feature selection improving the classification quality:
>>> from sklearn.datasets import make_classification
>>> from sklearn.linear_model import SGDClassifier
>>> from ITMO_FS.embedded import MOS
>>> X, y = make_classification(n_samples=300, n_features=10, random_state=0, n_informative=2)
>>> sel = MOS()
>>> trX = sel.fit_transform(X, y, smote=False)
>>> cl1 = SGDClassifier()
>>> cl1.fit(X, y)
>>> cl1.score(X, y)
0.9033333333333333
>>> cl2 = SGDClassifier()
>>> cl2.fit(trX, y)
>>> cl2.score(trX, y)
0.9433333333333334
As expected, the quality of the SVGClassifier’s results is impacted by the presence of redundant features in data set. We can see that after using of feature selection the mean accuracy increases from 0.903 to 0.943.
ITMO_FS API¶
This is the full API documentation of the ITMO_FS toolbox.
ITMO_FS.filters
: Filter methods¶
ITMO_FS.filters.univariate
: Univariate filter methods¶
filters.univariate.VDM ([weighted, q]) |
Creates Value Difference Metric builder. |
filters.univariate.UnivariateFilter (measure) |
Basic interface for using univariate measures for feature selection. |
Measures for univariate filters¶
filters.univariate.fit_criterion_measure (x, y) |
Calculate the FitCriterion score for features. |
filters.univariate.f_ratio_measure (x, y) |
Calculate Fisher score for features. |
filters.univariate.gini_index (x, y) |
Calculate Gini index for features. |
filters.univariate.su_measure (x, y) |
SU is a correlation measure between the features and the class calculated via formula SU(X,Y) = 2 * I(X|Y) / (H(X) + H(Y)). |
filters.univariate.spearman_corr (x, y) |
Calculate Spearman’s correlation for each feature. |
filters.univariate.pearson_corr (x, y) |
Calculate Pearson’s correlation for each feature. |
filters.univariate.fechner_corr (x, y) |
Calculate Sample sign correlation (Fechner correlation) for each feature. |
filters.univariate.kendall_corr (x, y) |
Calculate Sample sign correlation (Kendall correlation) for each feature. |
filters.univariate.reliefF_measure (x, y[, …]) |
Calculate ReliefF measure for each feature. |
filters.univariate.chi2_measure (x, y) |
Calculate the Chi-squared measure for each feature. |
filters.univariate.information_gain (x, y) |
Calculate mutual information for each feature by formula I(X,Y) = H(Y) - H(Y|X). |
Cutting rules for univariate filters¶
ITMO_FS.filters.multivariate
: Multivariate filter methods¶
filters.multivariate.DISRWithMassive (n_features) |
Create DISR (Double Input Symmetric Relevance) feature selection filter based on kASSI criterin for feature selection which aims at maximizing the mutual information avoiding, meanwhile, large multivariate density estimation. |
filters.multivariate.FCBFDiscreteFilter ([delta]) |
Create FCBF (Fast Correlation Based filter) feature selection filter based on mutual information criteria for data with discrete features. |
filters.multivariate.MultivariateFilter (…) |
Provides basic functionality for multivariate filters. |
filters.multivariate.STIR (n_features[, …]) |
Feature selection using STIR algorithm. |
filters.multivariate.TraceRatioFisher (n_features) |
Creates TraceRatio(similarity based) feature selection filter performed in supervised way, i.e. |
filters.multivariate.MIMAGA (mim_size, pop_size) |
Measures for multivariate filters¶
filters.multivariate.MIM (selected_features, …) |
Mutual Information Maximization feature scoring criterion. |
filters.multivariate.MRMR (selected_features, …) |
Minimum-Redundancy Maximum-Relevance feature scoring criterion. |
filters.multivariate.JMI (selected_features, …) |
Joint Mutual Information feature scoring criterion. |
filters.multivariate.CIFE (selected_features, …) |
Conditional Infomax Feature Extraction feature scoring criterion. |
filters.multivariate.MIFS (selected_features, …) |
Mutual Information Feature Selection feature scoring criterion. |
filters.multivariate.CMIM (selected_features, …) |
Conditional Mutual Info Maximisation feature scoring criterion. |
filters.multivariate.ICAP (selected_features, …) |
Interaction Capping feature scoring criterion. |
filters.multivariate.DCSF (selected_features, …) |
Dynamic change of selected feature with the class scoring criterion. |
filters.multivariate.CFR (selected_features, …) |
The criterion of CFR maximizes the correlation and minimizes the redundancy. |
filters.multivariate.MRI (selected_features, …) |
Max-Relevance and Max-Independence feature scoring criteria. |
filters.multivariate.IWFS (selected_features, …) |
Interaction Weight base feature scoring criteria. |
filters.multivariate.generalizedCriteria (…) |
This feature scoring criteria is a linear combination of all relevance, redundancy, conditional dependency Given set of already selected features and set of remaining features on dataset X with labels y selects next feature. |
ITMO_FS.filters.unsupervised
: Unsupervised filter methods¶
filters.unsupervised.TraceRatioLaplacian (…) |
TraceRatio(similarity based) feature selection filter performed in unsupervised way, i.e laplacian version |
ITMO_FS.filters.sparse
: Sparse filter methods¶
filters.sparse.MCFS |
|
filters.sparse.NDFS |
|
filters.sparse.RFS |
|
filters.sparse.SPEC |
|
filters.sparse.UDFS |
ITMO_FS.ensembles
: Ensemble methods¶
ITMO_FS.ensembles.measure_based
: Measure based ensemble methods¶
ensembles.measure_based.WeightBased (filters) |
Weight-based filter ensemble. |
ITMO_FS.ensembles.model_based
: Model based ensemble methods¶
ensembles.model_based.BestSum (models, …[, …]) |
Best weighted sum ensemble. |
ITMO_FS.ensembles.ranking_based
: Ranking based ensemble methods¶
ensembles.ranking_based.Mixed (filters, …) |
Perform feature selection based on several filters, selecting features this way: Get ranks from every filter from input. |
ITMO_FS.embedded
: Embedded methods¶
embedded.MOS (model, weight_func[, loss, …]) |
Perform Minimizing Overlapping Selection under SMOTE (MOSS) or under No-Sampling (MOSNS) algorithm. |
ITMO_FS.hybrid
: Hybrid methods¶
hybrid.FilterWrapperHybrid (filter_, wrapper) |
Perform the filter + wrapper hybrid algorithm by first running the filter algorithm on the full dataset, leaving the selected features and running the wrapper algorithm on the cut dataset. |
hybrid.Melif (estimator, measure, …[, …]) |
MeLiF algorithm. |
ITMO_FS.wrappers
: Wrapper methods¶
ITMO_FS.wrappers.deterministic
: Deterministic wrapper methods¶
wrappers.deterministic.AddDelWrapper (…[, …]) |
Add-Del feature wrapper. |
wrappers.deterministic.BackwardSelection (…) |
Backward Selection removes one feature at a time until the number of features to be removed is reached. |
wrappers.deterministic.RecursiveElimination (…) |
Recursive feature elimination algorithm. |
wrappers.deterministic.SequentialForwardSelection (…) |
Sequentially add features that maximize the classifying function when combined with the features already used. |
Deterministic wrapper function¶
wrappers.deterministic.qpfs_wrapper |
ITMO_FS.wrappers.randomized
: Randomized wrapper methods¶
wrappers.randomized.HillClimbingWrapper (…) |
Hill Climbing algorithm. |
wrappers.randomized.SimulatedAnnealing (…) |
Simulated Annealing algorithm. |
wrappers.randomized.TPhMGWO (estimator, measure) |
Grey Wolf optimization with Two-Phase Mutation. |
Getting started¶
Information to install, test, and contribute to the package.
User Guide¶
User guide of ITMO_FS
API¶
The main documentation. This contains an in-depth description of all algorithms and how to apply them.
API Documentation¶
The exact API of all functions and classes, as given in the doctring. The API documents expected types and allowed features for all functions, and all parameters available for the algorithms.