Welcome to ITMO_FS!

Install and contribution

Prerequisites

The feature selection library requires the following dependencies:

  • python (>=3.6)
  • numpy (>=1.13.3)
  • scipy (>=0.19.1)
  • scikit-learn (>=0.22)
  • imblearn (>=0.0)
  • qpsolvers (>=1.0.1)

Install

ITMO_FS is currently available on the PyPi’s repositories and you can install it via pip:

pip install -U ITMO_FS

If you prefer, you can clone it and run the setup.py file. Use the following commands to get a copy from Github and install all dependencies:

git clone https://github.com/LastShekel/ITMO_FS.git
cd ITMO_FS
pip install .

Or install using pip and GitHub:

pip install -U git+https://github.com/LastShekel/ITMO_FS.git

Test and coverage

You want to test the code before to install:

$ make test

You wish to test the coverage of your version:

$ make coverage

You can also use pytest:

$ pytest ITMO_FS -v

Contribute

You can contribute to this code through Pull Request on GitHub. Please, make sure that your code is coming with unit tests to ensure full coverage and continuous integration in the API.

User Guide

Introduction

API’s of feature selectors

Available selectors follow the scikit-learn API using the base estimator and selector mixin:

Transformer:

The base object, implements a fit method to learn from data, either:

selector.fit(data, targets)

To select features from a data set after learning, each selector implements:

data_selected = selector.transform(data)

To learn from data and select features from the same data set at once, each selector implements:

data_selected = selector.fir_transform(data, targets)

To reverse the selection operation, each selector implements:

data_reversed = selector.fir_transform(data)

Feature selectors accept the same inputs that in scikit-learn:

  • data: array-like (2-D list, pandas.Dataframe, numpy.array) or sparse matrices;
  • targets: array-like (1-D list, pandas.Series, numpy.array).

The output will be of the following type:

  • data_selected: array-like (2-D list, pandas.Dataframe, numpy.array) or
    sparse matrices;
  • data_reversed: array-like (2-D list, pandas.Dataframe, numpy.array) or
    sparse matrices;

Sparse input

For sparse input the data is converted to the Compressed Sparse Rows representation (see scipy.sparse.csr_matrix) before being fed to the sampler. To avoid unnecessary memory copies, it is recommended to choose the CSR representation upstream.

Problem statement regarding data sets with redundant features

Feature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model. Fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.

Here is one of examples of feature selection improving the classification quality:

>>> from sklearn.datasets import make_classification
>>> from sklearn.linear_model import SGDClassifier
>>> from ITMO_FS.embedded import MOS

>>> X, y = make_classification(n_samples=300, n_features=10, random_state=0, n_informative=2)
>>> sel = MOS()
>>> trX = sel.fit_transform(X, y, smote=False)

>>> cl1 = SGDClassifier()
>>> cl1.fit(X, y)
>>> cl1.score(X, y)
0.9033333333333333

>>> cl2 = SGDClassifier()
>>> cl2.fit(trX, y)
>>> cl2.score(trX, y)
0.9433333333333334

As expected, the quality of the SVGClassifier’s results is impacted by the presence of redundant features in data set. We can see that after using of feature selection the mean accuracy increases from 0.903 to 0.943.

ITMO_FS API

This is the full API documentation of the ITMO_FS toolbox.

ITMO_FS.filters: Filter methods

ITMO_FS.filters.univariate: Univariate filter methods

filters.univariate.VDM([weighted]) Creates Value Difference Metric builder http://aura.abdn.ac.uk/bitstream/handle/2164/10951/payne_ecai_98.pdf?sequence=1 https://www.jair.org/index.php/jair/article/view/10182
filters.univariate.UnivariateFilter(measure) Basic interface for using univariate measures for feature selection.
Measures for univariate filters
filters.univariate.fit_criterion_measure(X, y)
filters.univariate.f_ratio_measure(X, y) Calculates Fisher score for features.
filters.univariate.gini_index(X, y) Gini index is a measure of statistical dispersion.
filters.univariate.su_measure(X, y) SU is a correlation measure between the features and the class calculated, via formula SU(X,Y) = 2 * I(X|Y) / (H(X) + H(Y))
filters.univariate.spearman_corr(X, y) Calculates spearman correlation for each feature.
filters.univariate.pearson_corr(X, y) Calculates pearson correlation for each feature.
filters.univariate.fechner_corr(X, y) Calculates Sample sign correlation (Fechner correlation) for each feature.
filters.univariate.kendall_corr(X, y) Calculates Sample sign correlation (Kendall correlation) for each feature.
filters.univariate.reliefF_measure(X, y[, …]) Counts ReliefF measure for each feature
filters.univariate.chi2_measure(X, y) Calculates score for the test chi-squared statistic from X.
filters.univariate.information_gain(X, y) Calculates mutual information for each feature by formula, I(X,Y) = H(X) - H(X|Y)

ITMO_FS.filters.multivariate: Multivariate filter methods

filters.multivariate.DISRWithMassive([…]) Creates DISR (Double Input Symmetric Relevance) feature selection filter based on kASSI criterin for feature selection which aims at maximizing the mutual information avoiding, meanwhile, large multivariate density estimation.
filters.multivariate.FCBFDiscreteFilter() Creates FCBF (Fast Correlation Based filter) feature selection filter based on mutual information criteria for data with discrete features This filter finds best set of features by searching for a feature, which provides the most information about classification problem on given dataset at each step and then eliminating features which are less relevant than redundant
filters.multivariate.MultivariateFilter(…) Provides basic functionality for multivariate filters.
filters.multivariate.STIR([n_features_to_keep]) Feature selection using STIR algorithm.
filters.multivariate.TraceRatioFisher(…) Creates TraceRatio(similarity based) feature selection filter performed in supervised way, i.e fisher version
filters.multivariate.MIMAGA(mim_size, …)
Measures for multivariate filters
filters.multivariate.MIM(selected_features, …) Mutual Information Maximization feature scoring criterion.
filters.multivariate.MRMR(selected_features, …) Minimum-Redundancy Maximum-Relevance feature scoring criterion.
filters.multivariate.JMI(selected_features, …) Joint Mutual Information feature scoring criterion.
filters.multivariate.CIFE(selected_features, …) Conditional Infomax Feature Extraction feature scoring criterion.
filters.multivariate.MIFS(selected_features, …) Mutual Information Feature Selection feature scoring criterion.
filters.multivariate.CMIM(selected_features, …) Conditional Mutual Info Maximisation feature scoring criterion.
filters.multivariate.ICAP(selected_features, …) Interaction Capping feature scoring criterion.
filters.multivariate.DCSF(selected_features, …) Dynamic change of selected feature with the class scoring criterion.
filters.multivariate.CFR(selected_features, …) The criterion of CFR maximizes the correlation and minimizes the redundancy.
filters.multivariate.MRI(selected_features, …) Max-Relevance and Max-Independence feature scoring criteria.
filters.multivariate.IWFS(selected_features, …) Interaction Weight base feature scoring criteria.
filters.multivariate.generalizedCriteria(…) This feature scoring criteria is a linear combination of all relevance, redundancy, conditional dependency Given set of already selected features and set of remaining features on dataset X with labels y selects next feature.

ITMO_FS.filters.unsupervised: Unsupervised filter methods

filters.unsupervised.TraceRatioLaplacian(…) Creates TraceRatio(similarity based) feature selection filter performed in unsupervised way, i.e laplacian version

ITMO_FS.filters.sparse: Sparse filter methods

filters.sparse.MCFS(d[, k, p, scheme, sigma]) Performs the Unsupervised Feature Selection for Multi-Cluster Data algorithm.
filters.sparse.NDFS(p[, c, k, alpha, beta, …]) Performs the Nonnegative Discriminative Feature Selection algorithm.
filters.sparse.RFS(p[, gamma, …]) Performs the Robust Feature Selection via Joint L2,1-Norms Minimization algorithm.
filters.sparse.SPEC(p[, k, gamma, sigma, …]) Performs the Spectral Feature Selection algorithm.
filters.sparse.UDFS(p[, c, k, gamma, l, …]) Performs the Unsupervised Discriminative Feature Selection algorithm.

ITMO_FS.ensembles: Ensemble methods

ITMO_FS.ensembles.measure_based: Measure based ensemble methods

ensembles.measure_based.WeightBased(filters)

ITMO_FS.ensembles.model_based: Model based ensemble methods

ensembles.model_based.BestSum(models, …)

ITMO_FS.ensembles.ranking_based: Ranking based ensemble methods

ensembles.ranking_based.Mixed(filters) Performs feature selection based on several filters, selecting features this way: Get ranks from every filter from input.

ITMO_FS.embedded: Embedded methods

embedded.MOS([model, loss, seed]) Performs Minimizing Overlapping Selection under SMOTE (MOSS) or under No-Sampling (MOSNS) algorithm.

ITMO_FS.hybrid: Hybrid methods

hybrid.FilterWrapperHybrid(filter_, wrapper)
hybrid.Melif(filter_ensemble[, scorer, verbose])

ITMO_FS.wrappers: Wrapper methods

ITMO_FS.wrappers.deterministic: Deterministic wrapper methods

wrappers.deterministic.AddDelWrapper(…[, …]) Creates add-del feature wrapper
wrappers.deterministic.BackwardSelection(…) Backward Selection removes one feature at a time until the number of features to be removed is reached.
wrappers.deterministic.RecursiveElimination(…) Performs a recursive feature elimination until the required number of features is reached.
wrappers.deterministic.SequentialForwardSelection(…) Sequentially Adds Features that Maximises the Classifying function when combined with the features already used TODO add theory about this method
Deterministic wrapper function
wrappers.deterministic.qpfs_wrapper(X, y, alpha) Performs Quadratic Programming Feature Selection algorithm.

ITMO_FS.wrappers.randomized: Randomized wrapper methods

wrappers.randomized.HillClimbingWrapper(…)
wrappers.randomized.SimulatedAnnealing(…) Performs feature selection using simulated annealing
wrappers.randomized.TPhMGWO([wolfNumber, …]) Performs Grey Wolf optimization with Two-Phase Mutation

Getting started

Information to install, test, and contribute to the package.

User Guide

User guide of ITMO_FS

API

The main documentation. This contains an in-depth description of all algorithms and how to apply them.

API Documentation

The exact API of all functions and classes, as given in the doctring. The API documents expected types and allowed features for all functions, and all parameters available for the algorithms.