Welcome to ITMO_FS!

Install and contribution

Prerequisites

The feature selection library requires the following dependencies:

  • python (>=3.6)
  • numpy (>=1.13.3)
  • scipy (>=0.19.1)
  • scikit-learn (>=0.22)
  • imblearn (>=0.0)
  • qpsolvers (>=1.0.1)

Install

ITMO_FS is currently available on the PyPi’s repositories and you can install it via pip:

pip install -U ITMO_FS

If you prefer, you can clone it and run the setup.py file. Use the following commands to get a copy from Github and install all dependencies:

git clone https://github.com/LastShekel/ITMO_FS.git
cd ITMO_FS
pip install .

Or install using pip and GitHub:

pip install -U git+https://github.com/LastShekel/ITMO_FS.git

Test and coverage

You want to test the code before to install:

$ make test

You wish to test the coverage of your version:

$ make coverage

You can also use pytest:

$ pytest ITMO_FS -v

Contribute

You can contribute to this code through Pull Request on GitHub. Please, make sure that your code is coming with unit tests to ensure full coverage and continuous integration in the API.

User Guide

Introduction

API’s of feature selectors

Available selectors follow the scikit-learn API using the base estimator and selector mixin:

Transformer:

The base object, implements a fit method to learn from data, either:

selector.fit(data, targets)

To select features from a data set after learning, each selector implements:

data_selected = selector.transform(data)

To learn from data and select features from the same data set at once, each selector implements:

data_selected = selector.fir_transform(data, targets)

To reverse the selection operation, each selector implements:

data_reversed = selector.fir_transform(data)

Feature selectors accept the same inputs that in scikit-learn:

  • data: array-like (2-D list, pandas.Dataframe, numpy.array) or sparse matrices;
  • targets: array-like (1-D list, pandas.Series, numpy.array).

The output will be of the following type:

  • data_selected: array-like (2-D list, pandas.Dataframe, numpy.array) or
    sparse matrices;
  • data_reversed: array-like (2-D list, pandas.Dataframe, numpy.array) or
    sparse matrices;

Sparse input

For sparse input the data is converted to the Compressed Sparse Rows representation (see scipy.sparse.csr_matrix) before being fed to the sampler. To avoid unnecessary memory copies, it is recommended to choose the CSR representation upstream.

Problem statement regarding data sets with redundant features

Feature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model. Fewer attributes is desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.

Here is one of examples of feature selection improving the classification quality:

>>> from sklearn.datasets import make_classification
>>> from sklearn.linear_model import SGDClassifier
>>> from ITMO_FS.embedded import MOS

>>> X, y = make_classification(n_samples=300, n_features=10, random_state=0, n_informative=2)
>>> sel = MOS()
>>> trX = sel.fit_transform(X, y, smote=False)

>>> cl1 = SGDClassifier()
>>> cl1.fit(X, y)
>>> cl1.score(X, y)
0.9033333333333333

>>> cl2 = SGDClassifier()
>>> cl2.fit(trX, y)
>>> cl2.score(trX, y)
0.9433333333333334

As expected, the quality of the SVGClassifier’s results is impacted by the presence of redundant features in data set. We can see that after using of feature selection the mean accuracy increases from 0.903 to 0.943.

ITMO_FS API

This is the full API documentation of the ITMO_FS toolbox.

ITMO_FS.filters: Filter methods

ITMO_FS.filters.univariate: Univariate filter methods

filters.univariate.VDM([weighted, q]) Creates Value Difference Metric builder.
filters.univariate.UnivariateFilter(measure) Basic interface for using univariate measures for feature selection.
Measures for univariate filters
filters.univariate.fit_criterion_measure(x, y) Calculate the FitCriterion score for features.
filters.univariate.f_ratio_measure(x, y) Calculate Fisher score for features.
filters.univariate.gini_index(x, y) Calculate Gini index for features.
filters.univariate.su_measure(x, y) SU is a correlation measure between the features and the class calculated via formula SU(X,Y) = 2 * I(X|Y) / (H(X) + H(Y)).
filters.univariate.spearman_corr(x, y) Calculate Spearman’s correlation for each feature.
filters.univariate.pearson_corr(x, y) Calculate Pearson’s correlation for each feature.
filters.univariate.fechner_corr(x, y) Calculate Sample sign correlation (Fechner correlation) for each feature.
filters.univariate.kendall_corr(x, y) Calculate Sample sign correlation (Kendall correlation) for each feature.
filters.univariate.reliefF_measure(x, y[, …]) Calculate ReliefF measure for each feature.
filters.univariate.chi2_measure(x, y) Calculate the Chi-squared measure for each feature.
filters.univariate.information_gain(x, y) Calculate mutual information for each feature by formula I(X,Y) = H(Y) - H(Y|X).

ITMO_FS.filters.multivariate: Multivariate filter methods

filters.multivariate.DISRWithMassive(n_features) Create DISR (Double Input Symmetric Relevance) feature selection filter based on kASSI criterin for feature selection which aims at maximizing the mutual information avoiding, meanwhile, large multivariate density estimation.
filters.multivariate.FCBFDiscreteFilter([delta]) Create FCBF (Fast Correlation Based filter) feature selection filter based on mutual information criteria for data with discrete features.
filters.multivariate.MultivariateFilter(…) Provides basic functionality for multivariate filters.
filters.multivariate.STIR(n_features[, …]) Feature selection using STIR algorithm.
filters.multivariate.TraceRatioFisher(n_features) Creates TraceRatio(similarity based) feature selection filter performed in supervised way, i.e.
filters.multivariate.MIMAGA(mim_size, pop_size)
Measures for multivariate filters
filters.multivariate.MIM(selected_features, …) Mutual Information Maximization feature scoring criterion.
filters.multivariate.MRMR(selected_features, …) Minimum-Redundancy Maximum-Relevance feature scoring criterion.
filters.multivariate.JMI(selected_features, …) Joint Mutual Information feature scoring criterion.
filters.multivariate.CIFE(selected_features, …) Conditional Infomax Feature Extraction feature scoring criterion.
filters.multivariate.MIFS(selected_features, …) Mutual Information Feature Selection feature scoring criterion.
filters.multivariate.CMIM(selected_features, …) Conditional Mutual Info Maximisation feature scoring criterion.
filters.multivariate.ICAP(selected_features, …) Interaction Capping feature scoring criterion.
filters.multivariate.DCSF(selected_features, …) Dynamic change of selected feature with the class scoring criterion.
filters.multivariate.CFR(selected_features, …) The criterion of CFR maximizes the correlation and minimizes the redundancy.
filters.multivariate.MRI(selected_features, …) Max-Relevance and Max-Independence feature scoring criteria.
filters.multivariate.IWFS(selected_features, …) Interaction Weight base feature scoring criteria.
filters.multivariate.generalizedCriteria(…) This feature scoring criteria is a linear combination of all relevance, redundancy, conditional dependency Given set of already selected features and set of remaining features on dataset X with labels y selects next feature.

ITMO_FS.filters.unsupervised: Unsupervised filter methods

filters.unsupervised.TraceRatioLaplacian(…) TraceRatio(similarity based) feature selection filter performed in unsupervised way, i.e laplacian version

ITMO_FS.filters.sparse: Sparse filter methods

filters.sparse.MCFS
filters.sparse.NDFS
filters.sparse.RFS
filters.sparse.SPEC
filters.sparse.UDFS

ITMO_FS.ensembles: Ensemble methods

ITMO_FS.ensembles.measure_based: Measure based ensemble methods

ensembles.measure_based.WeightBased(filters) Weight-based filter ensemble.

ITMO_FS.ensembles.model_based: Model based ensemble methods

ensembles.model_based.BestSum(models, …[, …]) Best weighted sum ensemble.

ITMO_FS.ensembles.ranking_based: Ranking based ensemble methods

ensembles.ranking_based.Mixed(filters, …) Perform feature selection based on several filters, selecting features this way: Get ranks from every filter from input.

ITMO_FS.embedded: Embedded methods

embedded.MOS(model, weight_func[, loss, …]) Perform Minimizing Overlapping Selection under SMOTE (MOSS) or under No-Sampling (MOSNS) algorithm.

ITMO_FS.hybrid: Hybrid methods

hybrid.FilterWrapperHybrid(filter_, wrapper) Perform the filter + wrapper hybrid algorithm by first running the filter algorithm on the full dataset, leaving the selected features and running the wrapper algorithm on the cut dataset.
hybrid.Melif(estimator, measure, …[, …]) MeLiF algorithm.

ITMO_FS.wrappers: Wrapper methods

ITMO_FS.wrappers.deterministic: Deterministic wrapper methods

wrappers.deterministic.AddDelWrapper(…[, …]) Add-Del feature wrapper.
wrappers.deterministic.BackwardSelection(…) Backward Selection removes one feature at a time until the number of features to be removed is reached.
wrappers.deterministic.RecursiveElimination(…) Recursive feature elimination algorithm.
wrappers.deterministic.SequentialForwardSelection(…) Sequentially add features that maximize the classifying function when combined with the features already used.
Deterministic wrapper function
wrappers.deterministic.qpfs_wrapper

ITMO_FS.wrappers.randomized: Randomized wrapper methods

wrappers.randomized.HillClimbingWrapper(…) Hill Climbing algorithm.
wrappers.randomized.SimulatedAnnealing(…) Simulated Annealing algorithm.
wrappers.randomized.TPhMGWO(estimator, measure) Grey Wolf optimization with Two-Phase Mutation.

Getting started

Information to install, test, and contribute to the package.

User Guide

User guide of ITMO_FS

API

The main documentation. This contains an in-depth description of all algorithms and how to apply them.

API Documentation

The exact API of all functions and classes, as given in the doctring. The API documents expected types and allowed features for all functions, and all parameters available for the algorithms.