ITMO_FS.filters.multivariate.STIR

class ITMO_FS.filters.multivariate.STIR(n_features_to_keep=10)

Feature selection using STIR algorithm.

Algorithm taken from paper:

STatistical Inference Relief (STIR) feature selection (https://academic.oup.com/bioinformatics/article/35/8/1358/5100883).

__init__(n_features_to_keep=10)

Sets up STIR to perform feature selection.

distance_matrix(X)

Computes the distance matrix.

Before calculating distance we center matrix and normalize it.

Parameters:X (array-like, shape (n_samples, n_features)) – matrix to compute column difference of.
Returns:X_distances – distance matrix.
Return type:array-like, shape (n_samples, n_samples)
find_neighbors(X, y, k=1)

Finds the nearest hit/miss matrices.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – matrix to compute neighbors of.
  • y (array-like, shape (n_samples, )) – vector of binary class status (usually -1/1).
  • k (int, optional) – number of constant nearest hits/misses.
Returns:

hitmiss – hitmiss[1] (hits) and hitmiss[2] (misses). Each list has two columns: index is the first column (instances) in both lists. The second column is hit_index (nearest hits for the first column instance) for list [1] and miss_index (nearest misses) for list [2].

Return type:

array-like, shape (2, )

fit(X, y, feature_names=None, k=1)

Computes the feature importance scores from the training data.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training instances to compute the feature importance scores from.
  • y (array-like, shape (n_samples, )) – Training labels.
  • feature_names (list of strings, optional) – In case you want to define feature names
  • k (int, optional) – number of constant nearest hits/misses.
Returns:

Return type:

None

fit_transform(X, y, feature_names=None, k=1)

Fits and transforms data.

Computes the feature importance scores from the training data, then reduces the feature set down to the top ‘n_features_to_keep’ features.

Parameters:
  • X (array-like, shape (n_samples, n_features)) – Training instances to compute the feature importance scores from.
  • y (array-like, shape (n_samples, )) – Training labels.
  • feature_names (list of strings, optional) – In case you want to define feature names
  • k (int, optional) – number of constant nearest hits/misses.
Returns:

Return type:

Transformed 2D numpy array

max_diff(X)

Computes max difference in each column.

Parameters:X (array-like, shape (n_samples, n_features)) – matrix to compute column difference of.
Returns:diff_vector – column difference vector.
Return type:array-like, shape (n_features)
transform(X)

Reduces the feature set down to the top n_features_to_keep features.

Parameters:X (array-like, shape (n_samples, n_features)) – Feature matrix to perform feature selection on.
Returns:
Return type:Transformed 2D numpy array