unipy.stats.feature_selection module¶

Feature selection.

unipy.stats.feature_selection.lasso_rank(formula=None, X=None, y=None, data=None, alpha=array([1e-05, 0.00011, 0.00021, 0.00031, 0.00041, 0.00051, 0.00061, 0.00071, 0.00081, 0.00091, 0.00101, 0.00111, 0.00121, 0.00131, 0.00141, 0.00151, 0.00161, 0.00171, 0.00181, 0.00191, 0.00201, 0.00211, 0.00221, 0.00231, 0.00241, 0.00251, 0.00261, 0.00271, 0.00281, 0.00291, 0.00301, 0.00311, 0.00321, 0.00331, 0.00341, 0.00351, 0.00361, 0.00371, 0.00381, 0.00391, 0.00401, 0.00411, 0.00421, 0.00431, 0.00441, 0.00451, 0.00461, 0.00471, 0.00481, 0.00491, 0.00501, 0.00511, 0.00521, 0.00531, 0.00541, 0.00551, 0.00561, 0.00571, 0.00581, 0.00591, 0.00601, 0.00611, 0.00621, 0.00631, 0.00641, 0.00651, 0.00661, 0.00671, 0.00681, 0.00691, 0.00701, 0.00711, 0.00721, 0.00731, 0.00741, 0.00751, 0.00761, 0.00771, 0.00781, 0.00791, 0.00801, 0.00811, 0.00821, 0.00831, 0.00841, 0.00851, 0.00861, 0.00871, 0.00881, 0.00891, 0.00901, 0.00911, 0.00921, 0.00931, 0.00941, 0.00951, 0.00961, 0.00971, 0.00981, 0.00991]), k=2, plot=False, *args, **kwargs)[source]¶

Feature selection by LASSO regression.

Parameters

formula – R-style formula string
X (list-like) – Column values for X.
y (list-like) – A column value for y.
data (pandas.DataFrame) – A DataFrame.
alpha (Iterable) – An Iterable contains alpha values.
k (int) – Threshold of coefficient matrix
plot (Boolean (default: False)) – True if want to plot the result.

Returns

rankTbl (pandas.DataFrame) – Feature ranking by given k.
minIntercept (pandas.DataFrame) – The minimum intercept row in coefficient matrix.
coefMatrix (pandas.DataFrame) – A coefficient matrix.
kBest (pandas.DataFrame) – When Given k, The best intercept row in coefficient matrix.
kBestPredY (dict) – A predicted Y with kBest alpha.

Example

>>> import unipy.dataset.api as dm
>>> dm.init()
['cars', 'anscombe', 'iris', 'nutrients', 'german_credit_scoring_fars2008', 'winequality_red', 'winequality_white', 'titanic', 'car90', 'diabetes', 'adult', 'tips', 'births_big', 'breast_cancer', 'air_quality', 'births_small']
>>> wine_red = dm.load('winequality_red')
Dataset : winequality_red
>>>
>>> ranked, best_by_intercept, coefTbl, kBest, kBestPred = lasso_rank(X=wine_red.columns.drop('quality'), y=['quality'], data=wine_red)
>>> ranked
                  rank  lasso_coef  abs_coef
volatile_acidity     1   -0.675725  0.675725
alcohol              2    0.194865  0.194865
>>> best_by_intercept
                      RSS  Intercept  fixed_acidity  volatile_acidity      alpha_0.00121  691.956364   3.134874       0.002374         -1.023793

citric_acid residual_sugar chlorides free_sulfur_dioxide alpha_0.00121 0.0 0.0 -0.272912 -0.0

total_sulfur_dioxide density pH sulphates alcohol alpha_0.00121 -0.000963 -0.0 -0.0 0.505956 0.264552

var_count

alpha_0.00121 6 >>>

unipy.stats.feature_selection.feature_selection_vif(data, thresh=5.0)[source]¶

Stepwise Feature Selection for multivariate analysis.

It calculates OLS regressions and the variance inflation factors iterating all explanatory variables. If the maximum VIF of a variable is over the given threshold, It will be dropped. This process is repeated until all VIFs are lower than the given threshold.

Recommended threshold is lower than 5, because if VIF is greater than 5, then the explanatory variable selected is highly collinear with the other explanatory variables, and the parameter estimates will have large standard errors because of this.

Parameters

data (DataFrame, (rows: observed values, columns: multivariate variables)) – design dataframe with all explanatory variables, as for example used in regression
thresh (int, float) – A threshold of VIF

Returns

Filtered_data (DataFrame) – A subset of the input DataFame
dropped_List (DataFrame) – ‘var’ column : dropped variable names from input data columns ‘vif’ column : variance inflation factor of dropped variables

Notes

This function does not save the auxiliary regression.