unipy.stats.feature_selection module¶
Feature selection.
-
unipy.stats.feature_selection.
lasso_rank
(formula=None, X=None, y=None, data=None, alpha=array([1e-05, 0.00011, 0.00021, 0.00031, 0.00041, 0.00051, 0.00061, 0.00071, 0.00081, 0.00091, 0.00101, 0.00111, 0.00121, 0.00131, 0.00141, 0.00151, 0.00161, 0.00171, 0.00181, 0.00191, 0.00201, 0.00211, 0.00221, 0.00231, 0.00241, 0.00251, 0.00261, 0.00271, 0.00281, 0.00291, 0.00301, 0.00311, 0.00321, 0.00331, 0.00341, 0.00351, 0.00361, 0.00371, 0.00381, 0.00391, 0.00401, 0.00411, 0.00421, 0.00431, 0.00441, 0.00451, 0.00461, 0.00471, 0.00481, 0.00491, 0.00501, 0.00511, 0.00521, 0.00531, 0.00541, 0.00551, 0.00561, 0.00571, 0.00581, 0.00591, 0.00601, 0.00611, 0.00621, 0.00631, 0.00641, 0.00651, 0.00661, 0.00671, 0.00681, 0.00691, 0.00701, 0.00711, 0.00721, 0.00731, 0.00741, 0.00751, 0.00761, 0.00771, 0.00781, 0.00791, 0.00801, 0.00811, 0.00821, 0.00831, 0.00841, 0.00851, 0.00861, 0.00871, 0.00881, 0.00891, 0.00901, 0.00911, 0.00921, 0.00931, 0.00941, 0.00951, 0.00961, 0.00971, 0.00981, 0.00991]), k=2, plot=False, *args, **kwargs)[source]¶ Feature selection by LASSO regression.
- Parameters
formula – R-style formula string
X (list-like) – Column values for X.
y (list-like) – A column value for y.
data (pandas.DataFrame) – A DataFrame.
alpha (Iterable) – An Iterable contains alpha values.
k (int) – Threshold of coefficient matrix
plot (Boolean (default: False)) – True if want to plot the result.
- Returns
rankTbl (pandas.DataFrame) – Feature ranking by given
k
.minIntercept (pandas.DataFrame) – The minimum intercept row in coefficient matrix.
coefMatrix (pandas.DataFrame) – A coefficient matrix.
kBest (pandas.DataFrame) – When Given
k
, The best intercept row in coefficient matrix.kBestPredY (dict) – A predicted
Y
withkBest
alpha.
Example
>>> import unipy.dataset.api as dm >>> dm.init() ['cars', 'anscombe', 'iris', 'nutrients', 'german_credit_scoring_fars2008', 'winequality_red', 'winequality_white', 'titanic', 'car90', 'diabetes', 'adult', 'tips', 'births_big', 'breast_cancer', 'air_quality', 'births_small'] >>> wine_red = dm.load('winequality_red') Dataset : winequality_red >>> >>> ranked, best_by_intercept, coefTbl, kBest, kBestPred = lasso_rank(X=wine_red.columns.drop('quality'), y=['quality'], data=wine_red) >>> ranked rank lasso_coef abs_coef volatile_acidity 1 -0.675725 0.675725 alcohol 2 0.194865 0.194865 >>> best_by_intercept RSS Intercept fixed_acidity volatile_acidity alpha_0.00121 691.956364 3.134874 0.002374 -1.023793
citric_acid residual_sugar chlorides free_sulfur_dioxide alpha_0.00121 0.0 0.0 -0.272912 -0.0
total_sulfur_dioxide density pH sulphates alcohol alpha_0.00121 -0.000963 -0.0 -0.0 0.505956 0.264552
var_count
alpha_0.00121 6 >>>
-
unipy.stats.feature_selection.
feature_selection_vif
(data, thresh=5.0)[source]¶ Stepwise Feature Selection for multivariate analysis.
It calculates OLS regressions and the variance inflation factors iterating all explanatory variables. If the maximum VIF of a variable is over the given threshold, It will be dropped. This process is repeated until all VIFs are lower than the given threshold.
Recommended threshold is lower than 5, because if VIF is greater than 5, then the explanatory variable selected is highly collinear with the other explanatory variables, and the parameter estimates will have large standard errors because of this.
- Parameters
- Returns
Filtered_data (DataFrame) – A subset of the input DataFame
dropped_List (DataFrame) – ‘var’ column : dropped variable names from input data columns ‘vif’ column : variance inflation factor of dropped variables
Notes
This function does not save the auxiliary regression.
See also
statsmodels.stats.outliers_influence.variance_inflation_factor()
References