unipy.stats package¶
Submodules¶
Module contents¶
Utility Objects.
This module provides a number of functions and objects for utility.
hypo_test¶
f_test – F-Test.
f_test_formula – F-Test by formula.
anova_test – ANOVA Test.
anova_test_formula – ANOVA Test by formula.
chisq_test – Chi-square Test.
fisher_test – Fisher’s Exact Test.
feature_selection¶
lasso_rank – Feature selection by LASSO regression.
- feature_selection_vif – VIF based stepwise feature selection
for multivariate analysis.
metrics¶
deviation – Deviation.
vif – Variance inflation factor.
mean_absolute_percentage_error – Mean Absolute Percentage Error.
average_absolute_deviation – Average Absolute Deviation.
median_absolute_deviation – Median Absolute Deviation.
calculate_interaction – Feature interaction calculation.
formula¶
from_formula – R-style Formula Formatting.
-
unipy.stats.
mean_absolute_percentage_error
(measure, predict, thresh=3.0)[source]¶ Mean Absolute Percentage Error. It is a percent of errors. It measures the prediction accuracy of a forecasting method in Statistics with the real mesured values and the predicted values, for example in trend estimation. If MAPE is 5, it means this prediction method potentially has 5% error. It cannot be used if there are zero values, because there would be a division by zero.
-
unipy.stats.
average_absolute_deviation
(measure, predict, thresh=2)[source]¶ Average Absolute Deviation. It is … It measures the prediction accuracy of a forecasting method in Statistics with the real mesured values and the predicted values, for example in trend estimation. If MAD is 5, it means this prediction method potentially has…
-
unipy.stats.
median_absolute_deviation
(measure, predict, thresh=2)[source]¶ Median Absolute Deviation. It is … It measures the prediction accuracy of a forecasting method in Statistics with the real mesured values and the predicted values, for example in trend estimation. If MAD is 5, it means this prediction method potentially has…
-
unipy.stats.
calculate_interaction
(rankTbl, pvTbl, target, ranknum=10)[source]¶ Feature interaction calculation.
-
unipy.stats.
f_test
(a, b, scale=1, alternative='two-sided', conf_level=0.95, *args, **kwargs)[source]¶ F-Test.
-
unipy.stats.
f_test_formula
(a, b, scale=1, alternative='two-sided', conf_level=0.95, *args, **kwargs)[source]¶ F-Test by formula.
-
unipy.stats.
chisq_test
(data, x=None, y=None, correction=None, lambda_=None, margin=True, print_ok=True)[source]¶ Chi-square Test.
lambda_
gives the power in the Cressie-Read power divergence statistic. The default is 1. For convenience, lambda_ may be assigned one of the following strings, in which case the corresponding numerical value is used:
-
unipy.stats.
fisher_test
(data, x=None, y=None, alternative='two-sided', margin=True, print_ok=True)[source]¶ Fisher’s Exact Test.
-
unipy.stats.
lasso_rank
(formula=None, X=None, y=None, data=None, alpha=array([1e-05, 0.00011, 0.00021, 0.00031, 0.00041, 0.00051, 0.00061, 0.00071, 0.00081, 0.00091, 0.00101, 0.00111, 0.00121, 0.00131, 0.00141, 0.00151, 0.00161, 0.00171, 0.00181, 0.00191, 0.00201, 0.00211, 0.00221, 0.00231, 0.00241, 0.00251, 0.00261, 0.00271, 0.00281, 0.00291, 0.00301, 0.00311, 0.00321, 0.00331, 0.00341, 0.00351, 0.00361, 0.00371, 0.00381, 0.00391, 0.00401, 0.00411, 0.00421, 0.00431, 0.00441, 0.00451, 0.00461, 0.00471, 0.00481, 0.00491, 0.00501, 0.00511, 0.00521, 0.00531, 0.00541, 0.00551, 0.00561, 0.00571, 0.00581, 0.00591, 0.00601, 0.00611, 0.00621, 0.00631, 0.00641, 0.00651, 0.00661, 0.00671, 0.00681, 0.00691, 0.00701, 0.00711, 0.00721, 0.00731, 0.00741, 0.00751, 0.00761, 0.00771, 0.00781, 0.00791, 0.00801, 0.00811, 0.00821, 0.00831, 0.00841, 0.00851, 0.00861, 0.00871, 0.00881, 0.00891, 0.00901, 0.00911, 0.00921, 0.00931, 0.00941, 0.00951, 0.00961, 0.00971, 0.00981, 0.00991]), k=2, plot=False, *args, **kwargs)[source]¶ Feature selection by LASSO regression.
- Parameters
formula – R-style formula string
X (list-like) – Column values for X.
y (list-like) – A column value for y.
data (pandas.DataFrame) – A DataFrame.
alpha (Iterable) – An Iterable contains alpha values.
k (int) – Threshold of coefficient matrix
plot (Boolean (default: False)) – True if want to plot the result.
- Returns
rankTbl (pandas.DataFrame) – Feature ranking by given
k
.minIntercept (pandas.DataFrame) – The minimum intercept row in coefficient matrix.
coefMatrix (pandas.DataFrame) – A coefficient matrix.
kBest (pandas.DataFrame) – When Given
k
, The best intercept row in coefficient matrix.kBestPredY (dict) – A predicted
Y
withkBest
alpha.
Example
>>> import unipy.dataset.api as dm >>> dm.init() ['cars', 'anscombe', 'iris', 'nutrients', 'german_credit_scoring_fars2008', 'winequality_red', 'winequality_white', 'titanic', 'car90', 'diabetes', 'adult', 'tips', 'births_big', 'breast_cancer', 'air_quality', 'births_small'] >>> wine_red = dm.load('winequality_red') Dataset : winequality_red >>> >>> ranked, best_by_intercept, coefTbl, kBest, kBestPred = lasso_rank(X=wine_red.columns.drop('quality'), y=['quality'], data=wine_red) >>> ranked rank lasso_coef abs_coef volatile_acidity 1 -0.675725 0.675725 alcohol 2 0.194865 0.194865 >>> best_by_intercept RSS Intercept fixed_acidity volatile_acidity alpha_0.00121 691.956364 3.134874 0.002374 -1.023793
citric_acid residual_sugar chlorides free_sulfur_dioxide alpha_0.00121 0.0 0.0 -0.272912 -0.0
total_sulfur_dioxide density pH sulphates alcohol alpha_0.00121 -0.000963 -0.0 -0.0 0.505956 0.264552
var_count
alpha_0.00121 6 >>>
-
unipy.stats.
feature_selection_vif
(data, thresh=5.0)[source]¶ Stepwise Feature Selection for multivariate analysis.
It calculates OLS regressions and the variance inflation factors iterating all explanatory variables. If the maximum VIF of a variable is over the given threshold, It will be dropped. This process is repeated until all VIFs are lower than the given threshold.
Recommended threshold is lower than 5, because if VIF is greater than 5, then the explanatory variable selected is highly collinear with the other explanatory variables, and the parameter estimates will have large standard errors because of this.
- Parameters
- Returns
Filtered_data (DataFrame) – A subset of the input DataFame
dropped_List (DataFrame) – ‘var’ column : dropped variable names from input data columns ‘vif’ column : variance inflation factor of dropped variables
Notes
This function does not save the auxiliary regression.
See also
statsmodels.stats.outliers_influence.variance_inflation_factor()
References