unipy.stats package¶

Submodules¶

Module contents¶

Utility Objects.

This module provides a number of functions and objects for utility.

hypo_test¶

f_test – F-Test.
f_test_formula – F-Test by formula.
anova_test – ANOVA Test.
anova_test_formula – ANOVA Test by formula.
chisq_test – Chi-square Test.
fisher_test – Fisher’s Exact Test.

feature_selection¶

lasso_rank – Feature selection by LASSO regression.
feature_selection_vif – VIF based stepwise feature selection
for multivariate analysis.

metrics¶

deviation – Deviation.
vif – Variance inflation factor.
mean_absolute_percentage_error – Mean Absolute Percentage Error.
average_absolute_deviation – Average Absolute Deviation.
median_absolute_deviation – Median Absolute Deviation.
calculate_interaction – Feature interaction calculation.

formula¶

from_formula – R-style Formula Formatting.

unipy.stats.deviation(container, method='mean', if_abs=True)[source]¶: Deviation.

unipy.stats.vif(y, X)[source]¶: Variance inflation factor.

unipy.stats.mean_absolute_percentage_error(measure, predict, thresh=3.0)[source]¶: Mean Absolute Percentage Error. It is a percent of errors. It measures the prediction accuracy of a forecasting method in Statistics with the real mesured values and the predicted values, for example in trend estimation. If MAPE is 5, it means this prediction method potentially has 5% error. It cannot be used if there are zero values, because there would be a division by zero.

unipy.stats.average_absolute_deviation(measure, predict, thresh=2)[source]¶: Average Absolute Deviation. It is … It measures the prediction accuracy of a forecasting method in Statistics with the real mesured values and the predicted values, for example in trend estimation. If MAD is 5, it means this prediction method potentially has…

unipy.stats.median_absolute_deviation(measure, predict, thresh=2)[source]¶: Median Absolute Deviation. It is … It measures the prediction accuracy of a forecasting method in Statistics with the real mesured values and the predicted values, for example in trend estimation. If MAD is 5, it means this prediction method potentially has…

unipy.stats.calculate_interaction(rankTbl, pvTbl, target, ranknum=10)[source]¶: Feature interaction calculation.

unipy.stats.f_test(a, b, scale=1, alternative='two-sided', conf_level=0.95, *args, **kwargs)[source]¶: F-Test.

unipy.stats.f_test_formula(a, b, scale=1, alternative='two-sided', conf_level=0.95, *args, **kwargs)[source]¶: F-Test by formula.

unipy.stats.anova_test(formula, data=None, typ=1)[source]¶: ANOVA Test.

unipy.stats.anova_test_formula(formula, data=None, typ=1)[source]¶: ANOVA Test by formula.

unipy.stats.chisq_test(data, x=None, y=None, correction=None, lambda_=None, margin=True, print_ok=True)[source]¶

Chi-square Test.

lambda_ gives the power in the Cressie-Read power divergence statistic. The default is 1. For convenience, lambda_ may be assigned one of the following strings, in which case the corresponding numerical value is used:

Parameters

data (pandas.DataFrame) –
x (str (default: None)) –
y (str (default: None)) –
correction ((default: None)) –
lambda_ (lambda (default: None)) –
margin (Boolean (default: True)) –
print_ok (Boolean (default: True)) –

unipy.stats.fisher_test(data, x=None, y=None, alternative='two-sided', margin=True, print_ok=True)[source]¶: Fisher’s Exact Test.

unipy.stats.lasso_rank(formula=None, X=None, y=None, data=None, alpha=array([1e-05, 0.00011, 0.00021, 0.00031, 0.00041, 0.00051, 0.00061, 0.00071, 0.00081, 0.00091, 0.00101, 0.00111, 0.00121, 0.00131, 0.00141, 0.00151, 0.00161, 0.00171, 0.00181, 0.00191, 0.00201, 0.00211, 0.00221, 0.00231, 0.00241, 0.00251, 0.00261, 0.00271, 0.00281, 0.00291, 0.00301, 0.00311, 0.00321, 0.00331, 0.00341, 0.00351, 0.00361, 0.00371, 0.00381, 0.00391, 0.00401, 0.00411, 0.00421, 0.00431, 0.00441, 0.00451, 0.00461, 0.00471, 0.00481, 0.00491, 0.00501, 0.00511, 0.00521, 0.00531, 0.00541, 0.00551, 0.00561, 0.00571, 0.00581, 0.00591, 0.00601, 0.00611, 0.00621, 0.00631, 0.00641, 0.00651, 0.00661, 0.00671, 0.00681, 0.00691, 0.00701, 0.00711, 0.00721, 0.00731, 0.00741, 0.00751, 0.00761, 0.00771, 0.00781, 0.00791, 0.00801, 0.00811, 0.00821, 0.00831, 0.00841, 0.00851, 0.00861, 0.00871, 0.00881, 0.00891, 0.00901, 0.00911, 0.00921, 0.00931, 0.00941, 0.00951, 0.00961, 0.00971, 0.00981, 0.00991]), k=2, plot=False, *args, **kwargs)[source]¶

Feature selection by LASSO regression.

Parameters

formula – R-style formula string
X (list-like) – Column values for X.
y (list-like) – A column value for y.
data (pandas.DataFrame) – A DataFrame.
alpha (Iterable) – An Iterable contains alpha values.
k (int) – Threshold of coefficient matrix
plot (Boolean (default: False)) – True if want to plot the result.

Returns

rankTbl (pandas.DataFrame) – Feature ranking by given k.
minIntercept (pandas.DataFrame) – The minimum intercept row in coefficient matrix.
coefMatrix (pandas.DataFrame) – A coefficient matrix.
kBest (pandas.DataFrame) – When Given k, The best intercept row in coefficient matrix.
kBestPredY (dict) – A predicted Y with kBest alpha.

Example

>>> import unipy.dataset.api as dm
>>> dm.init()
['cars', 'anscombe', 'iris', 'nutrients', 'german_credit_scoring_fars2008', 'winequality_red', 'winequality_white', 'titanic', 'car90', 'diabetes', 'adult', 'tips', 'births_big', 'breast_cancer', 'air_quality', 'births_small']
>>> wine_red = dm.load('winequality_red')
Dataset : winequality_red
>>>
>>> ranked, best_by_intercept, coefTbl, kBest, kBestPred = lasso_rank(X=wine_red.columns.drop('quality'), y=['quality'], data=wine_red)
>>> ranked
                  rank  lasso_coef  abs_coef
volatile_acidity     1   -0.675725  0.675725
alcohol              2    0.194865  0.194865
>>> best_by_intercept
                      RSS  Intercept  fixed_acidity  volatile_acidity      alpha_0.00121  691.956364   3.134874       0.002374         -1.023793

citric_acid residual_sugar chlorides free_sulfur_dioxide alpha_0.00121 0.0 0.0 -0.272912 -0.0

total_sulfur_dioxide density pH sulphates alcohol alpha_0.00121 -0.000963 -0.0 -0.0 0.505956 0.264552

var_count

alpha_0.00121 6 >>>

unipy.stats.feature_selection_vif(data, thresh=5.0)[source]¶

Stepwise Feature Selection for multivariate analysis.

It calculates OLS regressions and the variance inflation factors iterating all explanatory variables. If the maximum VIF of a variable is over the given threshold, It will be dropped. This process is repeated until all VIFs are lower than the given threshold.

Recommended threshold is lower than 5, because if VIF is greater than 5, then the explanatory variable selected is highly collinear with the other explanatory variables, and the parameter estimates will have large standard errors because of this.

Parameters

data (DataFrame, (rows: observed values, columns: multivariate variables)) – design dataframe with all explanatory variables, as for example used in regression
thresh (int, float) – A threshold of VIF

Returns

Filtered_data (DataFrame) – A subset of the input DataFame
dropped_List (DataFrame) – ‘var’ column : dropped variable names from input data columns ‘vif’ column : variance inflation factor of dropped variables

Notes

This function does not save the auxiliary regression.