# unipy.stats package¶

## Module contents¶

Utility Objects.

This module provides a number of functions and objects for utility.

### hypo_test¶

• f_test – F-Test.

• f_test_formula – F-Test by formula.

• anova_test – ANOVA Test.

• anova_test_formula – ANOVA Test by formula.

• chisq_test – Chi-square Test.

• fisher_test – Fisher’s Exact Test.

### feature_selection¶

• lasso_rank – Feature selection by LASSO regression.

• feature_selection_vif – VIF based stepwise feature selection

for multivariate analysis.

### metrics¶

• deviation – Deviation.

• vif – Variance inflation factor.

• mean_absolute_percentage_error – Mean Absolute Percentage Error.

• average_absolute_deviation – Average Absolute Deviation.

• median_absolute_deviation – Median Absolute Deviation.

• calculate_interaction – Feature interaction calculation.

### formula¶

• from_formula – R-style Formula Formatting.

`unipy.stats.``deviation`(container, method='mean', if_abs=True)[source]

Deviation.

`unipy.stats.``vif`(y, X)[source]

Variance inflation factor.

`unipy.stats.``mean_absolute_percentage_error`(measure, predict, thresh=3.0)[source]

Mean Absolute Percentage Error. It is a percent of errors. It measures the prediction accuracy of a forecasting method in Statistics with the real mesured values and the predicted values, for example in trend estimation. If MAPE is 5, it means this prediction method potentially has 5% error. It cannot be used if there are zero values, because there would be a division by zero.

`unipy.stats.``average_absolute_deviation`(measure, predict, thresh=2)[source]

Average Absolute Deviation. It is … It measures the prediction accuracy of a forecasting method in Statistics with the real mesured values and the predicted values, for example in trend estimation. If MAD is 5, it means this prediction method potentially has…

`unipy.stats.``median_absolute_deviation`(measure, predict, thresh=2)[source]

Median Absolute Deviation. It is … It measures the prediction accuracy of a forecasting method in Statistics with the real mesured values and the predicted values, for example in trend estimation. If MAD is 5, it means this prediction method potentially has…

`unipy.stats.``calculate_interaction`(rankTbl, pvTbl, target, ranknum=10)[source]

Feature interaction calculation.

`unipy.stats.``f_test`(a, b, scale=1, alternative='two-sided', conf_level=0.95, *args, **kwargs)[source]

F-Test.

`unipy.stats.``f_test_formula`(a, b, scale=1, alternative='two-sided', conf_level=0.95, *args, **kwargs)[source]

F-Test by formula.

`unipy.stats.``anova_test`(formula, data=None, typ=1)[source]

ANOVA Test.

`unipy.stats.``anova_test_formula`(formula, data=None, typ=1)[source]

ANOVA Test by formula.

`unipy.stats.``chisq_test`(data, x=None, y=None, correction=None, lambda_=None, margin=True, print_ok=True)[source]

Chi-square Test.

`lambda_` gives the power in the Cressie-Read power divergence statistic. The default is 1. For convenience, lambda_ may be assigned one of the following strings, in which case the corresponding numerical value is used:

Parameters
• data (pandas.DataFrame) –

• x (str (default: None)) –

• y (str (default: None)) –

• correction ((default: None)) –

• lambda_ (lambda (default: None)) –

• margin (Boolean (default: True)) –

• print_ok (Boolean (default: True)) –

`unipy.stats.``fisher_test`(data, x=None, y=None, alternative='two-sided', margin=True, print_ok=True)[source]

Fisher’s Exact Test.

`unipy.stats.``lasso_rank`(formula=None, X=None, y=None, data=None, alpha=array([1e-05, 0.00011, 0.00021, 0.00031, 0.00041, 0.00051, 0.00061, 0.00071, 0.00081, 0.00091, 0.00101, 0.00111, 0.00121, 0.00131, 0.00141, 0.00151, 0.00161, 0.00171, 0.00181, 0.00191, 0.00201, 0.00211, 0.00221, 0.00231, 0.00241, 0.00251, 0.00261, 0.00271, 0.00281, 0.00291, 0.00301, 0.00311, 0.00321, 0.00331, 0.00341, 0.00351, 0.00361, 0.00371, 0.00381, 0.00391, 0.00401, 0.00411, 0.00421, 0.00431, 0.00441, 0.00451, 0.00461, 0.00471, 0.00481, 0.00491, 0.00501, 0.00511, 0.00521, 0.00531, 0.00541, 0.00551, 0.00561, 0.00571, 0.00581, 0.00591, 0.00601, 0.00611, 0.00621, 0.00631, 0.00641, 0.00651, 0.00661, 0.00671, 0.00681, 0.00691, 0.00701, 0.00711, 0.00721, 0.00731, 0.00741, 0.00751, 0.00761, 0.00771, 0.00781, 0.00791, 0.00801, 0.00811, 0.00821, 0.00831, 0.00841, 0.00851, 0.00861, 0.00871, 0.00881, 0.00891, 0.00901, 0.00911, 0.00921, 0.00931, 0.00941, 0.00951, 0.00961, 0.00971, 0.00981, 0.00991]), k=2, plot=False, *args, **kwargs)[source]

Feature selection by LASSO regression.

Parameters
• formula – R-style formula string

• X (list-like) – Column values for X.

• y (list-like) – A column value for y.

• data (pandas.DataFrame) – A DataFrame.

• alpha (Iterable) – An Iterable contains alpha values.

• k (int) – Threshold of coefficient matrix

• plot (Boolean (default: False)) – True if want to plot the result.

Returns

• rankTbl (pandas.DataFrame) – Feature ranking by given `k`.

• minIntercept (pandas.DataFrame) – The minimum intercept row in coefficient matrix.

• coefMatrix (pandas.DataFrame) – A coefficient matrix.

• kBest (pandas.DataFrame) – When Given `k`, The best intercept row in coefficient matrix.

• kBestPredY (dict) – A predicted `Y` with `kBest` alpha.

Example

```>>> import unipy.dataset.api as dm
>>> dm.init()
['cars', 'anscombe', 'iris', 'nutrients', 'german_credit_scoring_fars2008', 'winequality_red', 'winequality_white', 'titanic', 'car90', 'diabetes', 'adult', 'tips', 'births_big', 'breast_cancer', 'air_quality', 'births_small']
Dataset : winequality_red
>>>
>>> ranked, best_by_intercept, coefTbl, kBest, kBestPred = lasso_rank(X=wine_red.columns.drop('quality'), y=['quality'], data=wine_red)
>>> ranked
rank  lasso_coef  abs_coef
volatile_acidity     1   -0.675725  0.675725
alcohol              2    0.194865  0.194865
>>> best_by_intercept
RSS  Intercept  fixed_acidity  volatile_acidity      alpha_0.00121  691.956364   3.134874       0.002374         -1.023793
```

citric_acid residual_sugar chlorides free_sulfur_dioxide alpha_0.00121 0.0 0.0 -0.272912 -0.0

total_sulfur_dioxide density pH sulphates alcohol alpha_0.00121 -0.000963 -0.0 -0.0 0.505956 0.264552

var_count

alpha_0.00121 6 >>>

`unipy.stats.``feature_selection_vif`(data, thresh=5.0)[source]

Stepwise Feature Selection for multivariate analysis.

It calculates OLS regressions and the variance inflation factors iterating all explanatory variables. If the maximum VIF of a variable is over the given threshold, It will be dropped. This process is repeated until all VIFs are lower than the given threshold.

Recommended threshold is lower than 5, because if VIF is greater than 5, then the explanatory variable selected is highly collinear with the other explanatory variables, and the parameter estimates will have large standard errors because of this.

Parameters
• data (DataFrame, (rows: observed values, columns: multivariate variables)) – design dataframe with all explanatory variables, as for example used in regression

• thresh (int, float) – A threshold of VIF

Returns

• Filtered_data (DataFrame) – A subset of the input DataFame

• dropped_List (DataFrame) – ‘var’ column : dropped variable names from input data columns ‘vif’ column : variance inflation factor of dropped variables

Notes

This function does not save the auxiliary regression.

`statsmodels.stats.outliers_influence.variance_inflation_factor()`
`unipy.stats.``from_formula`(formula)[source]