In [1]: import numpy as np
In [2]: import statsmodels.api as sm
In [3]: import matplotlib.pyplot as plt
In [4]: from statsmodels.sandbox.regression.predstd import wls_prediction_std
In [5]: np.random.seed(9876789)
In [6]: nsample = 100
In [7]: x = np.linspace(0, 10, 100)
In [8]: X = np.column_stack((x, x**2))
In [9]: beta = np.array([1, 0.1, 10])
In [10]: e = np.random.normal(size=nsample)
Our model needs an intercept so we add a column of 1s:
In [11]: X = sm.add_constant(X, prepend=False)
In [12]: y = np.dot(X, beta) + e
Inspect data
In [13]: print X[:5, :]
[[ 0. 0. 1. ]
[ 0.101 0.0102 1. ]
[ 0.202 0.0408 1. ]
[ 0.303 0.0918 1. ]
[ 0.404 0.1632 1. ]]
In [14]: print y[:5]
[ 9.1595 11.6995 10.6716 9.8041 13.3547]
In [15]: model = sm.OLS(y, X)
In [16]: results = model.fit()
In [17]: print results.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.968
Model: OLS Adj. R-squared: 0.968
Method: Least Squares F-statistic: 1479.
Date: Fri, 16 Aug 2013 Prob (F-statistic): 2.19e-73
Time: 08:00:02 Log-Likelihood: -146.51
No. Observations: 100 AIC: 299.0
Df Residuals: 97 BIC: 306.8
Df Model: 2
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 0.8598 0.145 5.948 0.000 0.573 1.147
x2 0.1103 0.014 7.883 0.000 0.082 0.138
const 10.3423 0.313 33.072 0.000 9.722 10.963
==============================================================================
Omnibus: 2.042 Durbin-Watson: 2.274
Prob(Omnibus): 0.360 Jarque-Bera (JB): 1.875
Skew: 0.234 Prob(JB): 0.392
Kurtosis: 2.519 Cond. No. 144.
==============================================================================
Quantities of interest can be extracted directly from the fitted model. Type dir(results) for a full list. Here are some examples:
In [18]: print results.params
[ 0.8598 0.1103 10.3423]
In [19]: print results.rsquared
0.968242518536
Non-linear relationship between x and y
In [20]: nsample = 50
In [21]: sig = 0.5
In [22]: x = np.linspace(0, 20, nsample)
In [23]: X = np.c_[x, np.sin(x), (x - 5)**2, np.ones(nsample)]
In [24]: beta = [0.5, 0.5, -0.02, 5.]
In [25]: y_true = np.dot(X, beta)
In [26]: y = y_true + sig * np.random.normal(size=nsample)
In [27]: res = sm.OLS(y, X).fit()
In [28]: print res.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.933
Model: OLS Adj. R-squared: 0.928
Method: Least Squares F-statistic: 211.8
Date: Fri, 16 Aug 2013 Prob (F-statistic): 6.30e-27
Time: 08:00:02 Log-Likelihood: -34.438
No. Observations: 50 AIC: 76.88
Df Residuals: 46 BIC: 84.52
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 0.4687 0.026 17.751 0.000 0.416 0.522
x2 0.4836 0.104 4.659 0.000 0.275 0.693
x3 -0.0174 0.002 -7.507 0.000 -0.022 -0.013
const 5.2058 0.171 30.405 0.000 4.861 5.550
==============================================================================
Omnibus: 0.655 Durbin-Watson: 2.896
Prob(Omnibus): 0.721 Jarque-Bera (JB): 0.360
Skew: 0.207 Prob(JB): 0.835
Kurtosis: 3.026 Cond. No. 221.
==============================================================================
Extract other quantities of interest
In [29]: print res.params
[ 0.4687 0.4836 -0.0174 5.2058]
In [30]: print res.bse
[ 0.0264 0.1038 0.0023 0.1712]
In [31]: print res.predict()
[ 4.7707 5.2221 5.6362 5.9866 6.2564 6.4412 6.5493 6.6009
6.6243 6.6518 6.7138 6.8341 7.0262 7.2905 7.6149 7.9763
8.3446 8.6876 8.9764 9.19 9.3187 9.3659 9.3474 9.2889
9.2217 9.1775 9.1834 9.2571 9.4044 9.6181 9.879 10.1591
10.4266 10.6505 10.8063 10.8795 10.8683 10.7838 10.6483 10.4913
10.3452 10.2393 10.1957 10.2249 10.3249 10.4808 10.6678 10.8549
11.0101 11.1058]
Draw a plot to compare the true relationship to OLS predictions. Confidence intervals around the predictions are built using the wls_prediction_std command.
In [32]: plt.figure();
In [33]: plt.plot(x, y, 'o', x, y_true, 'b-');
In [34]: prstd, iv_l, iv_u = wls_prediction_std(res);
In [35]: plt.plot(x, res.fittedvalues, 'r--.');
In [36]: plt.plot(x, iv_u, 'r--');
In [37]: plt.plot(x, iv_l, 'r--');
In [38]: plt.title('blue: true, red: OLS');
We create 3 groups which will be modelled using dummy variables. Group 0 is the omitted/benchmark category.
In [39]: nsample = 50
In [40]: groups = np.zeros(nsample, int)
In [41]: groups[20:40] = 1
In [42]: groups[40:] = 2
In [43]: dummy = (groups[:, None] == np.unique(groups)).astype(float)
In [44]: x = np.linspace(0, 20, nsample)
In [45]: X = np.c_[x, dummy[:, 1:], np.ones(nsample)]
In [46]: beta = [1., 3, -3, 10]
In [47]: y_true = np.dot(X, beta)
In [48]: e = np.random.normal(size=nsample)
In [49]: y = y_true + e
Inspect the data
In [50]: print X[:5, :]
[[ 0. 0. 0. 1. ]
[ 0.4082 0. 0. 1. ]
[ 0.8163 0. 0. 1. ]
[ 1.2245 0. 0. 1. ]
[ 1.6327 0. 0. 1. ]]
In [51]: print y[:5]
[ 9.2822 10.5048 11.8439 10.3851 12.3794]
In [52]: print groups
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 2 2 2 2 2 2 2 2 2 2]
In [53]: print dummy[:5, :]
[[ 1. 0. 0.]
[ 1. 0. 0.]
[ 1. 0. 0.]
[ 1. 0. 0.]
[ 1. 0. 0.]]
In [54]: res2 = sm.OLS(y, X).fit()
In [55]: print res.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.933
Model: OLS Adj. R-squared: 0.928
Method: Least Squares F-statistic: 211.8
Date: Fri, 16 Aug 2013 Prob (F-statistic): 6.30e-27
Time: 08:00:03 Log-Likelihood: -34.438
No. Observations: 50 AIC: 76.88
Df Residuals: 46 BIC: 84.52
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 0.4687 0.026 17.751 0.000 0.416 0.522
x2 0.4836 0.104 4.659 0.000 0.275 0.693
x3 -0.0174 0.002 -7.507 0.000 -0.022 -0.013
const 5.2058 0.171 30.405 0.000 4.861 5.550
==============================================================================
Omnibus: 0.655 Durbin-Watson: 2.896
Prob(Omnibus): 0.721 Jarque-Bera (JB): 0.360
Skew: 0.207 Prob(JB): 0.835
Kurtosis: 3.026 Cond. No. 221.
==============================================================================
In [56]: print res2.params
[ 0.9999 2.8909 -3.2232 10.1031]
In [57]: print res2.bse
[ 0.0599 0.5689 0.927 0.3102]
In [58]: print res.predict()
[ 4.7707 5.2221 5.6362 5.9866 6.2564 6.4412 6.5493 6.6009
6.6243 6.6518 6.7138 6.8341 7.0262 7.2905 7.6149 7.9763
8.3446 8.6876 8.9764 9.19 9.3187 9.3659 9.3474 9.2889
9.2217 9.1775 9.1834 9.2571 9.4044 9.6181 9.879 10.1591
10.4266 10.6505 10.8063 10.8795 10.8683 10.7838 10.6483 10.4913
10.3452 10.2393 10.1957 10.2249 10.3249 10.4808 10.6678 10.8549
11.0101 11.1058]
Draw a plot to compare the true relationship to OLS predictions.
In [59]: prstd, iv_l, iv_u = wls_prediction_std(res2);
In [60]: plt.figure();
In [61]: plt.plot(x, y, 'o', x, y_true, 'b-');
In [62]: plt.plot(x, res2.fittedvalues, 'r--.');
In [63]: plt.plot(x, iv_u, 'r--');
In [64]: plt.plot(x, iv_l, 'r--');
In [65]: plt.title('blue: true, red: OLS');
We want to test the hypothesis that both coefficients on the dummy variables are equal to zero, that is, R \times \beta = 0. An F test leads us to strongly reject the null hypothesis of identical constant in the 3 groups:
In [66]: R = [[0, 1, 0, 0], [0, 0, 1, 0]]
In [67]: print np.array(R)
[[0 1 0 0]
[0 0 1 0]]
In [68]: print res2.f_test(R)
<F test: F=array([[ 145.4927]]), p=[[ 0.]], df_denom=46, df_num=2>
We want to test the null hypothesis that the effects of the 2nd and 3rd groups add to zero. The T-test allows us to reject the Null (but note the one-sided p-value):
In [69]: R = [0, 1, -1, 0]
In [70]: print res2.t_test(R)
<T test: effect=array([ 6.114]), sd=array([[ 0.5111]]), t=array([[ 11.9616]]), p=array([[ 0.]]), df_denom=46>
If we generate artificial data with smaller group effects, the T test can no longer reject the Null hypothesis:
In [71]: beta = [1., 0.3, -0.0, 10]
In [72]: y_true = np.dot(X, beta)
In [73]: y = y_true + np.random.normal(size=nsample)
In [74]: res3 = sm.OLS(y, X).fit()
In [75]: print res3.f_test(R)
<F test: F=array([[ 0.0151]]), p=[[ 0.9028]], df_denom=46, df_num=1>
The Longley dataset is well known to have high multicollinearity, that is, the exogenous predictors are highly correlated. This is problematic because it can affect the stability of our coefficient estimates as we make minor changes to model specification.
In [76]: from statsmodels.datasets.longley import load
In [77]: y = load().endog
In [78]: X = load().exog
In [79]: X = sm.tools.add_constant(X, prepend=False)
In [80]: ols_model = sm.OLS(y, X)
In [81]: ols_results = ols_model.fit()
In [82]: print ols_results.summary()
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.995
Model: OLS Adj. R-squared: 0.992
Method: Least Squares F-statistic: 330.3
Date: Fri, 16 Aug 2013 Prob (F-statistic): 4.98e-10
Time: 08:00:04 Log-Likelihood: -109.62
No. Observations: 16 AIC: 233.2
Df Residuals: 9 BIC: 238.6
Df Model: 6
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 15.0619 84.915 0.177 0.863 -177.029 207.153
x2 -0.0358 0.033 -1.070 0.313 -0.112 0.040
x3 -2.0202 0.488 -4.136 0.003 -3.125 -0.915
x4 -1.0332 0.214 -4.822 0.001 -1.518 -0.549
x5 -0.0511 0.226 -0.226 0.826 -0.563 0.460
x6 1829.1515 455.478 4.016 0.003 798.788 2859.515
const -3.482e+06 8.9e+05 -3.911 0.004 -5.5e+06 -1.47e+06
==============================================================================
Omnibus: 0.749 Durbin-Watson: 2.559
Prob(Omnibus): 0.688 Jarque-Bera (JB): 0.684
Skew: 0.420 Prob(JB): 0.710
Kurtosis: 2.434 Cond. No. 4.86e+09
==============================================================================
Warnings:
[1] The condition number is large, 4.86e+09. This might indicate that there are
strong multicollinearity or other numerical problems.
One way to assess multicollinearity is to compute the condition number. Values over 20 are worrisome (see Greene 4.9). The first step is to normalize the independent variables to have unit length:
In [83]: norm_x = np.ones_like(X)
In [84]: for i in range(int(ols_model.df_model)):
....: norm_x[:, i] = X[:, i] / np.linalg.norm(X[:, i])
....:
In [85]: norm_xtx = np.dot(norm_x.T, norm_x)
Then, we take the square root of the ratio of the biggest to the smallest eigen values.
In [86]: eigs = np.linalg.eigvals(norm_xtx)
In [87]: condition_number = np.sqrt(eigs.max() / eigs.min())
In [88]: print condition_number
56240.8685449
Greene also points out that dropping a single observation can have a dramatic effect on the coefficient estimates:
In [89]: ols_results2 = sm.OLS(y[:-1], X[:-1, :]).fit()
In [90]: res_dropped = ols_results.params / ols_results2.params * 100 - 100
In [91]: print 'Percentage change %4.2f%%\n' * 7 % tuple(i for i in res_dropped)
Percentage change -173.43%
Percentage change 31.04%
Percentage change 3.48%
Percentage change 7.83%
Percentage change -199.54%
Percentage change 15.39%
Percentage change 15.40%