In [1]: import numpy as np
In [2]: import statsmodels.api as sm
In [3]: from scipy import stats
In [4]: from matplotlib import pyplot as plt
In this example, we use the Star98 dataset which was taken with permission from Jeff Gill (2000) Generalized linear models: A unified approach. Codebook information can be obtained by typing:
In [5]: print sm.datasets.star98.NOTE
Number of Observations - 303 (counties in California).
Number of Variables - 13 and 8 interaction terms.
Definition of variables names::
NABOVE - Total number of students above the national median for the math
section.
NBELOW - Total number of students below the national median for the math
section.
LOWINC - Percentage of low income students
PERASIAN - Percentage of Asian student
PERBLACK - Percentage of black students
PERHISP - Percentage of Hispanic students
PERMINTE - Percentage of minority teachers
AVYRSEXP - Sum of teachers' years in educational service divided by the
number of teachers.
AVSALK - Total salary budget including benefits divided by the number of
full-time teachers (in thousands)
PERSPENK - Per-pupil spending (in thousands)
PTRATIO - Pupil-teacher ratio.
PCTAF - Percentage of students taking UC/CSU prep courses
PCTCHRT - Percentage of charter schools
PCTYRRND - Percentage of year-round schools
The below variables are interaction terms of the variables defined above.
PERMINTE_AVYRSEXP
PEMINTE_AVSAL
AVYRSEXP_AVSAL
PERSPEN_PTRATIO
PERSPEN_PCTAF
PTRATIO_PCTAF
PERMINTE_AVTRSEXP_AVSAL
PERSPEN_PTRATIO_PCTAF
Load the data and add a constant to the exogenous (independent) variables:
In [6]: data = sm.datasets.star98.load()
In [7]: data.exog = sm.add_constant(data.exog, prepend=False)
The dependent variable is N by 2 (Success: NABOVE, Failure: NBELOW):
In [8]: print data.endog[:5, :]
[[ 452. 355.]
[ 144. 40.]
[ 337. 234.]
[ 395. 178.]
[ 8. 57.]]
The independent variables include all the other variables described above, as well as the interaction terms:
In [9]: print data.exog[:2, :]
[[ 34.3973 23.2993 14.2353 11.4111 15.9184 14.7065
59.1573 4.4452 21.7102 57.0328 0. 22.2222
234.1029 941.6881 869.9948 96.5066 253.5224 1238.1955
13848.8985 5504.0352 1. ]
[ 17.3651 29.3284 8.2349 9.3149 13.6364 16.0832
59.504 5.2676 20.4428 64.6226 0. 0.
219.3169 811.4176 957.0166 107.6843 340.4061 1321.0664
13050.2233 6958.8468 1. ]]
In [10]: glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial())
In [11]: res = glm_binom.fit()
In [12]: print res.summary()
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: ['y1', 'y2'] No. Observations: 303
Model: GLM Df Residuals: 282
Model Family: Binomial Df Model: 20
Link Function: logit Scale: 1.0
Method: IRLS Log-Likelihood: -2998.6
Date: Fri, 16 Aug 2013 Deviance: 4078.8
Time: 07:19:37 Pearson chi2: 4.05e+03
No. Iterations: 6
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 -0.0168 0.000 -38.749 0.000 -0.018 -0.016
x2 0.0099 0.001 16.505 0.000 0.009 0.011
x3 -0.0187 0.001 -25.182 0.000 -0.020 -0.017
x4 -0.0142 0.000 -32.818 0.000 -0.015 -0.013
x5 0.2545 0.030 8.498 0.000 0.196 0.313
x6 0.2407 0.057 4.212 0.000 0.129 0.353
x7 0.0804 0.014 5.775 0.000 0.053 0.108
x8 -1.9522 0.317 -6.162 0.000 -2.573 -1.331
x9 -0.3341 0.061 -5.453 0.000 -0.454 -0.214
x10 -0.1690 0.033 -5.169 0.000 -0.233 -0.105
x11 0.0049 0.001 3.921 0.000 0.002 0.007
x12 -0.0036 0.000 -15.878 0.000 -0.004 -0.003
x13 -0.0141 0.002 -7.391 0.000 -0.018 -0.010
x14 -0.0040 0.000 -8.450 0.000 -0.005 -0.003
x15 -0.0039 0.001 -4.059 0.000 -0.006 -0.002
x16 0.0917 0.015 6.321 0.000 0.063 0.120
x17 0.0490 0.007 6.574 0.000 0.034 0.064
x18 0.0080 0.001 5.362 0.000 0.005 0.011
x19 0.0002 2.99e-05 7.428 0.000 0.000 0.000
x20 -0.0022 0.000 -6.445 0.000 -0.003 -0.002
const 2.9589 1.547 1.913 0.056 -0.073 5.990
==============================================================================
Total number of trials:
In [13]: print data.endog[0].sum()
807.0
Parameter estimates:
In [14]: print res.params
[-0.0168 0.0099 -0.0187 -0.0142 0.2545 0.2407 0.0804 -1.9522 -0.3341
-0.169 0.0049 -0.0036 -0.0141 -0.004 -0.0039 0.0917 0.049 0.008
0.0002 -0.0022 2.9589]
The corresponding t-values:
In [15]: print res.tvalues
[-38.7491 16.5047 -25.1822 -32.8179 8.4983 4.2125 5.775 -6.1619
-5.4532 -5.1687 3.9212 -15.8783 -7.3909 -8.4496 -4.0592 6.3211
6.5743 5.3623 7.4281 -6.4451 1.913 ]
First differences: We hold all explanatory variables constant at their means and manipulate the percentage of low income households to assess its impact on the response variables:
In [16]: means = data.exog.mean(axis=0)
In [17]: means25 = means.copy()
In [18]: means25[0] = stats.scoreatpercentile(data.exog[:, 0], 25)
In [19]: means75 = means.copy()
In [20]: means75[0] = lowinc_75per = stats.scoreatpercentile(data.exog[:, 0], 75)
In [21]: resp_25 = res.predict(means25)
In [22]: resp_75 = res.predict(means75)
In [23]: diff = resp_75 - resp_25
The interquartile first difference for the percentage of low income households in a school district is:
In [24]: print '%2.4f%%' % (diff * 100)
-11.8753%
We extract information that will be used to draw some interesting plots:
In [25]: nobs = res.nobs
In [26]: y = data.endog[:, 0] / data.endog.sum(1)
In [27]: yhat = res.mu
Plot yhat vs y:
In [28]: plt.figure();
In [29]: plt.scatter(yhat, y);
In [30]: line_fit = sm.OLS(y, sm.add_constant(yhat, prepend=False)).fit().params
In [31]: fit = lambda x: line_fit[1] + line_fit[0] * x # better way in scipy?
In [32]: plt.plot(np.linspace(0, 1, nobs), fit(np.linspace(0, 1, nobs)));
In [33]: plt.title('Model Fit Plot');
In [34]: plt.ylabel('Observed values');
In [35]: plt.xlabel('Fitted values');
Plot yhat vs. Pearson residuals:
In [36]: plt.figure();
In [37]: plt.scatter(yhat, res.resid_pearson);
In [38]: plt.plot([0.0, 1.0], [0.0, 0.0], 'k-');
In [39]: plt.title('Residual Dependence Plot');
In [40]: plt.ylabel('Pearson Residuals');
In [41]: plt.xlabel('Fitted values');
Histogram of standardized deviance residuals
In [42]: plt.figure();
In [43]: resid = res.resid_deviance.copy()
In [44]: resid_std = (resid - resid.mean()) / resid.std()
In [45]: plt.hist(resid_std, bins=25);
In [46]: plt.title('Histogram of standardized deviance residuals');
QQ Plot of Deviance Residuals
In [47]: from statsmodels import graphics
In [48]: graphics.gofplots.qqplot(resid, line='r');
In the example above, we printed the NOTE attribute to learn about the Star98 dataset. Statsmodels datasets ships with other useful information. For example:
In [49]: print sm.datasets.scotland.DESCRLONG
This data is based on the example in Gill and describes the proportion of
voters who voted Yes to grant the Scottish Parliament taxation powers.
The data are divided into 32 council districts. This example's explanatory
variables include the amount of council tax collected in pounds sterling as
of April 1997 per two adults before adjustments, the female percentage of
total claims for unemployment benefits as of January, 1998, the standardized
mortality rate (UK is 100), the percentage of labor force participation,
regional GDP, the percentage of children aged 5 to 15, and an interaction term
between female unemployment and the council tax.
The original source files and variable information are included in
/scotland/src/
Load the data and add a constant to the exogenous variables:
In [50]: data2 = sm.datasets.scotland.load()
In [51]: data2.exog = sm.add_constant(data2.exog, prepend=False)
In [52]: print data2.exog[:5, :]
[[ 712. 21. 105. 82.4 13566. 12.3 14952. 1. ]
[ 643. 26.5 97. 80.2 13566. 15.3 17039.5 1. ]
[ 679. 28.3 113. 86.3 9611. 13.9 19215.7 1. ]
[ 801. 27.1 109. 80.4 9483. 13.6 21707.1 1. ]
[ 753. 22. 115. 64.7 9265. 14.6 16566. 1. ]]
In [53]: print data2.endog[:5]
[ 60.3 52.3 53.4 57. 68.7]
In [54]: glm_gamma = sm.GLM(data2.endog, data2.exog, family=sm.families.Gamma())
In [55]: glm_results = glm_gamma.fit()
In [56]: print glm_results.summary()
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 32
Model: GLM Df Residuals: 24
Model Family: Gamma Df Model: 7
Link Function: inverse_power Scale: 0.00358428317349
Method: IRLS Log-Likelihood: -83.017
Date: Fri, 16 Aug 2013 Deviance: 0.087389
Time: 07:19:39 Pearson chi2: 0.0860
No. Iterations: 5
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 4.962e-05 1.62e-05 3.060 0.002 1.78e-05 8.14e-05
x2 0.0020 0.001 3.824 0.000 0.001 0.003
x3 -7.181e-05 2.71e-05 -2.648 0.008 -0.000 -1.87e-05
x4 0.0001 4.06e-05 2.757 0.006 3.23e-05 0.000
x5 -1.468e-07 1.24e-07 -1.187 0.235 -3.89e-07 9.56e-08
x6 -0.0005 0.000 -2.159 0.031 -0.001 -4.78e-05
x7 -2.427e-06 7.46e-07 -3.253 0.001 -3.89e-06 -9.65e-07
const -0.0178 0.011 -1.548 0.122 -0.040 0.005
==============================================================================
In [57]: nobs2 = 100
In [58]: x = np.arange(nobs2)
In [59]: np.random.seed(54321)
In [60]: X = np.column_stack((x, x**2))
In [61]: X = sm.add_constant(X, prepend=False)
In [62]: lny = np.exp(-(.03 * x + .0001 * x**2 - 1.0)) + .001 * np.random.rand(nobs2)
In [63]: gauss_log = sm.GLM(lny, X, family=sm.families.Gaussian(sm.families.links.log))
In [64]: gauss_log_results = gauss_log.fit()
In [65]: print gauss_log_results.summary()
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: y No. Observations: 100
Model: GLM Df Residuals: 97
Model Family: Gaussian Df Model: 2
Link Function: log Scale: 1.05311425588e-07
Method: IRLS Log-Likelihood: 662.92
Date: Fri, 16 Aug 2013 Deviance: 1.0215e-05
Time: 07:19:39 Pearson chi2: 1.02e-05
No. Iterations: 6
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 -0.0300 5.6e-06 -5361.316 0.000 -0.030 -0.030
x2 -9.939e-05 1.05e-07 -951.091 0.000 -9.96e-05 -9.92e-05
const 1.0003 5.39e-05 1.86e+04 0.000 1.000 1.000
==============================================================================