Background
Scipy.stats.models was originally written by Jonathan Taylor.
For some time it was part of scipy but then removed from it. During
the Google Summer of Code 2009, stats.models was corrected, tested and
enhanced and released as a new package. Since then we have continued to
improve the existing models and added new statistical methods.
Main Features and Current Status
statsmodels 0.4 is a pure python package, with one optional cython based
extension that provides a considerable speed improvement for ARIMA estimation.
Future releases will depend on cython generated extensions.
statsmodels includes:
- regression: Generalized least squares (including weighted least squares and
least squares with autoregressive errors), ordinary least squares.
- glm: Generalized linear models with support for all of the one-parameter
exponential family distributions.
- discrete: regression with discrete dependent variables, including Logit, Probit,
MNLogit, Poisson, based on maximum likelihood estimators
- rlm: Robust linear models with support for several M-estimators.
- tsa: models for time series analysis
- univariate time series analysis: AR, ARIMA
- vector autoregressive models, VAR and structural VAR
- descriptive statistics and process models for time series analysis
- nonparametric : (Univariate) kernel density estimators
- datasets: Datasets to be distributed and used for examples and in testing.
- stats: a wide range of statistical tests
- diagnostics and specification tests
- goodness-of-fit and normality tests
- functions for multiple testing
- various additional statistical tests
- iolib: Tools for reading Stata .dta files into numpy arrays. (not yet ported to Python 3)
- iolob: printing table output to ascii, latex, and html
- miscellaneous models
statsmodels contains a sandbox folder, which includes some of the original
stats.models code that has not yet been rewritten and tested. The sandbox also
contains models and functions that we are currently developing. This code is
in various stages of development from early stages to almost finished, but
not sufficiently tested or with an API that is still in flux. Some of the code
in the advanced state covers among others Mixed (repeated measures) Models,
GARCH models, general method of moments (GMM) estimators, kernel regression and
kernel density estimation, and various extensions to scipy.stats.distributions.
The code is written for plain NumPy arrays so that statsmodels can be used
as a library for any kind of data structure users might have. However, in
order to make the data handling easier, some time series specific models
rely on pandas, and we have plans to integrate pandas in future releases of
statsmodels.
We have also included several datasets from the public domain and by
permission for tests and examples. The datasets are set up so that it is
easy to add more datasets.
Testing
Most results have been verified with at least one other statistical package: R,
Stata or SAS. The guiding principal for the initial rewrite and for continued
development is that all numbers have to be verified. Some statistical
methods are tested with Monte Carlo studies. While we strife to follow this
test driven approach, there is no guarantee that the code is bug-free and
always works. Some auxiliary function are still insufficiently tested, some
edge cases might not be correctly taken into account, and the possibility of
numerical problems is inherent to many of the statistical models. We
especially appreciate any help and reports for these kind of problems so we
can keep improving the existing models.
Looking Forward
We would like to invite everyone to give statsmodels a test drive, use it, and
report comments, possibilities for improvement and bugs to the statsmodels
mailing list http://groups.google.com/group/pystatsmodels or file tickets on our
issue tracker at https://github.com/statsmodels/statsmodels/issues
The source code is available from https://github.com/statsmodels/statsmodels.
Our plans for the future include improving the coverage of statistical
models, methods and tests that any basic statistics package should provide.
But the main direction for the expansion of statsmodels depends on the
requirements and interests of the developers and contributers.
The current maintainers are mostly interested in econometrics and time series
analysis, but we would like to invite any users or developers to contribute
their own extensions to existing models, or new models. To speed up
improvements that are waiting in the sandbox, any help with providing test
cases, reviewing or improving the code would be very appreciated.
Planned Extensions
Big changes that are planned for the next release will improve the
usability of statsmodels especially for interactive work.
- Metainformation about data and models: Currently the models essentially
use no information about the design matrix and just treat it as numpy
array. Some information like variable names are included with the wrapper
for use with Pandas or other data structures.
- Formulas similar to R: This will provide a faster way to interactively
define models and contrast matrices, and will provide additional
information especially for categorical variables. (Nathaniel Smith)
Various models that are work in progress where the time to inclusion in
statsmodels proper will depend on the available developer time and interests:
Bayesian dynamic linear models (Wes)
more Kalman filter based time series analysis (Skipper)
New models (roughly in order of completeness):
general method of moments (GMM) estimators, kernel regression,
kernel density estimation, various extensions to scipy.stats.distributions,
GARCH models, copulas, system of equation models, panel data models,
more discrete choice models, mixed effects models, survival models.
Resampling approaches like bootstrap and permutation for tests and estimator
statistics.
Code Stability
The existing models are mostly settled in their user interface and we do not
expect many changes anymore. One area that will need adjustment is how
formulas and meta information are included. New models that have just been
included might require adjustments as we gain more experience and obtain
feedback by users. As we expand the range of models, we keep improving the
framework for different estimators and statistical tests, so further changes
will be necessary.
In 0.3 we reorganized the internal location of the code and
import paths which will make future enhancements less interruptive. In 0.4
most models obtained a wrapper that stores and returns additional information
from richer data structures like data structures in Pandas and structured
arrays. In 0.4 also prediction has been improved in many cases and made more
consistent across models.
Although there is no guarantee yet on API stability, we try to keep changes
that require adjustments by existing users to a minimal level.
Financial Support
We are grateful for the financial support that we obtained for the
developement of statsmodels:
Google www.google.com : two Google Summer of Code,
GSOC 2009, GSOC 2010 and GSOC 2011
AQR www.aqr.com : financial sponsor for the work on
Vector Autoregressive Models (VAR) by Wes McKinney
We would also like to thank our hosting providers, github for the public code repository, sourceforge for hosting our documentation and python.org for making our downloads available on pypi.
Josef Perktold and Skipper Seabold
(maintainers)