Main Usage¶
To load a dataset do the following
>>> import scikits.statsmodels.api as sm
>>> data = sm.datasets.longley.load()
The Dataset object follows the bunch pattern as explain in the proposal.
Most datasets have two attributes of particular interest to users for examples
>>> data.endog
array([ 60323., 61122., 60171., 61187., 63221., 63639., 64989.,
63761., 66019., 67857., 68169., 66513., 68655., 69564.,
69331., 70551.])
>>> data.exog
array([[ 83. , 234289. , 2356. , 1590. , 107608. , 1947. ],
[ 88.5, 259426. , 2325. , 1456. , 108632. , 1948. ],
[ 88.2, 258054. , 3682. , 1616. , 109773. , 1949. ],
[ 89.5, 284599. , 3351. , 1650. , 110929. , 1950. ],
[ 96.2, 328975. , 2099. , 3099. , 112075. , 1951. ],
[ 98.1, 346999. , 1932. , 3594. , 113270. , 1952. ],
[ 99. , 365385. , 1870. , 3547. , 115094. , 1953. ],
[ 100. , 363112. , 3578. , 3350. , 116219. , 1954. ],
[ 101.2, 397469. , 2904. , 3048. , 117388. , 1955. ],
[ 104.6, 419180. , 2822. , 2857. , 118734. , 1956. ],
[ 108.4, 442769. , 2936. , 2798. , 120445. , 1957. ],
[ 110.8, 444546. , 4681. , 2637. , 121950. , 1958. ],
[ 112.6, 482704. , 3813. , 2552. , 123366. , 1959. ],
[ 114.2, 502601. , 3931. , 2514. , 125368. , 1960. ],
[ 115.7, 518173. , 4806. , 2572. , 127852. , 1961. ],
[ 116.9, 554894. , 4007. , 2827. , 130081. , 1962. ]])
Univariate datasets, however, do not have an exog attribute. You can find out the variable names by doing
>>> data.endog_name
'TOTEMP'
>>> data.exog_name
['GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']
If the dataset does not have a clear interpretation of what should be an endog and exog, then you can always access the data or raw_data attributes. This is the case for the macrodata dataset, which is a collection of US macroeconomic data rather than a dataset with a specific example in mind. The data attribute contains a record array of the full dataset and the raw_data attribute contains an ndarray with the names of the columns given by the names attribute.
>>> type(data.data)
numpy.core.records.recarray
>>> type(data.raw_data)
numpy.ndarray
>>> data.names
['TOTEMP', 'GNPDEFL', 'GNP', 'UNEMP', 'ARMED', 'POP', 'YEAR']