2025-07-21
statsmodels is a package for implementing traditional frequentist statistics models
scikit-learn is a workhorse package for general statistical modeling (inclusive of ML)
matplotlib and seaborn are packages for visualizing data results
matplotlibstatsmodelsstatsmodels the following way:R’s lm)import numpy as np
rng = np.random.default_rng(seed=12345)
def dnorm(mean, variance, rng, size=1):
    if isinstance(size, int):
        size = size,
    return mean + np.sqrt(variance) * rng.standard_normal(*size)
N = 100
X = np.c_[dnorm(0, 0.4, rng, size=N),
          dnorm(0, 0.6, rng, size=N),
          dnorm(0, 0.2, rng, size=N)]
eps = dnorm(0, 0.1, rng, size=N)
beta = [0.1, 0.3, 0.5]
y = np.dot(X, beta) + epsWhat are the dimensions of X and y?
Note that this is a linear, homoskedastic model.
array([[-0.90050602, -0.18942958, -1.0278702 ],
       [ 0.79925205, -1.54598388, -0.32739708],
       [-0.55065483, -0.12025429,  0.32935899],
       [-0.16391555,  0.82403985,  0.20827485],
       [-0.04765129, -0.21314698, -0.04824364]])
As in our numpy lecture, we also want to fit an intercept – statsmodels makes this easy
array([[ 1.        , -0.90050602, -0.18942958, -1.0278702 ],
       [ 1.        ,  0.79925205, -1.54598388, -0.32739708],
       [ 1.        , -0.55065483, -0.12025429,  0.32935899],
       [ 1.        , -0.16391555,  0.82403985,  0.20827485],
       [ 1.        , -0.04765129, -0.21314698, -0.04824364]])
sm.OLSWe begin by creating a sm.OLS object using our data (this is the array-based method):
summary()                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.470
Model:                            OLS   Adj. R-squared:                  0.453
Method:                 Least Squares   F-statistic:                     28.36
Date:                Fri, 18 Jul 2025   Prob (F-statistic):           3.23e-13
Time:                        21:25:33   Log-Likelihood:                -25.390
No. Observations:                 100   AIC:                             58.78
Df Residuals:                      96   BIC:                             69.20
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0208      0.032     -0.653      0.516      -0.084       0.042
x1             0.0658      0.054      1.220      0.226      -0.041       0.173
x2             0.2690      0.043      6.312      0.000       0.184       0.354
x3             0.4494      0.068      6.567      0.000       0.314       0.585
==============================================================================
Omnibus:                        0.429   Durbin-Watson:                   1.878
Prob(Omnibus):                  0.807   Jarque-Bera (JB):                0.296
Skew:                           0.133   Prob(JB):                        0.863
Kurtosis:                       2.995   Cond. No.                         2.16
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Let’s rerun with heteroskedasticity-robust standard errors:
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.470
Model:                            OLS   Adj. R-squared:                  0.453
Method:                 Least Squares   F-statistic:                     25.90
Date:                Fri, 18 Jul 2025   Prob (F-statistic):           2.32e-12
Time:                        21:25:33   Log-Likelihood:                -25.390
No. Observations:                 100   AIC:                             58.78
Df Residuals:                      96   BIC:                             69.20
Df Model:                           3                                         
Covariance Type:                  HC1                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
const         -0.0208      0.032     -0.657      0.511      -0.083       0.041
x1             0.0658      0.060      1.097      0.273      -0.052       0.183
x2             0.2690      0.046      5.795      0.000       0.178       0.360
x3             0.4494      0.072      6.231      0.000       0.308       0.591
==============================================================================
Omnibus:                        0.429   Durbin-Watson:                   1.878
Prob(Omnibus):                  0.807   Jarque-Bera (JB):                0.296
Skew:                           0.133   Prob(JB):                        0.863
Kurtosis:                       2.995   Cond. No.                         2.16
==============================================================================
Notes:
[1] Standard Errors are heteroscedasticity robust (HC1)
pd.DataFrame as Input to statsmodels       col0      col1      col2         y
0 -0.900506 -0.189430 -1.027870 -0.599527
1  0.799252 -1.545984 -0.327397 -0.588454
2 -0.550655 -0.120254  0.329359  0.185634
3 -0.163916  0.824040  0.208275 -0.007477
4 -0.047651 -0.213147 -0.048244 -0.015374
This can now be used much like R’s lm – we use a formula to access the API:
Intercept   -0.020799
col0         0.065813
col1         0.268970
col2         0.449419
dtype: float64
Note the intercept is fit by default
col0            0.069201
col1            0.341687
col2            0.444801
col0:col1      -0.121129
np.exp(col1)   -0.032190
dtype: float64
If you’re doing something in your formula that patsy doesn’t recognize by default, you can pass it within I() in the formula, which tells patsy to let python handle it.
We can also use the results object to predict on out-of-sample data
0   -0.631536
1   -0.475749
2    0.030740
3    0.305839
4   -0.124827
dtype: float64
patsyIn fact, patsy is quite useful for creating design matrices from formulae (again, very similar to R’s model.matrix):
array([[ 1.        , -0.90050602, -1.0278702 ,  1.05651715],
       [ 1.        ,  0.79925205, -0.32739708,  0.10718885],
       [ 1.        , -0.55065483,  0.32935899,  0.10847735],
       [ 1.        , -0.16391555,  0.20827485,  0.04337841],
       [ 1.        , -0.04765129, -0.04824364,  0.00232745]])
statsmodelsSimulate 1000 observations from
\[ y = 2 + 3\times\sin(x) + \varepsilon \]
where \(x\sim\mathcal{N}(0,1)\) and \(\varepsilon\sim\mathcal{N}(0,2)\). Then fit two statsmodels objects reporting the coefficients from a correctly specified regression (using \(\sin(x)\)) and from an incorrectly specified one (just using \(x\)).
statsmodels                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.678
Model:                            OLS   Adj. R-squared:                  0.678
Method:                 Least Squares   F-statistic:                     2100.
Date:                Fri, 18 Jul 2025   Prob (F-statistic):          9.48e-248
Time:                        21:25:33   Log-Likelihood:                -1762.2
No. Observations:                1000   AIC:                             3528.
Df Residuals:                     998   BIC:                             3538.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0459      0.045     45.830      0.000       1.958       2.133
np.sin(x)      3.0660      0.067     45.830      0.000       2.935       3.197
==============================================================================
Omnibus:                        0.087   Durbin-Watson:                   2.064
Prob(Omnibus):                  0.958   Jarque-Bera (JB):                0.055
Skew:                           0.017   Prob(JB):                        0.973
Kurtosis:                       3.012   Cond. No.                         1.50
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.579
Model:                            OLS   Adj. R-squared:                  0.578
Method:                 Least Squares   F-statistic:                     1371.
Date:                Fri, 18 Jul 2025   Prob (F-statistic):          1.37e-189
Time:                        21:25:33   Log-Likelihood:                -1896.3
No. Observations:                1000   AIC:                             3797.
Df Residuals:                     998   BIC:                             3806.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      2.0309      0.051     39.791      0.000       1.931       2.131
x              1.8869      0.051     37.033      0.000       1.787       1.987
==============================================================================
Omnibus:                       14.047   Durbin-Watson:                   2.019
Prob(Omnibus):                  0.001   Jarque-Bera (JB):               23.471
Skew:                           0.020   Prob(JB):                     8.01e-06
Kurtosis:                       3.750   Cond. No.                         1.03
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
scikit-learnscikit-learn?scikit-learn API style which is very accessibleData is from a Kaggle competition about passenger survival rates on the Titanic
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   
                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   
   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500   NaN        S  
With a few exceptions, you can’t feed ML algorithms missing data, so we need see the prevalence of missing data in predictors:
PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.198653
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64
PassengerId    0.000000
Survived       0.000000
Pclass         0.000000
Name           0.000000
Sex            0.000000
Age            0.000000
SibSp          0.000000
Parch          0.000000
Ticket         0.000000
Fare           0.000000
Cabin          0.771044
Embarked       0.002245
dtype: float64
Sex| IsFemale | Sex | |
|---|---|---|
| 0 | 0 | male | 
| 1 | 1 | female | 
| 2 | 1 | female | 
| 3 | 1 | female | 
| 4 | 0 | male | 
| 5 | 0 | male | 
We have to pass np.arrays to sklearn, so we pick features and convert them to numpy
array([[ 3.,  0., 22.],
       [ 1.,  1., 38.],
       [ 3.,  1., 26.],
       [ 1.,  1., 35.],
       [ 3.,  0., 35.]])
We will fit a logistic regression – this means that we have to import the model from the sklearn module linear_model and create an instance of the model type
We can then fit this model using the data and the fit() method, which is consistent across all of sklearn’s models:
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
Recall that the probability of survival in this case is being modeled as
\[ \mathbb{P}(\text{Survival}_i) = \frac{1}{1 + \exp(-(\beta^{\top}x_i))} = \frac{\exp(\beta^{\top}x_i)}{1 + \exp(\beta^{\top}x_i)} \]
Are the coefficients interpretable?
Finally, we can make predictions using the fitted model:
array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0])
We can also assess the fit using the score method (here we have no test data outcomes, so we reuse the training data)
0.7878787878787878
This is the percentage of observations for which the model makes the correct prediction.
np.float64(0.7878787878787878)
Since we’re fitting a logistic regression model, we can also retrieve the coefficients:
array([[-1.14157583,  2.51975386, -0.03272378]])
How should we interpret these coefficients (on ticket class, female, and age, respectively)?
sklearn API More Fullypandasscikit-learn has an API for doing thisWe have two options – using pandas (pd.get_dummies) or using sklearn (preprocessing.OneHotEncoder). I’ll use the former here, but the latter is also fine
array([[3, 22.0, True],
       [1, 38.0, False],
       [3, 26.0, False],
       [1, 35.0, False],
       [3, 35.0, True]], dtype=object)
PipelineLet’s say that we were comfortable saying that we wanted to use our median imputation on any features that we pass to our model – then we can use a Pipeline
Pipeline(steps=[('impute_median', SimpleImputer(strategy='median')),
                ('logit', LogisticRegression())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. Pipeline(steps=[('impute_median', SimpleImputer(strategy='median')),
                ('logit', LogisticRegression())])SimpleImputer(strategy='median')
LogisticRegression()
To access the output, we now have to specify which step in the pipeline we want to access
array([[-1.14120327, -0.03271306, -2.51954002]])
scikit-learnsklearn.linear_model.LassoCV)array([1., 2., 3., 4., 5., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
beta is “sparse”LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LinearRegression()
Let’s generate unseen data:
-0.4090791143319581
Implementation of five-fold CV for a parameter value of \(\alpha=2\)
array([0.05229446, 0.110592  , 0.1352647 , 0.06777857, 0.17660835])
What does each value in this array mean?
We want to look for the best parameter over a grid of possibilities. Let’s say we restrict ourselves to \(\alpha \in \{1,...,10\}\):
GridSearchCV(estimator=Lasso(alpha=2.0),
             param_grid={'alpha': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])})In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. GridSearchCV(estimator=Lasso(alpha=2.0),
             param_grid={'alpha': array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10])})Lasso(alpha=np.int64(1))
Lasso(alpha=np.int64(1))
   mean_fit_time  std_fit_time  mean_score_time  std_score_time  param_alpha  \
0       0.001251  3.612121e-04         0.000612    4.559157e-05            1   
1       0.000496  2.208939e-04         0.000253    5.957458e-05            2   
2       0.000309  1.018089e-05         0.000186    4.199417e-06            3   
3       0.000292  3.423943e-06         0.000179    1.457282e-06            4   
4       0.000288  2.423909e-06         0.000177    1.134432e-06            5   
5       0.000285  9.608003e-07         0.000175    3.234067e-07            6   
6       0.000284  9.122432e-07         0.000174    5.519789e-07            7   
7       0.000569  3.575465e-04         0.000330    1.928284e-04            8   
8       0.000781  1.850838e-04         0.000487    1.307984e-04            9   
9       0.000301  1.084152e-05         0.000188    6.319847e-06           10   
          params  split0_test_score  split1_test_score  split2_test_score  \
0   {'alpha': 1}           0.055458           0.219426           0.062059   
1   {'alpha': 2}           0.052294           0.110592           0.135265   
2   {'alpha': 3}           0.030939           0.059064           0.075362   
3   {'alpha': 4}          -0.002265           0.034352           0.051328   
4   {'alpha': 5}          -0.031806           0.003537           0.021333   
5   {'alpha': 6}          -0.047611          -0.006580          -0.001169   
6   {'alpha': 7}          -0.047611          -0.006580          -0.001169   
7   {'alpha': 8}          -0.047611          -0.006580          -0.001169   
8   {'alpha': 9}          -0.047611          -0.006580          -0.001169   
9  {'alpha': 10}          -0.047611          -0.006580          -0.001169   
   split3_test_score  split4_test_score  mean_test_score  std_test_score  \
0           0.084018           0.204363         0.125065        0.071683   
1           0.067779           0.176608         0.108508        0.045115   
2          -0.041256           0.074601         0.039742        0.043579   
3          -0.135167          -0.028565        -0.016063        0.065751   
4          -0.134277          -0.075044        -0.043251        0.056192   
5          -0.134277          -0.075044        -0.052936        0.048913   
6          -0.134277          -0.075044        -0.052936        0.048913   
7          -0.134277          -0.075044        -0.052936        0.048913   
8          -0.134277          -0.075044        -0.052936        0.048913   
9          -0.134277          -0.075044        -0.052936        0.048913   
   rank_test_score  
0                1  
1                2  
2                3  
3                4  
4                5  
5                6  
6                6  
7                6  
8                6  
9                6  
Lasso(alpha=np.int64(1))In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
Lasso(alpha=np.int64(1))
array([ 0.18018531,  0.19411377,  2.31230729,  2.60385006,  3.9903116 ,
        0.        , -0.        , -0.        , -0.        , -0.55286155,
       -0.        ,  0.        ,  0.89657092,  0.        , -0.        ,
        0.        ,  0.        , -0.        , -0.        ,  0.        ,
        0.        ,  0.        ,  0.        , -0.        , -0.27817011,
       -0.        , -0.        , -0.        , -0.        , -0.        ,
       -0.        ,  0.        , -0.        ,  0.        , -0.        ,
       -0.        , -0.        ,  0.41305422, -0.        ,  0.        ,
       -0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        , -0.        , -0.        ,  0.45830332,
       -0.        ,  0.        , -0.        , -0.        ,  0.        ,
       -0.        , -0.19840069,  0.        , -0.        ,  0.        ,
       -0.        ,  0.        ,  0.        ,  0.        , -0.        ,
        0.        , -0.        , -0.        ,  0.        , -0.        ,
       -0.        , -0.        ,  0.        ,  0.19424094,  0.        ,
       -0.99068853,  0.        , -0.        , -0.20650089, -0.        ,
       -0.        ,  0.        ,  0.        , -0.        ,  0.        ,
        0.        , -0.        , -0.        ,  0.08357251, -0.        ,
       -0.        ,  0.0501038 ,  0.        , -0.        , -0.        ,
        0.        , -0.        ,  0.        ,  0.        ,  0.        ])
LassoCVLook up the documentation for sklearn.linear_model.LassoCV and fit a cross-validated model directly with the API (feel free to just use the package defaults). What is the optimal value of \(\alpha\) and what is the score on our unseen data?
LassoCV1.2160551211595514
0.23526655630512117
matplotlibWe will always import matplotlib like so:
These are very easy! Let’s generate some data and pass it directly to matplotlib
matplotlib API BroadlyCreating a figure:
<Figure size 960x480 with 0 Axes>
Plenty more in the documentation
Using our OLS model and LASSO model from before, plot actual vs. predictions on unseen data from both models.
preds_lasso = grid.best_estimator_.predict(X_unseen)
preds_ols = ols.predict(X_unseen)
fig, axes = plt.subplots(1,2)
axes[0].scatter(y_unseen, preds_ols)
axes[1].scatter(y_unseen, preds_lasso)
axes[0].set_title('OLS')
axes[1].set_title('LASSO')
for i in range(2):
    axes[i].set_xlabel('Actual')
    axes[i].set_ylabel('Predicted')To illustrate how to annotate, we’ll create a plot of the closing S&P 500 price over the course of the financial crisis, with important dates called out. First, let’s get the data
               SPX
Date              
1990-02-01  328.79
1990-02-02  330.92
1990-02-05  331.85
1990-02-06  329.66
1990-02-07  333.75
This is enough to create a basic plot of the S&P 500 closing price
We need to know what happened and when:
list
seabornmatplotlib API can be Inconvenientmatplotlib, but it’s not ideal
seaborn works directly with pandas to leverage DataFrame and Series objectsSeries have a plot Methodplot Has Formatting OptionsDataFrame is made up of Series…           A         B         C         D
0   1.106728 -0.866214  0.235206 -1.853799
10  2.948470 -0.725233  0.686533 -1.808754
20  5.147404 -2.057531 -0.643525 -2.597403
30  4.132320 -1.565383 -0.225219 -4.784608
40  2.902674 -2.789219  0.611306 -6.792956
50  3.214318 -2.902681  1.248224 -5.860315
60  4.613268 -2.930443  0.319327 -5.002226
70  5.384007 -2.878670  1.494778 -4.990756
80  4.741819 -1.049899  0.849258 -5.290131
90  5.164763 -2.133666  1.681957 -4.424057
plot on a DataFrameDataFrame Bar ChartDataFrame Stacked Bar Chart| Unnamed: 0.2 | Unnamed: 0.1 | Unnamed: 0 | match_id | shotX | shotY | quarter | time_remaining | player | team | made | shot_type | distance | score | opp | status | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 0 | 202203210BRK | 14.8 | 28.9 | 1st quarter | 11:34.0 | Donovan Mitchell | UTA | False | 3-pointer | 27 | 0-0 | 'UTA' | tied | 
| 1 | 1 | 1 | 1 | 202203210BRK | 24.8 | 7.3 | 1st quarter | 11:06.0 | Donovan Mitchell | UTA | False | 2-pointer | 4 | 0-0 | 'UTA' | tied | 
| 2 | 2 | 2 | 2 | 202203210BRK | 20.0 | 13.7 | 1st quarter | 10:26.0 | Mike Conley | UTA | True | 2-pointer | 11 | 2-2 | 'UTA' | tied | 
shot_type  made 
2-pointer  True     526
           False    410
3-pointer  False    400
           True     217
Name: count, dtype: int64
DataFrame| made | False | True | 
|---|---|---|
| shot_type | ||
| 2-pointer | 410 | 526 | 
| 3-pointer | 400 | 217 | 
unstack| made | False | True | 
|---|---|---|
| shot_type | ||
| 2-pointer | 410 | 526 | 
| 3-pointer | 400 | 217 | 
unstackLet’s create a heatmap of expected points by location on the floor:
| points | |||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| shotX | -0.3 | -0.1 | 0.1 | 0.2 | 0.3 | 0.4 | 0.5 | 0.6 | 0.7 | 0.8 | ... | 47.3 | 47.4 | 47.5 | 47.6 | 47.7 | 47.8 | 47.9 | 48.0 | 48.1 | 48.2 | 
| shotY | |||||||||||||||||||||
| -0.3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 
| 0.3 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 
| 0.5 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 
3 rows × 383 columns
Note that pd.Series objects have a plot.hist() method
Note that pd.Series objects have a plot.density() method
Courtesy of the scikit-learn documentation
Courtesy of the scikit-learn documentation
\(\beta_{LASSO}\) solves the problem
\[ \min_{\hat{\beta}}\; \underbrace{\sum_{i=1}^n\left(y-\mathbb{X}\hat{\beta}\right)^2}_{\text{Sum of Squared Errors}} + \underbrace{\alpha|\hat{\beta}|}_{\text{Regularization Penalty}} \]