This post was actually written in a Jupyter notebook (.ipynb) and converted to markdown format for posting on my github blog. I followed the instructions found here.

  1. Go to the location of your notebook.ipynb file using terminal and run jupyter nbconvert –to markdown notebook.ipynb. This will create notebook.md and notebook_files in the same directory
  2. Copy notebook.md to your _posts folder and contents of the notebook_files folder to your assets folder or wherever linked images are typically stored
  3. The notebook_files folder would contain plots and charts in PNG format which would need to be linked using the img src tag
  4. The CSS formatting mentioned in the post is quite important since the pandas dataframe tables may be unusual sizes. Make sure you set class=”dataframe” for every table if not already set so that the CSS formatting is applied.

Linear Regression is a supervised learning approach to predict a quantitative response. While it may be less exciting than modern statistical learning approaches, it serves as a good starting point for more sophisticated techniques and good understanding of this algorithm is crucial. It helps answer the question-

  • Is there a relationship between the predictors and my dependent variable?
  • How strong is this relationship and in what direction?
  • How confident are we of this impact?

We will use the dataset obtained from this Kaggle competition - https://www.kaggle.com/c/house-prices-advanced-regression-techniques

Let’s start by reading in the dataset and looking at it’s contents.

import numpy as np
import pandas as pd

house_data = pd.read_csv('house-prices-advanced-regression-techniques/train.csv')
print(house_data.shape)
house_data.head()
(1460, 81)
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000

5 rows × 81 columns

It is good practice to have at least 30 oberservations per variable and we see the need to cut down on number of predictors in our model (1438/81~18). One simple way to prune out less helpful predictors is to remove the ones with large number of missing values.

By observing the variable names as well as number of missing (NA) values, it is evident that the ones having ‘Garage’ or ‘Bsmt’ missing are observations for houses that do not have these features so they’re not actually missing, just not applicable.To make sure these stay in the dataset we choose our threshold as 6%.

#drop with more than 6% of missing values
house_data = house_data.loc[:, house_data.isna().mean()<0.06]
house_data.isna().sum().sort_values(ascending=False).head(20)
GarageType      81
GarageYrBlt     81
GarageFinish    81
GarageCond      81
GarageQual      81
BsmtExposure    38
BsmtFinType2    38
BsmtFinType1    37
BsmtCond        37
BsmtQual        37
MasVnrType       8
MasVnrArea       8
Electrical       1
RoofMatl         0
RoofStyle        0
SalePrice        0
Exterior1st      0
Exterior2nd      0
YearBuilt        0
ExterQual        0
dtype: int64

Another way to reduce number of variables as well as improve model quality is to keep only one from a group of correlated variables.I created a correlation matrix to identify correlated variables and mantained that one with higher correlation to sales.

corr_matrix = house_data.corr().style.apply(lambda x: ["background: red" if v > 0.7 or v < -0.7 else "" for v in x], axis = 1)
house_data = house_data.drop(['GarageYrBlt', 'GarageCars', 'TotRmsAbvGrd'], axis=1)
corr_matrix
Id MSSubClass LotArea OverallQual OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF TotalBsmtSF 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold SalePrice
Id 1.000000 0.011156 -0.033226 -0.028365 0.012609 -0.012713 -0.021998 -0.050298 -0.005024 -0.005968 -0.007940 -0.015415 0.010496 0.005590 -0.044230 0.008273 0.002289 -0.020155 0.005587 0.006784 0.037719 0.002951 0.027239 -0.019772 0.000072 0.016570 0.017634 -0.029643 -0.000477 0.002889 -0.046635 0.001330 0.057044 -0.006242 0.021172 0.000712 -0.021917
MSSubClass 0.011156 1.000000 -0.139781 0.032628 -0.059316 0.027850 0.040581 0.022936 -0.069836 -0.065649 -0.140759 -0.238518 -0.251758 0.307886 0.046474 0.074853 0.003491 -0.002333 0.131608 0.177354 -0.023438 0.281721 0.040380 -0.045569 0.085072 -0.040110 -0.098672 -0.012579 -0.006100 -0.012037 -0.043825 -0.026030 0.008283 -0.007683 -0.013585 -0.021407 -0.084284
LotArea -0.033226 -0.139781 1.000000 0.105806 -0.005636 0.014228 0.013788 0.104160 0.214103 0.111170 -0.002618 0.260833 0.299475 0.050986 0.004779 0.263116 0.158155 0.048046 0.126031 0.014259 0.119690 -0.017784 0.190015 0.271364 -0.024947 0.154871 0.180403 0.171698 0.084774 -0.018340 0.020423 0.043160 0.077672 0.038068 0.001205 -0.014261 0.263843
OverallQual -0.028365 0.032628 0.105806 1.000000 -0.091932 0.572323 0.550684 0.411876 0.239666 -0.059119 0.308159 0.537808 0.476224 0.295493 -0.030429 0.593007 0.111098 -0.040150 0.550600 0.273458 0.101676 -0.183882 0.427452 0.396765 0.547766 0.600671 0.562022 0.238923 0.308819 -0.113937 0.030371 0.064886 0.065166 -0.031406 0.070815 -0.027347 0.790982
OverallCond 0.012609 -0.059316 -0.005636 -0.091932 1.000000 -0.375983 0.073741 -0.128101 -0.046231 0.040229 -0.136841 -0.171098 -0.144203 0.028942 0.025494 -0.079686 -0.054942 0.117821 -0.194149 -0.060769 0.012980 -0.087001 -0.057583 -0.023820 -0.324297 -0.185758 -0.151521 -0.003334 -0.032589 0.070356 0.025504 0.054811 -0.001985 0.068777 -0.003511 0.043950 -0.077856
YearBuilt -0.012713 0.027850 0.014228 0.572323 -0.375983 1.000000 0.592855 0.315707 0.249503 -0.049107 0.149040 0.391452 0.281986 0.010308 -0.183784 0.199010 0.187599 -0.038162 0.468271 0.242656 -0.070651 -0.174800 0.095589 0.147716 0.825667 0.537850 0.478954 0.224880 0.188686 -0.387268 0.031355 -0.050364 0.004950 -0.034383 0.012398 -0.013618 0.522897
YearRemodAdd -0.021998 0.040581 0.013788 0.550684 0.073741 0.592855 1.000000 0.179618 0.128451 -0.067759 0.181133 0.291066 0.240379 0.140024 -0.062419 0.287389 0.119470 -0.012337 0.439046 0.183331 -0.040581 -0.149598 0.191740 0.112581 0.642277 0.420622 0.371600 0.205726 0.226298 -0.193919 0.045286 -0.038740 0.005829 -0.010286 0.021490 0.035743 0.507101
MasVnrArea -0.050298 0.022936 0.104160 0.411876 -0.128101 0.315707 0.179618 1.000000 0.264736 -0.072319 0.114442 0.363936 0.344501 0.174561 -0.069071 0.390857 0.085310 0.026673 0.276833 0.201444 0.102821 -0.037610 0.280682 0.249070 0.252691 0.364204 0.373066 0.159718 0.125703 -0.110204 0.018796 0.061466 0.011723 -0.029815 -0.005965 -0.008201 0.477493
BsmtFinSF1 -0.005024 -0.069836 0.214103 0.239666 -0.046231 0.249503 0.128451 0.264736 1.000000 -0.050117 -0.495251 0.522396 0.445863 -0.137079 -0.064503 0.208171 0.649212 0.067418 0.058543 0.004262 -0.107355 -0.081007 0.044316 0.260011 0.153484 0.224054 0.296970 0.204306 0.111761 -0.102303 0.026451 0.062021 0.140491 0.003571 -0.015727 0.014359 0.386420
BsmtFinSF2 -0.005968 -0.065649 0.111170 -0.059119 0.040229 -0.049107 -0.067759 -0.072319 -0.050117 1.000000 -0.209294 0.104810 0.097117 -0.099260 0.014807 -0.009640 0.158678 0.070948 -0.076444 -0.032148 -0.015728 -0.040751 -0.035227 0.046921 -0.088011 -0.038264 -0.018227 0.067898 0.003093 0.036543 -0.029993 0.088871 0.041709 0.004940 -0.015211 0.031706 -0.011378
BsmtUnfSF -0.007940 -0.140759 -0.002618 0.308159 -0.136841 0.149040 0.181133 0.114442 -0.495251 -0.209294 1.000000 0.415360 0.317987 0.004469 0.028167 0.240257 -0.422900 -0.095804 0.288886 -0.041118 0.166643 0.030086 0.250647 0.051575 0.190708 0.214175 0.183303 -0.005316 0.129005 -0.002538 0.020764 -0.012579 -0.035092 -0.023837 0.034888 -0.041258 0.214479
TotalBsmtSF -0.015415 -0.238518 0.260833 0.537808 -0.171098 0.391452 0.291066 0.363936 0.522396 0.104810 0.415360 1.000000 0.819530 -0.174512 -0.033245 0.454868 0.307351 -0.000315 0.323722 -0.048804 0.050450 -0.068901 0.285573 0.339519 0.322445 0.434585 0.486665 0.232019 0.247264 -0.095478 0.037384 0.084489 0.126053 -0.018479 0.013196 -0.014969 0.613581
1stFlrSF 0.010496 -0.251758 0.299475 0.476224 -0.144203 0.281986 0.240379 0.344501 0.445863 0.097117 0.317987 0.819530 1.000000 -0.202646 -0.014241 0.566024 0.244671 0.001956 0.380637 -0.119916 0.127401 0.068101 0.409516 0.410531 0.233449 0.439317 0.489782 0.235459 0.211671 -0.065292 0.056104 0.088758 0.131525 -0.021096 0.031372 -0.013604 0.605852
2ndFlrSF 0.005590 0.307886 0.050986 0.295493 0.028942 0.010308 0.140024 0.174561 -0.137079 -0.099260 0.004469 -0.174512 -0.202646 1.000000 0.063353 0.687501 -0.169494 -0.023855 0.421378 0.609707 0.502901 0.059306 0.616423 0.194561 0.070832 0.183926 0.138347 0.092165 0.208026 0.061989 -0.024358 0.040606 0.081487 0.016197 0.035164 -0.028700 0.319334
LowQualFinSF -0.044230 0.046474 0.004779 -0.030429 0.025494 -0.183784 -0.062419 -0.069071 -0.064503 0.014807 0.028167 -0.033245 -0.014241 0.063353 1.000000 0.134683 -0.047143 -0.005842 -0.000710 -0.027080 0.105607 0.007522 0.131185 -0.021272 -0.036363 -0.094480 -0.067601 -0.025444 0.018251 0.061081 -0.004296 0.026799 0.062157 -0.003793 -0.022174 -0.028921 -0.025606
GrLivArea 0.008273 0.074853 0.263116 0.593007 -0.079686 0.199010 0.287389 0.390857 0.208171 -0.009640 0.240257 0.454868 0.566024 0.687501 0.134683 1.000000 0.034836 -0.018918 0.630012 0.415772 0.521270 0.100063 0.825489 0.461679 0.231197 0.467247 0.468997 0.247433 0.330224 0.009113 0.020643 0.101510 0.170205 -0.002416 0.050240 -0.036526 0.708624
BsmtFullBath 0.002289 0.003491 0.158155 0.111098 -0.054942 0.187599 0.119470 0.085310 0.649212 0.158678 -0.422900 0.307351 0.244671 -0.169494 -0.047143 0.034836 1.000000 -0.147871 -0.064512 -0.030905 -0.150673 -0.041503 -0.053275 0.137928 0.124553 0.131881 0.179189 0.175315 0.067341 -0.049911 -0.000106 0.023148 0.067616 -0.023047 -0.025361 0.067049 0.227122
BsmtHalfBath -0.020155 -0.002333 0.048046 -0.040150 0.117821 -0.038162 -0.012337 0.026673 0.067418 0.070948 -0.095804 -0.000315 0.001956 -0.023855 -0.005842 -0.018918 -0.147871 1.000000 -0.054536 -0.012340 0.046519 -0.037944 -0.023836 0.028976 -0.077464 -0.020891 -0.024536 0.040161 -0.025324 -0.008555 0.035114 0.032121 0.020025 -0.007367 0.032873 -0.046524 -0.016844
FullBath 0.005587 0.131608 0.126031 0.550600 -0.194149 0.468271 0.439046 0.276833 0.058543 -0.076444 0.288886 0.323722 0.380637 0.421378 -0.000710 0.630012 -0.064512 -0.054536 1.000000 0.136381 0.363252 0.133115 0.554784 0.243671 0.484557 0.469672 0.405656 0.187703 0.259977 -0.115093 0.035353 -0.008106 0.049604 -0.014290 0.055872 -0.019669 0.560664
HalfBath 0.006784 0.177354 0.014259 0.273458 -0.060769 0.242656 0.183331 0.201444 0.004262 -0.032148 -0.041118 -0.048804 -0.119916 0.609707 -0.027080 0.415772 -0.030905 -0.012340 0.136381 1.000000 0.226651 -0.068263 0.343415 0.203649 0.196785 0.219178 0.163549 0.108080 0.199740 -0.095317 -0.004972 0.072426 0.022381 0.001290 -0.009050 -0.010269 0.284108
BedroomAbvGr 0.037719 -0.023438 0.119690 0.101676 0.012980 -0.070651 -0.040581 0.102821 -0.107355 -0.015728 0.166643 0.050450 0.127401 0.502901 0.105607 0.521270 -0.150673 0.046519 0.363252 0.226651 1.000000 0.198597 0.676620 0.107570 -0.064518 0.086106 0.065253 0.046854 0.093810 0.041570 -0.024478 0.044300 0.070703 0.007767 0.046544 -0.036014 0.168213
KitchenAbvGr 0.002951 0.281721 -0.017784 -0.183882 -0.087001 -0.174800 -0.149598 -0.037610 -0.081007 -0.040751 0.030086 -0.068901 0.068101 0.059306 0.007522 0.100063 -0.041503 -0.037944 0.133115 -0.068263 0.198597 1.000000 0.256045 -0.123936 -0.124411 -0.050634 -0.064433 -0.090130 -0.070091 0.037312 -0.024600 -0.051613 -0.014525 0.062341 0.026589 0.031687 -0.135907
TotRmsAbvGrd 0.027239 0.040380 0.190015 0.427452 -0.057583 0.095589 0.191740 0.280682 0.044316 -0.035227 0.250647 0.285573 0.409516 0.616423 0.131185 0.825489 -0.053275 -0.023836 0.554784 0.343415 0.676620 0.256045 1.000000 0.326114 0.148112 0.362289 0.337822 0.165984 0.234192 0.004151 -0.006683 0.059383 0.083757 0.024763 0.036907 -0.034516 0.533723
Fireplaces -0.019772 -0.045569 0.271364 0.396765 -0.023820 0.147716 0.112581 0.249070 0.260011 0.046921 0.051575 0.339519 0.410531 0.194561 -0.021272 0.461679 0.137928 0.028976 0.243671 0.203649 0.107570 -0.123936 0.326114 1.000000 0.046822 0.300789 0.269141 0.200019 0.169405 -0.024822 0.011257 0.184530 0.095074 0.001409 0.046357 -0.024096 0.466929
GarageYrBlt 0.000072 0.085072 -0.024947 0.547766 -0.324297 0.825667 0.642277 0.252691 0.153484 -0.088011 0.190708 0.322445 0.233449 0.070832 -0.036363 0.231197 0.124553 -0.077464 0.484557 0.196785 -0.064518 -0.124411 0.148112 0.046822 1.000000 0.588920 0.564567 0.224577 0.228425 -0.297003 0.023544 -0.075418 -0.014501 -0.032417 0.005337 -0.001014 0.486362
GarageCars 0.016570 -0.040110 0.154871 0.600671 -0.185758 0.537850 0.420622 0.364204 0.224054 -0.038264 0.214175 0.434585 0.439317 0.183926 -0.094480 0.467247 0.131881 -0.020891 0.469672 0.219178 0.086106 -0.050634 0.362289 0.300789 0.588920 1.000000 0.882475 0.226342 0.213569 -0.151434 0.035765 0.050494 0.020934 -0.043080 0.040522 -0.039117 0.640409
GarageArea 0.017634 -0.098672 0.180403 0.562022 -0.151521 0.478954 0.371600 0.373066 0.296970 -0.018227 0.183303 0.486665 0.489782 0.138347 -0.067601 0.468997 0.179189 -0.024536 0.405656 0.163549 0.065253 -0.064433 0.337822 0.269141 0.564567 0.882475 1.000000 0.224666 0.241435 -0.121777 0.035087 0.051412 0.061047 -0.027400 0.027974 -0.027378 0.623431
WoodDeckSF -0.029643 -0.012579 0.171698 0.238923 -0.003334 0.224880 0.205726 0.159718 0.204306 0.067898 -0.005316 0.232019 0.235459 0.092165 -0.025444 0.247433 0.175315 0.040161 0.187703 0.108080 0.046854 -0.090130 0.165984 0.200019 0.224577 0.226342 0.224666 1.000000 0.058661 -0.125989 -0.032771 -0.074181 0.073378 -0.009551 0.021011 0.022270 0.324413
OpenPorchSF -0.000477 -0.006100 0.084774 0.308819 -0.032589 0.188686 0.226298 0.125703 0.111761 0.003093 0.129005 0.247264 0.211671 0.208026 0.018251 0.330224 0.067341 -0.025324 0.259977 0.199740 0.093810 -0.070091 0.234192 0.169405 0.228425 0.213569 0.241435 0.058661 1.000000 -0.093079 -0.005842 0.074304 0.060762 -0.018584 0.071255 -0.057619 0.315856
EnclosedPorch 0.002889 -0.012037 -0.018340 -0.113937 0.070356 -0.387268 -0.193919 -0.110204 -0.102303 0.036543 -0.002538 -0.095478 -0.065292 0.061989 0.061081 0.009113 -0.049911 -0.008555 -0.115093 -0.095317 0.041570 0.037312 0.004151 -0.024822 -0.297003 -0.151434 -0.121777 -0.125989 -0.093079 1.000000 -0.037305 -0.082864 0.054203 0.018361 -0.028887 -0.009916 -0.128578
3SsnPorch -0.046635 -0.043825 0.020423 0.030371 0.025504 0.031355 0.045286 0.018796 0.026451 -0.029993 0.020764 0.037384 0.056104 -0.024358 -0.004296 0.020643 -0.000106 0.035114 0.035353 -0.004972 -0.024478 -0.024600 -0.006683 0.011257 0.023544 0.035765 0.035087 -0.032771 -0.005842 -0.037305 1.000000 -0.031436 -0.007992 0.000354 0.029474 0.018645 0.044584
ScreenPorch 0.001330 -0.026030 0.043160 0.064886 0.054811 -0.050364 -0.038740 0.061466 0.062021 0.088871 -0.012579 0.084489 0.088758 0.040606 0.026799 0.101510 0.023148 0.032121 -0.008106 0.072426 0.044300 -0.051613 0.059383 0.184530 -0.075418 0.050494 0.051412 -0.074181 0.074304 -0.082864 -0.031436 1.000000 0.051307 0.031946 0.023217 0.010694 0.111447
PoolArea 0.057044 0.008283 0.077672 0.065166 -0.001985 0.004950 0.005829 0.011723 0.140491 0.041709 -0.035092 0.126053 0.131525 0.081487 0.062157 0.170205 0.067616 0.020025 0.049604 0.022381 0.070703 -0.014525 0.083757 0.095074 -0.014501 0.020934 0.061047 0.073378 0.060762 0.054203 -0.007992 0.051307 1.000000 0.029669 -0.033737 -0.059689 0.092404
MiscVal -0.006242 -0.007683 0.038068 -0.031406 0.068777 -0.034383 -0.010286 -0.029815 0.003571 0.004940 -0.023837 -0.018479 -0.021096 0.016197 -0.003793 -0.002416 -0.023047 -0.007367 -0.014290 0.001290 0.007767 0.062341 0.024763 0.001409 -0.032417 -0.043080 -0.027400 -0.009551 -0.018584 0.018361 0.000354 0.031946 0.029669 1.000000 -0.006495 0.004906 -0.021190
MoSold 0.021172 -0.013585 0.001205 0.070815 -0.003511 0.012398 0.021490 -0.005965 -0.015727 -0.015211 0.034888 0.013196 0.031372 0.035164 -0.022174 0.050240 -0.025361 0.032873 0.055872 -0.009050 0.046544 0.026589 0.036907 0.046357 0.005337 0.040522 0.027974 0.021011 0.071255 -0.028887 0.029474 0.023217 -0.033737 -0.006495 1.000000 -0.145721 0.046432
YrSold 0.000712 -0.021407 -0.014261 -0.027347 0.043950 -0.013618 0.035743 -0.008201 0.014359 0.031706 -0.041258 -0.014969 -0.013604 -0.028700 -0.028921 -0.036526 0.067049 -0.046524 -0.019669 -0.010269 -0.036014 0.031687 -0.034516 -0.024096 -0.001014 -0.039117 -0.027378 0.022270 -0.057619 -0.009916 0.018645 0.010694 -0.059689 0.004906 -0.145721 1.000000 -0.028923
SalePrice -0.021917 -0.084284 0.263843 0.790982 -0.077856 0.522897 0.507101 0.477493 0.386420 -0.011378 0.214479 0.613581 0.605852 0.319334 -0.025606 0.708624 0.227122 -0.016844 0.560664 0.284108 0.168213 -0.135907 0.533723 0.466929 0.486362 0.640409 0.623431 0.324413 0.315856 -0.128578 0.044584 0.111447 0.092404 -0.021190 0.046432 -0.028923 1.000000

We still need to prune further and speaking of pruning - one way to do this is using a Decision Tree. A requirement to use sklearn’s Decision Tree Classifer/Regressor is to encode categorical vairables. So we start with separating our categorical and numerical variables and applying an encoding to our categorical variables using the LabelEncoder.

df = house_data.copy()
df['MasVnrArea'] = df['MasVnrArea'].fillna(0)
num_cols = df._get_numeric_data().columns
factor_cols = list(set(df.columns) - set(num_cols))

for fc in factor_cols: 
    df[fc] = df[fc].fillna('NP').astype('category')
    
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

encoding = {}

for col in factor_cols:
    df[col] = le.fit_transform(df[col])
    le_name_mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    encoding[col] = le_name_mapping

Since this post is focused on Linear Regression, we won’t go into detail regarding decision tree fitting mechanism but we are able to obtain feature importance which is the weighted impurity decrease on splitting a node. Due to the high variance inherent in Decision trees, try fitting the model a number of times or better yet, opt for an ensemble such as Bagging, Boosting or Random Forest.

from sklearn.tree import DecisionTreeRegressor

X = df.iloc[:,1:-1] #remove Id column

dt = DecisionTreeRegressor()
dt.fit(X,y)
pd.Series(dt.feature_importances_, index=X.columns).sort_values(ascending=False).plot.bar(color='red', figsize=(20,6))

The OverallQual is prescribed a very high feature importance relative to other variables and beyond 9-10 vairables, the importance is negligible. Let’s consider the top 10 variables for modeling using Linear Regression.

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

top = pd.Series(dt.feature_importances_, index=X.columns).sort_values(ascending=False)[:10].index.to_list()
reg = LinearRegression().fit(X_train[top],y_train)
print('Training score: {:.2f}'.format(reg.score(X_train[top], y_train)))
print('Testing score: {:.2f}'.format(reg.score(X_test[top], y_test)))

Training score: 0.81
Testing score: 0.62

As you can see, the test score is quite low and our simplified model is not able to properly predict house price and does not improve significantly even by increasing the number of predictors used. Let’s look at the F-stats to see if our coefficent values are statistically significant. Unfortunately, sklearn’s LinearRegression class does not have attributes to display the statistical summary so we use the statsmodels package instead.

import statsmodels.api as sm
from scipy import stats

X2 = sm.add_constant(X_train[top])
est = sm.OLS(y_train, X2)
est2 = est.fit()
print(est2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              SalePrice   R-squared:                       0.814
Model:                            OLS   Adj. R-squared:                  0.812
Method:                 Least Squares   F-statistic:                     506.4
Date:                Fri, 06 Nov 2020   Prob (F-statistic):               0.00
Time:                        18:22:44   Log-Likelihood:                -13839.
No. Observations:                1168   AIC:                         2.770e+04
Df Residuals:                    1157   BIC:                         2.776e+04
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
const         -1.07e+05   5017.363    -21.335      0.000   -1.17e+05   -9.72e+04
OverallQual   2.369e+04   1054.524     22.469      0.000    2.16e+04    2.58e+04
GrLivArea       -0.7216     19.581     -0.037      0.971     -39.141      37.697
2ndFlrSF        48.3502     19.957      2.423      0.016       9.193      87.507
TotalBsmtSF     22.6861      4.453      5.094      0.000      13.948      31.424
BsmtFinSF1      24.8870      2.670      9.322      0.000      19.649      30.125
1stFlrSF        51.2961     20.315      2.525      0.012      11.438      91.154
GarageArea      51.5652      6.202      8.314      0.000      39.396      63.734
Neighborhood    67.3986    170.479      0.395      0.693    -267.085     401.882
LotArea          0.4956      0.098      5.042      0.000       0.303       0.688
WoodDeckSF      30.0592      8.263      3.638      0.000      13.848      46.271
==============================================================================
Omnibus:                      312.087   Durbin-Watson:                   2.028
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            21443.147
Skew:                           0.152   Prob(JB):                         0.00
Kurtosis:                      23.989   Cond. No.                     7.71e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.71e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

We see that some of our variables are not significant which means we may need to test other methods of feature selection. The fact that our training Rsq is higher than our test Rsq by a significant amount indicates that we may be overfitting - this calls for Regularization, which we go indepth with in the next post.