Housing Prices: Advanced Regression Techniques

By: Benjamin Podell, Blake Freeman, Edel Farah, Sarah Delia, Tomeka Morrison, Valentina Delia


      
      Ask a home buyer to describe their dream house, and they probably won't begin with the height of the 
      basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset 
      proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

      With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this
      competition challenges you to predict the final price of each home. 
      
Housing Prices

      File descriptions

      train.csv - the training set
      test.csv - the test set
      data_description.txt - full description of each column, originally prepared by 
      Dean De Cock but lightly edited to match the column names used here
        
      sample_submission.csv - a benchmark submission from a linear regression on year 
      and month of sale, lot square footage, and number of bedrooms


        
Housing Prices

      ### Read Data and Load Packages 

      # Importing dependencies
      %matplotlib inline
      import pandas as pd
      import numpy as np
      import matplotlib.pyplot as plt
      import seaborn as sns
      
      from sklearn import ensemble, tree, linear_model
      from sklearn.model_selection import train_test_split, cross_val_score
      from sklearn.metrics import r2_score
      from sklearn.utils import shuffle
      from sklearn.ensemble import AdaBoostRegressor
      


      # Read the data
      train_df = pd.read_csv('Resources/train.csv')
      test_df = pd.read_csv('Resources/test.csv')



      # Create Total Living Area SF and Price Per SF for lot and Living Area 
      basement1 = train_df['BsmtFinSF1']
      basement2 = train_df['BsmtFinSF2']
      living_space = train_df['GrLivArea']
      sale_price = train_df['SalePrice']

      total_square_feet = living_space + basement1 + basement2
      cost_per_square_feet = sale_price / total_square_feet

      # Train Set Up 
      train_df['Total Square Feet (ft)'] = total_square_feet
      train_df['Total Cost Per Square Feet ($)'] = cost_per_square_feet.round(2)

      # Test Set Up
      Test_basement1 = test_df['BsmtFinSF1']
      Test_basement2 = test_df['BsmtFinSF2']
      Test_living_space = test_df['GrLivArea']

      Test_total_square_feet = Test_living_space + Test_basement1 + Test_basement2
      test_df['Total Square Feet (ft)'] = total_square_feet






      # Generating a Features Heatmap Prior to One Hot Encoding
      sns.set_style('whitegrid')
      plt.subplots(figsize = (30,20))
      ## Plotting heatmap. 

      # Generate a mask for the upper triangle (taken from seaborn example gallery)
      mask = np.zeros_like(train_df.corr(), dtype=np.bool)
      mask[np.triu_indices_from(mask)] = True

      sns.heatmap(train_df.corr(), cmap=sns.diverging_palette(20, 220, n=200), mask = mask, annot=True, center = 0, );
      plt.xticks(size = 14)
      ## Give title. 
      plt.title("Features Heatmap", fontsize = 30);
      
      
Heatmap #Train Encoding, created multiple additional columns train_df_hot = pd.DataFrame(train_df, columns=['Alley', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual', 'CentralAir', 'Condition1', 'Condition2', 'Electrical', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd', 'Fence', 'FireplaceQu', 'Foundation', 'Functional', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'Heating', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LandContour', 'LandSlope', 'LotConfig', 'LotShape', 'MSZoning', 'MasVnrType', 'MiscFeature', 'Neighborhood', 'PavedDrive', 'PoolQC', 'RoofMatl', 'RoofStyle', 'SaleCondition', 'SaleType', 'Street', 'Utilities'] ) train_df_hot = pd.get_dummies(train_df_hot,drop_first=True) #Test DF Hot Encoding test_df_hot = pd.DataFrame(test_df, columns=['Alley', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual', 'CentralAir', 'Condition1', 'Condition2', 'Electrical', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd', 'Fence', 'FireplaceQu', 'Foundation', 'Functional', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'Heating', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LandContour', 'LandSlope', 'LotConfig', 'LotShape', 'MSZoning', 'MasVnrType', 'MiscFeature', 'Neighborhood', 'PavedDrive', 'PoolQC', 'RoofMatl', 'RoofStyle', 'SaleCondition', 'SaleType', 'Street', 'Utilities'] ) test_df_hot = pd.get_dummies(test_df_hot,drop_first=True) # Combine the Dataframes and Drop the String Values train_df = pd.concat([train_df, train_df_hot], axis=1) train_df = train_df.drop(columns=['Id','Alley', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual', 'CentralAir', 'Condition1', 'Condition2', 'Electrical', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd', 'Fence', 'FireplaceQu', 'Foundation', 'Functional', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'Heating', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LandContour', 'LandSlope', 'LotConfig', 'LotShape', 'MSZoning', 'MasVnrType', 'MiscFeature', 'Neighborhood', 'PavedDrive', 'PoolQC', 'RoofMatl', 'RoofStyle', 'SaleCondition', 'SaleType', 'Street', 'Utilities'] ) test_df = pd.concat([test_df, test_df_hot], axis=1) test_df = test_df.drop(columns=['Id','Alley', 'BldgType', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'BsmtQual', 'CentralAir', 'Condition1', 'Condition2', 'Electrical', 'ExterCond', 'ExterQual', 'Exterior1st', 'Exterior2nd', 'Fence', 'FireplaceQu', 'Foundation', 'Functional', 'GarageCond', 'GarageFinish', 'GarageQual', 'GarageType', 'Heating', 'HeatingQC', 'HouseStyle', 'KitchenQual', 'LandContour', 'LandSlope', 'LotConfig', 'LotShape', 'MSZoning', 'MasVnrType', 'MiscFeature', 'Neighborhood', 'PavedDrive', 'PoolQC', 'RoofMatl', 'RoofStyle', 'SaleCondition', 'SaleType', 'Street', 'Utilities'] ) #Set Everything to Numeric train_df = train_df.apply(pd.to_numeric) test_df = test_df.apply(pd.to_numeric) train_df = train_df.fillna(0) test_df = test_df.fillna(0) train_df = train_df.drop(columns=['SalePrice']) # Exploration corrmat = train_df.corr() top_corr_features = corrmat.index[abs(corrmat['Total Cost Per Square Feet ($)'])>0.45] plt.figure(figsize=(8,8)) g = sns.heatmap(train_df[top_corr_features].corr(),annot=True,cmap="RdYlGn") g.set_title('Correlation Matrix', size = 24, fontweight = 'bold', color = 'hotpink') bottom, top = g.get_ylim() g.set_ylim(bottom + 0.5, top - 0.5) plt.savefig('./templates/Images/Heatmap.png')
Heatmap #box plot overallqual/saleprice var = 'OverallQual' data = pd.concat([train_df['Total Cost Per Square Feet ($)'], train_df[var]], axis=1) f, ax = plt.subplots(figsize=(16, 10)) fig = sns.boxplot(x=var, y='Total Cost Per Square Feet ($)', data=data) plt.xticks(size = 14, fontweight = 'bold', color = 'hotpink') ax.set_title('Total Cost Per Square Feet ($) VS Overall Quality', size = 24, fontweight = 'bold', color = 'hotpink') fig.axis(ymin=0,); ax.figure.savefig('./templates/OverallQuality.png')
Overall Quality var = 'BsmtUnfSF' data = pd.concat([train_df['Total Cost Per Square Feet ($)'], train_df[var]], axis=1) data.plot.scatter(x=var, y='Total Cost Per Square Feet ($)',figsize=(16, 10)); plt.title('Total Cost Per Square Feet ($) VS Unfinished SF of Basement Area', size = 24, fontweight = 'bold', color = 'hotpink') plt.savefig('./templates/Images/BsmtUnfSF.png')
Basement #box plot overallqual/saleprice var = 'Foundation_PConc' data = pd.concat([train_df['Total Cost Per Square Feet ($)'], train_df[var]], axis=1) f, ax = plt.subplots(figsize=(16, 10)) fig = sns.boxplot(x=var, y='Total Cost Per Square Feet ($)', data=data) plt.xticks(size = 14, fontweight = 'bold', color = 'hotpink') ax.set_title('Total Cost Per Sq. Ft ($) VS Type of Foundation With Poured Concrete', size = 24, fontweight = 'bold', color = 'hotpink') fig.axis(ymin=0,); ax.figure.savefig('./templates/Images/Foundation_PConc.png')
Basement var = 'YearRemodAdd' data = pd.concat([train_df['Total Cost Per Square Feet ($)'], train_df[var]], axis=1) data.plot.scatter(x=var, y='Total Cost Per Square Feet ($)',figsize=(16, 10)); plt.title('Total Cost Per Sq. Ft ($) VS Remodel Date', size = 24, fontweight = 'bold', color = 'hotpink') plt.savefig('./templates/Images/YearRemodAdd.png')
Year Remodel #box plot overallqual/saleprice var = 'ExterQual_TA' data = pd.concat([train_df['Total Cost Per Square Feet ($)'], train_df[var]], axis=1) f, ax = plt.subplots(figsize=(16, 10)) fig = sns.boxplot(x=var, y='Total Cost Per Square Feet ($)', data=data) plt.xticks(size = 14, fontweight = 'bold', color = 'hotpink') ax.set_title('Total Cost Per Sq. Ft ($) VS Exterior Quality', size = 24, fontweight = 'bold', color = 'hotpink') fig.axis(ymin=0,); ax.figure.savefig('./templates/Images/ExterQual_TA.png')
Year Remodel

      # Obtain target and predictors
      strong_cor = ["OverallQual", "YearRemodAdd","BsmtUnfSF","ExterQual_TA","Foundation_PConc"]
      
      X = train_df[strong_cor]
      y = train_df["Total Cost Per Square Feet ($)"].values.reshape(-1, 1)
      print(X.shape, y.shape)

      Output: (1460, 5) (1460, 1)
   

      from sklearn.model_selection import train_test_split

      X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

      from sklearn.linear_model import LinearRegression
      # Create a linear model
      model = LinearRegression()

      # Fit (Train) our model to the data
      model.fit(X, y)
      
      Output: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)


      #calculate R2 Score and Mean Squared Error (MSE)
      from sklearn.metrics import  r2_score

      # Use our model to predict a value
      predicted = model.predict(X)

      # Score the prediction with mse and r2
      r2 = r2_score(y, predicted)

      print(f"R-squared: {r2}")

      R-squared: 0.531143673282861
   
  
    

Modeling The Data


      
      # Create the model using LinearRegression

      from sklearn.linear_model import LinearRegression
      model = LinearRegression()


      # Fit the model to the training data and calculate the scores for the training and testing data

      model.fit(X_train, y_train)
      training_score = model.score(X_train, y_train)
      testing_score = model.score(X_test, y_test)

      print(f"Training Score: {training_score}")
      print(f"Testing Score: {testing_score}")

      Output: Training Score: 0.5427163893247156, Testing Score: 0.48929751770099894

      # Plot the Residuals for the Training and Testing data

      plt.scatter(model.predict(X_train), model.predict(X_train) - y_train, c="blue", label="Training Data")
      plt.scatter(model.predict(X_test), model.predict(X_test) - y_test, c="orange", label="Testing Data")
      plt.legend()
      plt.hlines(y=0, xmin=y.min(), xmax=y.max())
      plt.title("Residual Plot")
      
Year Remodel from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)) x_train_scaled = scaler.fit_transform(X_train) x_train = pd.DataFrame(x_train_scaled) x_test_scaled = scaler.fit_transform(X_test) x_test = pd.DataFrame(x_test_scaled) from sklearn import neighbors from sklearn.metrics import mean_squared_error from math import sqrt import matplotlib.pyplot as plt %matplotlib inline r_score = [] #to store rmse values for different k for K in range(20): K = K+1 model = neighbors.KNeighborsRegressor(n_neighbors = K) model.fit(x_train, y_train) #fit the model pred=model.predict(x_test) #make prediction on test set error = r2_score(y_test, pred) #calculate rmse r_score.append(error) #store r_score values print('r2 Score ' , K , 'is:', error) r2 Score 1 is: -0.006175268194475558 r2 Score 2 is: 0.19013128958412795 r2 Score 3 is: 0.32231075018098765 r2 Score 4 is: 0.3851921121030757 r2 Score 5 is: 0.42108552230258456 r2 Score 6 is: 0.4549242549693332 r2 Score 7 is: 0.4617540588481075 r2 Score 8 is: 0.4696218381838856 r2 Score 9 is: 0.4743936874630249 r2 Score 10 is: 0.4773475366958242 r2 Score 11 is: 0.48010535045491465 r2 Score 12 is: 0.48523541291715866 r2 Score 13 is: 0.48762564644094286 r2 Score 14 is: 0.4841587948967565 r2 Score 15 is: 0.4849015514070848 r2 Score 16 is: 0.4834942200298865 r2 Score 17 is: 0.4817999168275017 r2 Score 18 is: 0.48605288191498774 r2 Score 19 is: 0.4871079989600001 r2 Score 20 is: 0.4890283304321088 curve = pd.DataFrame(r_score) #elbow curve curve.plot()
Year Remodel from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(0, 1)) x_train_scaled = scaler.fit_transform(X_train) x_train = pd.DataFrame(x_train_scaled) x_test_scaled = scaler.fit_transform(X_test) x_test = pd.DataFrame(x_test_scaled) from sklearn.tree import DecisionTreeRegressor # create a regressor object regressor = DecisionTreeRegressor(random_state = 42) # fit the regressor with X and Y data regressor.fit(X, y) training_score = regressor.score(X_train, y_train) testing_score = regressor.score(X_test, y_test) print(f"Training Score: {training_score}") print(f"Testing Score: {testing_score}") Output: Training Score: 0.9897327659059136, Testing Score: 0.9928085181043863
Another Image zoom-on-hover effect

Summary Report


      In summmary, when it came to initially cleaning our data, we used pandas and numpy. Then we determined the cost per square 
      feet by dividing  the sale price by total square feet to begin our Machine Learning Model. We used matplotlib and seaborn 
      for the exploration aspect of our data. We used Sklearn to make a linear regression model, and ensemble a decision tree. 
      We attempted to use K Nearest Neighbors but ultimately decided to use the decision tree for a more accurate prediction. 
      Deriving relevant variables, we made a heatmap only including things with over 45% relevance to total price per square foot. 
      We created a residual plot to compare the relationship from our training and testing data. And various plots compared to the 
      price per square foot; A scatter plot with Unfinished Square Feet of Basement Area; a scatter plot with year remodeled or 
      construction date if it hasn’t been remodeled; a boxplot comparing whether or not a house has poured concrete; a boxplot 
      categorizing whether the exterior quality is Typical or Average or not; and finally, an elbow curve pertaining to our KNN model.

      In conclusion, after testing several supervised ML models we received the best results from the Decision Tree Regressor 
      (Random Forest). Though the size of the model can be cumbersome and can use a large amount of computing power this did 
      preform the best with a R score of .9897 for Training and .9928 for testing.