Calgary Housing Pricing Predictor

Developed a 𝘀𝗰𝗮𝗹𝗮𝗯𝗹𝗲 𝘀𝗰𝗿𝗮𝗽𝗶𝗻𝗴 𝗮𝗻𝗱 𝗽𝗿𝗲𝗽𝗿𝗼𝗰𝗲𝘀𝘀𝗶𝗻𝗴 𝗽𝗶𝗽𝗲𝗹𝗶𝗻𝗲 𝘂𝘀𝗶𝗻𝗴 𝗣𝘆𝗦𝗽𝗮𝗿𝗸, aggregating data from 9 public Canadian datasets and engineering 16 unique features. Cleaned and filtered the data to produce 1,000,000+ valid records for training. Evaluated and 𝘁𝘂𝗻𝗲𝗱 𝗺𝘂𝗹𝘁𝗶𝗽𝗹𝗲 𝗿𝗲𝗴𝗿𝗲𝘀𝘀𝗶𝗼𝗻 models, achieving 54% test accuracy with a linear regression baseline for predicting housing prices from novel inputs.

Technologies used: Python, PySpark, pandas, numpy, scikit-learn

Collaborators: Mehreen Akmal, Jenn Bushey, Hao Liu, Eric Diep