52 Weeks of Data Science: My First Machine Learning Algorithm!
Preface
The idea of being a data scientist has always popped into my head every now and then for a couple of years now. However, getting used to the full-time work life, the fear of not making it as a business undergrad, and every other excuse that you could think of has kept me away.
But not anymore.
For the next 52 weeks, I’m making it my goal to consistently learn and post data science-related content. Why? The same reasons that I paid for a medium subscription — to prove my commitment and keep myself accountable.
So if you’re like me and want to get into data science but don’t know where to start, come ride along! And if you’re a data wiz, please critique my work and help me learn faster :)
Let's dive into my first machine learning algorithm!
My First Machine Learning Algorithm
For my first algorithm, I wanted to create something that I thought would be relevant later in my life. I decided to use a “Used Car Dataset” from Kaggle, which has over 600,000 used car listings.
This algorithm aims to predict the price of a used car based on a number of features, including the year it was built, the manufacturer, the odometer (number of kilometers), and more. I chose to use a Random Forest model because it’s effective for a large number of attributes, has low variance and bias, and is useful for attribute selection.
Through multiple iterations, I was able to reduce the mean absolute error (MAE) from $4375 to $1787 with an average price of $14254.90!
See data and code here
Takeaways
- Through creating this model, I learned that the algorithm can only be as good as the quality of the data. Exploring the data was so important because there’s (usually) always some sort of human error from listings. Even though I accounted for unrealistic odometer and asking price inputs, it’s likely that there were other ‘unclean’ variables that I didn’t account for.
- Intuition and a strong understanding of the data are equally as important, if not more, as having a strong set of technical skills. I noticed that choosing the right features, setting the right thresholds, and how you deal with missing data significantly changes how you deal with the data (hence, my improvement in MAE!)
- There’s so much more that I could’ve done to improve my model. After consulting my friend about how I could improve my model, a couple of pieces of advice included: making sure that the data is in scale, doing more exploratory data analysis, and looking into using dummy variables instead of label encoding/one-hot encoding.
And that’s it!
Hopefully the next time I tackle this data set, I’ll be able to achieve a much more accurate model.