
This was a data mining project. This was one of the most difficult and time consuming projects. The professor had provided us with the milestones and objectives we needed to accomplish for this project.
Project Milestones
MILESTONE #1 – THE PITCH
-
Your team is looking to pitch a new idea/concept/company and are looking for funding. Think of this as Dragon’s Den or Shark Tank, but with data. Your job is to make the pitch for your idea.
-
Deliverables: Create and present a pitch deck. The deck must include:
-
Team Name and members
-
Background of your colleagues
-
What are they bringing to the table
-
Audience for your pitch
-
Who are they
-
Why should they care?
-
What data are your going to be mining to prove your pitch?
-
*Note that this must be a quantitative and a qualitative dataset
​
MILESTONE #2 – BUSINESS PLAN
You’ve settled on an idea. What is the business impact of your new idea/concept/company. Your job is to contextualize to prove or disprove your idea. It is possible you may want to re-pitch at this point.
Deliverables:
-
Create and present a data validation deck with supporting code document. You will be limited to 1000 records The deck must include:
-
What your hypothesis was
-
Data outputs and metrics that prove or disprove your model
-
Data visualizations that create the narrative supporting your case.
​
MILESTONE #3 – DATA COLLECTION
You’ve settled on an idea and shown the business plan. Do the hard work to analyze the data and is the business impact of your new idea/concept/company. Your job is to contextualize to prove or disprove your idea. It is possible you may want to re-pitch at this point.
Deliverables:
-
Create and present a summary findings deck with supporting code document. You will be limited to 100,000 records The deck must include:
-
What your hypothesis was
-
Data outputs and metrics that prove or disprove your model
-
Must use a minimum 3 different methodologies of classification and clustering
-
Data visualizations that create the narrative supporting your case
-
​
MILESTONE #4 – FINAL PRESENTATIONS
You’re now ready to bring everything together. The data, the story, the business plan, and the passion. Time to present.
For this project we imported raw data from New York City open source website. We used NYC Parking violations dataset for this project which consisted more than 10 million rows and more than 50 columns.
Challenges faced during the project:
-
Car companies and car colors were named differently due to which it was hard to analyze the data. For example, car color Grey was mentioned as "Grey", "Gray", "Gry", "Gy" etc, Car company Honda was mentioned as "Honda", "Hnd", "Hd", "Hon" etc. We had to correct each name in order to have a clean and accurate dataset.
-
Because we had such a huge dataset, which consisted of more than 10 million rows and 50 columns, our systems weren't powerful enough to clean and load the dataset due to which it would constantly crash.
-
It was difficult to make predictions because our systems weren't powerful enough to predict such a vast dataset, due to which we were restricted to predicting 100,000 - 150,000 records.
-
Majority of the attributes in our dataset were categorical in nature and weren't numerical due to which we were limited in regards to the prediction models we wanted to use.
-
The attributes we needed in our dataset in order to predict wasn't available due to which we had to alter our business model.
Milestone 1 Presentation
Milestone 2 Presentation
Milestone 3 Presentation
Milestone 4 Presentation
Team Members:
Afaqul Haque
Urvaksh Irani
Eric Chow
Shannon Paes
Paul LeeFoon
Tony Tan