Posted 17 Sep 2014
Link to this post
There are some excellent resources here. But, I thought, the more
helpful approach might be a plan and hence am adding one more answer to
My goal is to create a plan where you get to the level of average industry practitioner
Skills you need: Ability to take Excel/CSV data sets, pre-process and visualize; Build a model and Visualize the results
1. Download one data set from Kaggle/UCI or anywhere from the
Internet. I am deliberately not giving a link as I want you to search
through multiple sets. Create a deck of slides describing the business
problem, ROI, current practices, their weakness etc.
Mile stone 1: Creating a business context for a problem is a crucial
step in becoming a practitioner. Congrats, you have done that! You
should spend a week for this provided you put in 20 hours a week.
2. Look at the attributes given. Brain storm whether you can create
more attributes from them. If transactions are given, you can create
average number of transaction per day, average value of transactions
etc. Think and create as many new attributes as you can.
2. Download R, Deducer (my preference). They both are open source.
3. From the resources provided by others, learn the techniques and
intuition behind standard data pre-processing (I mean ways in which you
fill missing values, bin neumeric variables and merge categorical
variables, scale data, dimensionality reduction etc.).
4. Use Excel/Deducer and create new data and pre-process the data.
Mile stone 2: Creating one big structured table where independent
attributes are columns and records are rows is a huge step in solving.
You should be able to do this with 4 weeks of work. Don’t forget to add a
few slides in your ppt on data pre-processing
5. Learn descriptive statistics, histogram, box plot, scatter plot and bar chart. Learn to plot these in deducer/ggplot.
6. Do detailed descriptive statistics and visualizations on the data.
There are excellent resources on this all over the net. I created a few
videos myselg (http://beyond.insofe.edu.in/cate…)
Mile stone 3: Visualizing is considered most important interfacing
step. and you are done with it. Add these to your slide deck. Allocate
two weeks for this.
6. Learn linear, logistic regression and clustering from any of the resources given in these threads.
7. Apply then on your data sets and do all diagnostics. Deducer makes it easy to do this.
Mile stone 4: Congrats! You built your predictive models. I think, you need 3 weeks for this step.
8. Brain storm and think about how you can simplify and present these
results. Goal is to present to a non-data scientist. Use your
visualization skills again. Add these slides to your deck.
Milestone 5: Take a week or two for this.
You have created a slide deck, some code and knowledge base. Nore
importantly, you solved a problem end-to-end. Viola, in approximately 12
weeks you are where 90% of data scientists are
Now, to get to a higher level
Add more algorithms (decision trees, neural nets etc.). Learn more
domains and problems. Study techniques to solve unstructured data. There
are wonderful courses in the thread. Take them slowly.
Hope this helps.