become a data scientist

本文为希望从IT行业转向数据科学领域的专业人士提供了一条明确的学习路径。文章详细介绍了必备技能、所需工具及软件,并提供了丰富的学习资源和实践建议。

Learning Path for Developers & IT Professionals to become a Data Scientist

SHARE       
  , MAY 26, 2016  /  34 

Introduction

This guide to meant to help web developers, software engineers and other IT industry people to transition into analytics / data science industry.

Last week, I was taking a guest lecture with one of the well known institutes in India. Rather (un)surprisingly, more than 60% of the students comprised of experienced IT Professionals. Most of them are facing a common problem, “I have been in IT / software / web development for more than a few years and want to up-skill myself in analytics. I have taken a few MOOCs and have tried using a few books / platforms. Still, I don’t get it what should I do next?”

This scenario is not very different from several learnups / meetups we have conducted over the last year. In order to help these people as much as I can, I’ve created this comprehensive career guide to get you quickly started with data science. Once you finish reading this post, you would know the next steps in making a transition.

webdevelopertodatascientist

 

Self-assessment: A summary of where you stand

People working in IT industry are generally comfortable with coding, working with databases and using frameworks. After spending a few years as a developer, you would know at least some languages like Java, ASP.net, Javascript, C++, C, HTML, Python, PHP, and you would have worked with several databases including SQL, MongoDb, Oracle etc. With these skills, you are envied by people trying to transition from non-programming background. Ask them, how they feel about it!

far away from goalLet me explain the set of skills (eventually advantages) which I expect every good IT professional to have:

  1. Coding Experience: If you are comfortable with programming, you can focus on understanding the concepts and solving the problems. This can be a huge benefit in the initial days. Even though you might not have experience in languages used in data science, you still have an advantage as you understand the concepts like indexes, functions, objects, referencing etc.
  2. Logical Thinking: Data Scientists are logical. They make data based decisions. So, would you! Coding requires logic. Be it the programming logic. Your brain is accustomed to think that way. Definitely a plus point for you!
  3. Numerical Ability: Mathematics forms the core of data science. To be good with numbers is the biggest achievement one could get. I’ve seen that people who are good at coding are also good with numbers.
  4. Database Knowledge: IT professionals are known to have good knowledge of database. This knowledge is immensely helpful for data scientist to quickly obtain data from database teams.

On the other hand, you can be fooled to believe that data science is about learning a few more tools. Just like in programming, knowing a few languages or frameworks would not make you a good software engineer. What differentiates a good analyst from a bad one is the problem solving and structured thinking skills. Tools are just a way to implement this thinking. Hence, I recommend people to just pick one tool depending on their convenience and then focus on getting hands on experience.

Here are the areas you need to focus on going forward:

  1. Structured Thinking: Structured thinking refers to the art of taking an ambiguous problem and then putting a structure around it. Remember those days, when you got a very loosely / generically written requirements document from a client? This is bound to happen often in analytics. Your clients and stakeholders will tell you the business problem and you will be expected to put a structure around it and solve it using data science.
  2. Math & Statistics: You should know some basics of statistics, algebra and numerical inferences to start with. You can enrich this further over time.
  3. Domain knowledge: As a data scientist, you would need to know the domain very closely. Why is a particular process being done that way? If you are in banking, you need to understand why risk management is important and what are some of the common terms used in the industry. You need to spend some time understanding the domain and current processes.
  4. Presentation skills: As a developer, you are mostly good as long as you can explain your code to fellow developers. Not all developers need to co-ordinate with the client and present their solutions to the client directly. On the other hand, a data scientist / analyst would need to present his / her findings in a manner which is easy and exciting for the business.
  5. Technical Data Science skills: Starting from ways and biases which can arise in data collection, cleaning to the predictive modeling and implementation of data science solutions – you will need to pick all of it in coming days.

 

Where should you start ?

The problem in today’s world is the problem of plenty. I am sure you couldn’t agree more. Try searching for resources on Internet for R / Python / Data Science and you end up with a long list of resources. Talk to a few people who have made the transition and they would add a few more resources, which worked for them. If you are an avid reader, you can add a few books and blogs over and above this. Check out the platforms offering MOOCs and you can see a few good courses.

The sad part is that while you have access to plenty of resources today, you find it difficult to find your way through these resources. Hence, I have created this learning path:

 

Step 1: Getting your machine ready

The Hardware:

Data Science is computationally intense (this should not be a news to you!). So, the first thing you should do is to set up a machine which helps you in your learning. Ideally, I would say that any machine used for serious data science work should have at least 8 GB RAM and an equivalent of i5 / i7 intel chips. Of course, the higher capacity you buy, the better it is! If you have some more money to spend at this stage, you can even get a SSD upgrade.

The Operating System:

Ironically, there is no single OS which works perfectly for Data Science. You would likely need a mix of Unix and Windows machine. Unix is better in resource management and for performing the data science work. On the other hand, you would need Windows for Powerpoint and Excel, both of which are used very heavily in data science work flows.

Also, there will be a few visualization tools, which work better in Windows environment. Hence, I would recommend to use Linux as the core OS with a virtual machine running Windows or vice versa. If you are used to Mac and can work comfortably on Excel on Mac, you might be good too.

The Softwares:

You would need to choose the language / tool of your preference here. If you have experience in coding with Object oriented languages, I would recommend Python. It is easy of learn and has a vibrant community on internet. If you aren’t used to object oriented programming, you can give R a try as well. If you need more details before making a decision, read this comparison – SAS vs. R vs. Python

Here is a list of softwares, I would recommends at the minimum:

  1. MS Office for Excel, building presentations and writing documents.
  2. FileZilla for transferring files using FTP
  3. Git & GitHub for version management.
  4. VMWare / Oracle Virtual Box / Vagrant for running virtual machines
  5. Cygwin / Putty (for windows)
  6. I use Evernote for taking notes. In case of Linux, you need to run it in the browser.
  7. Terminator (for Linux) to run multiple terminals in a single view (it is awesome!)
  8. Sublime Text (or any other editor of your choice) for editing codes. You should install additional plugins for the languages you use.

If you have chosen Python as your preferred tool:

  • Install Python (2.7 or 3.4) with numpy, scipy, matplotlib, scikit-learn, pandas
  • Install Jupyter notebooks
  • In addition, Dato (Graphlab), vowpal-wabbit, import.io are some interesting additional libraries

If you have chosen R as your preferred tool:

  • RStudio is a good choice of environment.
  • Install important packages like dplyr, lubridate, ggplot, caret, randomForest etc.

Your machine is ready to crunch some numbers now!

 

Step 2: Getting used to solving ambiguous problems (i.e. Developing Structured Thinking)

The art of structure thinking is a tacit requirement but profoundly being sought in every data scientist. Otherwise, below are some good resources to enhance your skill.

Assignment: Solve this case study on operational analytics: Call Center Optimization. It’s a beginner level assignment. Once you complete it successfully, you can move to medium and advanced level.

Think you are ready for the next level, try out our practice problem on Strategic thinking

 

Step 3: Basics of Math & Statistics

Mathematics plays an important role in defining data science. Thankfully, you don’t need to learn all of math, just a few topics would do. You can start from the basic topics (marked mandatory) and pick up the rest of the topics as you progress:

Assignment: Do a statistical analysis on Big Mart Practice Problem. After you have finished with this assignment, you can showcase it on your LinkedIn profile as project work.

 

Step 4: Basics of new tools (R & Python)

After step 3, you should do programming. Coding in data science is laconic in nature. Best practitioners avoid redundant lines of code and adopt ways to make it faster. Your prior knowledge of programming basics, should give you a nice head start in solving practical problems using R or Python.

Get used to the basics of R / Python from any of the following introductory courses:

Apart from these 2 tools, you can also use Julia, Go, Java to build predictive models. However, a possible drawback with other programming languages is the lack of community support. Till now, Python and R have the best community support on web. Thus, would help you to debug issues and learn faster.

Assignment: Already given the practice links above.

 

Step 5: Get your hands dirty – Your First Project

Time to get your hands on your first project. Like programming, the best way to learn data science is to do data science. Hence, let us start by taking up a problem to work on. You can choose any of our Practice problems or any of the projects mentioned here to start with. Perform an exploratory analysis on the data to get you started.

Here are a few good places to look at:

Assignment: Perform a similar analysis on the project of your choice.

 

Step 6: Follow the steps in learning path

Now that you have tasted blood, go for the kill! Check out our learning paths on R & Python – follow them step by step. Skip the steps you are comfortable with. Do as many exercises as you can!

Here are some additional machine learning resources you can look at. Remember that the best way to do data science is to learn data science thoroughly:

Assignments: Build a machine learning model on Loan Prediction Problem using the following algorithms:

  • Logistic Regression
  • Decision Tree
  • Random Forest

For Decision Tree and Random Forest, you can seek help from: Complete Guide on Tree Based Modeling.

Make sure you understand how each of these algorithms works. Just implementing them and obtaining predictions wouldn’t be a success for you. The real success lies in gaining knowledge of how they work!

 

Step 7: Start participating in the competitions

Time to step in battle ground. A benefit of being a part of analytics community is that you get to access so many thrilling ways of learning concepts. You no longer need to stick to traditional ways of learning.

Competitions: Several data science competitions get organized across the globe where you can participate and win prizes too. After you’ve completed above steps you must participate in these competitions to assess your learning level.

Assignment: In 6 months – 1 years, try to rank in top 100 in the competition rankings on both websites. This would give a massive boost to your profile.

 

Mistakes You Should Avoid While Learning

We are prone to make mistakes (unknowingly) in pursuit of learning concepts quickly. Nothing to worry, we all are susceptible to such things. But, we need to be prudent enough to analyze our learning pace and proceed accordingly. Below are the list of mistakes ( bad practices) you should avoid while completing this learning path:

  1. Choosing difficult problems: Working on difficult problem during initial days wouldn’t make you a kick ass data scientist. You’d only end up torturing yourself. Follow a natural pace. Start with simple problems as given in assignments above. It would establish your much needed confidence.
  2. Learning too many concepts: You might get overwhelmed with the availability of online resources. Don’t get swayed away by their abundance. You don’t have to be a “jack of all trades and master of none”. Become a master. Take one tool or one concept at a time. Once you become confident about it, move to next.
  3. Losing way in coding:  You might have done die hard coding in last few years. But, data science is a bit different. You are no longer writing softwares. You are using them here. Don’t waste time in writing algorithms. Use Libraries (or packages). Tools like R and Python enjoy an incredible support of powerful libraries and packages. Use them. In short, figure out ways to do more in less lines of code.
  4. Getting Impatient: This is the most important of all. I’ve met people who’ve claimed to spend hours on learning data science. Yet, they haven’t got a break in this industry. You need to stay patient. After all, good things take time. You’ve got access to amazing community support @ Discuss. Engage more. Share your queries. Let people know what you are up to. Remember, you are not alone with this, you are with us.

 

Finally, a peek into the Life of Data Scientists at Companies

Just for some motivation…

width="500" height="281" src="https://www.youtube.com/embed/7mXFkvu98vI?feature=oembed&width=500&height=750" frameborder="0" allowfullscreen="" style="box-sizing: border-box;">

width="500" height="281" src="https://www.youtube.com/embed/doySgcGFZnc?feature=oembed&width=500&height=750" frameborder="0" allowfullscreen="" style="box-sizing: border-box;">

 

End Notes

I hope this guide will help people from IT / software development background to take up data science / machine learning as a career option. In summary, rely on your strengths, focus on developing structured thinking & problem solving, practice a lot and get your hands dirty on as many real life problems as possible. In the process, if you get stuck, leverage the communities and people in your network to help you out.

As usual, if you have any questions or suggestions I might have missed out on, feel free to reach out to us through the comments below.

You can test your skills and knowledge. Check out Live Competitions and compete with best Data Scientists from all over the world.

Next Article
8 Reasons Why Analytics / Machine Learning Models Fail To Get Deployed
Previous Article
Infographic: 16 Genius Minds Whose Inventions Made Data Science Easier For Us
Author
Kunal is a post graduate from IIT Bombay in Aerospace Engineering. He has spent more than 8 years in field of Data Science. His work experience ranges from mature markets like UK to a developing market like India. During this period he has lead teams of various sizes and has worked on various tools like SAS, SPSS, Qlikview, R, Python, Adobe Insight (Omniture) and Matlab.
这是一个基于AI视觉识别与3D引擎技术打造的沉浸式交互圣诞装置。 简单来说,它是一棵通过网页浏览器运行的数字智慧圣诞树,你可以用真实的肢体动作来操控它的形态,并将自己的回忆照片融入其中。 1. 核心技术组成 这个作品是由三个尖端技术模块组成的: Three.js 3D引擎:负责渲染整棵圣诞树、动态落雪、五彩挂灯和树顶星。它创建了一个具备光影和深度感的虚拟3D空间。 MediaPipe AI手势识别:调用电脑摄像头,实时识别手部的21个关键点。它能读懂你的手势,如握拳、张开或捏合。 GSAP动画系统:负责处理粒子散开与聚合时的平滑过渡,让成百上千个物体在运动时保持顺滑。 2. 它的主要作用与功能 交互式情感表达: 回忆挂载:你可以上传本地照片,这些照片会像装饰品一样挂在树上,或者像星云一样环绕在树周围。 魔法操控:握拳时粒子迅速聚拢,构成一棵挺拔的圣诞树;张开手掌时,树会瞬间炸裂成星光和雪花,照片随之起舞;捏合手指时视线会拉近,让你特写观察某一张选中的照片。 节日氛围装饰: 在白色背景下,这棵树呈现出一种现代艺术感。600片雪花在3D空间里缓缓飘落,提供视觉深度。树上的彩色粒子和白色星灯会周期性地呼吸闪烁,模拟真实灯串的效果。 3. 如何使用 启动:运行代码后,允许浏览器开启摄像头。 装扮:点击上传照片按钮,选择温馨合照。 互动:对着摄像头挥动手掌可以旋转圣诞树;五指张开让照片和树化作满天星辰;攥紧拳头让它们重新变回挺拔的树。 4. 适用场景 个人纪念:作为一个独特的数字相册,在节日陪伴自己。 浪漫惊喜:录制一段操作手势让照片绽放的视频发给朋友。 技术展示:作为WebGL与AI结合的案例,展示前端开发的潜力。
【顶级EI复现】计及连锁故障传播路径的电力系统 N-k 多阶段双层优化及故障场景筛选模型(Matlab代码实现)内容概要:本文提出了一种计及连锁故障传播路径的电力系统N-k多阶段双层优化及故障场景筛选模型,并提供了基于Matlab的代码实现。该模型旨在应对复杂电力系统中可能发生的N-k故障(即多个元件相继失效),通过构建双层优化框架,上层优化系统运行策略,下层模拟故障传播过程,从而实现对关键故障场景的有效识别与筛选。研究结合多阶段动态特性,充分考虑故障的时序演化与连锁反应机制,提升了电力系统安全性评估的准确性与实用性。此外,模型具备良好的通用性与可扩展性,适用于大规模电网的风险评估与预防控制。; 适合人群:电力系统、能源互联网及相关领域的高校研究生、科研人员以及从事电网安全分析、风险评估的工程技术人员。; 使用场景及目标:①用于电力系统连锁故障建模与风险评估;②支撑N-k故障场景的自动化筛选与关键脆弱环节识别;③为电网规划、调度运行及应急预案制定提供理论依据和技术工具;④服务于高水平学术论文复现与科研项目开发。; 阅读建议:建议读者结合Matlab代码深入理解模型构建细节,重点关注双层优化结构的设计逻辑、故障传播路径的建模方法以及场景削减技术的应用,建议在实际电网数据上进行测试与验证,以提升对模型性能与适用边界的认知。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值