Before diving straight into this topic, allow me to shed some light on the root term upon which the whole domain stands —data. Each and every day, the volume of data produced on Earth is increasing exponentially. From the simple text that sent in your WhatsApp to 200+ minutes of movies you streamed online, we are surrounded by data in every possible way. But the only problem is, these gigantic volumes of data are unorganized and unstructured, providing less substantial information than they might otherwise. This is where data science comes to the rescue.
在深入探讨该主题之前,请允许我阐明整个领域所代表的基本术语-数据。 每天,地球上产生的数据量呈指数增长。 从在WhatsApp中发送的简单文本到在线播放200多分钟的电影,我们以各种可能的方式包围着数据。 但是唯一的问题是,这些庞大的数据量是无组织的和无结构的,所提供的实质性信息要少得多。 这就是拯救数据科学的地方。
了解术语“数据科学” (Understanding the term ‘data science’)
Data science is an interdisciplinary field that ranges from programming to statistics and — most importantly — to domain knowledge. Simply put, data science is the process of drawing insights from manipulating data.
数据科学是一个跨学科领域,范围从编程到统计,最重要的是,到领域知识。 简而言之,数据科学是从处理数据中获得见解的过程。
谁可以学习或实践数据科学: (Who can learn or practice data science:)
Anyone who has graduated with at least a bachelor’s degree in engineering, sciences, or commerce, and with a fair amount of knowledge in math, statistics, and programming, can practice data science. If you are in another domain and want to change your stream to data science then this would be the cherry on top as you might be able to leverage your previous domain knowledge to thrive in the field of data science.
任何已经获得至少工程,科学或商业学士学位并且具有相当多的数学,统计学和编程知识的人都可以实践数据科学。 如果您在另一个领域,并且想将自己的流转为数据科学,那么这将是最重要的,因为您可能能够利用您先前的领域知识在数据科学领域蓬勃发展。
数学和统计-数据科学职业的基石: (Math and stats — the building blocks of a data science career:)
First things first, you need to have a strong grip on mathematics, statistics, and probability to have a promising data science career. This might seem like a waste of time as once you are into programming most of those processes are taken care of in the background by the software libraries or packages you use. However, this is where people tend to go wrong as they only learn the programming and execution part of the data science lifecycle. It’s very important to know mathematics and statistics as they are both crucial in model building. Additionally, data scientists need to make decisions during their careers that are mainly based upon statistical and mathematical intuitions where coding part is seldom important.
首先,您需要牢牢掌握数学,统计学和概率,才能拥有有前途的数据科学事业。 这似乎是在浪费时间,因为一旦您开始编程,这些过程中的大多数都会在后台由您使用的软件库或软件包来处理。 但是,这是人们容易出错的地方,因为他们只学习数据科学生命周期中的编程和执行部分。 了解数学和统计学非常重要,因为它们在模型构建中都至关重要。 此外,数据科学家在职业生涯中需要做出的决策主要是基于统计和数学直觉,而编码部分却并不重要。
所需的编程语言 (Required programming languages)
If you are starting your career in 2020, Python programming language is a must as Python is considered easier than most other programming languages and also has great community support (which is updated and growing with each passing day). Along with Python, it’s advisable to have a good grasp of the famous statistical programming language known as R programming.
如果您要在2020年开始您的职业,那么Python编程语言是必须的,因为Python被认为比大多数其他编程语言更容易并且拥有强大的社区支持(随着时间的推移,它会不断更新和增长)。 与Python一起,最好掌握著名的统计编程语言R编程。
Using programming languages will help you build various machine learning and deep learning models which will draw insights from the data given as inputs to the models. Before that, however, the data needs to be preprocessed. There are various packages that are dedicated to such tasks, you just need to use them effectively.
使用编程语言将帮助您构建各种机器学习和深度学习模型,这些模型将从作为模型输入的数据中得出见解。 但是,在此之前,数据需要进行预处理。 有许多专用于此类任务的软件包,您只需要有效地使用它们即可。
In India, the predominant programming language used by data scientists is Python, however, there is not much of a gap between Python and R programming in terms of usage (whereas in other parts of the globe Python far outpaces R programming in terms of usage).
在印度,数据科学家使用的主要编程语言是Python,但是就使用而言,Python和R编程之间并没有太大的差距(而在全球其他地区,Python的使用率远远超过了R编程)。 。
编码并不是您所需要的一切: (Coding isn’t everything you need:)
We have discussed how data scientists should make machine learning models based on data. But the catch is, the amount data used here could be in credibly high — we’re talking gigabytes! That is the notion of big data. As a data scientist, you might need to store data in vaults or warehouses. These technical vaults are known as “databases”. There are several database management systems (or DBMS for short) such as MySQL, Microsoft SQL Server, PostgreSQL, and MongoDB. Learning a few of the DBMS querying is very important and can turn out to be very beneficial to your career.
我们讨论了数据科学家应如何基于数据建立机器学习模型。 但是要注意的是,这里使用的数据量可能高得令人难以置信-我们在谈论千兆字节! 那就是大数据的概念。 作为数据科学家,您可能需要将数据存储在保管库或仓库中。 这些技术库被称为“数据库”。 有几种数据库管理系统(或简称DBMS),例如MySQL,Microsoft SQL Server,PostgreSQL和MongoDB。 学习一些DBMS查询非常重要,并且可能对您的职业非常有益。
Apart from DBMS, model building, and statistics, there is another major part that goes into the process: storytelling. Yes, storytelling is considered one of the major soft skills that you need to have as this is key in making the data make sense. To enhance this skill, you might need to use data visualization software like Tableau, Power BI, and sometimes MS excel (which help you present your story).
除了DBMS,模型构建和统计之外,过程中还有另一个主要部分:讲故事。 是的,讲故事被认为是您需要具备的主要软技能之一,因为这是使数据有意义的关键。 为了增强此技能,您可能需要使用Tableau,Power BI等数据可视化软件,有时还需要使用MS excel(可帮助您展示故事)。
完成数据科学流程 (Completing the data science pipeline)
Here are some steps to complete the data science pipeline:
以下是完成数据科学流程的一些步骤:
- Understanding the problem statement 了解问题陈述
- Gathering the data收集数据
- Preprocessing/polishing the data according to our needs根据我们的需求预处理/抛光数据
- Data visualization and exploration数据可视化和探索
- Model building and experimenting模型建立与实验
- Model deployment to tackle the problem statement模型部署以解决问题陈述
Check this Medium post to learn more about the data science pipeline.
领域知识在数据科学中的重要性 (The importance of domain knowledge in data science)
Domain knowledge (also known as “subject matter expertise”) is the practice of understanding the versatility of data science. All data scientists generate insights from data and apply it to their specific use case with respect to the domain that they are working in. In doing, they will produce better results and enhanced performance. Data science blends perfectly with various industries, giving them better leverage on the data they are surrounded with, some of the major industries include the medical industry, the IT sector, and many business firms.
领域知识(也称为“学科专业知识”)是一种理解数据科学多功能性的实践。 所有数据科学家都可以从数据中获得见解,并将其应用于与他们所工作领域有关的特定用例。这样做,他们将产生更好的结果和增强的性能。 数据科学与各种行业完美融合,使它们可以更好地利用其所围绕的数据,其中一些主要行业包括医疗行业,IT部门和许多商业公司。
数据科学中的学习途径: (Learning paths in Data Science:)
There are many sub-domains of data science, for example the positions of data scientist, data analyst, machine learning engineer, and data engineer (to name only a few). The main issue, however, is that there is no proper categorization within these sub-domains as a data scientist needs to work from scratch through to the end of a project. In that process, the boundaries between the categories can become blurred. It is up to your interest to choose one amongst all the available sub-domains. If you are interested in developing machine learning models then becoming a machine learning engineer could be a good option for you, where as if you are into data warehousing and like SQL querying then being a data engineer could be a good fit. You can even get a software engineer job in an IT-based company (but you might find yourself occupying a data scientist role). With this being the case, it’s a good idea to never turn down any opportunity you get in your career as — at the very least — you will improve your domain knowledge.
数据科学有很多子领域,例如数据科学家,数据分析师,机器学习工程师和数据工程师的职位(仅举几例)。 然而,主要的问题是,在这些子域中没有适当的分类,因为数据科学家需要从头开始直到项目结束。 在此过程中,类别之间的界限可能变得模糊。 您可以从所有可用的子域中选择一个。 如果您对开发机器学习模型感兴趣,那么成为一名机器学习工程师可能是一个不错的选择,就像您喜欢数据仓库一样,就像SQL查询一样,成为一名数据工程师也很合适。 您甚至可以在基于IT的公司中获得软件工程师的工作(但您可能会发现自己担任数据科学家的角色)。 在这种情况下,最好不要放弃您在职业生涯中获得的任何机会,因为至少可以提高您的领域知识。
结论 (Conclusion)
As you can see, a lot goes into the field of data science due to the abstractness and versatility of this profession. One through line through it all is that having a curiosity to learn will always stand you in good stead so don’t let that fade away. Happy learning!
如您所见,由于该专业的抽象性和多功能性,因此进入了数据科学领域。 贯穿所有这一切的是,好奇心会永远使您站稳脚跟,所以请不要让它消失。 学习愉快!
翻译自: https://medium.com/@siddhanth.n.b/taking-baby-steps-into-data-science-fdf113557bfb