In our previous blogpost of this series, we discussed data science and the various problems that can be solved using data science. We also mentioned specializations in the data field that have become popular over the last few years. Although they are not the only ones, we will present our interpretation on three of the most popular data specialization roles that companies seek nowadays: data scientists, machine learning engineers, and data engineers.
Data scientists are professionals with a strong background in statistics and programming who are usually part of a team with the role of generating valuable and eventually complex analytical insights.
Much of a data scientist’s job involves understanding the business in-depth, then creating analytical tools that optimize processes, reveal hidden problems, or help understand business details. These three pillars can involve a lot of research, experiments, and discussions with professionals from different areas.
In the last decade, with the advancement of the processing capacity of computers and big data, the area of artificial intelligence and machine learning stood out a lot, as it is made up precisely of methods developed to deal with large volumes of data and that serve as the basis for many complex applications and discoveries that we currently have.
Depending on the application area, a data scientist may have more responsibilities involving developing data pipelines so that it is possible to analyze further the business in question or the production of machine learning models that have great relevance to solving a given problem. The complexity of these two tasks can get high, so two other areas have become very popular in recent years: data engineering and machine learning engineering.
There is a problem with Data Scientists: sometimes they overthink. That is something to be expected when so many of them come from an academic background and given the fact that the Data Science job includes a lot of research. Now let’s face it: working with people full of ideas is excellent, but when all they got is ideas… well… then projects never get finished. Enter the Machine Learning Engineer: the doer.
The difference between a Data Scientist and a Machine Learning Engineer is the exact difference between a science and an engineer: one thinks and develops, the other executes. This is not to say a machine learning engineer cannot do research and keep pace with the state of the art. It is just that his/her profile is more task oriented. When it comes to getting a project ready and a client happy, it is good to know well what moves it forward and what holds it back.
The Machine Learning Engineer is the contrast to the Data Scientist academic and scientific profile. Although its background can be related, they are the professional that puts things into production and optimizes models as fit. Their job is to develop the last part of the data science stack. They might rewrite code from R to Scala, adapt POCs from Jupyter Notebooks to a deployable Flask application, and optimize it to run faster and process high volumes of data. A Machine Learning Engineer shall also be capable of ever-improving deployed models to adapt them to new data, new frameworks, and even malicious attacks. For that effort, they mix the knowledge from data science, software engineering, and data engineering.
Data engineers are more focused on programming and developing than researching solutions. They usually have strong programming backgrounds (especially Python, Java, and Scala) and are specialists in distributed systems and high-performance computing.
A data engineer shall have strong development and programming skills. Like any developer, they should not have direct access to the production environment (that is the job for system admins and DevOps). They can, however, be somewhat involved in the cluster’s operations and even have some DevOps or DataOps jobs in some cases. Nevertheless, most of the technologies used by a data engineer do not require any DevOps nor DataOps setup skills.
The main job of a data engineer is to create data pipelines. Knowing the right tools to use in each of those jobs is the key to being a good data engineer. When dealing with big data, a developer shall often bring together dozens of technologies. A data engineer must know how to combine all those technologies and frameworks to develop solutions using data pipelines.
The once generalist definition of Data Scientist has been branching out with new specializations to meet the needs of a booming industry. And it will not cease to evolve before long. Just as software engineering has evolved to meet new demands, so will data science.
In the next and final article of this series, we will discuss what the future holds for data science.