One of the current tendencies in the tech world is the increasing usage of Data Science by all companies. Data Science has a lot of its background from the academy, so when bringing it to companies, one of the main questions that arise is how to manage data science projects?
One natural way to answer this question is to use the same that has been working for software engineering lately: agile development methodologies and frameworks. However, if you look at articles posted online, talks given in conferences, or discuss with practitioners, something seems to be off: those techniques sometimes appear to not be working completely to manage data science projects.
Some of those problems could be attributed to the maturity of the fields: in the business context, software engineering has been going on for much more time than Data Science, so people had more time to test new approaches for project management and find what works best. This is true, but I also think that there are some fundamental differences between Software Engineering and Data Science projects that should demand new management approaches.
Before talking about those differences, there is an important distinction to be made. When speaking about Data Science projects, it is useful to think about two workflows: the exploration stage (where the business problem is defined, the data is analyzed and understood, and experiments are created) and the product stage (where we turn those insights of the first stage into data products). The product stage is very similar to a software engineering project, so we can use mostly the same management techniques. However, when talking about the differences between the fields, I am talking mostly about the exploration stage, in which I found 3 main differences:
When doing a software development project, one experienced developer can relatively well grab the problem and divide it into smaller tasks. Estimating time is always hard to do, but this is an effort that teams that had done similar tasks over time get better when predicting. However, for Data Science projects the problem is bigger because the estimation of time depends not only on the task per se but the data available (which is never the same from one project to another). For example, the effort to create an image classification product with a given % of accuracy depends on the complexity of creating an image classification model, but the accuracy is very much dependent on the data’s size and quality.
The main idea when tackling a bigger problem is to divide it into smaller problems and then create tasks, that given, some order should solve the problem when done. The problem with creating data science tasks is that the work is cyclical by nature, which makes the process of creating an ordered list of tasks and keeping its track hard.
One way to visualize this problem is to imagine a common workflow: you first get the data, then you create new features and finally you train the model. However, most of the time this first model is not good enough. So, what should you do? Train this same model further or test new ones (and in that case, do you keep the same “Train model” task or create a new one?)? But another alternative is to go back to the feature engineering task and create new features. But in this case, you should create a new “create features” task? And for how long should you do this process? Finally, the bigger problem with this workflow is that it occurs very fast, sometimes on the same day or week the team has to go forth between these cyclical tasks.
This problem of changing business objectives also occurs with software engineering, but I think the changes are more frequent in data science projects, given that the problems we are trying to solve is by nature more ill-defined and vaguer and many times the question “is it technically feasible to solve this problem?” is impossible to answer before working with the data and testing some models. For example, the problem of “training a model that predicts the type of a document with 97% of accuracy” may be impossible to solve (with the desired accuracy) given the data available to the company and the state-of-art of the technology nowadays. However, the team will only be able to answer this question after a lot of work and experiments. And what happens if the problem is, in fact, impossible to solve? Sometimes we can accept a lower accuracy, but in other times the best business decision should be trying to solve a different problem, which could change a lot the planning of the current sprint.
Given all those considerations, one could think that we should not try to implement agile methodologies to data science projects, but that is not a conclusion I agree with. If there are problems with agile, the problems would be even bigger using waterfall methodologies (and, obviously, not using any methodology to manage the project is not really a solution). So, what should we do? We need to use agile but making some changes to adapt to Data Science needs. In some future posts, I’ll make some suggestions on what adaptations could be made, especially to Kanban and Scrum.
Featured Image from Freepik.