These two might not seem related, but a DevOps approach can help you formulate good data science questions.
The roles of software engineering and DevOps have been separated for years: software engineers write applications and DevOps is responsible for deploying and maintaining software in production environments. The same conflict exists between DevOps and data scientists. Separating the role of the data scientist, who is responsible for developing models, from the person developing and deploying software is an obstacle to operationalizing algorithms.
The data science process can be split into several phases. The typical workflow is to define questions; get the data (often a difficult ask); explore and validate the data; build a model; perform an analysis (once you’ve got the data); generate outputs, e.g. a model, a visualization, new KPIs or a new process; deploy this result; and evaluate. The first three activities are a very linear process. Once you’ve gotten to step three, exploring and validating the data, you might be pushed back to step two to look for more data. Through the initial exploration of the data, you might need to refine or change the question because you find you need additional data or information. Among the most painful things in data science are getting data in the first place, and waiting to have enough data to answer a question.
Using a DevOps Approach to Construct Good Data Science Questions
Formulating useful data science questions on the first try can be challenging for a number of reasons. One to watch out for is an organizational lack of understanding of how to form a good data science question. An approach that borrows from DevOps best practices might start by stating the goal: getting through question ‘development’ to a deployed state (a model) and getting results that can be ‘operationalized’ (collect results and report), as quickly as possible.
The ability to iterate quickly on questions is useful but challenging without access to unformatted or pre-filtered data. If IT and DevOps teams use a flexible data management platform that supports common log formats as well as the retention of larger ranges of historical data for longer terms, the task is simplified – without burdening IT. If the data needed to answer the question resides in multiple systems, functionality such as a correlation engine that can join data sources and create materialized views of combined data sets is helpful. Tools that allow rapid exploration of data and provide indicators that the data looks correct also help data scientists quickly assess the quality of their questions.
Another challenge is working with unproven data. It’s tough to know, in advance, if a dataset has limitations. Perhaps the data doesn’t exist – the company only keeps a week of hot data, and the question requires a larger range of historical data. Attributes may be missing. For example, the data may come from weblogs that show user behavior, but the user ID is missing, or time frames are missing – there’s no data for Tuesdays. Another common challenge is data exists in multiple systems, and IT needs to use ETL to glue it together.
A DevOps-influenced process for addressing issues with data can be approached by finding the right toolset to source data, for example, a flexible data management platform, to reduce the natural friction in iterating on data science questions.
A DevOps mindset will also enable data scientists to define and refine questions quickly through rapid interactive exploration of the dataset. Once the process of iterating on the question by testing it on datasets becomes a process, automation of the data science process – the goal of DevOps – is closer to reality. Speeding this up helps you define your question more quickly.
The next time-consuming activity is deploying and productizing the model. This typically involves an engineering handoff where the model developed by hackers and data scientists is given to engineering to build a production-ready workload. Data scientists using a DevOps approach can smooth this process by instrumenting their models, ensuring input data from the product is transformed to be appropriate for the model. Be clear on the assumptions you’ve made about how quickly the model has to run. It the model is perfect but it takes 10 seconds to make a prediction and it can’t scale, or It needs to be highly available, be prepared to work with the DevOps and site engineering teams to get A/B testing going. Specify how you’re going to act on model outputs, for example by specifying a platform with alerting functionality that can trigger events based on alerts.
Another area where taking a page from the DevOps playbook can help is containerization. Deploying models and their required supporting libraries can be managed using containers, which has the benefit of isolating the developing model, datasets and workflow from production, as well as freeing the DevOps team from the need to be involved in the management of the data science process.
Next is the evaluation phase, when DevOps has fully deployed your model and you want to collect data and evaluate how good it is. Specifying the right analytics platform or toolset at the outset of the process will ensure you’ve got your predictions and actual results. With this analysis in hand, you can explore and understand the results on the platform.
DevOps has mastered the intricacies of change control, documentation, expected outcomes, reliability testing and more. Learn from these best practices to develop sound data science processes.
Most data scientists spend the majority of their time getting access to data or trying to get their algorithms deployed. With better tooling and a DevOps point of view, this process can be improved. When DevOps and data scientists collaborate earlier in the process, it’s possible to ensure data flow pipelines get the same respect as a consumer-facing website.