Searching for a big data tool? You’ll want to make sure it allows you to embed results, can easily integrate with other apps and offers version control, among other capabilities.
George Leopold September 6, 2018
Big data analytics involves a complex process that can span business management, data scientists, developers and production teams. Crafting a new data analytics model is just one part of this elaborate process.
The following are 10 must-have features in big data analytics tools that can help reduce the effort required by data scientists to improve business results:
#1 Embeddable results
Big data analytics gain value when the insights gleaned from data models can help support decisions made while using other applications.
“It is of utmost importance to be able to incorporate these insights into a real-time decision-making process,” said Dheeraj Remella, chief technologist at VoltDB, an in-memory database provider.
These features should include the ability to create insights in a format that is easily embeddable into a decision-making platform, which should be able to apply these insights in a real-time stream of event data to make in-the-moment decisions.
#2 Data wrangling
Data scientists tend to spend a good deal of time cleaning, labeling and organizing data for data analytics. This involves seamless integration across disparate data sources and types, applications and APIs, cleansing data, and providing granular, role-based, secure access to the data.
Big data analytics tools must support the full spectrum of data types, protocols and integration scenarios to speed up and simplify these data wrangling steps, said Joe Lichtenberg, director of marketing for data platforms at InterSystems, a database provider.
#3 Data exploration
Data analytics frequently involves an ad hoc discovery and exploration phase of the underlying data. This exploration helps organizations understand the business context of a problem and formulate better analytic questions. Features that help streamline this process can reduce the effort involved in testing new hypotheses about the data to weed out bad ones faster and streamline the discovery of useful connections buried in the data.
Strong visualization capabilities can also help this data exploration process.
#4 Support for different analytics
There are a wide variety of approaches for putting data analytics results into production, including business intelligence, predictive analytics, real-time analytics and machine learning. Each approach provides a different kind of value to the business. Good big data analytics tools should be functional and flexible enough to support these different use cases with minimal effort or the retraining that might be involved when adopting different tools
Data scientists typically have the luxury of developing and testing different data models on small data sets for long durations. But the resulting analytics models need to run economically and often must deliver results quickly. This requires that these models support high levels of scale for ingesting data and working with large data sets in production without exorbitant hardware or cloud service costs.
“A tool that scales an algorithm from small data sets to large with minimal effort is also critical,” said Eduardo Franco, data science lead at Descartes Labs, a predictive analytics company. “So much time and effort is spent in making this transition, so automating this is a huge help.”
#6 Version control
In a large data analytics project, several individuals may be involved in adjusting the data analytics model parameters. Some of these changes may initially look promising, but they can create unexpected problems when pushed into production.
Version control built into big data analytics tools can improve the ability to track these changes. If problems emerge later, it can also make it easier to roll back an analytics model to a previous version that worked better.
“Without version control, one change made by a single developer can result in a breakdown of all that was already created,” said Charles Amick, vice president of data science at Devo USA, a data operations platform provider.
#7 Simple integration
The less time data scientists and developers spend customizing integrations to process data sources and connect with applications, the more time they can spend improving data analytic models and applications.
Simple integrations also make it easier to share results with other developers and data scientists. Data analytics tools should support easy integration with existing enterprise and cloud applications and data warehouses.
#8 Data management
Big data analytics tools need a robust yet efficient data management platform to ensure continuity and standardization across all deliverables, said Tim Lafferty, director of analytics at Velocity Group Development, a data analytics consultancy. As the magnitude of data increases, so does variability.
A robust data management platform can help an enterprise maintain a single source for truth, which is critical for a successful data initiative.
#9 Data governance
Data governance features are important for big data analytics tools to help enterprises stay compliant and secure. This includes being able to track the source and characteristics of the data sets used to build analytic models and to help secure and manage data used by data scientists and engineers. Data sets used to build models may introduce hidden biases that could create discrimination problems.
Data governance is especially crucial for sensitive data, such as protected health information and personally identifiable information that needs to comply with privacy regulations. Some tools now include the ability to pseudonymize data, allowing data scientists to build models based on personal information in compliance with regulations like GDPR.
#10 Data processing frameworks
Many big data analytics tools focus on either analytics or data processing. Some frameworks, like Apache Spark, support both. These enable developers and data scientists to use the same tools for real-time processing; complex extract, transform and load tasks; machine learning; reporting; and SQL. This is important because data science is a highly iterative process. A data scientist might create 100 models before arriving at one that is put into production. This iterative process often involves enriching the data to improve the results of the models.
Unified analytics tools help enterprises build data pipelines across a multitude of siloed data storage systems while training and modeling their solution in an iterative fashion, said Ali Ghodsi, CEO and co-founder of Databricks, a data analytics platform provider.