This is part 5 of a multi-part series on the Analytics Operating Model.
As many organizations embark on the transition from “proof-of-concept” to production, selecting the optimal data platform can be a task not for the faint of heart. According to Gartner, through the end of 2017, approximately 60 percent of big data projects will fail to go beyond piloting and will be abandoned. To further compound this statement, only 15 percent of those deployed projects will make it to production as compared to 14 percent in 2016. These numbers beg the question: Why can’t enterprises make the leap?
The traditional IT concept of maintaining infrastructures and applications for years no longer applies. As the needs of the enterprise mature, so should the ecosystem. Capabilities to handle real-time streaming data, in-flight transformation, and the shift from Extract-Transform-Load (ETL) to Extract-Load-Transform (ELT)/Schema-on-Read are no longer concepts, but reality that require adoption of newer tools and techniques to promote these trends.
Selection of an enterprise tool or platform is common practice and typically facilitated by enterprise domain/subject matter experts, based on requirements. However, the selection of the Big Data tools and platforms have left the enterprise perplexed because the ecosystem is quite different. The tools and platforms that comprise the ecosystem have created a crowded landscape, especially in the last few years. Customers are no-longer faced with finding a solution, but rather determining which solution fits their short-term tactical requirements with long-term hope for strategic fit. The Big Data Landscape 2016 (Figure 1) breaks down the myriad of tools and platforms into functional domains. Still, the selection and integration of point solutions continues to plague customers as they migrate from proof-of-concept to production.
Open-Source Limitations: The world of Open-source tools and code provide an avenue for exploration, innovation, and problem solving free of charge. In the Big Data space, many of these solutions were developed to crack very specific technical challenges at places like Google, Yahoo, and Facebook. As a result, the maturity of these tools is lagging in many areas surrounding required enterprise functionality (e.g., usability, end-to-end integration, security, and back-up/recovery) thus hindering adoption in mainstream IT. Vendors such as Cloudera, Hortonworks, and Mapr have worked to overcome these shortcomings, but the ecosystem is evolving rapidly, and point-solution vendors are taking advantage causing disruption in the harmonizing process.
Complex End-to-End Integration: Cobbling point solutions present a real integration challenge to data engineering and production operations. As mentioned above, point solutions vendors are targeting the various stakeholders in the organizations based on differing priorities to sell product. Often, this results in stakeholders unable to agree on the appropriate solution and/or platform for the centralized function. A complex spaghetti of tools emerges as the so called “platform” and all synergies begin to fall apart. This fuels the fire and many organizations cannot recover since they are unable to show value.
First and foremost, Big Data is not all about technology. A typical approach has been “build it (the platform) and let them come”. While this might have worked in the past, this no longer applies if the enterprise expects the various stakeholders to come together. The enterprise must step back and focus efforts initially on driving a foundational “data strategy”. The strategy as it implies must concentrate on how the business will drive value from data initiatives. This plan should ultimately define how the business will invest in technologies based on those that support the business goals. A key facet of the strategy is defining the “data value chain”.
The “data value chain” is defined as a series of steps required to generate value and useful insights from data. Much has been written on the various stages of the data lifecycle, but in general, understanding this flow will provide the necessary insight and understanding of the data friction points which can hinder progress. Enterprises recognize that data must be treated as an asset and have revisited their retention policies. By 2020, IDC predicts that the amount of data stored in the world will reach a staggering 44 zettabytes. With all this data being retained, the importance of the data value chain will become critical. Many organizations encounter friction – political, technical, process, diversity; when attempting to shift from siloed data stores to a centralized data lake. Although the data lake concept has been adopted, enterprise have encountered a steep learning curve on how, what, and the format of information stored in the platform. As a result, only about 20 percent of enterprise data actually makes its way to the data lake, with only a small fraction of that data being useful for the data scientist. The data value chain addresses strategies at each step for overcoming a given friction point; this is key for building an agile data platform/architecture that will scale insights quickly.
Second, no single data platform is going to solve every business need and/or use case. In the beginning, Hadoop was the promise for solving all Big Data related projects. With the wide diversity of the data streams, this no longer is the case. In more mature organizations, there can be upwards of five different platforms/solutions, much to the displeasure of production operations. But one should remember, the architecture is driven by the strategy to support business needs, not the other way around. There is no silver bullet here, so the focus must shift to the data pipelines, standardizing how data ingestion is performed, and the governance model to ensure data has the proper metadata and lineage support. Organizations must get the data ingestion right to be successful and breakdown the friction points. These will help enforce certain platform standards and tools which the organization can build a successful foundation for the future.
Finally, the data platform solution needs to be agile enough to scale out, not up. Traditionally, the enterprise has opted to maintain on-site/on premise solutions. With data warehouses and data marts, data flows through a very pragmatic process which is very predictable and repeatable. In big data, data is highly agile, experimental, ad-hoc, and workstation-driven. As business cases evolve and mature, the platform must be elastic enough to support rapid growth with dynamic workloads and real-time capabilities. On-premise solutions are difficult to scale-out in a cost-effective manner, due to physical barriers in the hardware itself causing the shift toward cloud to reach critical mass. According to IDC, spending on cloud-based analytics technology will grow 4.5x faster through 2020 than spending for on-premises solutions with open source technology representing the core of this growth. Shifting to the cloud does not mean the enterprise should shy away from fully integrated providers, but rather embrace some of the simplicities that the cloud creates in the underlying integration.
Click here to learn more about our Data & Analytics practice. Want to learn more or continue the conversation? Contact us at firstname.lastname@example.org, and follow the rest of this series as we dig deeper into each element of the Analytics Operating Model.