When it comes to big data analytics, some key industries, such as healthcare and financial services, have made great strides in standing up platforms to leverage their data effectively, however, oil & gas is not among them.  A February 2016 survey conducted by DNV GL found that 52% of executives are aware of big data capabilities and see them as an opportunity. However, it was also uncovered that only 25% have a clear big data strategy and are able to leverage it to boost productivity and value creation.  These numbers, given the concrete benefits of a big data addition to complement existing architecture, are astonishingly low.

Assuming that an advanced analytics platform is “the” solution, a savvy upstream executive might ask for an example of a specific problem:

  • Classic Question – “What is my Planned vs. Actual Production”?  This is a problem that has well understood relationships and variables. It can be adeptly handled by current infrastructure (i.e. a traditional, structured-to-structured relational database management system).  The business knows exactly what it needs, and IT knows from which data sources they can obtain and integrate the data by leveraging a global schema.
  • Current Question – “What are the factors causing lost circulation”?  Lost circulation occurs when drilling fluid (i.e. mud) seeps out into a formation, rather than returning up the annulus.  When lost circulation occurs, the conceptual understanding is in a state of flux. This type of problem can only be solved by a system that can handle semi-structured and unstructured data integration, combined with advanced analytics.  In this case, the business is unsure of which data they need, thus IT is unable to effectively source relevant data, thereby leaving any analysis at a dead end.

What if we tried to solve the latter question in the classic, Enterprise Data Warehouse (EDW) world?  The project could linger for months as intelligent people worked to identify relevant data sources and create a global schema for integrating information from disparate sources.  Time, budget and resource needs would skyrocket, ultimately resulting in a high-risk venture, viewed skeptically by leadership. If done in the big data world, all potential sources of data could be ingested into a data lake, which allows for schema-on-read capability.  Using sprint iterations, one could run hypotheses on the data, leveraging pattern recognition and multi-variant correlation algorithms to determine if the problem could be solved with those parameters.  If that hypothesis proves to be incorrect, a subsequent sprint can test a new hypothesis, which allows risk to be significantly mitigated.
In order to obtain this big data solution engine, a data lake must exist.  “…Think of a data mart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples” (James Dixon, Pentaho CTO).  This model allows structured and unstructured data to flow in, serves as a staging area for a consuming entity (i.e. data warehouse or SQL staging DB), enables data streaming for real-time analytics and provides a discovery ground for Data Scientists to test theories.
When contemplating potential use-cases and justifying a big data investment in upstream, four themes emerge, and the problems requiring a big data solution typically fall into one of these categories:

    • Unstructured
      Traditional EDW’s are not designed for processing unstructured data, while big data platforms, such as Hadoop, can store, manage and analyze a wide variety of data formats, such as PDF’s, video, audio, well logs, etc., all of which are prevalent in upstream.  By having the ability to ingest unstructured data, the world of potentially solvable problems is expanded.
    • Data Science
      Searching, exploring, discovering, analyzing and modeling new-found data is the Data Scientist’s dream.  A big data platform allows one to run algorithms on top of the platform with existing tools (e.g. R Studio).  Doing so streamlines workflows and enables an iterative/Agile approach for ad hoc, what-if analysis of unknowns.
    • Volume/Velocity
      Hadoop scales by adding nodes to the cluster, thereby increasing storage, distributing processing and matching capability. Thus, ingesting volume (e.g. seismic) data and streaming real-time (e.g. WITSML) data can be accomplished with relative ease.
    • Integration
      Hadoop facilitates schema-on-read vs. schema-on-write, which enhances the Agile working environment.  Ingesting all data sources into a data lake and mixing and matching data from disparate sources can lead to extreme cross-system insight and augmented decision excellence, thereby enriching existing Business Intelligence tools.


Architectures across myriad industry landscapes are changing.  Upstream oil and gas, though presented with a specific set of challenges, will be no different.  Considering the recent downturn in USD/bbl, realizing cost efficiencies in upstream is critical.  The nature of problems surfacing in upstream necessitates a more robust solution for solving multifaceted and complex challenges.  Though some capital costs are imminent with big data platform purchases, and the proper governance needs to be in place to ensure success, a business case will invariably demonstrate a positive NPV for taking the plunge into the data lake.