How We Extract Data Lineage from Large Data Warehouses

How We Extract Data Lineage from Large Data Warehouses


Enterprises have built massive data infrastructures to capture and manage their ever-growing mountains of data. But as data stores increase, the pipelines that carry precious information to the business become murkier, making the resulting data analysis less trustworthy. Informatica’s Enterprise Data Catalog (EDC) can help you shed light on data transformations along your pipelines, and LumenData has further built tools to extract and visualize additional data lineage to extend the use of Informatica EDC. Here’s how.


More Data Leads to More Transformations

All organizations are data-driven. Complex data engineering and rapid development to accomplish this has brought a new challenge to every CDO/CIO: How do we manage and visualize the data movement across the organization? This challenge is exacerbated with the advent of massive data warehouses, where the tendency is to store all enterprise data “just in case we need it in the future.” Instead of Extract (from source), Transform (to an intelligible trustable form), and then Load (to the warehouse) — otherwise known as ETL — many organizations extract data and dump it into a warehouse with the intention of transforming it when they need it, what is known as ELT. 


When data is required for a particular analytic purpose, enterprise developers read the raw data from the warehouse, transform it to the form needed, and store it back in the warehouse or push it to another destination. They use a technique called data pipelines to perform these calculations (“transformations”), reading from one data set, applying algorithms on the data, and storing in the intermediate form in the warehouse. Then repeat this with more transformations. They do this as many times as needed until the final data is in the desired form. 


Tracing Data Along the Pipelines

Each department within an organization will create multiple data pipelines. The simplest analogy is that of a truck that picks up a payload, stops at an intermediate stop where some of the payload is added/removed, or even processed to another form, and then it drops the final payload at a destination. A second truck picks up the payload, transforms it some more, and moves to the next stop. This occurs multiple times until the final truck reaches its final destination. However, this “multi-step transformation” creates issues of trust in the final data set. Since you don’t know where the parts of the payload originated, you don’t know what you have in the end state. Essentially you can’t easily trace it back to the source.


Inherent to the use of all enterprise data is trust. Organizations must be able to trust the veracity of data at all steps along the way. This means they need to understand how the data was transformed within the pipeline since what is visible is only what we started with and what we end up with. The transformation steps across the middle are typically not visible. Therefore, organizations would like to know how data was transformed, what the specific operations were, and at which step in the pipeline these transformations took place. And most importantly, how do they reconcile the final data set with perhaps another, somewhat similar data set they see elsewhere, and which seems different, or worse, inconsistent?


How many times have each of us read one set of analytics, and spotted unexplainable inconsistencies with another set of analytics from a different source? To resolve this, one has to trace back from the destination back to the source or keep extremely detailed analytics at each transformation step.


The challenge is that generally pipelines are not built for this type of detailed “traceback” insights from the ground up. For that very reason, developers often “push the can down the road.” In other words, using our road and truck analogy, until and unless a truck driver keeps detailed records of every stop along the way, the changes in his truck’s payload will not be easily understood.


Using Informatica EDC to Gain Transformational Insights

Informatica’s Enterprise Data Catalog (EDC) is an excellent product for gaining insights into single-step transformations. But, what should an organization do when multiple transformations are applied to the data over many months or years, and by different developers from different departments?


LumenData has built a tool leveraging Apache Spline (Open Source), Python, and Informatica EDC to help. This tool can extract data lineage from Spark job execution and visualize that in Informatica EDC. Spline captures and stores lineage information from internal Spark execution plans (DAG) and persists that information in ArangoDB in a proprietary graph format. The custom Python Scanner parses that information and reformats the lineage so it can be imported into Informatica EDC. For each operation, Spline maintains a different data structure. The Python code parses the lineage data operation by operation. And, the Python code is generic enough for most of the spark operations supported by Spline.



Both the Spline, Arango DB, and the Python code have been dockerized and essentially encapsulating the underlying platform, enabling easy upgrade and plug and play. All it needs is network connectivity to the Spark DAG and Informatica EDC.


The result is detailed insights into the “what,” “when,” and “how” of individual transformation steps in a data pipeline. What was done, when was it done, and how was it performed. This enables organizations to further extend the use of Informatica’s excellent EDC product.


Visit our solutions page to learn how we can help you leverage more of your own data.