How We Extract Data Lineage from Large Data Warehouses

Don’t lose sight of your data as it’s piped for analysis. Learn how LumenData uses Informatica EDC to shed light on multi-stage data transformations.

Enterprises have built massive data infrastructures to capture and manage their ever-growing mountains of data. But as data stores increase, the pipelines that carry precious information to the business become murkier, making the resulting data analysis less trustworthy. Informatica’s Enterprise Data Catalog (EDC) can help you shed light on data transformations along your pipelines, and LumenData has further built tools to extract and visualize additional data lineage to extend the use of Informatica EDC. Here’s how.

More Data Leads to More Transformations

All organizations are data-driven. Complex data engineering and rapid development to accomplish this has brought a new challenge to every CDO/CIO: How do we manage and visualize the data movement across the organization? This challenge is exacerbated with the advent of massive data warehouses, where the tendency is to store all enterprise data “just in case we need it in the future.” Instead of Extract (from source), Transform (to an intelligible trustable form), and then Load (to the warehouse) — otherwise known as ETL — many organizations extract data and dump it into a warehouse with the intention of transforming it when they need it, what is known as ELT.

When data is required for a particular analytic purpose, enterprise developers read the raw data from the warehouse, transform it to the form needed, and store it back in the warehouse or push it to another destination. They use a technique called data pipelines to perform these calculations (“transformations”), reading from one data set, applying algorithms on the data, and storing in the intermediate form in the warehouse. Then repeat this with more transformations. They do this as many times as needed until the final data is in the desired form.

Tracing Data Along the Pipelines

Each department within an organization will create multiple data pipelines. The simplest analogy is that of a truck that picks up a payload, stops at an intermediate stop where some of the payload is added/removed, or even processed to another form, and then it drops the final payload at a destination. A second truck picks up the payload, transforms it some more, and moves to the next stop. This occurs multiple times until the final truck reaches its final destination. However, this “Multi-Step Transformation” creates issues of trust in the final data set. Since you don’t know where the parts of the payload originated, you don’t know what you have in the end state. Essentially you can’t easily trace it back to the source.

Inherent to the use of all enterprise data is trust. Organizations must be able to trust the veracity of data at all steps along the way. This means they need to understand how the data was transformed within the pipeline since what is visible is only what we started with and what we end up with. The transformation steps across the middle are typically not visible. Therefore, organizations would like to know how data was transformed, what the specific operations were, and at which step in the pipeline these transformations took place. And most importantly, how do they reconcile the final data set with perhaps another, somewhat similar data set they see elsewhere, and which seems different, or worse, inconsistent?

How many times have each of us read one set of analytics, and spotted unexplainable inconsistencies with another set of analytics from a different source? To resolve this, one has to trace back from the destination back to the source or keep extremely detailed analytics at each transformation step.

The challenge is that generally pipelines are not built for this type of detailed “traceback” insights from the ground up. For that very reason, developers often “push the can down the road.” In other words, using our road and truck analogy, until and unless a truck driver keeps detailed records of every stop along the way, the changes in his truck’s payload will not be easily understood.

Using Informatica EDC to Gain Transformational Insights

Informatica’s Enterprise Data Catalog (EDC) is an excellent product for gaining insights into single-step transformations. But, what should an organization do when multiple transformations are applied to the data over many months or years, and by different developers from different departments?

LumenData has built a tool leveraging Apache Spline (Open Source), Python, and Informatica EDC to help. This tool can extract data lineage from Spark job execution and visualize that in Informatica EDC. Spline captures and stores lineage information from internal Spark execution plans (DAG) and persists that information in Arango DB in a proprietary graph format. The custom Python Scanner parses that information and reformats the lineage so it can be imported into Informatica EDC. For each operation, Spline maintains a different data structure. The Python code parses the lineage data operation by operation. And, the Python code is generic enough for most of the spark operations supported by Spline.

Both the Spline, Arango DB, and the Python code have been Dockerized and essentially encapsulating the underlying platform, enabling easy upgrade and plug and play. All it needs is network connectivity to the Spark DAG and Informatica EDC.

The result is detailed insights into the “what,” “when,” and “how” of individual transformation steps in a data pipeline. What was done, when was it done, and how was it performed. This enables organizations to further extend the use of Informatica’s excellent EDC product.

Visit our solutions page to learn how we can help you leverage more of your own data.

resources

Read our Case Studies

LumenData implements data cataloging and lineage with Snowflake and dbt for global consulting firm

Case Studies, Snowflake

LumenData Enables Comprehensive Data Cataloging & Lineage using Snowflake and dbt Systems for a Global Consulting Firm

Explore how LumenData helped to increase supplier performance & risk management and reduce time-to-insight for decision-makers.

Learn more

LumenData enables UD Trucks to improve data access and scalability using Informatica SaaS MDM

Case Studies, Informatica, Manufacturing

LumenData Helps UD Trucks to Use Informatica SaaS MDM to Enable Faster Data Access and Enhanced Scalability

Explore How LumenData Helps UD Trucks to Use Informatica SaaS MDM to Enable Faster Data Access and Enhanced Scalability

Learn more

Data modernization and intelligent reporting for a corporate travel provider

Case Studies, Travel & Hospitality

Data Modernization and Intelligent Reporting for a Leading Corporate Travel Provider

See how LumenData empowered a travel firm with data modernization, MDM upgrades, and real-time insights to boost growth and efficiency.

Learn more

How We Extract Data Lineage from Large Data Warehouses

More Data Leads to More Transformations

Tracing Data Along the Pipelines

Using Informatica EDC to Gain Transformational Insights

Read our Case Studies

Our Partners

LumenData: 2025 Informatica Global Growth Partner & Data Backbone for the Salesforce Ecosystem

Solutions

LumenData Accelerator for MDM Modernization

LumenData 360++ Extension for Supplier 360

LumenData Accelerator for Higher Ed 360

LumenData Axon to CDGC Modernization​

Informatica Reference 360 SaaS Accelerator

Life Science Accelerator for Customer360 SaaS

Migrating from Oracle DRM to Informatica R360

Salesforce Accelerator for Customer360 SaaS

SAP Accelerator for Customer360 SaaS

Salesforce Connector for Informatica MDM

Salesforce Connector for Oracle MDM Product

Reltio Integration for Salesforce

Industry Focus

Public Sector

Financial Services

Higher Education

Retail

Healthcare

Manufacturing

High Tech

Travel & Hospitality

Featured Case Studies

Services Focus

Cloud Modernization

Data Engineering & Analytics

Data Strategy & Business Value Assessment

Solutions

Industry Focus

Telemedicine

Dating Apps

Fintech

Consulting Providers

Featured Case Studies

Simplifying IT for a complex world.

Platform partnerships

Services

Business Challenges

Digital Transformation

Security

Automation

Gaining Efficiency

Industry Focus

LumenData

How can LumenData help?

LumenData Axon to CDGC Modernization

Simplifying IT
for a complex world.