How Does Incorporating Machine Learning Within Data Lakes Lead to Robust Business Results?

August 27, 2018

As the “the new oil,” data is an invaluable commodity and a source of power in the modern world. But just storing massive amounts of it doesn’t transform an organization into a data-driven organization! For the true value of data to be (continuously) unlocked, it must be curated across the enterprise to derive required, meaningful, and actionable insights.

Large companies worldwide collect massive amounts of data in various formats. With previous data storage platforms, the data must first be formatted to be uploaded into the system. This limits the type of information that can be stored on each platform and leads to disparate and disjointed systems. On the other hand, data lakes allow the raw data to be stored regardless of format and to remain unstructured until needed, so the source data is very detailed. With all their different types of data organized in one place, businesses can more easily and quickly realize value from their data, especially through analytics like machine learning.

A machine learning system that is incorporated within a data lake has access to a vast amount of very detailed data with which to quickly and cost-effectively train itself, to improve the accuracy of its algorithms, and to map relationships between data and entities. Businesses can then leverage this information to gain value in many ways, including the following:

  • To discover patterns within their business and operations to detect anomalies and mitigate risks
  • To spot new trends within their current and potential markets that will give them a competitive edge and lead to sales opportunities
  • To enhance their understanding of customer behaviors and sentiments for real-time customer targeting, which in turn leads to increased upselling and cross-selling opportunities through personalized product recommendations and greater customer loyalty and retention
  • To enable their marketers and call centers to identify a first-time customer’s purchasing potential (despite limited interactions and data) using automatic classification and then employ the most effective sales approach

Companies across all industries can benefit from incorporating machine learning within their data lakes. Below are a few examples of machine learning already in use.

Real Estate

Skyline AI’s technology helps real estate investors make better decisions and react quickly to real estate changes. The technology leverages the “largest dataset in the industry” to identify real estate opportunities. Parameters used to calculate the current market value and potential future value of an asset include the condition of the asset, the way it is managed, and its amenities. In June 2018, Skyline AI Ltd. and an unnamed partner made an unsolicited offer after its technology determined that two residential complexes in Philadelphia were mismanaged and that the owner was highly likely to sell. The result was an acquisition for $26 million.

Online Businesses

Since 2015, Yelp has been using machine learning to classify photos by categories (food, drink, inside, outside, menu). To train their system, they use various ways to capture data, such as photo captions, photo attributes marked by users at the time of the upload, and crowdsourcing to correct the data and collect more photos for deficient categories. Being able to quickly categorize the photos despite the multiple formats the source data comes in has enhanced the user experience on and helped restaurants bring in more business, thus making Yelp a much more valuable tool and resource.


Machine learning has also played a role in advancing security measures for financial institutions, helping them avoid losing money from fraudulent transactions and loan defaults. Advanced analytics has helped credit cards monitor transactions to alert them of fraud in near real-time and to offer customers relevant products and services. Advanced Analytics can also help financial institutions gage potential borrowers’ probability of repayment by comparing their profiles against profiles generated from various sources such as credit reports, transactional records, and public information records.


In healthcare, machine learning has played a role in providing faster and more accurate diagnoses and treatment recommendations by pulling historical information that includes factors such as symptoms, conditions, and treatments reported by previous patients and even current patients using mobile apps and fitness devices.

However, simply using a data lake is not enough. In order for machine learning analytics to be accurate, the data lake also needs to be a trusted data lake. That means having proper data governance and data quality strategies in place to make sure that the raw source data is clean, usable, and properly managed. ‘

Unfortunately, very few businesses have actually achieved continuous data quality measurements and management, and data governance programs are often underappreciated and underfunded. Businesses remain skeptical of IT reports, data cataloging is siloed as a nascent concept, and master data continues to be underleveraged. Consequently, many massive data lake initiatives have resulted in data swamps. The entire EIM ecosystem depends on the alignment of the people, processes, and technologies, so having the optimal data strategy and EIM roadmap is critical.

Contact LumenData for a consultation on how your company can build trusted data lakes to leverage the power of distributed computing for machine learning (or, alternatively, time series) analysis to automate business operations, perform predictive analytics, and more.

Leave a Reply

Your email address will not be published. Required fields are marked *