Ensuring Data Quality in Metals Manufacturing: Techniques and Challenges with SCADA and Databricks
E14

Ensuring Data Quality in Metals Manufacturing: Techniques and Challenges with SCADA and Databricks

In this episode of the Smart Metals Podcast, hosts Luke van Enkhuizen and Denis Gontcharov explore the critical topic of data quality in metals manufacturing, with a strong focus on SCADA systems and modern cloud platforms like Databricks

Denis kicks off with a big announcement: his business is now refocused on integrating legacy SCADA architectures with scalable cloud-native environments such as Azure Databricks. Together, Luke and Denis dive into the key challenges of aligning SCADA data with business use cases, the erosion of trust caused by bad data, and the urgent need for automated monitoring. 

The discussion emphasizes how companies—from SMBs to enterprises—can implement robust data quality testing using open-source frameworks like Soda and Great Expectations. You’ll learn how to embed testing into ETL pipelines, use Databricks to store and analyze data reliably, and ensure high-quality inputs within a Unified Namespace (UNS).  

Timestamps: 
00:00 Introduction to the Smart Metals Podcast
 00:44 Big Announcement: Refocusing Business Activities
 01:12 Understanding SCADA and Data Quality Challenges
 04:37 Importance of Data Quality in Manufacturing
 07:22 Real-World Data Quality Issues and Consequences
 11:04 Steps to Ensure High Data Quality
 27:00 Open Source Solutions for Data Quality Testing  

Notable Quotes: 
  1. “SCADA is essentially the second layer of the automation pyramid—supervisory control and data acquisition. It collects data from PLCs and individual machines. The challenge is moving this high-frequency, millisecond-level time series data to the cloud. Data quality is one of the key problems in this area.” – Denis Gontcharov
  2. “My new focus is helping companies integrate legacy SCADA systems into modern platforms like Azure Databricks, where they can finally get control over their industrial data.” – Denis Gontcharov
  3. “Almost any factory using modern machinery has multiple layers—sensors, PLCs, SCADA, MES, ERP, and eventually the cloud. Much of this may be hidden inside vendor-specific solutions, but understanding these layers is essential.” – Luke van Enkhuizen
  4. “Bad data completely erodes trust. If your dashboard shows an off number and you can’t explain it, users stop trusting your data platform—no matter if it’s SCADA or Databricks behind the scenes.” – Denis Gontcharov
  5. “You can’t manually verify data coming from hundreds of time series across SCADA systems. You need an automated application watching your data 24/7 and flagging anomalies before they affect operations.” – Denis Gontcharov
  6. “Where should you do data quality checks? Ideally, inside your pipeline—after transformations—whether you’re using SCADA historians or sending data into Databricks. This prevents dirty data from entering your clean system.” – Denis Gontcharov
  7. “ETL stands for extract, transform, load. As you bring SCADA data into Databricks or your UNS, every step must be monitored and tested.” – Denis Gontcharov
  8. “Just like raw ore needs refining before it becomes usable gold, raw SCADA data must be cleaned, structured, and tested—often inside platforms like Databricks—to unlock its real business value.” – Luke van Enkhuizen
 
Relevant Links: