
Ensuring Data Quality in Metals Manufacturing: Techniques and Challenges with SCADA and Databricks
Luke: Welcome back to
the Smart Metals Podcast.
Hi, I'm Luke van Enkhuizen.
I'm here together with my co-host.
Denis: Denis Gontcharov
Luke: We dive into designing
and implementing smart factories
for the metals industry, helping
you become more productive.
Today, Dennis and I will dive
into the topic of data quality
in metals manufacturing.
We'll explore how that works, what
are common misconceptions and how you
can succeed both as an enterprise,
but particularly also as an SMB.
Dennis.
Let's, start and talk about the
origin of this topic because you
also have a big announcement, right?
Denis: Yes, that's true.
I think that's a good segue
into the announcement.
Indeed.
So the topic came to be because of
the focus on my new positioning.
In essence, I will refocus my
business activities more on.
Integrating legacy SCADA systems to modern
cloud systems such as Azure Databricks.
Now, before I continue, let me quickly
explain again, for those who don't know
what SCADA is, SCADA is essentially
the second base of the automation
pyramid, the second layer or level
two, which stands for supervisory
control and data acquisition.
In a nutshell, it's a system
that's responsible for.
Collecting all the data of your
PLCs, of your individual machines.
In that sense, the historian is a part of
scada now it's called supervisory control
because SCADA in some implementations also
allows to steer the process by setting
certain parameters across many machines.
And in essence, the big challenge
is to how do you get this much data?
At such high frequency, because we're
talking about time series here in
milliseconds, potentially to the cloud,
and this is a very big topic that
a lot of enterprise industrials are
working on and are really struggling
with, I think data quality is one
of the key problems in this area.
It's very hard to get right.
Luke: So this is very interesting
and I'm really looking forward
to diving into this topic.
Just to clarify a little bit, for
those who are not really familiar with
what we are talking about here and
why this is also relevant for an SMB.
Because almost any factory that
producers with modern machinery
has multiple layers in the factory.
You might not see it.
It might be part of your
equipment already, or it
might be part of your process.
You might not interact with it on
a daily basis because it's hidden
within the vendor specific solution.
But any factory that exists has on the
lowest level, of course, the sensors that
are part of your factory, there are PLCs
that control, and do the logic for the
machinery, and that is then leading into
the SCADA layer, which is summarizing,
as you said, and above that, is usually
where your manufacturing execution lives.
So this is the software that you normally
interact with, for your daily planning
and scheduling and working and above
that is then the ERP, and sometimes even
above that, they say there is the cloud.
We are really talking about, lower levels
between what happens in A PLC, happens on
the factory floor, or what happens in a.
Continuous process or anywhere
in the metals industry and how
that data is then captured.
Do you have anything to
add or change to that?
Denis: Yeah, pretty much.
I will just summarize that indeed.
You have to separate the two topics
here, like SCADA is currently my focus
because that's where most of my experience
has been in the enterprise world.
But data quality topic we will
discuss in the next part of this
podcast is essentially applying to
any layer of the automation pyramid.
So both the scada, both the MES, the
ERP, at the end of the day, clean data
is of vitally importance and that's
what how we're gonna discuss now.
Luke: Yeah.
It's very important to just also emphasize
then on the most important topic, is
first of all, what is data quality?
Like, you know, we are collecting data.
Great.
Now what?
Denis: Yeah, that's a great point.
And data quality is often seen as
a checkbox in the sense that, well,
as long as we get data, when the
file is created, when the rows are
written to the database, we assume
the story is done, the data is there.
But when people actually start looking
at the data, it's then when you get the
angry phone calls, the annoyed emails
saying that, Hey, this dashboard is wrong.
Or some data fields are missing.
What's wrong?
So essentially we have to define data
quality more precisely in the sense
that it consists of various metrics, you
can define quite precisely one of them.
For example, being accuracy.
Like does this value actually
represent what you want to
represent in the real world?
When you want to inform the use of a
certain temperature in the furnace,
do you actually send the value of the
sensor that is in that furnace, or do
you send some proxy value, let's say
a sensor who's outside that particular
furnace, but it gives you just an idea.
Luke: Right.
So why is this a common
misconception of misunderstood?
Denis: I think everyone is, in
essence, understands what good data is.
We all have a experience
working with bad data quality.
I think when we ask someone what
they expect from clean data,
a business user can define it
pretty accurately, in my opinion.
The reason why we don't have
clean data is more about.
It's a bit paradoxical, but it's
assumed to be easy, and that's why
we assume a data is just correct.
We assume that if we move data from place
A to place B, that it should be fine.
To me, it seems that everyone
underestimates the problem.
I think that's the key
word in this discussion.
We underestimate how difficult
it is to get the data right.
Luke: Yeah.
Yeah, I can imagine.
So underestimating is never a good thing.
Of course.
So basically you also shared in the
preparation of the show notes, this,
image where somebody, complains
about the report being broken maybe
you can explain it a little bit more.
Denis: Let me give an example
of, a typical data quality issue
problem that happens in reality.
Luke: Mm-hmm.
Denis: Imagine a big industrial
enterprise that's obligated to report.
Let's say you are a coal power plant.
Right.
You produce energy by burning coal.
You have investors because
you have a very big business.
And of course those investors want
to know how much energy you produce.
'cause that will determine how much
money they can earn and how much
they invest in will appreciate.
So they expect from you to see, let's
say, on a monthly basis, very clear
statistics about energy production.
And this, by the way, is not
just an an agreement, it's an
actual contractual obligation.
With hefty fines in case you
do not meet those requirements.
So if you fail to deliver the
data, you get a fine that could
potentially run into the millions.
So you can imagine this VP
is looking at the dashboard
and he sees absolute garbage.
And he writes an email to the
head of the data department,
and says, what's going on?
Where is the data?
And then instead of, having an
easy answer, they have to create
tickets and go look for the IT
person who is maybe on holiday.
And that's where, tensions rise.
So it seems to me that
essentially data quality is such
a problem on such a small level.
Essentially it's looking
at the actual numbers.
It's so far down below the problems that
someone on the board usually tasked with
that it just never really, wins their
priority or wins their recognition.
Luke: Yeah, no, I can imagine it,
it really, it seems to be a given.
How about we, dive a little bit
into why you should really solve it?
Like now, if you don't,
what could go wrong?
Denis: Mm-hmm.
Yeah.
I mean, essentially what you try to do is
you try to avoid losing money in a sense.
You want to avoid bad decisions, if we
look at it in terms of examples, right?
In my previous example,
a failure to deliver clean data or
to meet obligations in the contracts
with regards to data reporting
will actually lead to fines.
Like the European Union will fine
you if you fail to produce data
on your energy emissions, or like
environmental emissions, for instance.
But if you look at an SMB, they perhaps
do not have these specific contracts,
but they do have machine running
and they may depend on things like
predictive or preventive maintenance
to prevent a machine from failing.
Now, if you have bad data, you
will not discover this problem and
your machine will in fact fail.
And finally, a third point I want
to mention is that, and perhaps
about the most important point,
bad data, completely erodes trust.
Like imagine your spend time building
a dashboard, you hope it'll be used
by engineers or by business users.
And when someone mentions that, Hey,
this number looks off, and then you
can't explain why it looks odd, people
lose confidence in your solutions.
And the entire data department essentially
loses the complete trust of the company.
Luke: Yeah.
Sounds horrible to be in, like, you
know, keeping confidence is the key
in any initiative, not only in the
system, but often in each other.
If your colleague did the right
work, if your colleague comes with
the report and you know he is not
using the right source data, how
can you trust him on that report?
Right?
Denis: Mm-hmm.
Luke: hard to keep the
relationship well, so.
let's talk a little bit about you
can make sure that you're doing
this right, that you get this right.
How do you get the right data quality?
'cause that's what episode is about.
You know, in metal manufacturing,
you, everything matters.
So where do you start
to get your data right?
Denis: Well, I would begin with, a
typical Dutch proverb, which, says
Mathen is written, which translates
to English as to measure, is to know.
I think the first step to know
to improve your data quality.
So first know, well,
what is the data quality?
Like, how good or how bad is it?
A lot of companies have
absolutely no vision on this.
And if you speak of, for example, of
an SMB manufacturer, they may perhaps
not have as much data, but they perhaps
also don't have enough people to manage
it to maintain and to monitor it.
What I'm getting at is that data
quality monitoring should not be
done by humans, in my opinion.
It should be completely automated
as much as possible 'cause it can
be automated as much as possible.
Luke: So the one would be remove the
human factor almost to say bluntly.
Denis: Yeah, I mean, whenever you
are copying data manually, you just
always introduce a place, for error.
But it's not about, avoiding
errors, also about the fact that
that work is not really productive.
I don't think in the 21st century we
need people who copy numbers from a
paper sheet into an Excel file and then
from this Excel file into a program when
this can be automated, I think it's a
waste of human talent and ingenuity.
Like imagine if you can liberate
effort at companies and metal companies
in aluminum industry where entire
jobs consisted simply of data entry
Luke: Yeah,
Denis: and not only to test a lot of.
Errors, but it's just also a very
expensive process, don't you think?
Luke: Yeah, it is, absolutely.
And it numbs you down.
Like it feels like such a missed
opportunity to, instead of adding
value, just doing that work, right.
So getting good quality data,
you have to it as it is and.
So maybe you can paint us the
picture of how to do this right.
To get your data from, for example, a
SCADA system and use it for analysis
or for any purposes that you want to
use for example, in the last episode,
we talked about predictive maintenance.
Perhaps you can talk a little
bit about this specific process
and then how to keep your data in
check so you can do the things we
talked about in the last episode.
Denis: Yeah, good point.
So in essence, what we talk
here about is data testing.
Luke: Mm-hmm.
Denis: If you look at software
development, well, every application
we use, or all code's written is
almost always thoroughly tested.
Developers write unit tests, they
write integration tests, and they
only committed tests to a repository.
If.
The code has been sufficiently
covered with tests.
This may be a bit more it technical for
our listeners, but you have to imagine
that we want to be sure that the code
just written is actually correct,
and you do this by making certain
assertions what we call about the code.
You could say that if you have a function
in your code and you give it this
inputs, it should yield this result.
And if this test passes, the
test completes successfully.
But if the code produces a different
result, we say that the test has failed.
And this allows the user to,
basically verify and validate
the logic of your application.
Now we do something very similar in the
data world with so-called data quality
tests, where you do the same assertions
or expectations about your data.
An example could be: imagine you
are reading a column from a database
table column A, which contains
numeric values between, and let's
say those values are a percentage.
Luke: Mm-hmm.
Denis: in that case you could say
that all the values have to be
a number between zero and 100.
That would be the test for your data.
And whenever a value pops up, which has,
for example, 105, the test will fail
because it is outside of zero to 100.
And it'll raise a flag.
So we want to install these tests
across our data sets, whenever
we want to test something.
Luke: Yeah.
Can we make it a bit more specific
here as in a, maybe a step 1, 2,
3 for the listeners to get this
'cause not everybody's maybe on
that technical level, but they're
interested to learn more about.
What it can mean to them?
Many decision makers are just
waiting to learn about, you know,
what, be like the high level steps?
What do we need to do next?
What is the pre-work?
Maybe you can talk a little bit into
a 1, 2, 3 step approach for this.
Denis: Yeah, sure.
Let's imagine use case where you
have to collect a lot of data
from a lot of different sources
that could be SCADA systems.
And let's imagine you have
about 100 different historians.
So part of SCADA and all of these
historians collect data on, let's say
the average pressure of a rolling mill.
Your rolling mills, they produce
time series on the force, right?
And, so you're essentially
trying to measure the force
across all your rolling mills.
You could use this value, for
instance, to, try to predict when the
machine would fail or when something
is going wrong based on the force.
But in essence, you have to imagine you
get hundreds of different of time series.
Now, to analyze this data, you want
to have this data in one single
source of truth, which on this
podcast is the unified namespace.
The question we are trying to solve.
Yes.
I had to plug it in, didn't I?
Well, you can build the data pipelines
as we discussed in previous podcasts,
Luke: Hmm.
Denis: the question we are working
on here is that, how can you be sure
that you actually get all the data
and it's all correct 'cause we are
looking at hundreds of time series
from hundreds of different systems.
You cannot check this manually.
So you need some automatic way
to watch over the data that
lands in your unified namespace.
So we're talking of some
application that will day and night.
Look at the data that is landing.
An obvious check would
be, is data missing?
For example, if scale system number
87 has not sent any data for more
than four eight hours, well that's
something that you would like to know
because maybe the connection is broken.
Maybe there's a firewall, something is
wrong, but you'd be surprised how often in
the cities is simply not noticed for days.
Did this help a bit to clarify?
Luke: Yeah, definitely.
I think it's definitely important.
If something slips through, you'd
notice it when it's too late and
therefore you need to clear it up.
I think we addressed it in previous
episodes as well, and the importance
that you need to bring in, first of
all, an expert that understands the
source and destination, but also
really doing upfront transformations.
Is this what she also can
call ETL in the terminology
Denis: Exactly.
ETL stands for extract, transform, load.
So you're trying to get data from a
system, transform it in your data pipeline
and then load it into a new system.
Essentially your data flows
from point A to point B and goes
transformation along the way.
Well, I'm really happy you mentioned this
point because the question is, well, where
should you do the data quality checks?
And a very good approach
is to actually integrate.
A data quality check as a step in your
pipeline, for example, at the end of the
transformations, because then you can warn
the user as the data is being written.
So as the problem occurs, you can
notify the user, Hey, hold on, this
transformation, yield the result
that is outside of the expected
mean and max values, for instance.
And then you can either refuse to
write the data to the system to
prevent polluting your clean database.
Or you can raise the alarm
and let use a deal with it.
Luke: Okay, so this is key indeed.
So the SVA distinction between the two.
But they were related in a certain way.
Now,
Denis: Mm-hmm.
Luke: tie this all back to the
start of this episode where we
said, okay, there we talk about data
quality for metals manufacturing.
We talk about your new positioning,
that you're focused on.
Could you maybe then help the listeners
out and just tell a little bit about
how you are to help companies with this?
Like what, what are the kind of
transformations you achieve with them
and how this is relevant for them.
So I'm just trying to really understand
a bit also for the listeners,
what can they rely with you on?
Denis: Yeah, so I think, we
all have an intuitive idea of
how data quality can be tested.
In fact, a simple test we all do on a
daily basis just looking at data and
see if the values are correct, right?
I believe business owners are already
capable of defining checks for their
data, or at least able to define what
clean data should look like for them.
The real question where I come in is to,
how can you implement a testing framework
that would automate this process?
So essentially you try to define
the checks you want to do, but how
can you design a framework that
will then execute these checks?
On an hourly, daily, or whenever
data's being written, basis.
Luke: Mm-hmm.
Denis: So what I come in is that
essentially I help companies establish
open source frameworks for data quality
testing that allow you to not only
test the data, but also create, reports
about the results, but also from
these data quality checks generate.
A document to prove to business users,
Hey, we can actually prove or certify
that this data is correct because we
have performed these and these checks.
And because trust is so important
in the this data quality industry
is that we want to not just say
data is tested, but we want to give
the business users a way to verify.
The tests that are being made in plain
English, for example, we test that for
these data sets, this column, all values
are between two and 10 guaranteed.
Luke: Yeah.
Oh, this is very crucial
details there, right?
So can.
Yeah, you can validate
up, like how you say it.
You can make sure that what the,
the data as you expect it to be, it
actually will be, which is such a thing.
You would think the data, just
the data, but it's not used.
It's not guaranteed to be right,
and you need to filter out.
you want to filter, like you
need to know what is true or not.
It's such interesting below the
surface that most people don't
even remotely are aware of, right?
So just like many fabricators and machine
builders and metals companies, they don't
even remotely aware of what they already
have in terms of equipment and sensors and
data generating and probably already SCADA
systems and what they can do for them.
But then to only think about that
what is in there, doesn't mean it's
immediately right or useful or has the
qualities that you need it to achieve
the use cases and then to realize
that there's a whole level below that.
That is super fascinating and think
that also shows again, how diverse our
industry can be and how many places work
needs to be done before you can actually
achieve those smart factory results.
How you can really, truly achieve
a true digital transformation.
Denis: Yeah, absolutely.
I mean, we say that data is new oil
or we hear quotes that you should make
data the primary commodity of your
business, but then we don't actually
talk about looking at this quality.
Luke: yeah, yeah.
Yes.
Right.
But it's, it's, it's funny, right?
Because if you say data is the new
gold of your business, then, you know,
even gold has a certain quality, right?
Like, if you have like, you
know, you need to make sure
Denis: Yeah.
Luke: Yeah.
Otherwise you still get people
having to go to each individual
asset and checking it And for me,
someone focused really on the SMBI.
I do think this sounds like it's be
specifically large amount of assets, but
I can already quickly imagine that if you
have connectivity issues in your network
or there, are various interruptions,
in the system, you can already have it
on probably one asset I can imagine.
Is that right?
Denis: Yeah, for sure.
Like imagine, in the use case we
worked on in the last episode that we
discussed, one of the problems I remember
was that if you look at work orders.
Sometimes a work order is updated in
one system or a sales order is updated
in one system, the client changes his
mind, but that data doesn't travel
back from the ERP to the MES system
and the bar still gets produced.
So for me, that still falls
on the data quality as well.
'cause in essence, data
is not being propagated
Luke: Right.
Denis: RP back to the MES.
Luke: This is very important to emphasize
this for, so data is not just the
time series data and numerical values.
It is the full spectrum.
So it could also be this entries in the
ledger, it could be entries in the table.
It could be work order, sales
orders, anything that is important.
It could be even customer's
email address, right?
It could be anything, right?
This is very important to,
to emphasize this in the end.
No matter where you are, in any
case, it helps to do that work.
And I think that's also what we did
indeed in the, in the project, right?
To making sure that if an ERP system
releases a certain work order, then
the, the work order is stored somewhere.
Now, what happens when you start changing
or the leading records various systems,
and you forget to do any, like, you
know, scripts or run any activities
as a user to update other systems.
You already have a data problem.
And so there is a way to somewhat
automate that and check that and
make sure that that is actually in
place, one of which is the UNS, but
others are other testing frameworks.
think briefly touched on it, but
you didn't mention any examples and
perhaps I noticed that you also just
posted a new article about, soda.
So probably maybe you can.
talk about some of the technologies
and then we wrap up the episodes.
Denis: I can mention two open
source testing frameworks that
caught my attention because again,
they're open source, which we
really value on this podcast.
Luke: Mm-hmm.
Denis: But they both have
companies behind them, meaning
that their core product is free.
But if you want additional features,
you would have to sign up to the
company, which is not something
we are doing at the moment.
We are perfectly happy with the
open source features, but the two
solutions I mentioned are indeed soda
or soda core, and great expectations.
Luke: Mm-hmm.
Denis: Now greater expectation
has been around for a while.
It's a Python package that allows
you to define expectations about your
data in a sense that I expect that
this column has no missing values.
And then it'll test that and
produce beautiful reports.
So the core is essentially does the same
thing, but it positions itself as the
more simple, minimal framework, easier
to understand, but less feature complete.
And more modern.
So if you would like to have something
quick, but minimal, you can go with soda.
If you have something more complete
or more rigorous framework, you
can stick to great expectations.
I think both are great.
I've written articles on both of
these frameworks, and in essence,
yes, I would say you can integrate
them very well in the UNS.
You can, in two places.
Either you integrate them
inside data pipelines.
Where you perform the checks as data
is being written to the UNS or you
can create a third party service that
lives also inside your UNS ecosystem
and periodically checks the data.
It can connect to any of the systems,
to the historian or to the broker,
and then just perform the checks
as an external, guard of your data.
Luke: Oh, this is super fascinating.
yeah, and of course you can
also do things in your units,
like in the Bridges to also.
It trigger of course, things when
they happen, to prevent data from
being bad in the first place, and
then notify if something is rejected.
Of course, that's also the option is to
prevent from bad data passing anyway.
Right.
That will be the, the alternative.
There
Denis: Yeah.
Luke: downsides to that, but if it's not
mission critical to have it immediately
in the other system, then you just
Denis: Mm-hmm.
Luke: it out and why not reject it
by logic This really goes into, you
know, Ana, doing data analysis here.
Denis: Exactly.
It's all, reporting and monitoring.
It can also do things like alerting
via Slack, via email, via teams,
and also show you a result of
your data quality over time.
So you get a nice view of how many in a
year, how many days data was missed, and
what's an average score for your data
quality across all your data sources.
So it's indeed, as you mentioned,
doing the checks, but also reporting
and doing analysis on the results.
Luke: Yeah.
You know what it is?
With all these developments, I'm a
very practical guy and I really am
hands down, you know, for the shop
floor and automating my daily work
as I did always in metal fabrication.
So whenever I dive deeper
into these topics with you,
it's absolutely fascinating.
It's like I'm seeing the matrix, right?
like, literally you're taking the pill,
you dive into UNS, you discover that.
What works there, what doesn't.
And we have been doing this for the last
year and it's been absolutely fascinating.
But then now that you discover there's
even multiple layers below that even,
and they're all open source again,
and whatever direction you seem to
look, there seem to be some kind of
open source solution that allows you
to really make everything water tight.
It's just fascinating.
It never stops fascinating me.
It's really awesome.
And so yeah, this has been
a great episode so far.
is there anything I forgot
to ask, or tell or share?
Denis: No, I think it's a
pretty complete, introduction.
I think we covered a
pretty interesting topic.
So my, everything's good.
I was really, really happy to
talk about that, find a very,
fascinating, passionate subject.
Luke: Yeah, and I think we'll
definitely hear more from you
about this on your articles and
your sharing on various channels.
I would definitely add
'em to the show notes.
And then, I hope many other, companies
will also explore the options of
that, read the articles and if
they want to know more about that,
they can work with you, of course.
Right.
Denis: Yeah, of course.
Luke: Yeah.
Great.
Well this has been a good episode.
Thank you so much and I look
forward to having you around
again at the Smart Metals Podcast.
Thanks for tuning in.
Bye-bye.
Denis: Thank you.
Bye-bye.