Capture the red flags: How to improve data validation
By: Will Shiflett
February 1, 2026 | Automated QA, Cost Savings, Data
Insights
The last article explained why data validation matters. This one explains what it actually is and how to tell if you’re already doing it.
What’s data validation?
Data validation is a set of tools and techniques to achieve consistency between your data and a set of thoughtful expectations provided by you, your team, and your users. No more, no less, no exceptions.
Like parenting, some data validation techniques are acceptable, some are great, and some are anti-patterns disguised as common sense. Here’s an example of the latter:
When getting your child ready for school, it’s tempting to do everything for them, whether it’s pulling on their socks or putting food into their mouths. If you are an exceptionally patient person, you can probably do this every day. And at first that will seem like the common sense way to do things, and will become the only way to do things, and then you will have another child, and then it will become painfully clear that you cannot do everything for two children and get out the door before the daycare closes.
In this anti-patten analogy, you are the parent, the data source is your child, and the data validation technique is getting your child dressed. The key takeaway is that you probably do not want to do absolutely everything for your child (or your data) in a way that takes your time away from other important activities. It’s fine to start as reactive, but the goal is to become proactive, and the absolute paradigm is to become predictive and adaptive.
Are you already doing it?
If you ask around, someone in your organization will assure you that, yes, there’s an entire branch of the org chart* dedicated to data quality and validation. Then they’ll turn pale and whisper, “Why? Did something happen?”
If you, or a small army of manual testers you employ, are excited by the prospect of manually reviewing CSV exports of your database tables for errors, writing update statements, and lobbing them at the production database, you can do that.
This is a form of data validation. Some organizations call this quality assurance, others call it manual testing, and still others will insist that they are legally obligated to refer to it as data management, or data governance, or whatever. There are exceptions, and you should consider telling your rapt colleagues all about them.
This approach is absolutely capable of achieving consistency between data and expectations. And you absolutely shouldn’t do it. But if you do, you should watch out for three things:
- Validating data manually takes longer or costs more than you’d like.
- The quality of data validation decreases as data volume increases.
- You or your team validate your data, but users still don’t trust it.
Ignoring these is akin to ignoring red flags on a first date, which is understandable, but ill-advised. It’s understandable because nobody wants to believe the data validation they do is ineffective. It’s ill-advised because the presence of any of these suggest that the work being done is not effective, efficient, or scalable.
Learning from mistakes instead of repeating them
You don’t have to defend yourself; this isn’t a morality play. No matter what people say, you need to make mistakes to learn from them. And data validation is, in fact, a controlled form of making, learning from, and not repeating mistakes. With an extreme emphasis on the “learning from” and “not repeating” aspects.
If you have the time, if the stakes are low, and if you want to experiment with manual, one-off approaches, go for it. Plenty of small- to medium-sized businesses do the same thing, indefinitely. And plenty of large businesses do this in silos across multiple departments while simultaneously hiring consultants to tell them why they shouldn’t.
Note that the problem isn’t that these organizations are trying to brute-force their way through a problem. The problem is that they continue to use brute force when that method has been shown to be wholly insufficient and wildly inefficient. They can do this because they have resources to burn on this approach. Do you?
Getting beyond the basics
After you’ve spent time validating your data, consider your failures. And not just those in your relationships, finances, and career choices. You should also consider, and record, how your data failed to meet the clearly and not so clearly defined expectations you had for it. For example:
- Was data missing?
- Were there duplicate rows?
- Did you end up with those weird, European periods in your numbers instead of good, old-fashioned, Anglo-Saxon commas?
- Is it possible that all of your date fields have the year 2568 not because that is, in fact, the correct date, but because you’ve somehow managed to format them using the Thai Solar Calendar?
Whatever the reason for your data’s deviance, note it and categorize the failures. Daniel Beach talks about a few categories of data validation that can help you get started, but you don’t need to use this terminology. The key exercise is to determine if your expectations for your data are fully fleshed out. And then, if so, determine whether the data met those expectations.
From there, you can start fixing problems. We’ll talk about how to do that in the next blog in this series.
*Fun fact: “org chart” is an American take on the much more fun “organogram.”