The Anatomy of a Dataset
What You See Versus What's Actually There
S showed up with a laptop this time. Screen already open to a spreadsheet.
“I did the lineage exercise you suggested,” S said. “Traced one number from the board report all the way back to its source.”
“How did it go?”
“Painful. But useful. I found three places where the data gets transformed and nobody documented why.”
“That’s progress.”
“But it raised another question.” S turned the laptop toward me. “I got to the source dataset. The original table. And I realized I don’t actually know what’s in it.”
“What do you mean?”
“I mean I can see the columns. I can see the rows. I can see the values. But I don’t know if I can trust any of it. I don’t know what’s missing. I don’t know what’s wrong. I don’t know what assumptions are baked in. I’m looking at it, but I’m not seeing it.”
The Illusion of Understanding
This is one of the most common traps in working with data. You open a dataset. It has columns and rows. It has labels that seem clear. Customer_ID. Transaction_Date. Revenue. Status.
You assume you understand it.
But the surface of a dataset tells you almost nothing about what’s actually inside. The labels tell you what someone intended to capture. They don’t tell you what was actually captured. They don’t tell you what’s missing. They don’t tell you what’s inconsistent. They don’t tell you what’s changed since the dataset was created.
“I pulled some basic statistics,” S said. “Row counts. Column types. That kind of thing.”
“That’s a start. What did you find?”
“Nothing obviously wrong. Which is almost worse. I don’t know if there’s nothing wrong, or if I just don’t know how to look.”
“There’s a difference between looking at data and profiling data. Most people do the first. Very few do the second.”
What Data Profiling Actually Means
Data profiling is the systematic examination of a dataset to understand its structure, content, and quality. It goes far beyond opening a file and scrolling through rows.
A proper profile examines several dimensions.
Structure: How many rows? How many columns? What are the data types? Are the types consistent with what the labels suggest? A column called “Age” should contain numbers. A column called “Email” should contain text in a specific format. If the types don’t match expectations, something is wrong.
Completeness: Which fields have missing values? How many? Is the missingness random or systematic? If 90 percent of records are missing a field, that field is probably useless. If only records from a certain time period are missing a field, something changed in the data collection process.
Uniqueness: Are there duplicates? How do you define a duplicate? Two records with the same customer ID? The same email? The same name and address? The answer depends on context, and getting it wrong can significantly skew any analysis.
Distribution: What values actually appear in each column? What’s the minimum? The maximum? The mean? The median? Are there outliers? A column called “Age” should probably have values between 0 and 120. If there’s a value of 999 or -1, that’s likely a placeholder for missing data, not an actual age.
Consistency: Do related fields make sense together? If a record has a “Signup_Date” of 2025 but a “First_Purchase_Date” of 2019, something is wrong. If a “State” field says “California” but the “Zip_Code” starts with “100,” that’s New York, not California.
Validity: Do the values conform to known rules? Email addresses should match a pattern. Phone numbers should have the right number of digits. Status codes should come from a defined list. Dates should be actual dates.
“That’s a lot to check,” S said.
“It is. But most of it can be automated. The problem isn’t that it’s hard to do. The problem is that people don’t do it.”
The Problems Hiding in Plain Sight
Let me give you some examples of what profiling reveals that casual inspection misses.
The 99 Percent Problem
Let’s say you have a dataset of customer orders. It has columns for Order_ID, Customer_Name, Product, Amount, and Country. You’re asked to analyze international sales trends.
You open the file. You scroll through the Country column. You see “USA,” “Canada,” “UK,” “Germany,” “Australia.” Looks good. You have international data.
But when you actually count the values, you discover that 99 percent of the orders are from “USA.” The remaining 1 percent is scattered across 47 different countries, with most having fewer than 10 orders each.
If you’re trying to understand international buying patterns, this dataset is nearly useless. You don’t have enough data from any country except the US to draw meaningful conclusions. But you’d never know this from scrolling. You’d only know it from counting.
The Placeholder Problem
Let’s say you have a dataset of employees. It has columns for Employee_ID, Name, Department, Start_Date, and Annual_Salary. You’re asked to calculate the average salary by department.
You scroll through the Salary column. You see values like $52,000, $78,000, $95,000, $120,000. The numbers look reasonable for salaries.
But when you look at the full distribution, you find something strange. About 15 percent of the values are exactly $0. Another 5 percent are exactly $999,999.
These aren’t real salaries. The $0 values are probably contractors or interns whose salary wasn’t entered. The $999,999 values are probably executives whose salary is confidential and someone entered a placeholder instead.
If you calculate the average salary including these values, you get a number that represents nobody. The zeros pull it down. The 999,999 values pull it up. The result is meaningless.
The Format Problem
Let’s say you have a dataset of transactions. It has columns for Transaction_ID, Customer_ID, Date, and Amount. You’re asked to analyze sales trends by month.
You scroll through the Date column. You see values like “01/02/2024,” “15/03/2024,” “22/04/2024.” They look like dates. Good.
But when you look more carefully, you notice something. Some records show “01/02/2024.” Others show “2024-02-01.” Others show “02-01-2024.”
Three different formats in the same column. And here’s the problem: is “01/02/2024” January 2nd or February 1st? In the US, it’s January 2nd. In Europe, it’s February 1st. In the ISO format, it would be written differently altogether.
If this data came from multiple sources or multiple time periods, the formats might be mixed. Without checking, you could be putting January transactions in February and vice versa. Your monthly trend analysis would be wrong, and you’d never know why.
The Evolution Problem
Let’s say you have a dataset of product sales going back five years. It has columns for Sale_ID, Date, Product_Name, Category, and Revenue. You’re asked to show how each product category has grown over time.
You scroll through the Category column. You see “Electronics,” “Clothing,” “Home Goods,” “Sports.” Looks consistent.
But when you filter by year and look at the values, you discover something. In 2020 and 2021, there was one category called “Electronics.” In 2022 and 2023, that category was split into “Consumer Electronics” and “Professional Electronics.” In 2024, they merged it back into just “Electronics.”
The label looks the same in 2020 and 2024, but they mean different things. In 2020, “Electronics” included everything. In 2024, “Electronics” is the merger of two categories that existed separately for two years.
If you draw a trend line for “Electronics” across five years, the line is lying to you. The dip in 2022-2023 isn’t a real decline. It’s just that the category was split. The spike in 2024 isn’t real growth. It’s just that they merged it back.
The Silent Null Problem
Let’s say you have a dataset of customer contacts. It has columns for Customer_ID, Name, Email, Phone, and Address. You’re asked to verify that the database is complete before a marketing campaign.
You run a quick check for missing values. The system tells you every row has a value in every column. 100 percent complete. Great.
But when you look at the actual values in the Email column, you find entries like “N/A,” “NA,” “none,” “null,” “not provided,” “-,” and what looks like empty space but is actually a single space character.
These are all different ways that people entered “we don’t have this information.” But because they typed something instead of leaving it blank, the system counts them as complete.
Your 100 percent completeness is fiction. You might actually be missing email addresses for 20 percent of your customers. If you send a marketing campaign based on this “complete” data, 20 percent of your emails go nowhere.
“I’ve seen some of these,” S said. “The placeholder thing especially. We have records with birthdates of January 1, 1900.”
“That’s a classic. January 1, 1900 is the default date in many systems. It gets entered when the real date wasn’t known. If you calculate average customer age including those records, your average customer is 125 years old. Obviously nonsense.”
The Metadata Gap
There’s another layer that profiling reveals: the gap between what documentation says and what data contains.
Let me give you an example.
Let’s say you have a dataset of support tickets. It has columns for Ticket_ID, Customer_ID, Date_Opened, Date_Closed, Status, and Priority. You’re asked to analyze how quickly tickets get resolved.
You find a data dictionary that someone wrote two years ago. It says the Status column contains three possible values: “Open,” “Closed,” or “Pending.”
You start writing your analysis based on this. Open tickets are active. Closed tickets are resolved. Pending tickets are waiting for something.
But when you actually look at the Status column, you find values like “Open,” “OPEN,” “open,” “Closed,” “closed,” “CLOSED,” “Pending,” “pending,” “Resolved,” “Cancelled,” “Escalated,” “On Hold,” “Waiting for Customer,” and “Duplicate.”
The documentation said three values. Reality has fifteen. Some are just capitalization differences. Others are completely new statuses that were added after the documentation was written.
If you build your analysis assuming only three statuses exist, you’ll miss or miscategorize a significant portion of your tickets. Your resolution time calculation will be wrong because you didn’t account for “Escalated” or “On Hold” tickets.
Documentation tells you what someone once believed about the data. Profiling tells you what’s actually there today. These are often very different things.
“So the documentation is lies,” S said.
“Not lies. Outdated truths. The documentation was probably accurate when someone wrote it two years ago. But then the support team added new statuses. Someone started entering values in all caps. A new manager created an ‘Escalated’ category. And nobody updated the documentation because nobody owns it.”
The Upstream Blindness Problem
Here’s another thing profiling reveals that’s particularly important if you’re building AI or machine learning models.
Let me explain with an example.
Let’s say you’re building a model to predict which customers will cancel their subscription. You have a dataset of past customers with columns for Customer_ID, Age, Income, Subscription_Length, Support_Tickets_Filed, and Cancelled (yes or no).
You train the model. It works well in testing. You deploy it.
Six months later, the model is making terrible predictions. Customers it said would stay are leaving. Customers it said would leave are staying. What happened?
You go back and profile the original training data. Here’s what you find.
The Age column had placeholder values. About 10 percent of customers had an age of -1, which was entered when age was unknown. The model learned that customers with age -1 had specific cancellation patterns. But -1 isn’t a real age. It’s a data entry convention.
The Income column had duplicates. Some customers appeared multiple times with slightly different information. The model learned those examples more heavily because it saw them more often. But they weren’t more important. They were just duplicated.
The Support_Tickets_Filed column changed meaning over time. In the first year of data, it counted all tickets. In the second year, it only counted unresolved tickets. The model learned patterns based on a number that meant different things at different times.
The model didn’t know these were data quality issues. The model just saw patterns. And it learned whatever patterns helped it make predictions, including patterns that only existed because of data problems.
This is how you get models that work in testing but fail in production. They learned the quirks of the training data, not the underlying reality.
“So profiling is actually a prerequisite for machine learning,” S said.
“It should be. But teams are under pressure to ship models quickly. Profiling feels like a delay. So they skip it, train on whatever data they have, and hope for the best. Sometimes they get lucky. Often they don’t. And when the model fails, they blame the algorithm instead of examining the data.”
What a Proper Profile Contains
Let me describe what a thorough data profile actually looks like, using a concrete example.
Let’s say you have a dataset of online orders. It has columns for Order_ID, Customer_ID, Order_Date, Product_ID, Quantity, Unit_Price, Total_Amount, Shipping_Address, and Status.
A proper profile would examine this dataset across several dimensions.
Summary statistics for every column
For Order_ID: How many total orders? Are all IDs unique, or are there duplicates?
For Quantity: What’s the minimum value? Maximum? Average? Is the minimum 1, or are there zeros or negative numbers that shouldn’t exist?
For Unit_Price: What’s the range? Is the minimum $0.01 or $0.00? Is the maximum reasonable, or is there a $999,999 placeholder?
For Status: What values appear? Just “Shipped” and “Delivered”? Or also “Cancelled,” “Returned,” “Processing,” “On Hold,” and twelve other variations?
Completeness analysis
Which columns have missing values?
Maybe Shipping_Address is missing for 5 percent of orders. Are those digital products that don’t need shipping? Or are they data entry errors?
Maybe Customer_ID is missing for 2 percent of orders. Are those guest checkouts? Or are they system errors?
Is the missingness random, or is it concentrated in certain time periods or certain product types?
Uniqueness analysis
Order_ID should be unique. Is it? If the same Order_ID appears twice, do both rows have identical information, or different information? Which one is correct?
Are there duplicate orders where everything is the same except Order_ID? That might indicate the same order was entered twice by mistake.
Distribution analysis
For Quantity, most orders probably have quantities between 1 and 5. If there’s an order with quantity 10,000, is that a real bulk order or a data entry error?
For Total_Amount, what does the distribution look like? Is it clustered around certain price points? Are there outliers like $0.00 orders or $50,000 orders that need investigation?
Cross-field validation
Does Quantity times Unit_Price equal Total_Amount? If not, which field is wrong?
If Status is “Delivered,” is there a delivery date somewhere? If Status is “Cancelled,” is Total_Amount zero or was the customer still charged?
Does Order_Date make sense? Are there orders dated in the future? Orders dated before the company existed?
Temporal analysis
Are orders distributed evenly over time, or are there gaps? A gap might mean data wasn’t collected during that period.
Did the average order value change suddenly at some point? That might indicate a pricing change, or it might indicate a data issue.
Did new Status values start appearing at a certain date? That might indicate a system change that needs to be accounted for.
“That sounds like a lot of work for every dataset,” S said.
“It sounds like more work than it is. Most of this can be automated. You run a profiling tool, and it generates these statistics in seconds. The real work isn’t generating the numbers. It’s reviewing them and deciding what they mean.”
The Decision That Follows
Profiling isn’t an end in itself. It’s the foundation for a decision: can you trust this data for your intended purpose?
Let me give you four scenarios.
Scenario 1: Green light.
You profile the order dataset. Everything looks clean. Order_IDs are unique. Quantities and prices are in reasonable ranges. Total_Amount matches Quantity times Unit_Price in 99.9 percent of cases. Missing values are minimal and explainable. Status values match the documentation.
Decision: You can proceed with confidence.
Scenario 2: Yellow light with caveats.
You profile the dataset and find some issues. About 3 percent of orders have a Status of “Unknown” that isn’t in the documentation. About 1 percent have Total_Amount that doesn’t match Quantity times Unit_Price.
Decision: You can proceed, but document the limitations. Exclude the “Unknown” status orders from your analysis, or treat them as a separate category. Flag the mismatched amounts for investigation but don’t let them block the whole project.
Scenario 3: Red light, fixable.
You profile the dataset and find significant issues. The Order_Date column has three different formats. About 15 percent of Unit_Price values are $0.00, which are probably placeholder values. There are duplicate Order_IDs that need to be resolved.
Decision: You cannot proceed until these issues are fixed. The date formats need to be standardized. The $0.00 prices need to be investigated and either corrected or excluded. The duplicates need to be resolved. This is data cleaning work that must happen before analysis.
Scenario 4: Red light, unfixable.
You profile the dataset and find fundamental problems. The data only covers three months, but you need a full year for seasonal analysis. Half the customers are missing demographic information that’s essential for your segmentation. The Status definitions changed twice during the time period and there’s no way to reconcile them.
Decision: This data cannot support your intended purpose. You need different data, or you need to change what you’re trying to do. This is painful to admit, but better to know now than to build something on a broken foundation.
“How often is it that last one?” S asked.
“More often than people want to admit. Projects get approved based on assumptions about data availability. Someone says ‘we have two years of customer data’ and everyone assumes it’s usable. Then someone finally profiles it and discovers half the fields are empty, the definitions changed three times, and it’s actually only useful for six months of analysis. By that point, there’s pressure to proceed anyway. That’s how bad analyses get built. That’s how bad models get deployed.”
Making This Practical
S closed the laptop.
“So before I trust any dataset, I should profile it. Understand what’s actually there. Compare it to what’s documented. Make an explicit decision about whether it’s fit for purpose.”
“That’s exactly right.”
“And if I’m going to train a model on it, I should be especially careful. Because the model will learn whatever patterns are in there, including patterns that only exist because of data problems.”
“Yes. The model doesn’t know that -1 means ‘missing.’ The model doesn’t know that January 1, 1900 means ‘unknown date.’ The model just sees numbers and finds patterns. Your job is to make sure the patterns it finds are real.”
“And most of this profiling can be automated, so it’s not as slow as it sounds.”
“Correct. The barrier isn’t technical. There are tools that generate profiles with a few clicks. The barrier is cultural. It’s making profiling a standard step that happens before anyone builds anything, not an afterthought that happens when something goes wrong.”
Where This Goes Next
“Next week,” I said, “we should talk about what happens after you deploy something. You’ve profiled the data. You’ve built your model or your report. It’s working today. But data changes over time. What happens when the data you trained on stops representing the data you’re seeing? That’s called drift, and it’s one of the most common reasons things fail in production.”
S nodded. “Same time?”
“Same time.”
Next week: Data Drift. Why yesterday's training data fails tomorrow, and how to detect when your data stops representing reality.
You can’t trust what you haven’t examined. And scrolling through rows is not examination.
Download: “Dataset Profiling Checklist” — A systematic guide to understanding what’s actually in your data before you build anything on top of it.
Download Dataset Profiling Checklist
Founder of SANJEEVANI AI. ISO/IEC 42001 Lead Auditor. 25+ years in AI, data, and compliance across HealthTech, FinTech, EdTech, and Insurance. Building METRIS, the quantitative AI governance platform.


