Data Quality Guide: 6 Dimensions, Big Data Lifecycle, Management, Validation Framework

Data Quality Guide: 6 Dimensions, Big Data Lifecycle, Management, Validation Framework

The Big [Data] Idea:  

Big Data is best utilized when you have rich analytics to support your decisions, but this is only possible when the data is high quality. However, 67% of the executives in the U.S. are not comfortable using data from analytics because they lack confidence in data quality. This article will take you through a data quality management framework to help you manage data quality and gain confidence in your data analytics. 

Why it is Important: 

Data that cannot support your decisions is virtually useless. In the era of Big Data, bad data has financial and costs U.S. companies $3.1 trillion yearly. Globally, organizations are losing $9.7 million every year. Some researchers suggest that employees today spend 50% of their time hunting for the right data and 60% cleaning it. If data quality is ensured at the source or in storage, these costs and can be avoided.  

What’s ahead:  

This article will help you understand data quality and find ways to manage its quality through: 

  • 6 Dimensions of data quality and how they impact the healthcare and banking industries 
  • Big-Data quality management framework 
  • Data Validation Framework  
  • Quality management process 
  • Responsibility for data quality

Understanding Quality in Data 

High quality data is the key to driving business intelligence. But with the dynamic nature of data, it’s important to consider all its dimensions.  

Data quality has six core dimensions: 

  • Timeliness 
  • Completeness 
  • Uniqueness 
  • Validity 
  • Consistency 
  • Accuracy 

Discrepancies in any of these dimensions can affect the quality of your data. You must also consider the industry in which the data is being used as the dimensions represent different use cases. Here are a few examples of data quality dimensions as they relate to healthcare and banking settings: 

Quality dimension  What it represents  Example 
Timeliness  Real-time update of the data captured  Healthcare: Difference between the time patient vitals were captured and time when it is available to see in your EMR (Electronic Medical Records) system. 

Banking: Your customer has withdrawn some cash from the ATM. Will it be reflected immediately in his account? 

Completeness  If your database has complete information  Healthcare: Number of births recorded in your system vs a total number of babies delivered in the hospital. 

Banking: A new account has opened. Does your portal provide all the details captured when your customer logs in? 

Uniqueness  If any records are captured more than once  Healthcare: Duplicate entries of your patients in your database. 

Banking: Your customer transferred some amount online to another account just once, but does your automated bank feed record it twice? 

Validity  If your data conforms to the syntax  Healthcare: If the level of severity of the hearing loss is selected from the list of allowed values. 

Banking: Does your bank statement follow a specific format name-middle name-surname to show the account holder’s identity? 

Consistency  The similarity in different representations of one information  Healthcare: The result of a test of a patient in your internal records and lab records.  

Banking: Do your customer details in the loan account match with the saving account details? 

Accuracy  Does it describe the real object correctly  Healthcare: If the date is not entered in the prescribed format MM/DD/YYYY format, but a date is entered first, the patient data will be incorrect. 

Banking: If your customer is a working professional, does your customer database contain the occupation details? 

 

Big Data Quality Management Framework 

The speed at which data flows is truly mind-boggling. An effective governance framework is required, especially for banks. The framework can ensure that data quality is maintained across the flow of big data starting from the stage of discovery to defining, designing, deployment, and monitoring. Your framework must contain several key components, including:   

  • Concept of quality that measures its dimensions (identified above) 
  • Big Data life cycle that guides your quality management processes  
  • Data validation framework that helps you ensure the validity of your data 
  • Quality management process that provides steps to manage quality 
  • Responsibility for data quality lies not just with a data engineer but all the stakeholders

Let’s dive into each of these.  

Big Data Life Cycle 

Big data flows through a series of 5 stages before it reaches the end-user. To avoid compounding data issues, close attention must be paid at each stage. Here’s what you need to know about each stage: 

Stage 1: Data generation 

Whether we search for something online, click a link, or send a message, every action generates information. When collected, you’ll possess a mix of structured, semi-structured, and unstructured data. At this stage, you might not be able to do much with the unstructured data, but while collecting data in a structured format, you can take the following measures to ensure quality: 

  • Put validation checks in your fields where data is captured to ensure that it follows the prescribed format 
  • Hold approvals or flag data fields if they are not completed. You can hold people accountable to ensure that fields are completed in a targeted time 

Stage 2: Data collection 

Your data will likely come from several different sources. Here are a few things you can do to organize your data sources: 

  • Use filters when collecting data instead of capturing everything, so you can only collect what you need 
  • You can use the merging features to combine data after it is received from different sources to ensure that one source of truth is created 
  • If your data-capturing system is integrated with your storage, running a check for an existing entry can help avoid duplicates.  

Stage 3: Data processing 

Processing a large volume is a huge undertaking. At this point, you would be formatting, cleaning, and compressing it.  

  • While formatting your data, you can check for conformity to syntax and make corrections wherever required 
  • Cleaning your data before it reaches your storage space is good for catching missing information 
  • Merge records whenever there are duplicate data entries

Stage 4: Data storage 

Once your data gets to this stage, it will be published to all the users, so any quality issues you find in the data must be resolved by now. When you are storing data manually, validation checks by data managers are helpful. For automatic data capture, you won’t have to worry about validation.  

Stage 5: Data Analytics & Visualization 

Once you reach the stage of analytics, you will begin to see the results of your quality management efforts. Minor errors that slipped through the earlier stages are exposed visually. Thus, use checks to ensure that you discover quality issues and flag them to your team so they can make corrections. 

Data Validation Framework 

Big Data programming languages provide built-in data validation options that can support you in checking data on quality dimensions like accuracy, completeness, timeliness, consistency, and uniqueness of data. Standard validation options are: 

Validation  Checks 
Type validation  Must enter age in a whole number format, and data must follow the format MM/DD/YYYY 
Value validation  Specify what is required, allowed, and forbidden in data fields 
Key validation  If your visualizer cannot support a specific feature of data, such as upper-case font, add a check for that 
Range validation  Define a range for fields like age, salary, and duration to ensure that data falls in a valid range 
Consistency check  Ensure that data is logically correct. For example, you cannot have a shipping date before the order date 
Uniqueness check  Run a check on every data point if it must not repeat in another row in your data file 
Code check  Check if the entry is selected from a list such as a list of countries, valid postal codes, etc. 
Lineage validation  Compare test and training databases to ensure all data generated at sources reaches storage as required.

 

Data Quality Management Process 

Data quality can be managed by following this 5-step process:  

1. Identification and measurement of data quality: What is not measured cannot be confirmed as truth. So, your first step should be to define quality on your terms and identify metrics to measure data quality. Some common data quality measures used by US organizations are:  

  • data-to-error ratio 
  • empty values count 
  • accurate values to all values ratio  
  • field miss rate
  • duplication rate

2. Define rules and targets for data performance: Set policies for data quality by assigning rules and targets for each quality dimension. For each activity or process, you might define your focus area in terms of quality check. For instance, during data migration from legacy systems to the cloud, your focus may be on completion. For every process, the focus areas can be different, and for each focus area, you can identify data quality issues. Some examples of these rules include:  

Quality dimension  Rule 
Uniqueness  No entity that is recorded must exist more than once in the database and thus, must be checked for duplication. In the case of an existing record, the new information may only be merged with the existing record. 

 

Accuracy  Registration status must have a value consistent with regional accreditation board data reference data, and if it is not, the board is to be approached for verification. 

 

Completeness  Constraints can be put for fields that are mandatory and optional for data entry before a form is accepted by the system. 

 

3.  Designing of quality improvement processes: Your quality improvement cannot be guaranteed unless you create a culture that respects data quality. Create local structures and define interventions to manage quality issues. Several measures can be taken to improve the quality of your data, as listed below: 

Quality improvement measure  What it does 
Profiling  Gather statistics on the data for quality assessment 
Matching  Merge or integrate related data records 
Parsing and standardization  Break data into components and re-combine them in a specific format 
Normalization  Eliminate redundancy and perform feature scaling 
Cleaning  Remove duplicate and incorrect data; Modify values to meet standards and rules 
Metadata management  Centralize management of semantic metadata for organizational consistency 
Data quality dashboard  Show the quality metrics measured to compare actual with target data quality 

 

4. Implementation of quality improvement methods: Create a proper plan not just for making rules but also for monitoring data quality, identifying opportunities for improvements, and implementing measures for improvements. Share this plan with your whole team to keep them aware of how important data quality is. Any successes must also be communicated to encourage your team to keep working positively. 
 

5. Monitoring of data quality: You can create reporting systems, monitor performance in the field, or run real-time quality checks on data. To make a reporting system, assign the responsibility of checking data on supervisors. You can also create a team of data quality checkers to ensure quality dimensions are well taken care of when managing data. Real-time data quality checks can be run using some tools such as:  

  • CDI (Customer Data Integration) 
  • Master Data Management (MDM) 
  • PIM (Product Information Management) 
  • DAM (Data Asset Management 

Responsibility for Data Quality 

From the bottom all the way to the top, many individuals across an enterprise encounter data. Therefore, they all influence data quality. It is essential to educate all knowledge workers on the importance of data quality and encourage them to follow the rules required to maintain it.  

Data Quality in the Age of Big Data 

Quality data in Big Data streams is essential to achieve positive business outcomes. But the path there isn’t always easy. Understanding the Big Data lifecycle and committing to a data quality management process will create your roadmap there. Additionally, we’ve designed a QE self-assessment test that you can use to understand if your quality engineering practices are aligned to deal with the quality challenges of Big Data. If you are not, our team is here to assist you in your digital transformation.  

For more on data quality and quality engineering, machine learning services, check out Apexon’s  data services and quality engineering practice or get in touch directly using the form below. 

 

Interested in our Data Services Services?

Please enable JavaScript in your browser to complete this form.
Checkboxes
By submitting this form, you agree that you have read and understand Apexon’s Terms and Conditions. You can opt-out of communications at any time. We respect your privacy.