The key characteristics of Big Data are commonly known as “the 3 V’s”: Volume, Velocity, and Variety. Volume for the amount of data stored to process; Velocity for the tremendous pace required to generate, store and analyze data; and Variety for the many sources of data generated as well as the disparity of formats.
There are “2 V’s” that have become incorporated into the literature of Big Data in recent years: Veracity and Value.
Veracity (Fourth V): The most critical aspect of the “5 V’s” is Veracity. Uncertainty due to data inconsistency, incompleteness, ambiguities, latency, deception, and model approximation.
Value (Fifth V): All data becomes valuable as it turns into information and is consumed as knowledge; the value is subjective to the data consumer, but at the end of the day it has some value.
So, let's focus on Veracity. No wonder that despite widespread interest in Big Data over the last decade, there has been a slight success at standardizing data quality metrics and processes for data projects, or Big Data.
The importance of Veracity when looking at statistics. Example - Fake News in 2016:
2.1M - Number of shares, reactions, and comments for the top performing fake news article on Facebook
46% - Percentage of the top-performing fake news stories on Facebook that were about US Politics
$1M - Amount that Craig Newmark –Craiglist founder- recently donated to fight fake news
Veracity is affected by data quality. Veracity affects Value; and no matter the Volume, Variety and Velocity, Veracity impacts all data, Big Data, a real value. So, dedicating resources to data quality improves the value of the information and knowledge becomes a true asset. There are different approaches to dedicating quality processes to data, however, when it comes to doing it, almost everyone has a different approach; with a widely varying level of success. When it works, you absolutely love the results. When it fails, you hate the headache of figuring out what is not working and why.
It is important to remember Veracity add more value to data by providing a more realistic overview of historical data on Business Intelligence and reporting, and improves quality of advanced analytics within predictive analytics and machine learning
Let’s not forget quality among the other Vs. Here is a high-level strategy to develop data quality for the others.
Data quality assessment and reporting –at least possible- measurements can at best be approximations. It is necessary to re-define most of the data quality metrics based on the specific characteristics of the data project so that those metrics can have a clear meaning, be measured (good approximation) and be used for evaluating the alternative strategies for data quality improvement.
Eventually, due to the volume of underlying data, it is not uncommon to find out that some desired data was not captured or is not available for other reasons (such as high cost, delay in getting it, etc.). It is ironical but true that data availability continues to be a prominent data quality concern in the Big Data era.
Very often is difficult to monitor data quality due to the fast pace of data generation when requiring reasonable overhead for time and resources (storage, compute, human effort, etc.). In such scenarios, you need to define data quality metrics so that they are relevant as well as feasible, especially in the real-time context.
Sampling can help you gain speed for the data quality efforts, but this comes at the cost of a bias because of the fact that samples are rarely an accurate representation of the entire data. Lesser samples will give higher speed, but with a larger bias.
Perhaps the biggest data quality issues in any data project are that the data includes several data types (structured, semi-structured, and unstructured) coming in from different data sources.
Semantic differences and syntactic inconsistencies due to the variety of sources are key elements to consider when applying any data quality process within Variety.
Thus, often a single data quality metric will not be applicable for the entire data; best alternative is defining quality metrics for each data type.