DATA PROFILING FOR DUMMIES
Data profiling involves the evaluation from an existing data source in order to collect statistics or summaries about the data. These statistics can be used to determine whether the data can be easily used for other purposes, such as:
To facilitate data searches by including keyword tags, descriptions, or by assigning it to a category;
To assess data quality and to determine whether the data conforms to particular standards or patterns;
To assess the risks of integrating data into new applications;
To identify metadata;
To assess whether known metadata accurately describe the actual data;
To understand data challenges early in any project, thereby avoiding delays and cost overruns;
To obtain an enterprise view of all data, for master data management, where key data is needed, or data governance for improving data quality.
Data profiling relies on descriptive statistics such as minimum, maximum, mean, mode, percentile, standard deviation, frequency, variation, aggregates such as count and sum, and other metadata information about the data such as data type, length, and uniqueness. The metadata can then be used to identify issues such as illegal values, misspellings, missing values, varying value representation, and duplicates.
Data profiling is used to improve data quality, to shorten the implementation cycle of major projects, and to increase understandability. This facilitates the discovery of embedded business knowledge which is one of the significant benefits of data profiling. Data profiling is one of the most effective technologies for improving data accuracy in corporate databases. Data profiling allows for prompt identification of relationships that may contribute to and highlight relationships relevant to the data analyses.
Data profiling has evolved as a result of the increasing complexity of issues with data collection and analyses, and provides a sophisticated means of dealing with these issues. Modern applications of data profiling have evolved considerably, and generate more sophisticated profiles. These applications are increasingly streamlined and user-friendly.
Some data-profiling tools are free, although these have limited functionalities. Data quality tools can be compared on the basis of which data quality tasks are performed: data profiling; data analysis; data transformation; data cleaning; duplicate elimination, and data enrichment. The following data profiling tools are worthy of note.
Ataccama: DQ Analyzer discovers, analyzes, and understands critical data patterns. It provides visualize representations of frequency, domain, and mask analysis of values, and uncovers complex dependencies between data attributes, and can test relationships.
Experian: connects data from different sources and eliminates duplication.
Informatica: Data Explorer, is available in two editions—Standard and Advanced—that employ powerful data profiling capabilities to scan every single data record, from any source, to find anomalies and hidden relationships regardless of their complexity
Talend: Confers access to hundreds of types of data sources using built-in data connectors, and rapidly generates a variety of statistics, such as category-based counts, text and numeric-field analyses, and pattern frequency analyses. It can compare data to custom-defined, business-relevant thresholds and ranges and measure conformity to internal (SKU or serial number) or external standards (international postal codes or credit card numbers).
Data profiling ensures that the data fulfills the collection and data-warehousing requirements. It also identifies anomalies and relationships that can improve further data collection, analyses and warehousing processes. The cost of ensuring good data profiling should not hinder its implementation.
1. Woodall, Philip; Oberhofer, Martin; Borek, Alexander (2014). A classification of data quality assessment and improvement methods. International Journal of Information Quality. 3 (4).
2. Barateiro J, H. Galhardas (2005). A survey of data quality tools. Datenbank-Spektrum. 14.