There are the following 2 types of problems with the data, from which the procedure of clearing them gets rid:
Problems with attributes - values of variables, columns in the table representation of the dataset;
At the attribute level there are 6 main problems:
Invalid values that lie outside the desired range, such as the number 7 in the field for school grades on a five-point scale;
Missing values that are not entered, meaningless, or undefined, e.g., the number 000-0000-0000 as a phone number;
spelling errors: misspelling of words: "voditl" instead of "driver" or "Omsk" instead of "Tomsk", which distorts the primary meaning of the variable by substituting another city for one;
multiple meanings: using different words to describe the same meaning, for example, "driver" and "driver" or using one abbreviation for different meanings, for example, "DB" may be an abbreviation for the phrase "big data" or "database
word transposition, usually found in free-format text fields;
nested values - multiple values in the same feature, such as a free-format field
Problems with records - objects that are strings of datasets and are described by values of the attributes.
violation of uniqueness, such as a passport number or other identifier;
duplicate records, when the same object is described twice;
inconsistency of records, when one and the same object is described with different feature values;
Incorrect references - violation of logical relationships between attributes
1. be able to identify and remove all major errors and inconsistencies, both in individual data sources and when integrating multiple sources;
2. supported by certain tools to reduce manual checking and programming;
3. be flexible to work with additional sources.
Because data fuels machine learning and artificial intelligence technologies, enterprises need to take care of data quality. While data marketplaces and other data providers can help organizations get clean and structured data, these platforms do not allow enterprises to ensure data quality for the organization's own data. Therefore, enterprises need to understand the necessary steps of a business intelligence strategy and use business intelligence tools to troubleshoot problems in data sets.
Better data impacts every activity that involves data. Almost all modern business processes involve data. Subsequently, when business intelligence is seen as an important organizational effort, it can lead to a wide range of benefits for everyone.
Some of the biggest benefits include:
Optimization of business practices: Imagine that none of your records have duplicates, errors or inconsistencies. How much more efficient will all of your key daily activities become?
Increased productivity: Being able to focus on key work tasks instead of finding the right data or making corrections because of incorrect data is a significant benefit. Having access to clean, high-quality data through effective knowledge management can make a difference.
Faster sales cycle: Marketing decisions depend on data. Giving your marketing department the best possible quality data means your sales team can convert data into better data. The same concept applies to B2C relationships!
#1 Develop a data quality plan
It's important to first understand where most of the errors are occurring so you can identify the root cause and make a plan to manage them. Remember that effective business intelligence practices will make a huge difference for the entire organization, so it's important to stay as open and communicative as possible. The plan should include.
Responsibilities: a C-level manager, a chief information technology officer (CDO) if the company has already appointed one. In addition, it is necessary to appoint a person responsible for various data.
Metrics: ideally, data quality should be summarized as a single number on a scale of 1-100. While different data may have different data quality, having a common number can help an organization measure continuous improvement. This common number can give more weight to data that is critical to companies' success, helping to prioritize data quality initiatives that impact important data.
Actions: a clear set of actions needs to be defined to begin implementing a data quality plan. Over time, these actions will need to be updated as data quality changes and companies' priorities change.
#2 Correct data at source
If data can be corrected before it becomes erroneous (or duplicated) in the system, it saves hours of time and reduces the burden on the line.
#3 Measure the accuracy of the data
Invest the time, tools, and research needed to measure the accuracy of your data in real time. If you need to purchase a tool to measure data accuracy, you can check out our article, "Tools for Measuring Data Quality," in which we explain the criteria for selecting the right tool for measuring data quality.
#4 Managing Data and Duplicates
If some duplicates do sneak past your new entry practice, make sure they are actively detected and removed. After removing any duplicate records, it's also important to consider the following:
Standardization: Confirming that the same type of data exists in each column.
Normalization: Ensuring that all data is recorded consistently.
Merging: When data is scattered across multiple data sets, merging is the act of combining the relevant parts of those data sets to create a new file.
Aggregation: Merging: Sorting the data and expressing it in summary form
Filtering: Narrowing the data set to include only the information we need.
Scaling: Converting the data so that it fits on a certain scale, such as 0-100 or 0-1.
#5 Add Data.
Adding is a process that helps organizations identify and fill in missing information. Reliable third-party sources are often one of the best options for managing this practice.
After completing these 5 steps, your data will be ready to be exported to the data catalog and used when analysis is needed. Keep in mind that when dealing with large data sets, 100% cleanliness is nearly impossible.