"Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database."
The process of cleaning and preparing data for analysis, including dealing with missing values and outliers.
Data sources: Understanding where data comes from and how to access it.
Data integrity: Ensuring the accuracy and completeness of data.
Data formatting: Fixing issues with data structure, such as inconsistent naming conventions.
Data standardization: Establishing consistent formatting across multiple data sources.
Data validation: Verifying data accuracy through cross-checking and external sources.
Data cleaning tools: Utilizing software to automate data cleaning processes.
Missing data: Dealing with null or missing data points in a dataset.
Outliers: Detecting and removing extreme data points that skew analysis.
Data transformation: Converting data into a different format or structure to enhance analysis.
Data augmentation: Enriching a dataset with additional relevant data.
Data deduplication: Identifying and merging multiple records for the same entity.
Data matching: Combining different datasets through identifying common data points.
Data normalization: Transforming data to a standard format for easier comparison.
Entity resolution: Resolving ambiguities in data across various data sources.
Data sampling: Using a subset of data to test cleaning and analysis procedures before applying them to the entire dataset.
Data privacy: Protecting confidential or sensitive data from unauthorized access or misuse.
Data Aggregation: It involves combining multiple data tables or sources into a single one for easier analysis.
Data Cleansing: This technique involves identifying and correcting inaccurate, incomplete, or irrelevant data in a database or table.
Data Parsing: It refers to breaking up a dataset into smaller, more manageable pieces that can be analyzed independently for specific details.
Data Standardization: This technique involves converting different formats, variables, and units of measure in a dataset into a consistent format that can be easily analyzed.
Deduplication: This involves identifying and removing any duplicate data points within a data set.
Fuzzy Matching: This technique is used to match similar data points within a dataset based on a set of defined criteria.
Outlier Detection: It involves identifying extreme or aberrant data points that may not be representative of the dataset as a whole.
Normalization: It refers to scaling numerical data within a dataset to a common range that can be easily compared across different variables.
Text Cleaning: This technique involves removing unwanted characters, punctuation, or formatting from text fields to make them easier to understand.
Missing Value Imputation: This involves filling in any missing data points within a dataset using a set of predefined methods.
"Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data."
"The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores."
"Data cleansing may be performed interactively with data wrangling tools, or as batch processing through scripting or a data quality firewall."
"The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities."
"The validation may be strict (such as rejecting any address that does not have a valid postal code), or with fuzzy or approximate string matching (such as correcting records that partially match existing, known records)."
"Some data cleansing solutions will clean data by cross-checking with a validated data set."
"A common data cleansing practice is data enhancement, where data is made more complete by adding related information. For example, appending addresses with any phone numbers related to that address."
"Data cleansing may also involve harmonization (or normalization) of data, which is the process of bringing together data of 'varying file formats, naming conventions, and columns' and transforming it into one cohesive data set."
"A simple example is the expansion of abbreviations ('st, rd, etc.' to 'street, road, etcetera')."
"The inconsistencies detected or removed may have been originally caused by user entry errors, by corruption in transmission or storage, or by different data dictionary definitions of similar entities in different stores."
"Data cleansing may be performed interactively with data wrangling tools or as batch processing through scripting or a data quality firewall."
"After cleansing, a data set should be consistent with other similar data sets in the system."
"Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data."
"Data cleansing refers to identifying incomplete, incorrect, inaccurate, or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data."
"After cleansing, a data set should be consistent with other similar data sets in the system."
"Data cleaning differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at the time of entry, rather than on batches of data."
"The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities."
"Some data cleansing solutions will clean data by cross-checking with a validated data set."
"A common data cleansing practice is data enhancement, where data is made more complete by adding related information. For example, appending addresses with any phone numbers related to that address."