
Image by Editor
Power Query provides a powerful function called Table.Distinct to easily remove duplicate records from your dataset. In this tutorial, we will show how to use Table.Distinct to remove duplicate data using Power Query M language.
What is Table.Distinct?
Table.Distinct removes duplicate rows from a table, keeping unique records. It can remove duplicates based on all columns or just specific columns.
Syntax:
Table.Distinct(table as table, optional equationCriteria as any) as table
Where:
- table: The input table to remove duplicates from.
- comparisonCriteria [Optional]: It specifies which columns to use for comparison.
To show the use of Table.Distnict function, we will use sales data that contains multiple duplicate data. We loaded the data from Excel to Power Query.
Sales Data:
Basic Use: Remove Complete Duplicates
Let’s assume a sales data set that has some duplicate data. You can use Table.Distinct function to remove the duplicates from an entire dataset.
Power Query M Code:
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content], UniqueRows = Table.Distinct(Source) in UniqueRows
All identical rows are removed, preserving unique records only.
Remove Duplicates Based on Specific Columns
There are cases when duplicates are defined by one or more specific columns rather than entire rows. Removing duplicates based on the Region column.
Power Query M Code:
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content], UniqueRows = Table.Distinct(Source, {"Region"}) in UniqueRows
Only the first occurrence of each unique Region is retained.
Use Multiple Columns As Criteria
For more granular control, specify multiple columns to determine which rows are considered duplicates. Removing duplicates based on Product and Salesperson column.
let Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content], UniqueRows = Table.Distinct(Source, {"Product", "Salesperson"}) in UniqueRows
This retains rows unique across both columns.
Use Comparison Criteria (Case Sensitivity)
By default, Table.Distinct is case-sensitive. To handle duplicates in a case-insensitive manner, you can use the Comparer.OrdinalIgnoreCase function. Let’s remove case-insensitive duplicates from the Product column.
let Source = SalesData, UniqueRows = Table.Distinct(Source, {"Product", Comparer.OrdinalIgnoreCase}) in UniqueRows
This considers “Product A”, “product A”, “Product a” as duplicates.
Advanced Techniques:
Create Custom Comparers
Advanced users can specify custom functions as the criteria for Table.Distinct. This allows for more complex logic. Let’s create a table in Power Query to create a custom comparer.
let Source = #table( {"OrderID", "CustomerID", "Product", "Quantity"}, { {1001, "C001", "Keyboard", 5}, {1002, "C-001", "Mouse", 10}, {1003, "C_002", "Monitor", 2} } ), // Add a normalized column by removing '-' and '_' AddNormalizedID = Table.AddColumn( Source, "NormalizedCustomerID", each Text.Remove([CustomerID], {"-", "_"}), type text ), // Remove duplicates based on normalized Customer IDs DistinctCustomers = Table.Distinct( AddNormalizedID, {"NormalizedCustomerID"} ), // Optionally remove the helper column Result = Table.RemoveColumns( DistinctCustomers, {"NormalizedCustomerID"} ) in Result
The row with C-001 is removed because it duplicates C001 after normalization.
Use Table.DistinctRows for Legacy Support
In older versions of Power Query, you might see Table.DistinctRows instead.
Table.DistinctRows(YourTable, {"Column1", "Column2"})
This function is similar but has a slightly different syntax where you specify columns directly as the second parameter.
Tips for Working with Table.Distinct
- Performance: For large datasets, consider using Table.Distinct only on necessary columns to improve performance.
- Debugging: If your results aren’t as expected, use Table.View to inspect the data before and after applying Table.Distinct.
- Data Cleaning: Consider cleaning your data (standardizing case, trimming whitespace) before using Table.Distinct for better results.
- Column Selection: Only include columns that matter for your duplicate identification to improve results.
Conclusion
Table.Distinct is an essential function in the Power Query M language for data cleansing and preparation. By understanding its parameters and options, you can effectively remove duplicate data based on your requirements. Always ensure the columns specified in Table.Distinct accurately represents your intended duplicate criteria. Preview results in Power Query are used to verify that only intended duplicates have been removed.
Get FREE Advanced Excel Exercises with Solutions!