Detecting Duplication and Finding Unique Values in Excel – Pro Tips
Duplication of values in Excel, can occur in a single column or across multiple columns and rows in Excel. It is often necessary to eliminate duplication, for correct data analysis and data integration purposes. Other times, one may just want to note recurring values, rather than remove them. There are few key ways to get rid of duplicate values in Excel or to detect unique values. So, we are going to examine a few of these in this post.
Right, let’s get started with an example.
- Using Advanced Filter to Extract Unique Values
- Using the Remove Duplicates Feature on the Data Tab
- Using COUNTIF, EXACT and conditional formatting to detect duplication
- Creating an Excel Table and using the table options to remove duplicates or filter and sorting to detect duplication
- Using Slicers with Tables in order to filter and detect repeat values
- Remove Duplicates in a Table
- Download Excel File
- Some useful links:
High Performance Liquid Chromatography (HPLC) is a well-known and widely utilized analytical chemistry technique. It allows for separation, identification and quantification of compounds in a mixture. HPLC is used in many different scientific industries. From environmental sciences to pharmaceutical sciences, HPLC chemical compound analysis is fairly ubiquitous. Different HPLC Columns allow for analysis and identification of different types of chemical compounds. HPLC has been used to separate and identify:
- Antiarrhythmic drugs
- Pesticides & Herbicides
The complete range of compounds, where HPLC analysis has been used or found application in, is immense.
An analytical chemist working for hypothetical supplier, has been tasked to find out which HPLC Columns, universities and research institutes in the area, are using for chemical analysis.
He has been asked to send targeted advertising brochures to the said universities and research institutes, of the HPLC Columns (in the case of the capitalized column – this refers to a component of the HPLC instrument that is filled with a packing material; the sample mixture and a solvent is pumped through this particular column for analysis purpose) that they do not already have.
He has received an Excel worksheet with a lot of data, including the type of HPLC Column the university/research institution has and the analysis it is capable of. He also has the compound analysis that the university actually uses the HPLC Column for, since in some cases, due to the lab manager adjusting certain settings, the HPLC Column is used for additional compound analysis not necessarily specified by the manufacturer. He is interested in this section of the data, in order to add to the body of knowledge on analysis.
If he finds that by tweaking the sample preparation, or by adjusting certain settings, compound analysis not necessarily specified by the supplier is possible, this adds another avenue of revenue potentially, if further validated. The compounds the HPLC can detect are referred to by the number of carbon atoms in the molecules.
Different research assistants have also worked on compiling the spreadsheet, so needless to say there is a lot of duplication in the sheet. The source data is shown below, there are 200 rows of data.
Using Advanced Filter to Extract Unique Values
1) First things first, create five new sheets in order to store the extracted unique values, in these sheets. It is always a good idea to keep the original data intact.
2) We now have a sheet called UniversitiesResearchInstitutes, and we are going to extract all the unique university and research institution names to this sheet using Advanced Filter.
3) So we start on our destination sheet, which is the UniversitiesResearchInstitutes sheet as shown below.
4) Go to Data>Sort & Filter> Advanced.
5) Select the Copy to another location checkbox.
6) For the Copy to option, select $A$1 on the UniversitiesResearchInstitutes sheet, and tick unique records only as shown below.
7) Select the List range, and then navigate to the OriginalDataset sheet and select column $A$4:$A$204, which is our Name of University or Research Institute column, as shown below.
8) Click Ok, while still on the OriginalDataset sheet.
9) The unique university and research institute names are then extracted to the UniversitiesResearchInstitutes sheet.
10) We then follow the same process for all the other columns in the original data set.
11) We have now removed the duplication to obtain the unique list of universities, HPLC Columns, compounds advertised by the suppliers, the compound identification and detection actually taking place at the labs and research assistants using Advanced Filter.
We have the answers to the following questions, through using the Advanced Filter Feature:
How many unique universities or research institutions are there in the data set? Six unique names have been extracted.
How many unique HPLC Columns are on site in terms, of actually being at the university lab or research institution? Eighteen unique HPLC Column names have been extracted.
How many unique compounds can be separated or identified, as specified by the supplier(s) of the HPLC Columns? Eighteen unique compounds (denoted by their carbon number) can be extracted as specified by the actual supplier.
How many compounds are actually separated or identified, at the actual university labs or research institution labs? Twenty-one compounds are separated and identified using the HPLC columns at the actual labs.
How many research assistants worked on compiling the spreadsheet? Eleven research assistants worked on compiling the spreadsheet.
The analytical chemist can now look at the results in context for the Advanced Filter set of extractions and sees that twenty-one compounds are actually being separated and identified. This could be due to a number of reasons. In the course of standard analysis, the lab manager could have discovered by accident, that additional analysis was possible on the HPLC column, outside of the specs specified by the supplier. Alternatively, due to the expense of acquiring a new HPLC column for additional compound identification– the lab manager could have set out to tweak the sample preparation and settings on the instrument, in order to get other compounds to be separated.
Either way, the analytical chemist employed by the supplier would now book a trip, to the institutions that are doing these extra sets of analysis and verify their methods. If indeed they have discovered another protocol that allows for additional analysis, the supplier would add this compound to the set of compounds the HPLC column can analyse, in order to increase potential revenue streams.
Using the Remove Duplicates Feature on the Data Tab
One can also remove duplicate values, using the Remove Duplicates Feature on the Data Tab.
1) First things first, make a copy of the Original Dataset sheet, since it is always best to keep the source data intact as mentioned and call this copy, the RemoveDuplicates1 sheet.
2) Select the range, go to Data>Data Tools>Remove Duplicates.
3) Make sure the My data has headers checkbox is ticked, and then choose the column that you’d like the duplicates to be removed from or leave all the columns checked.
4) If you choose to remove duplication for one particular column in the data set, you’ll get the same results as for the Advanced Filter, unique extracts example above.
5) For example, choosing Name of University or Research Institute as shown below.
6) Results in a dialog box popping up, showing that based on the Name of University or Research Institute column 194 duplicate values were removed, and six unique values remain.
7) This confirms the Advanced Filter results and additionally highlights the number of duplicate values, within this specific column.
8) Doing the same thing for the HPLC Column yields 182 duplicate values found, 18 unique values remain.
9) Removing duplicates and ensuring all the columns are checked as shown below.
10) This results in Excel returning an answer that says that there were 3 duplicate values and 197 unique values. This means that there were three rows in the dataset where the values for Name of University or Research Institute column, HPLC Column column, Compounds the HPLC Column can Detect (Supplier) column, Compounds the HPLC Column is actually used to detect (specific lab) column and Research Assistant column were all equal.
11) We now want to look at the data in context and find out, outside of the Research Assistant column, how many values are duplicated.
12) We get 23 duplicate values removed, and 177 unique values remain in the data set as shown below.
Using COUNTIF, EXACT and conditional formatting to detect duplication
One can also use the COUNTIF function to detect duplication or recurring values.
1) For example in our Compounds the HPLC Column is actually used to detect (specific lab) column, we want to see the amount of times C8 compounds, is repeated.
2) We will make a copy of our OriginalDataset sheet and work in this copy. In a cell in the sheet, we enter the formula:
3) We get 18 results returned.
We can then use the EXACT function, in conjunction with conditional formatting in order to show us where exactly these values in the column are occurring.
1) We highlight the Compounds the HPLC Column is actually used to detect (specific lab) column as shown below.
2) We select the following Home>Styles >Conditional Formatting.
3) We select New Rule from the drop down options given.
4) We choose the Use a formula to determine which cells to format option.
5) In the Format values where this formula is true: textbox enter the following formula:
6) Choose Format…….
7) Choose a dark blue fill, with white bold text from the options given, as shown below.
8) Click Ok once done, in order to preview what the formatting will look like.
9) Click Ok again.
10) The recurring instances of C8 compounds are now highlighted through the conditional formatting.
11) In order to quickly see all the rows, containing the formatting, highlight the entire range, and go to Data>Sort & Filter>Sort.
12) In the Sort dialog box, for the Sort by option select the Compounds the HPLC Column is actually used to detect (specific lab) column. For Sort On, select the Cell Color option. For the Order, choose the cell color as shown, and then select on top and click Ok.
13) Now you can see all the C8 compound rows, on top, sorted by this colour formatting. This type of conditional formatting, followed by a colour sort to get all the values formatted on top, is ideal for data sets, which have large numbers of rows.
Creating an Excel Table and using the table options to remove duplicates or filter and sorting to detect duplication
You can also use the built in features of Excel Tables, in order to detect and remove duplication
1) Make a copy of the sheet containing the original data.
2) With a cell in the data set selected, press CTRL-T on your keyboard or go to Insert>Tables >Table.
3) Using the filtering options, on the table also allows for isolation of certain values. Clicking on the drop down arrow next to the Name of University or Research Institute column, allows one to select certain universities. Ticking just University ABC isolates all the records for University ABC in the data set as shown below.
4) One can also filter a table using more than one column, for example lets say one wanted to see only the records in the table from University ABC, that were compiled by research assistant Jennifer Smith. One would first tick University ABC as in the above example and then click the drop down arrow next to the Research Assistant column and select Jennifer Smith as shown.
5) Going to Data>Sort & Filter and clicking on Clear, in order to toggle custom filters you have applied, returns the table back to its original unfiltered state.
Using Slicers with Tables in order to filter and detect repeat values
1) Slicers provide a quick way to filter tables visually. In order to insert a slicer, select a cell in the table, go to the Design tab in the Table Tools context sensitive menu>Tools> Insert Slicer.
2) In the Insert Slicers dialog box, tick the Compounds the HPLC Column can Detect (Supplier) column and the Compounds the HPLC Column is actually used to detect (specific lab) column as shown below.
3) The slicers below are inserted.
4) Selecting C8 compounds from the Compounds the HPLC Column can Detect (Supplier) slicer, filters the table instantly to show only the rows where C8 compounds were listed, in this column as shown below.
5) We then choose C8 compounds again in the Compounds the HPLC Column is actually used to detect (specific lab) slicer as shown below. We want to check, where the value is repeated in order to see which labs are actually following the specifications set by the supplier.
6) We can now also see, which labs are not using the HPLC column in the manner specified by the manufacturer, by clearing the filter on the slicer on Compounds the HPLC Column is actually used to detect (specific lab) slicer. Consequently, we then see all the other carbon number compounds the HPLC Column is being used for, in addition to the one specified by the supplier.
7) We can see from just applying the filter to Compounds the HPLC Column can Detect (Supplier) slicer, that the compounds this HPLC Column, is also used for in addition to the supplier’s specification are C10 compounds, C11 compounds, C12 compounds, C15 compounds, C16 compounds, C17 compounds, C18 compounds, C19 compounds, C22 compounds, C5 compounds, C7 compounds and C9 compounds. The greyed out values indicate the column is not being used to analyse these other compounds. One can thus use slicers to conduct quick visual analysis and filtering.
What the analytical chemist would now do is book an appointment to visit these labs and see and verify the other compound detection. As mentioned, often due to the expense of procuring HPLC Columns, lab managers fine tune and tweak the sample preparation and the instrument settings to accommodate other compound analysis. If some of these other compounds are verified for this column, in addition to the C8 compounds specified by the supplier, then the analytical chemist has found another source of revenue for the supplier, due to the additional applications and compounds.
Remove Duplicates in a Table
1) One can also remove duplicates in an Excel Table. With a cell in the table selected, go to the Design tab in the Table Tools context sensitive menu>Tools > Remove Duplicates as shown below.
2) We can then select the columns, we’ll just leave all of the columns checked for this example.
3) We then get the amount of duplicates detected using these settings, which is the same value, that was returned using the Remove Duplicates Feature on the Data Tab example above
Download Excel File
Download the file that has been used to make this tutorial.
Excel provides a wide array of tools to detect recurring values and/or remove duplicate values. It is sometimes necessary when cleaning datasets in Excel for future database integration (which requires removal of redundancy and duplication) to use more than one tool. As mentioned and illustrated in some of the examples above, we don’t always want to remove recurring values instead we just want to count them or note them, scenarios that are catered to in Excel as well.
Please feel free to comment in the attachment and tell us your tips for removing duplicate values/detecting recurring values and if you use HPLC analysis in your lab.