Delete duplicate records in sql for specific columns

Delete duplicate records from table is the very common question and task we need in our day to day life, so we will see how easily we can delete duplicate records from a table by using subquery and using partition by function supported by Sql Server 2005 and later version.

Say the table name is Address, having following records

AddressID   CustomerID  Address  City    Zip        Country
1           100         A        B       1          USA
2           100         A        B       1          USA
3           111         A        B       1          USA
4           101         B        C       2          USA
5           101         B        C       2          USA
6           112         B        C       2          USA
7           103         C        D       3          USA
8           103         C        D       3          USA
9           113         C        D       3          USA

So If we will check duplicate records on the basis of CustomerID, Address, City, Zip and Country columns then record 1 and 2 is same, 4 and 5 is same, 7 and 8 is same. How to check this in Sql server, execute fullotin query

select *, 
ROW_NUMBER() OVER(Partition By customerId, address, city, zip, country 
  Order By addressId) [ranked] 
from address

Records with “ranked” > 1 is the duplicate record. So how to delete them, simple, we will delete all those records where ranked > 1, in this case record with address Id 2, 5 and 8 should be deleted, right? So let’s write the query to check

With ranked_records AS
(
   select *, 
    ROW_NUMBER() OVER(Partition By customerId, address, city, zip, country 
    Order By addressId) [ranked] 
    from address
)
select * from ranked_records
where ranked > 1

And here is the result:

AddressID   CustomerID  Address  City    Zip        Country   ranked
2           100         A        B       1          USA       2
5           101         B        C       2          USA       2
8           103         C        D       3          USA       2

Means we are getting correct records, so change our last line to delete the record rather than select

select * from ranked_records where ranked > 1 
To
delete ranked_records where ranked > 1

Up to now we checked duplicate records on columns Customer ID, Address, City, Zip and Country, but what if we want to check duplicate records on the basis of only three columns (Address, City and Zip)? Don't worry, in this case we need a small change in our query, only remove columns from partition by, let's see this:

select *, 
ROW_NUMBER() OVER(Partition By  address, city, zip 
  Order By addressId) [ranked] 
from address

and here is the result:

AddressID   CustomerID  Address  City    Zip        Country   Ranked
1           100         A        B       1          USA        1
2           100         A        B       1          USA        2
3           111         A        B       1          USA        3
4           101         B        C       2          USA        1
5           101         B        C       2          USA        2
6           112         B        C       2          USA        3
7           103         C        D       3          USA        1
8           103         C        D       3          USA        2
9           113         C        D       3          USA        3

It shows record number 1, 2 and 3 are duplicate, 4, 5, and 6 are duplicate and 7, 8 and 9 are also duplicate. To delete duplicate records we will use ranked column where ranked > 1 as we done in our previous example.

In all the above examples we delete those records which were added latter and keep the first records but as we are using the address table so last address will be the correct one so we need to delete all duplicate records except the last one. To achieve this we need to make a small change in our Row_number() query, so let’s change it, we will use Order By AddressID DESC:

 ROW_NUMBER() OVER(Partition By customerId, address, city, zip, country 
  Order By addressId DESC) [ranked] 

If there is not Id column then how we can delete duplicate records? ID column is not needed to delete duplicate records but in that case we cannot guarantee which record will be deleted.

But if there is any column like CreationDate or DateUpdate etc. we can order our record to delete.

Another Way;

Lets say we want to delete all those records where address, city and zip is same in address table, then we will use the inner join on same table on the basis of required columns, see this

DELETE A1
From Address A1
Inner Join Address A2 ON A2.City = A1.City
    AND A2.Address = A1.Address
    AND A2.Zip = A1.Zip
Where A1.AddressID > A2.AddressID

We can also use subquery like this

DELETE A1
From Address A1
Where Exists (Select 1 From Address A2 
    Where A2.City = A1.City
    AND A2.Address = A1.Address
    AND A2.Zip = A1.Zip
    AND A1.AddressID > A2.AddressID), 

But these both technique is slow, if you will change this query's first line Delete A1 to Select A1.*, you will know why. Suppose you have 3 records with matching address, city and zip, when you will do the join all the 3 records will join From A1 to all the 3 records of A2 so 9 row will come in result, you can think if there is 100,000 matching records then it will create 100,000,000,000 records which is quite huge, while in our previous example where we used partition by, will always return the only those much records which we have in our table.

it's up to you which one will be best for your requirements.

You may ask if there is no column which can help to get the older or latest record then how we can order and delete older or newer records? No way, question itself says there is no way to know which record is older and which one is newer.

Ali Adravi Having 13+ years of experience in Microsoft Technologies (C#, ASP.Net, MVC and SQL Server). Worked with Metaoption LLC, for more than 9 years and still with the same company. Always ready to learn new technologies and tricks.
  • sql server
By Ali Adravi On 26 Aug, 13  Viewed: 1,455

Other blogs you may like

Adding identity to an existing column in SQL Server

We can not directly change a column to identity columns, say we have a primary key column or any normal column in our table and want to make it identity column then we cannot do it directly by altering the table but there are only two way to change column to identity column 1. Create a new... By Ali Adravi   On 02 May 2013  Viewed: 1,084

some useful sql tricks

There are many ways which can help us to save our time to write some repeated code or to do some repeated action, in this post we will see some very common code which we need very frequently in our development life. I am not saying that after this post you will save your half say work every day but... By Ali Adravi   On 31 Mar 2013  Viewed: 924

Get precision and decimal value in SQL Server by using PARSENAME

In SQL server there are decimal and money data type to store precision and scale both together or say decimal values. Suppose we want to get precision and scale separately then how to get it. There are many ways to do this but there is a function in SQL named PARSENAME which can be used to get... By Myghty   On 26 Mar 2013  Viewed: 3,310

Available wildcards in sql server with examples

To write a good performing procedure or function we should be aware about available wildcards in SQL Server. I found most developer depends on ‘%’ wildcard only while we have others as well. In this article we will discuss all the available wildcards in detail with examples. First of all let’s see... By Dev D   On 25 Feb 2013  Viewed: 666

Concatenate multiple columns in SQL Server with NULL value

When we need to concatenate two columns simply we can use `+` sign and that is correct, but what if any of them is null, Will it return what we want, NO, it will return null. So let's discuss how we should concatenate two or more columns without creating extra space or extra comma problem. Let's... By Ali Adravi   On 13 Feb 2013  Viewed: 12,549