Delete duplicate records from table is the very common question and task we need in our day to day life, so we will see how easily we can delete duplicate records from a table by using subquery and using partition by function supported by Sql Server 2005 and later version.
Say the table name is Address, having following records
AddressID CustomerID Address City Zip Country 1 100 A B 1 USA 2 100 A B 1 USA 3 111 A B 1 USA 4 101 B C 2 USA 5 101 B C 2 USA 6 112 B C 2 USA 7 103 C D 3 USA 8 103 C D 3 USA 9 113 C D 3 USA
So If we will check duplicate records on the basis of CustomerID, Address, City, Zip and Country columns then record 1 and 2 is same, 4 and 5 is same, 7 and 8 is same. How to check this in Sql server, execute fullotin query
select *, ROW_NUMBER() OVER(Partition By customerId, address, city, zip, country Order By addressId) [ranked] from address
Records with “ranked” > 1 is the duplicate record. So how to delete them, simple, we will delete all those records where ranked > 1, in this case record with address Id 2, 5 and 8 should be deleted, right? So let’s write the query to check
With ranked_records AS ( select *, ROW_NUMBER() OVER(Partition By customerId, address, city, zip, country Order By addressId) [ranked] from address ) select * from ranked_records where ranked > 1
And here is the result:
AddressID CustomerID Address City Zip Country ranked 2 100 A B 1 USA 2 5 101 B C 2 USA 2 8 103 C D 3 USA 2
Means we are getting correct records, so change our last line to delete the record rather than select
select * from ranked_records where ranked > 1 To delete ranked_records where ranked > 1
Up to now we checked duplicate records on columns Customer ID, Address, City, Zip and Country, but what if we want to check duplicate records on the basis of only three columns (Address, City and Zip)? Don't worry, in this case we need a small change in our query, only remove columns from partition by, let's see this:
select *, ROW_NUMBER() OVER(Partition By address, city, zip Order By addressId) [ranked] from address
and here is the result:
AddressID CustomerID Address City Zip Country Ranked 1 100 A B 1 USA 1 2 100 A B 1 USA 2 3 111 A B 1 USA 3 4 101 B C 2 USA 1 5 101 B C 2 USA 2 6 112 B C 2 USA 3 7 103 C D 3 USA 1 8 103 C D 3 USA 2 9 113 C D 3 USA 3
It shows record number 1, 2 and 3 are duplicate, 4, 5, and 6 are duplicate and 7, 8 and 9 are also duplicate. To delete duplicate records we will use ranked column where ranked > 1 as we done in our previous example.
In all the above examples we delete those records which were added latter and keep the first records but as we are using the address table so last address will be the correct one so we need to delete all duplicate records except the last one. To achieve this we need to make a small change in our Row_number() query, so let’s change it, we will use Order By AddressID DESC:
ROW_NUMBER() OVER(Partition By customerId, address, city, zip, country Order By addressId DESC) [ranked]
If there is not Id column then how we can delete duplicate records? ID column is not needed to delete duplicate records but in that case we cannot guarantee which record will be deleted.
But if there is any column like CreationDate or DateUpdate etc. we can order our record to delete.
Lets say we want to delete all those records where address, city and zip is same in address table, then we will use the inner join on same table on the basis of required columns, see this
DELETE A1 From Address A1 Inner Join Address A2 ON A2.City = A1.City AND A2.Address = A1.Address AND A2.Zip = A1.Zip Where A1.AddressID > A2.AddressID
We can also use subquery like this
DELETE A1 From Address A1 Where Exists (Select 1 From Address A2 Where A2.City = A1.City AND A2.Address = A1.Address AND A2.Zip = A1.Zip AND A1.AddressID > A2.AddressID),
But these both technique is slow, if you will change this query's first line Delete A1 to Select A1.*, you will know why. Suppose you have 3 records with matching address, city and zip, when you will do the join all the 3 records will join From A1 to all the 3 records of A2 so 9 row will come in result, you can think if there is 100,000 matching records then it will create 100,000,000,000 records which is quite huge, while in our previous example where we used partition by, will always return the only those much records which we have in our table.
it's up to you which one will be best for your requirements.
You may ask if there is no column which can help to get the older or latest record then how we can order and delete older or newer records? No way, question itself says there is no way to know which record is older and which one is newer.
Having 10+ years of experience in Microsoft Technologies (C#, ASP.Net, MVC and SQL Server). Worked with Metaoption LLC, for more than 8 years and still with the same company. Always ready to learn new technologies and tricks.
|By Ali Adravi On 26 Aug, 13 Viewed: 649|
We can not directly change a column to identity columns, say we have a primary key column or any normal column in our table and want to make it identity column then we cannot do it directly by altering the table but there are only two way to change column to identity column 1. Create a new... By Ali Adravi On 02 May 2013 Viewed: 336
There are many ways which can help us to save our time to write some repeated code or to do some repeated action, in this post we will see some very common code which we need very frequently in our development life. I am not saying that after this post you will save your half say work every day but... By Ali Adravi On 31 Mar 2013 Viewed: 456
In SQL server there are decimal and money data type to store precision and scale both together or say decimal values. Suppose we want to get precision and scale separately then how to get it. There are many ways to do this but there is a function in SQL named PARSENAME which can be used to get... By Myghty On 26 Mar 2013 Viewed: 1,057
To write a good performing procedure or function we should be aware about available wildcards in SQL Server. I found most developer depends on ‘%’ wildcard only while we have others as well. In this article we will discuss all the available wildcards in detail with examples. First of all let’s see... By Dev D On 25 Feb 2013 Viewed: 304
When we need to concatenate two columns simply we can use `+` sign and that is correct, but what if any of them is null, Will it return what we want, NO, it will return null. So let's discuss how we should concatenate two or more columns without creating extra space or extra comma problem. Let's... By Ali Adravi On 13 Feb 2013 Viewed: 4,508