Intelligently Deduplicating Records in Rails
I recently needed to write a deduplicate method in a Rails app I built. In the application, each artist
has_one song. However, over the course of several months, several songs had been added to the database with the same
artist_id as an existing song.
Based on this helpful post, my deduplicate method looked like this:
1 2 3 4 5 6 7 8
This method groups every song by the song’s artist. For groupings with more than one song (i.e. songs that have the same
artist_id), it keeps the first song and deletes all the rest.
This works great, but it naively decides which duplicate record to destroy and which to keep.
To make the deduplicating method “smarter” we can use
sort_by to influence which duplicate will be kept. For example, if we wanted to keep the duplicate song with the highest play count, we could add this line before the
shift method call:
This would ensure that the song “kept” using the
shift method is the song with the highest play count.
It’s also possible to sort records by boolean values using
sort_by. This was useful for my method, because in cases where an artist had two songs, I preferred to keep the song that had been reviewed by a trusted editor vs. one that had not.
sort method doesn’t work with booleans, so we instead use
sort_by and replace the booleans with integers to achieve the sort. I used the following line of code before the
Now, I have a concise, reusable method I can use to intelligently de-duplicate my songs.