Showing posts with label de-dupe. Show all posts
Showing posts with label de-dupe. Show all posts

Thursday, March 27, 2008

Sort + De-dupe? Easy.

Sometimes when dumping data, it makes more sense from a performance perspective to not worry about removing duplicate data when constructing a SQL query and do it after the fact instead. This is accomplished really easy on the shell as follows:

cat filename.csv | sort --buffer-size=32M | uniq > filename_uniq.csv

You can omit the buffer-size argument to sort in favor of the default size or set it to whatever you want.