Thursday, March 27, 2008

Sort + De-dupe? Easy.

Sometimes when dumping data, it makes more sense from a performance perspective to not worry about removing duplicate data when constructing a SQL query and do it after the fact instead. This is accomplished really easy on the shell as follows:

cat filename.csv | sort --buffer-size=32M | uniq > filename_uniq.csv

You can omit the buffer-size argument to sort in favor of the default size or set it to whatever you want.

1 comment:

graywh said...

Even better, sort has a -u option so you don't have to use uniq.