Sometimes when dumping data, it makes more sense from a performance perspective to not worry about removing duplicate data when constructing a SQL query and do it after the fact instead. This is accomplished really easy on the shell as follows:
cat filename.csv | sort --buffer-size=32M | uniq > filename_uniq.csv
You can omit the buffer-size argument to sort in favor of the default size or set it to whatever you want.
Showing posts with label de-dupe. Show all posts
Showing posts with label de-dupe. Show all posts
Thursday, March 27, 2008
Subscribe to:
Posts (Atom)