Tuesday, April 1, 2014

Sort 10GB dataset over 1GB memory

转自这里
This is a frequently encountered problem, the solution to which is based on the concept of External Sorting.
Here’s a summary of the process involved in doing this:
  1. Of course, given that you only have 1 GB of main memory, that’s what you can read at once at any given time. So, read data in chunks of 1 GB at a time and sort using on of the conventional sorting algorithms, most likely Quicksort.
  2. Write each sorted dataset to disk (of course, assuming that you have enough disk space :)).
  3. You will end up having 10 chunks of 1 GB (10GB/1GB = 10 chunks), individually sorted, datasets which now needs to be merged into one single output file (on disk).
  4. Now read the first 100 MB (= 1 GB /10 chunks) of each sorted chunk into an input buffer in main memory and allocate the remaining to an output buffer.
  5. Perform a 10-way merge and store the results in the output buffer. If, at any given time, the output buffer is full, write it to the final sorted file, and empty it. If any of the 10 input buffers gets empty, fill it with the next 100 MB of its associated 1 GB sorted chunk until no more data from the chunk is available. This is the key step that makes external merge sort work externally — because the merge algorithm only makes one pass sequentially through each of the chunks, each chunk does not have to be loaded completely; rather, sequential parts of the chunk can be loaded as needed.
<Inspired from Wikipedia>

No comments:

Post a Comment