Divide and Combine to Conquer Big Data
I’ve recently been reading more and hearing more talks about an approach called “Divide and Combine” (sometimes “Divide and Recombine”), which pops up in various guises and contexts. “Divide and conquer” algorithms in computer science are one aspect of this, but there is more to it than mere parallelization of computing.
The starting point is that it may be hard, or impossible, to analyze a full data set because of its size. In some extreme cases, it may not even be feasible to store the entire data set on disk, let alone manipulate it for statistical analysis. With clever application of a parallel approach, though (on the statistical side, rather than solely the computing), it is sometimes possible to break the data into manageable chunks (the “Divide” part), perform the analysis on each piece separately, and then put the results back together in a way that retrieves (either exactly or approximately) the result that would have been obtained had the whole data set been analyzed (the “Combine” part).
Some content is only viewable by ASA Members. Please login or become an ASA member to gain access.