Big data is data so large that it does not fit in the main memory of a single machine. The need to process big data by space-efficient algorithms arises in Internet search, machine learning, network traffic monitoring, scientific computing, signal processing, and other areas.
This course will cover mathematically rigorous models for developing such algorithms, as well as some provable limitations of algorithms operating in those models. Some topics covered include:
Dimensionality reduction. General techniques and impossibility results for reducing data dimension while still preserving geometric structure.
Numerical linear algebra. Algorithms for big matrices (e.g. a user/product rating matrix for Netflix or Amazon). Regression, low rank approximation, matrix completion, etc.
Compressed sensing. Recovery of (approximately) sparse signals based on few linear measurements.
Sparse Fourier Transform. Fast algorithms for (approximately) computing the Fourier Transform of signals which are (approximately) sparse in the frequency domain.
Problem: Design an algorithm that monitors a sequence of events and upon request can output (an estimate of) the number of events thus far. More formally, create a data structure that maintains a single integer n and supports the following operations.
init() ; set n←0.
update() ; increments n←n+1.
query() ; outputs n or an estimate of n.
Before any operations are performed, we assume that n=0.
A trivial algorithm stores n as a sequence of ⌈logn⌉=O(logn) bits (a counter). If we want query() ; to return the exact value of n, this is the best we can do. Suppose for some such algorithm, we use f(n) bits to store the integer n. There are 2f(n) configurations for these bits. In order for the algorithm to be able to store the exact value of all integers up to n, the number of configurations must be greater than or equal to the number n. Hence,
2f(n)≥n⇒f(n)≥logn
The goal is to use much less space than O(logn), and so we must instead answer query() ; with some estimate n~ of n. We would like this n~ to satisfy:
P[∣n~−n∣>εn]<δ
for some 0<ε,δ<1 that are given to the algorithm up front.
Morris’s algorithm:
init() ; sets X←0.
update() ; increments X with probability 2−X.
query() ; outputs n~=2X−1.
Analysis: Let Xn denote X in Morris’s algorithm after n updates.
which is not particularly meaningful since the right hand side is only better than 1/2 failure probability when ε≥1 (which means the estimator may very well always be 0).
Morris+: To decrease the failure probability of Morris’s basic algorithm, we instantiate s independent copies of Morris’s algorithm and average their outputs. That is, we obtain independent estimators n~1,...,n~s from independent instantiations of Morris’s algorithm, and our output to a query is:
n~=s1i=1∑sn~i
Due to the property of expectation, E[n~]=n,V[n~]=sn. Thus
P[∣n~−n∣>εn]<2sε21<δ
for s>2ε2δ1=Θ(ε2δ1).
It turns out there is a simple technique (which we will see often) to reduce the dependence on the failure probability δ from 1/δ to log(1/δ). The technique is as follows.
Morris++: We run t instantiations of Morris+, each with failure probability 31. We then output the median estimate from all the t Morris+ instantiations.
Analysis:
If more than half of the instantiations are good estimations, then the median value is obviously a good estimation.
Note that the expected number of Morris+ instantiations that succeed is 32t (expectation of Bernoulli trials). Let
Yi={10if i-th instantiation is a good estimationotherwise
The following solutions assume that real number storage does not require infinite memory and we have true random hash function. However, we will introduce non-idealized solution later.
Maintain in memory the smallest hash we’ve seen so far: X=mini∈streamh(i).
query() : output 1/X−1.
For some intuition, say that t is the number of distinct elements. We are partitioning the interval [0,1] into bins of size 1/(t+1). With this in mind, we claim the following:
To achieve a logarithmic dependence on the failure probability, we can running t=Θ(logδ1) independent copies of FM+, each with δ=31. Then query() outputs the median across all FM+ estimates.
The new space for FM++ is O(ε21logδ1). The analysis of median trick is similar to lecture 1.
Definition: A family H of functions mapping [a] into [b] is k-wise independent if and only if for all distinct i1,...,ik∈[a] and for all j1,...,jk∈[b],
Ph∈H[h(i1)=j1∧...∧h(ik)=jk]=bk1
Note that we can store h∈H in memory with log2∣H∣ bits. One example of such a family H is the set of all functions mapping [a] to [b]. Then ∣H∣=ba so log∣H∣=alogb. A less trivial example is due to Carter and Wegman, where H is the set of all degree-(k−1) polynomials over Fq such that a=b=q. Then ∣H∣=qk , and so log∣H∣=klogq.
Having seen these examples, we will just assume that we have access to some 2-wise independent hash families, which will let us store in logn bits.
Suppose we have a substitute that gives us t~ as a 32-approximation to t. To get the (1+ε)-approximation, we will use the common strategy of geometric sampling of streams. This is important to understand because it is used fairly often in scenarios like this one.
First let’s consider the trivial solution: remember the first K distinct elements in the stream with K=ε2c, if the number of distinct elements is larger than K then the trivial solution will dismiss them. Then we can compose them as follows:
Assume n is a power of 2, pick g:[n]→[n] from a 2-wise family.