Uniform Sampling of Facebook

Monday, March 8, 2010 - 6:00 p.m. to Tuesday, March 9, 2010 - 6:55 p.m.
Engineering Gateway 3161
Center for Pervasive Communications and Computing Seminar Series

Featuring Minas Gjoka
Ph.D. Candidate
The Henry Samueli School of Engineering, UC Irvine

Location:  Engineering Gateway 3161
Free and open to the public

Abstract:

With more than 250 million active users, Facebook is currently one of the most important online social networks. Our goal is to obtain a representative (unbiased) sample of Facebook users by crawling its social graph. In this quest, we consider and implement several candidate techniques. Two approaches that are found to perform well are the Metropolis-Hasting random walk (MHRW) and a re-weighted random walk (RWRW). Both have pros and cons, which we demonstrate through a comparison to each other as well as to the "ground-truth", obtained through true uniform sampling of Facebook userIDs. In contrast, the traditional Breadth-First-Search (BFS) and Random Walk (RW) perform quite poorly, producing substantially biased results. In addition to offline performance assessment, we introduce online formal convergence diagnostics to assess sample quality during the data collection process.  We show how these can be used to effectively determine when a random walk sample is of adequate size and quality for subsequent use. Using these methods, we collect the first, to the best of our knowledge, unbiased sample of Facebook. Finally, we use one of our representative datasets, collected through MHRW, to characterize several key properties of Facebook.

About the Speaker:
Minas Gjoka received a B.S. degree in computer science from the Athens University of Economics and Business, Greece, in 2005 and an M.S. degree in networked systems from the University of California, Irvine, in 2008. He is currently a Ph.D. student in the Department of Electrical Engineering and Computer Science at the University of California, Irvine. His research interests include online social networks, peer-to-peer systems, network measurements, network protocols, and internet modeling.