# Sampling Online Social Networks

## People

- Faculty: Athina Markopoulou (EECS), Carter T. Butts (Sociology).
- Postdocs: Maciej Kurant, Minas Gjoka
- PhD Students: Blerim Cici
- MS Students: Yan Wang (visualization project)

- Collaborator: Natasa Przulj, Imperial College, London, UK.
- Current visitors: Omer Nebil Yaveroglu, and Kai Sun, Ph.D. students at Imperial College, London, UK.

## Funding

This work is currently supported by the following grants:

- AFOSR MURI: "Information Dynamics as Foundation for Network Management". AFOSR MURI prime award FA9550-09-0643 (subcontract to UCI from Princeton). 9/1/2009-8/31/2014.
- Docomo USA Labs contract: “OSNs from an Operator's Perspective”.

**Disclaimer:** Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the sponsors.

## Available Datasets

## Presentations

- 2/16/2012: Athina gives a talk at CSSI Seminar Series at UMASS.
- Nov. 2012: Maciej gives a presentation on Sampling Massive Online Graphs: Challenges, Techniques, and Applications to Facebook at KTH, University of Helsinki and ETHZ
- 9/23/2011: Athina gives a seminar (overview of our work on Sampling Online Social Networks) at the Advanced Networks Colloqium at the University of Maryland, College Park.
- Earlier versions of this talk were presented at the at NNMC, Madison, WI, in May 2011; IMBS Workshop at UC Irvine, in Jan. 2011; and at the UCLA Colloqium in April 2011.
**Workshop on Exponential Random Graphs and the ERGM in statnet, by Carter Butts, Sept. 21st, 2011.**- Here is a poster presented at the CalIT2 anniversary on our ongoing work on Mobile Online Social Networks.

## Publications

**Construction of Simple Graphs with a Target Joint Degree Matrix and Beyond**

[ pdf, bibtex , slides, software ]

Minas Gjoka, Balint Tillman, Athina Markopoulou

To appear in IEEE INFOCOM '15.

**Efficient Construction of 2K+ Graphs**

[ pdf, bibtex , slides, abstract, software ]

Minas Gjoka, Balint Tillman, Athina Markopoulou, Rasmus Pagh

NetSci 2014 Posters.

**Estimating Clique Composition and Size Distributions from Sampled Network Data**

[ pdf, bibtex , slides, software ]

Minas Gjoka, Emily Smith, Carter T. Butts

To appear in IEEE INFOCOM, NetSciCom Workshop '14 and on arXiv:cs.SI:1308.3297, Aug. 2013.

**2.5K-Graphs: from Sampling to Generation**

[ pdf, bibtex , slides, software ]

Minas Gjoka, Maciej Kurant, Athina Markopoulou

To appear in IEEE INFOCOM '13 and on arXiv:cs.SI:1208.3667.

**Coarse-Grained Topology Estimation via Graph Sampling**

[ pdf, bibtex , slides, report, dataset, software ]

Maciej Kurant,Minas Gjoka, Yan Wang, Zack W. Almquist,Carter T. Butts, , Athina Markopoulou

SIGCOMM Workshop on Online Social Networks (WOSN) '12, and available at arXiv:cs.SI:1105.5488

**Multigraph Sampling of Online Social Networks**

[ pdf , bibtex , slides, dataset ]

Minas Gjoka, Carter T. Butts, Maciej Kurant, Athina Markopoulou

JSAC special issue on Measurement of Internet Topologies, Vol.29, No. 9, Oct. 2011.

**Practical Recommendations on Crawling Online Social Networks**

[ pdf, bibtex, slides, dataset1, dataset2, software ]

Minas Gjoka, Maciej Kurant, Carter T. Butts, Athina Markopoulou

JSAC special issue on Measurement of Internet Topologies, Vol.29, No. 9, Oct. 2011.

**Towards Unbiased BFS Sampling**

[ pdf ]

Maciej Kurant, Athina Markopoulou, Patrick Thiran

JSAC special issue on Measurement of Internet Topologies, Vol.29, No. 9, Oct. 2011.

**Walking on a Graph with a Magnifying Glass: Stratified Sampling via Weighted Random Walks**

[ pdf, bibtex , slides, dataset ]

Maciej Kurant, Minas Gjoka, Carter T. Butts, Athina Markopoulou,

ACM SIGMETRICS 2011.

**On the Bias of Breadth First Search (BFS) and of Other Graph Sampling Techniques**

[ pdf, bibtex, slides, report ]

Maciej Kurant, Athina Markopoulou, Patrick Thiran

International Teletraffic Congress (ITC 22) '10, Amsterdam, Sept 2010

**Walking in Facebook: A Case Study of Unbiased Sampling of OSN**

[ pdf, bibtex, slides, dataset1, dataset2, software ]

Minas Gjoka, Maciej Kurant, Carter T. Butts, Athina Markopoulou

IEEE INFOCOM '10, San Diego, March 2010

**Poking Facebook: Characterization of OSN Applications**

[ pdf, bibtex, slides, dataset, software ]

Minas Gjoka,Michael Sirivianos, Athina Markopoulou, Xiaowei Yang

ACM SIGCOMM Workshop on Online Social Networks (WOSN) '08,Seattle, August 2008

## Software

- Clique Estimation: two Python scripts that demonstrate the estimators described in our paper Estimating Clique Composition and Size Distributions from Sampled Network Data. The first script implements two types of unbiased estimators of clique size distributions, one of which exploits labeling of sampled nodes neighbors and one of which does not require this information. Additionally, it supports the compositions of cliques by node attributes (only supports binary node attributes, such as gender). The second script demonstrates how to prepare the data for input to the first script. More specifically, it receives as input a known graph, sampling parameters (sampling method, sampling size, replacement type), and clique distribution preferences (labeling, attributes). It then appropriately samples egonets from the given graph and calculates the maximal clique distribution for each sampled egonet.

- 2.5K-Graphs: two software packages to demonstrate the algorithms and estimators described in our paper 2.5K-Graphs: from Sampling to Generation. The first package receives as input a random walk graph sample and estimates the degree-dependent clustering coefficient distribution and network average clustering coefficient. The second package implements all the algorithms and estimators within classes “Estimation” and “Generation”. It receives as an input a fully known graph and then simulates a random walk graph sample of given size. The class “Estimation” provides functions that estimate the degree-dependent clustering coefficient (CCK) and joint degree distribution (JDD). The class “Generation” provides functions that generate a 2.5K graph given specific CCK and JDD distributions.

- Geosocialmap: a web-based tool that visualizes geo-social data. More information can be found in our paper Coarse-Grained Topology Estimation via Graph Sampling and the M.Sc. thesis GeoSocialMap Visualization

- Graph sampling: A set of functions to sample nodes of a graph with replacements (Simple Random Walk, Weighted Random Walk, Metropolis Hastings Random Walk, Uniform Independent Sampling, Weighted Independent Sampling) and corresponding estimators.

- Facebook Applications: protype crawlers of the Facebook user profiles in 2008 and user coverage simulator. More information can be found in our paper Poking Facebook: Characterization of OSN Applications

## Datasets

The conditions of use for the released datasets are:

- the corresponding paper is cited
- no further re-distribution without our permission.

### Last.fm Multigraph

We release the following single and multigraph crawls of Last.fm, collected during July 2010 and presented in our IEEE JSAC journal paper.

- Friends - Contains 5 random walks of 50K users each on the friendship relation (5x50K=250K users)
- Events - Contains 5 random walks of 50K users each on the events relation (5x50K=250K users)
- Groups - Contains 5 random walks of 50K users each on the groups relation (5x50K=250K users)
- Neighbors - Contains 5 random walks of 50K users each on the symmetrized neighbors relation (5x50K=250K users)
- Friends_Events - Contains 5 multigraph random walks of 50K users each on the friendship and events relations (5x50K=250K users)
- Friends_Events_Groups -Contains 5 multigraph random walks of 50K users each on the friendship, events, and groups relations (5x50K=250K users)
- Friends_Events_Groups_Neighbors -Contains 5 multigraph random walks of 50K users each on the friendship, events, groups, and neighbors relations (5x50K=250K users)
- Uniform - Contains 10 simple random samples with replacement of 50K users each (10x50K=500K users). The samples are obtained by probing the userID space with rejection sampling.

Please refer to Table 1 and section IV-A of our paper for more information about the crawls.

##### Random Walk Crawls

For each random walk crawl we release the following files:

- Files “randomwalk-0 randomwalk-1 randomwalk-2 randomwalk-3 randomwalk-4” contain the order of user samples in each random walk, and the multigraph type and id visited. In the below example, the order is user0, user1, user2, user3, user1. The second column denotes the multigraph type with possible values being (i) friend (ii) event (iii) group (iv) neighbor. The third column is used only during multigraph types event and group, and denotes the specific multigraph id chosen by the random walk. For the specifics of the multigraph sampling algorithm see Algorithm 1 in our paper.

user_0|group|group_1

user_1|friend|

user_2|group|group_2

user_3|event|event_1

user_1|neighbor|

..

In our example, during the random walk (i) user_0 chose user_1 randomly from group 'group_1' (ii) user_1 chose user_2 randomly from her friends (iii) user_2 chose user_3 randomly from group 'group_2' (iv) user_3 chose user_1 randomly from event 'event_1' (v) user_1 chose the next (unseen here) visited user randomly from her symmetrized neighbors.

- File “uname_friends” contains the friends of each sampled user. Friends ids are the same as user ids.

user_1|friend_1|friend_2|..|friend_n

..

- File “uname_events” contains the events of each sampled user. For each user the events are split to future and past using as a reference time point the crawling period in mid-July 2010. In each row, the second and third fields determine the number of future and past events respectively. The row has a number of events equal to (n_future_events + n_past_events), of which the first 'n_future_events' are future events and the rest of 'n_past_events' are past events.

user_1|n_future_events|n_past_events|event_1|event_2|..|event_m

..

- File “uname_groups” contains the groups of each sampled user.

user_1|group_1|group_2|..|group_q

..

- File “uname_symNeighbors” contains the symmetrized neighbors of each sampled user. Neighbor ids are the same as user ids. This file is only present in the “neighbors” and “friends_events_groups_neighbors” crawls.

user_1|neighbor_1|neighbor_2|..|neighbor_r

..

- File “uname_info” contains user information for each sampled user.

user_1|is_subscriber|has_profilepicture|number of playcounts|number of playlists|registration date

..

- File “uname_infoExtra” contains additional user information for each sampled user. This file is NOT included in this release by default. It will be distributed only for well specified research projects.

user_1|country|age|gender

..

- File “events_size” contains the number of users for each event.

event_1|size_1

..

- File “groups_size” contains the number of users for each group

group_1|size_1

group_2|size_2

..

##### Uniform Crawl

For the uniform crawl we release the following files:

- Files “uniform_selection-0 .. uniform_selection-9” contain the user samples included in each of the 10 uniform samples with replacement.

user_1|uniform|

user_2|uniform|

..

- Files uname_friends, uname_events, uname_groups, uname_info, uname_infoExtra, events_size, and groups_size are also included and contain information for each sampled user. See above for a description of each file.

##### Summary

In summary the total number of sampled and observed items over all crawls are as follows.

Unique users sampled : 1,251,047

Unique users sampled and observed: 3,196,820

Unique events : 892,357

Unique groups : 102,706

The user_ids (same as friend_ids, neighbor_ids), event_ids and group_ids are all anonymized and are consistent between all files and all crawls.

As a condition of usage, please cite the Last.fm dataset with the following bibtex entry :

@article{mgjoka_multigraph_jsac, title = {Multigraph Sampling of Online Social Networks}, author = {Minas Gjoka and Carter T. Butts and Maciej Kurant and Athina Markopoulou}, journal = {IEEE JSAC on Measurement of Internet Topologies}, volume = {29}, number = {9}, year = {2011} }

### Facebook Geosocialmap

In our Coarse-Grained Topology Estimation paper we developed estimators that take as input a probability sample of nodes from an *original graph* and produce a *category-to-category graph*. In the *original graph* each node/user has declared a category i.e. a node/user can declare that she belongs to the “UC Irvine” Facebook network. In the *category-to-category graph*, each node corresponds to a category i.e. two nodes can be “UC Irvine” and “UC San Diego”. Additionally, each edge in the *category-to-category* graph reflects the strength between two category members in the original graph i.e. the weighted edge between “UC Irvine” and “UC San Diego” is interpreted as the probability that a random user from “UC Irvine” is a friend of a random user from “UC San Diego”.

As a practical illustration of our approach, we applied our methodology to our previously collected datasets Facebook Social Graph and Facebook Weighted Random Walks. We estimated several Facebook category graphs and visualized them at Geosocialmap. Here we make available (i) the mapping of anonymized networkIDs to Facebook network names in our released Facebook Social Graph dataset (ii) all the estimated category-to-category graphs.

##### University Category Graph

This category graph has been estimated from the Facebook Weighted Random Walks dataset. Its categories are the top 133 US national universities according to the “US News World Report ’09”.

**Nodes**

Category nodes are contained in the file “univ_nodes_2010.csv”. Each row in this file describes a category node and related category features. The structure of each row is as follows:

<Node ID> | <Node Name> | <Longitude> |<Latitude> | <FB Network Name> | <US State> | <Tier> | <Rank> | <Score> | <Type> | <Year Founded> | <Religion> | <Calendar> | <# Students> | <Setting> | <Acceptance Rate> | <Tuition Cost>

An example of a category node is the following:

16777277|University of California–Irvine|-117.8426417|33.64535|UC Irvine|CA|1|44|58|Public|1965|None|quarter|21696|suburban|0.556|7556

**Edges**

Category edges are contained in the file “univ_edges_2010.csv”. Each row in this file contains a category edge and has the following structure:

<Edge ID> | <Node ID> | <Node ID>| <Edge Weight>

##### Country Category Graph

This category graph has been estimated from the Facebook Social Graph dataset. Its categories are world countries that Facebook users could join as a regional network in 2009.

**Nodes**

Category nodes are contained in the file “country_nodes_2009.csv”. Each row in this file describes a category node and has the following structure:

<Node ID> | <Node Name> | <Longitude> |<Latitude>

**Edges**

Category edges are contained in the file “country_edges_2009.csv”. Each row in this file contains a category edge and has the following structure:

<Edge ID> | <Node ID> | <Node ID>| <Edge Weight>

##### North America Counties Category Graph

This category graph has been estimated from the Facebook Social Graph dataset. Its categories are regions of the United States and Canada that Facebook users could join in 2009.

**Nodes**

Category nodes are contained in the file “northamerica_nodes_2009.csv”. Each row in this file describes a category node and has the following structure:

<Node ID> | <Node Name> | <Longitude> |<Latitude> | <Country>

**Edges**

Category edges are contained in the file “northamerica_edges_2009.csv”. Each row in this file contains a category edge and has the following structure:

<Edge ID> | <Node ID> | <Node ID>| <Edge Weight>

##### UK Cities Category Graph

This category graph has been estimated from the Facebook Social Graph dataset. Its categories are regions of the United Kingdom that Facebook users could join in 2009.

**Nodes**

Category nodes are contained in the file “ukcities_nodes_2009.csv”. Each row in this file describes a category node and has the following structure:

<Node ID> | <Node Name>

**Edges**

Category edges are contained in the file “ukcities_edges_2009.csv”. Each row in this file contains a category edge and has the following structure:

<Edge ID> | <Node ID> | <Node ID>| <Edge Weight>

##### Mapping of NetworkIDs to Network names

The file “net2net_mapping_2009” contains all regional/school/workplace Facebook networks discovered during the collection of the Facebook Social Graph dataset. The structure is:

<Network ID> # <Network Name>

One can use the mapping of networksIDs in combination with the Facebook Social Graph to estimate the category-to-category graphs. One can use the estimated category-to-category graphs to create models and test hypotheses on how category features (rank and type of a university, language and religion of a country, geographical distance) affect the inter-category interaction rates.

You can request the Facebook Geosocialmap dataset here.

As a condition of usage, please cite the Facebook Geosocialmap dataset with the following bibtex entry :

@inproceedings{kurant11_coarsetopology, title= {{Coarse-Grained Topology Estimation via Graph Sampling}}, author= {Maciej Kurant and Minas Gjoka and Yan Wang and Zack W. Almquist and Carter T. Butts and Athina Markopoulou}, booktitle = {Proceedings of ACM SIGCOMM Workshop on Online Social Networks (WOSN) '12}, address = {Helsinki, Finland}, month = {August}, year = {2012} }

### Facebook Social Graph - MHRW & UNI

We release the following datasets, collected in April of 2009 through data scraping from Facebook :

- MHRW - A sample of 957K unique users obtained Facebook-wide by 28 independent Metropolis-Hastings random walks
- UNI - A sample of 984K unique users that represents the “ground truth” i.e., a truly uniform sample of Facebook userIDs, selected by a rejection sampling procedure from the system's 32-bit ID space.

For each dataset, we release two files. The first file contains for each sampled userID,the number of times the user was sampled and the userIDs of his/her friends.

** <uid> <#times sampled> <friend_uid_1> <friend_uid_2> .. <friend_uid_j> **

The second file contains additional node properties for each sampled user. For each sampled userID we have the number of times sampled, the total number of friends, the privacy settings and network membership.

**<uid> <#times sampled> <#totalfriends> <privacy settings> <networkID(s)>**

The privacy settings consist of four basic binary privacy attributes: 1) Add as friend 2) Photo Thumbnail 3) View Friends 4) Send message. We refer to the resulting 4-bit number as the “privacy settings of a user” and encode it in the released dataset as a decimal number i.e., 1111 is “15”, 1000 is “8”, 0001 is “1” . The list of networkIDs contains the regional/school/workplace FB networks that the user is a member of. For more information about the privacy settings and network membership see journal paper.

UserIDs are consistent across files. UserIDs and networkIDs are anonymized. The mapping of anonymized networkIDs to Facebook network names is available as part of our Facebook Geosocialmap dataset release.

**For assumptions, goal and applications of the collected datasets, see Sections III-A and III-B in the journal version of this work**

##### MHRW

- mhrw-socialgraph-anonymized (1.4GB)

##### UNI

- uni-socialgraph-anonymized (794MB)

As a condition of usage, please cite the Facebook Social graph data sets using the following bibtex entry:

@inproceedings{gjoka10_walkingfb, author= {Minas Gjoka and Maciej Kurant and Carter T. Butts and Athina Markopoulou}, title= { {Walking in Facebook: A Case Study of Unbiased Sampling of OSNs} }, booktitle = {Proceedings of IEEE INFOCOM '10}, address = {San Diego, CA}, month = {March}, year = {2010} }

### Facebook Social Graph - Breadth First Search

We extend the Facebook Social Graph dataset by releasing two more social graph samples collected in April of 2009 through data scraping from Facebook :

- BFS-28 - A sample of 2,198K unique users collected by 28 independent Breadth-First-Search Traversals of length 81K.
- BFS-1 - A sample of 1,189K unique users collected by 1 Breadth-First-Search traversal.

The BFS-28 and BFS-1 social graph samples are released in an adjacency list format. Each line contains a sampled userID and the userIDs of his/her friends.

** <uid> <friend_uid_1> <friend_uid_2> .. <friend_uid_j> **

UserIDs are anonymized and are not consistent between the BFS-28 and BFS-1 social graph samples.

##### BFS-28

- bfs-28-socialgraph-release.zip (2.3GB)

##### BFS-1

- bfs-1-socialgraph-release.zip (1.1GB)

As a condition of usage, please cite the “Facebook Social Graph - Breadth First Search” dataset using any of the following bibtex entries:

@inproceedings{gjoka10_walkingfb, author= {Minas Gjoka and Maciej Kurant and Carter T. Butts and Athina Markopoulou}, title= { {Walking in Facebook: A Case Study of Unbiased Sampling of OSNs} }, booktitle = {Proceedings of IEEE INFOCOM '10}, address = {San Diego, CA}, month = {March}, year = {2010} }

@article{mgjoka_recommendations_jsac, title= {{Practical Recommendations on Crawling Online Social Networks}}, author= {Minas Gjoka and Maciej Kurant and Carter T. Butts and Athina Markopoulou}, journal = {IEEE J. Sel. Areas Commun. on Measurement of Internet Topologies}, year = {2011} }

### Facebook Applications

##### Dataset I

We release dataset I, which contains the number of active users and total application installations daily for every Facebook application between 08/29/2007 and 02/14/2008 . The data was retrieved from the Adonomics website, which had been collecting aggregate applications statistics, Daily Active Users (DAU) and Application Installs, by scraping the Facebook application directory.

Dataset I comprises of 16,812 files (one file for each application present in the Facebook application directory until 02/14/2008). The filename of each file contains the respective <app_id>. The structure of each file is

””, <app_name>, ””

“Time”,”Installs”,”DAU”

<day_1>, <#installs_1>, <#DAU_1>

<day_2>, <#installs_2>, <#DAU_2>

<day_3>, <#installs_3>, <#DAU_3>

…

##### Dataset II

We release dataset II, collected in February 2008, which contains a list of installed applications for 297K Facebook users

**<uid> <app_id_1> <app_id_2> .. <app_id_j>**

UserIDs are anonymized. More information about the collection process and the representativeness of this dataset is contained in the paper.

Dataset II - Facebook Application Installations per User

As a condition of usage, please cite the Facebook Applications data sets using the following bibtex entry:

@inproceedings{mgjoka_wosn08, author= {Minas Gjoka and Michael Sirivianos and Athina Markopoulou and Xiaowei Yang}, title= { {Poking Facebook: Characterization of OSN Applications} }, booktitle = {Proceedings of ACM SIGCOMM Workshop on Online Social Networks (WOSN) '08}, address = {Seattle, WA}, month = {August}, year = {2008} }

### Facebook Weighted Random Walks

We release the following datasets, collected in October of 2010 through data scraping from Facebook :

- RW - A sample of 1M unique users obtained Facebook-wide by 25 independent simple Random Walks
- Hybrid - A sample of 1M unique users obtained Facebook-wide by 25 independent Stratified Weighted Random Walks (S-WRW) with hybrid conflict resolution. The measurement objective in the Hybrid sample are Facebook users with college network membership.

For each dataset, we release two files. The first file contains for each sampled userID, (i) the weight of the sampled user, (ii) the number of vfriends, the visitable friends during the social graph exploration (or friends for which “View Friends”=1), (iii) the total number of friends , and (iv) list of networkIDs of which the user is a member of. The symbol ”#” is used as a separator.

** <uid> <weight> <#vfriends> <#totalfriends> <networkID_1#networkID_2#> **

The second file contains mappings from networkIDs to network names and network types (college,work,school) .

**<networkID_1> <network_name_1> <network_type_1> **

UserIDs are anonymized and are consistent across files.

##### RW

##### Hybrid

- hybrid_release-networkmapping (1.4MB)

As a condition of usage, please cite the Facebook Weighted Random Walks data sets using the following bibtex entry:

@inproceedings{kurant11_magnifying, title= {{Walking on a Graph with a Magnifying Glass: Stratified Sampling via Weighted Random Walks}}, author= {Maciej Kurant and Minas Gjoka and Carter T. Butts and Athina Markopoulou}, booktitle = {Proceedings of ACM SIGMETRICS '11}, address = {San Jose, CA}, month = {June}, year = {2011} }