Datasets

The conditions of use for the released datasets are:

  • the corresponding paper is cited
  • no further re-distribution without our permission. 

Facebook Egonet Sample

In our previous work, we collected a representative sample of approximately 1 million unique Facebook users by crawling the Facebook social graph using a Metropolis Hasting Random Walk (MHRW) method. Subsequently we collected the egonets for 36,628 unique nodes that were randomly selected from the MHRW sample. This sub-sampling eliminates the correlation of consecutive nodes in the same crawl. The representativeness of this data has been validated against true random samples from the Facebook taken during the same period. This sample closely approximates a uniform, with replacement sample of egonets from the publicly visible FB graph.

For our recent paper, Estimating Subgraph Frequencies with or without Attributes from Egocentrically Sampled Data, we complemented the Facebook Egonet Sample with gender attributes for each user and all neighbors. 90% of the gender attributes are self-declared i.e. they are retrieved from Facebook (from URL http://graph.facebook.com/userid), whereas 10% are inferred using machine learning (see more details inside our paper). In our paper we (1) calculated in an exact way the maximal clique size distribution and gender composition of cliques for each egonet in our Facebook sample (2) used our techniques to estimate the maximal clique size distribution and gender composition of cliques over all Facebook. The definition of a maximal clique is a clique that cannot be extended by including one more adjacent vertex.

Here we release:

  1. the 36,628 egonet samples
  2. the gender for all egos and neighbors (total of 5,628,206 non-unique users)
  3. maximal clique computation for all egonets
  4. maximal clique computation by gender for all egonets.

This release consists of the following 3 files:

  • the compressed file egonet_data.tar.bz2 (246 Mbytes) contains 36,628 files that describe the structure of each egonet and another 36,628 files that contain mappings in each egonet from every node to gender (male or female). The file that describes the structure of an egonet is in edgelist format and has the extension “.edges”. Here is an example for file “aaaac.edges”, which corresponds to egonet “aaaac”. The first two lines contain the number of nodes (20) and edges (122) in the egonet “aaaac”. The rest of the 122 lines list all edges which consist of two node ids separated by the character “,”.

20
122
0,3
3,0
0,7
7,0
0,10
0,9
9,0
..
The file that contains the mappings of nodeIDs to gender has as many lines as the number of nodes and has the extension “.gender” e.g. for egonet “aaaac” you will find file “aaaac.gender” with 20 lines. In this file, the gender of the node with ID 0 is located in line 0, the gender of node with ID 1 is located in line 1 and so on. The gender attribute has two values, either 0 which corresponds to male or 1 which corresponds to female.

  • the Python pickle anonid_cliqcomputations.pickle (47 Mbytes) contains all maximal clique computations. To load this file in Python, one needs to write a script that has the following lines:

import pickle
myhashtable = pickle.load( open(“anonid_cliqcomputations.pickle”) )

The above python commands load the data in the hashtable (or dictionary) with variable name “myhashtable”. This hashtable has the following four keys:
(i) 'maxcliq' which contains the exact maximal clique computation for each egonet. For example the entry myhashtable['maxcliq']['aaaac'] which corresponds to egonet “aaaac” has as value the list [0, 0L, 1L, 3L, 2L, 6L, 0L, 1L]. This means that the egonet “aaaac” has 1 maximal clique of size 2, 3 maximal cliques of size 3, 2 maximal cliques of size 4, 6 maximal cliques of size 5, 0 maximal cliques of size 6 and 1 maximal clique of size 7.
(ii) 'maxcliq_gender' which contains the exact maximal clique computation by gender for each egonet. For example the entry myhashtable['maxcliq_gender']['aaaac'] has as value the double hash table {1L: {1: 1L, 2: 3L, 4: 2L}, 2L: {2: 1L, 3: 3L}, 3L: {1: 1L, 2: 1L, 4: 1L}}. The first level of the hash table is associated with the gender male and the second level with the gender female. Therefore the value of the myhashtable['maxcliq_gender']['aaaac'][3][4] indicates the number of maximal cliques with 3 males and 4 females, which in this case is 1.
(iii) 'degree' which contains the full number of neighbors in each egonet at Facebook. For example the entry myhashtable['degree']['aaaac'] has value 19 which is expected in this case since the egonet consists of 20 nodes (19 neighbors + the ego). However, in many cases the number of neighbors in the egonet will be lower than this degree. The reason is that Facebook users which did not publicly shared their list of friends were not included in the egonet.
(iv) 'weight' which contains the number of resamples of this ego in the Monte Carlo Metropolis Hastings sample. For example the entry myhashtable['weight']['aaaac'] has value 5 which means that this ego was resampled 5 times. This is important if one wants to estimate unbiased values of properties in the network.

  • the JSON file anonid_cliqcomputations.json (40 Mbytes) which contains the same information as the Python pickle anonid_cliqcomputations.pickle but in a text format. This is produced for users that would like to use a language other than Python. Read above to see explanations of all the fields inside this json file, since they are identical.

You can request the Facebook Egonet sample here.

As a condition of usage, please cite the Facebook Egonet sample with the following bibtex entry :

   @article{mgjoka15_subgraphsampling,
   title     = {Estimating Subgraph Frequencies with or without Attributes from Egocentrically Sampled Data},
   author    = {Minas Gjoka and Emily Smith and Carter T. Butts},     
   journal   = {arXiv CoRR},
   volume    = {abs/1510.08119},
   year      = {2015},
   ee        = {http://arxiv.org/abs/1510.08119},     
   }

Last.fm Multigraph

We release the following single and multigraph crawls of Last.fm, collected during July 2010 and presented in our IEEE JSAC journal paper.

  1. Friends - Contains 5 random walks of 50K users each on the friendship relation (5x50K=250K users)
  2. Events - Contains 5 random walks of 50K users each on the events relation (5x50K=250K users)
  3. Groups - Contains 5 random walks of 50K users each on the groups relation (5x50K=250K users)
  4. Neighbors - Contains 5 random walks of 50K users each on the symmetrized neighbors relation (5x50K=250K users)
  5. Friends_Events - Contains 5 multigraph random walks of 50K users each on the friendship and events relations (5x50K=250K users)
  6. Friends_Events_Groups -Contains 5 multigraph random walks of 50K users each on the friendship, events, and groups relations (5x50K=250K users)
  7. Friends_Events_Groups_Neighbors -Contains 5 multigraph random walks of 50K users each on the friendship, events, groups, and neighbors relations (5x50K=250K users)
  8. Uniform - Contains 10 simple random samples with replacement of 50K users each (10x50K=500K users). The samples are obtained by probing the userID space with rejection sampling.

Please refer to Table 1 and section IV-A of our paper for more information about the crawls.

Random Walk Crawls

For each random walk crawl we release the following files:

  • Files “randomwalk-0 randomwalk-1 randomwalk-2 randomwalk-3 randomwalk-4” contain the order of user samples in each random walk, and the multigraph type and id visited. In the below example, the order is user0, user1, user2, user3, user1. The second column denotes the multigraph type with possible values being (i) friend (ii) event (iii) group (iv) neighbor. The third column is used only during multigraph types event and group, and denotes the specific multigraph id chosen by the random walk. For the specifics of the multigraph sampling algorithm see Algorithm 1 in our paper.

user_0|group|group_1
user_1|friend|
user_2|group|group_2
user_3|event|event_1
user_1|neighbor|
..
In our example, during the random walk (i) user_0 chose user_1 randomly from group 'group_1' (ii) user_1 chose user_2 randomly from her friends (iii) user_2 chose user_3 randomly from group 'group_2' (iv) user_3 chose user_1 randomly from event 'event_1' (v) user_1 chose the next (unseen here) visited user randomly from her symmetrized neighbors.

  • File “uname_friends” contains the friends of each sampled user. Friends ids are the same as user ids.

user_1|friend_1|friend_2|..|friend_n
..

  • File “uname_events” contains the events of each sampled user. For each user the events are split to future and past using as a reference time point the crawling period in mid-July 2010. In each row, the second and third fields determine the number of future and past events respectively. The row has a number of events equal to (n_future_events + n_past_events), of which the first 'n_future_events' are future events and the rest of 'n_past_events' are past events.

user_1|n_future_events|n_past_events|event_1|event_2|..|event_m
..

  • File “uname_groups” contains the groups of each sampled user.

user_1|group_1|group_2|..|group_q
..

  • File “uname_symNeighbors” contains the symmetrized neighbors of each sampled user. Neighbor ids are the same as user ids. This file is only present in the “neighbors” and “friends_events_groups_neighbors” crawls.

user_1|neighbor_1|neighbor_2|..|neighbor_r
..

  • File “uname_info” contains user information for each sampled user.

user_1|is_subscriber|has_profilepicture|number of playcounts|number of playlists|registration date
..

  • File “uname_infoExtra” contains additional user information for each sampled user. This file is NOT included in this release by default. It will be distributed only for well specified research projects.

user_1|country|age|gender
..

  • File “events_size” contains the number of users for each event.

event_1|size_1
..

  • File “groups_size” contains the number of users for each group

group_1|size_1
group_2|size_2
..

Uniform Crawl

For the uniform crawl we release the following files:

  • Files “uniform_selection-0 .. uniform_selection-9” contain the user samples included in each of the 10 uniform samples with replacement.

user_1|uniform|
user_2|uniform|
..

  • Files uname_friends, uname_events, uname_groups, uname_info, uname_infoExtra, events_size, and groups_size are also included and contain information for each sampled user. See above for a description of each file.
Summary

In summary the total number of sampled and observed items over all crawls are as follows.

Unique users sampled : 1,251,047
Unique users sampled and observed: 3,196,820
Unique events : 892,357
Unique groups : 102,706

The user_ids (same as friend_ids, neighbor_ids), event_ids and group_ids are all anonymized and are consistent between all files and all crawls.

As a condition of usage, please cite the Last.fm dataset with the following bibtex entry :

   @article{mgjoka_multigraph_jsac,
   title     = {Multigraph Sampling of Online Social Networks},
   author    = {Minas Gjoka and  Carter T. Butts and Maciej Kurant and  Athina Markopoulou},
   journal   = {IEEE JSAC on Measurement of Internet Topologies},
   volume    = {29},
   number    = {9},  
   year      = {2011}
   }

Facebook Geosocialmap

In our Coarse-Grained Topology Estimation paper we developed estimators that take as input a probability sample of nodes from an original graph and produce a category-to-category graph. In the original graph each node/user has declared a category i.e. a node/user can declare that she belongs to the “UC Irvine” Facebook network. In the category-to-category graph, each node corresponds to a category i.e. two nodes can be “UC Irvine” and “UC San Diego”. Additionally, each edge in the category-to-category graph reflects the strength between two category members in the original graph i.e. the weighted edge between “UC Irvine” and “UC San Diego” is interpreted as the probability that a random user from “UC Irvine” is a friend of a random user from “UC San Diego”.

As a practical illustration of our approach, we applied our methodology to our previously collected datasets Facebook Social Graph and Facebook Weighted Random Walks. We estimated several Facebook category graphs and visualized them at Geosocialmap. Here we make available (i) the mapping of anonymized networkIDs to Facebook network names in our released Facebook Social Graph dataset (ii) all the estimated category-to-category graphs.

University Category Graph

This category graph has been estimated from the Facebook Weighted Random Walks dataset. Its categories are the top 133 US national universities according to the “US News World Report ’09”.

Nodes
Category nodes are contained in the file “univ_nodes_2010.csv”. Each row in this file describes a category node and related category features. The structure of each row is as follows:

<Node ID> | <Node Name> | <Longitude> |<Latitude> | <FB Network Name> | <US State> | <Tier> | <Rank> | <Score> | <Type> | <Year Founded> | <Religion> | <Calendar> | <# Students> | <Setting> | <Acceptance Rate> | <Tuition Cost>

An example of a category node is the following:
16777277|University of California–Irvine|-117.8426417|33.64535|UC Irvine|CA|1|44|58|Public|1965|None|quarter|21696|suburban|0.556|7556

Edges
Category edges are contained in the file “univ_edges_2010.csv”. Each row in this file contains a category edge and has the following structure:

<Edge ID> | <Node ID> | <Node ID>| <Edge Weight>

Country Category Graph

This category graph has been estimated from the Facebook Social Graph dataset. Its categories are world countries that Facebook users could join as a regional network in 2009.

Nodes
Category nodes are contained in the file “country_nodes_2009.csv”. Each row in this file describes a category node and has the following structure:

<Node ID> | <Node Name> | <Longitude> |<Latitude>

Edges
Category edges are contained in the file “country_edges_2009.csv”. Each row in this file contains a category edge and has the following structure:

<Edge ID> | <Node ID> | <Node ID>| <Edge Weight>

North America Counties Category Graph

This category graph has been estimated from the Facebook Social Graph dataset. Its categories are regions of the United States and Canada that Facebook users could join in 2009.

Nodes
Category nodes are contained in the file “northamerica_nodes_2009.csv”. Each row in this file describes a category node and has the following structure:

<Node ID> | <Node Name> | <Longitude> |<Latitude> | <Country>

Edges
Category edges are contained in the file “northamerica_edges_2009.csv”. Each row in this file contains a category edge and has the following structure:

<Edge ID> | <Node ID> | <Node ID>| <Edge Weight>

UK Cities Category Graph

This category graph has been estimated from the Facebook Social Graph dataset. Its categories are regions of the United Kingdom that Facebook users could join in 2009.

Nodes
Category nodes are contained in the file “ukcities_nodes_2009.csv”. Each row in this file describes a category node and has the following structure:

<Node ID> | <Node Name>

Edges
Category edges are contained in the file “ukcities_edges_2009.csv”. Each row in this file contains a category edge and has the following structure:

<Edge ID> | <Node ID> | <Node ID>| <Edge Weight>

Mapping of NetworkIDs to Network names

The file “net2net_mapping_2009” contains all regional/school/workplace Facebook networks discovered during the collection of the Facebook Social Graph dataset. The structure is:

<Network ID> # <Network Name>

One can use the mapping of networksIDs in combination with the Facebook Social Graph to estimate the category-to-category graphs. One can use the estimated category-to-category graphs to create models and test hypotheses on how category features (rank and type of a university, language and religion of a country, geographical distance) affect the inter-category interaction rates.

You can request the Facebook Geosocialmap dataset here.

As a condition of usage, please cite the Facebook Geosocialmap dataset with the following bibtex entry :

   @inproceedings{kurant11_coarsetopology,
   title= {{Coarse-Grained Topology Estimation via Graph Sampling}},
   author= {Maciej Kurant and Minas Gjoka and Yan Wang and Zack W. Almquist and Carter T. Butts and Athina Markopoulou},
   booktitle = {Proceedings of ACM SIGCOMM Workshop on Online Social Networks (WOSN) '12},
   address = {Helsinki, Finland},
   month = {August},
   year = {2012}
   }

Facebook Social Graph - MHRW & UNI

We release the following datasets, collected in April of 2009 through data scraping from Facebook :

  1. MHRW - A sample of 957K unique users obtained Facebook-wide by 28 independent Metropolis-Hastings random walks
  2. UNI - A sample of 984K unique users that represents the “ground truth” i.e., a truly uniform sample of Facebook userIDs, selected by a rejection sampling procedure from the system's 32-bit ID space.

For each dataset, we release two files. The first file contains for each sampled userID,the number of times the user was sampled and the userIDs of his/her friends.
<uid> <#times sampled> <friend_uid_1> <friend_uid_2> .. <friend_uid_j>

The second file contains additional node properties for each sampled user. For each sampled userID we have the number of times sampled, the total number of friends, the privacy settings and network membership.
<uid> <#times sampled> <#totalfriends> <privacy settings> <networkID(s)>

The privacy settings consist of four basic binary privacy attributes: 1) Add as friend 2) Photo Thumbnail 3) View Friends 4) Send message. We refer to the resulting 4-bit number as the “privacy settings of a user” and encode it in the released dataset as a decimal number i.e., 1111 is “15”, 1000 is “8”, 0001 is “1” . The list of networkIDs contains the regional/school/workplace FB networks that the user is a member of. For more information about the privacy settings and network membership see journal paper.

UserIDs are consistent across files. UserIDs and networkIDs are anonymized. The mapping of anonymized networkIDs to Facebook network names is available as part of our Facebook Geosocialmap dataset release.

For assumptions, goal and applications of the collected datasets, see Sections III-A and III-B in the journal version of this work

MHRW
UNI

As a condition of usage, please cite the Facebook Social graph data sets using the following bibtex entry:

   @inproceedings{gjoka10_walkingfb, 
   author= {Minas Gjoka and Maciej Kurant and Carter T. Butts and Athina Markopoulou}, 
   title= { {Walking in Facebook: A Case Study of Unbiased Sampling of OSNs} }, 
   booktitle = {Proceedings of IEEE INFOCOM '10},
   address = {San Diego, CA}, 
   month = {March}, 
   year = {2010} 
   }

We extend the Facebook Social Graph dataset by releasing two more social graph samples collected in April of 2009 through data scraping from Facebook :

  1. BFS-28 - A sample of 2,198K unique users collected by 28 independent Breadth-First-Search Traversals of length 81K.
  2. BFS-1 - A sample of 1,189K unique users collected by 1 Breadth-First-Search traversal.

The BFS-28 and BFS-1 social graph samples are released in an adjacency list format. Each line contains a sampled userID and the userIDs of his/her friends.
<uid> <friend_uid_1> <friend_uid_2> .. <friend_uid_j>

UserIDs are anonymized and are not consistent between the BFS-28 and BFS-1 social graph samples.

BFS-28
BFS-1

As a condition of usage, please cite the “Facebook Social Graph - Breadth First Search” dataset using any of the following bibtex entries:

   @inproceedings{gjoka10_walkingfb, 
   author= {Minas Gjoka and Maciej Kurant and Carter T. Butts and Athina Markopoulou}, 
   title= { {Walking in Facebook: A Case Study of Unbiased Sampling of OSNs} }, 
   booktitle = {Proceedings of IEEE INFOCOM '10},
   address = {San Diego, CA}, 
   month = {March}, 
   year = {2010} 
   }
  @article{mgjoka_recommendations_jsac,
  title= {{Practical Recommendations on Crawling Online Social Networks}},
  author= {Minas Gjoka and Maciej Kurant and Carter T. Butts and Athina Markopoulou},
  journal = {IEEE J. Sel. Areas Commun. on Measurement of Internet Topologies},
  year = {2011}
  }

Facebook Applications

Dataset I

We release dataset I, which contains the number of active users and total application installations daily for every Facebook application between 08/29/2007 and 02/14/2008 . The data was retrieved from the Adonomics website, which had been collecting aggregate applications statistics, Daily Active Users (DAU) and Application Installs, by scraping the Facebook application directory.

Dataset I comprises of 16,812 files (one file for each application present in the Facebook application directory until 02/14/2008). The filename of each file contains the respective <app_id>. The structure of each file is
“”, <app_name>, “”
“Time”,“Installs”,“DAU”
<day_1>, <#installs_1>, <#DAU_1>
<day_2>, <#installs_2>, <#DAU_2>
<day_3>, <#installs_3>, <#DAU_3>

Dataset I - Facebook Applications Statistics in Aggregate

Dataset II

We release dataset II, collected in February 2008, which contains a list of installed applications for 297K Facebook users
<uid> <app_id_1> <app_id_2> .. <app_id_j>

UserIDs are anonymized. More information about the collection process and the representativeness of this dataset is contained in the paper.

Dataset II - Facebook Application Installations per User

As a condition of usage, please cite the Facebook Applications data sets using the following bibtex entry:

   @inproceedings{mgjoka_wosn08,
   author= {Minas Gjoka and Michael Sirivianos and Athina Markopoulou and Xiaowei Yang},
   title= { {Poking Facebook: Characterization of OSN Applications} },
   booktitle = {Proceedings of ACM SIGCOMM Workshop on Online Social Networks (WOSN) '08},
   address = {Seattle, WA},
   month = {August},
   year = {2008}
   }

Facebook Weighted Random Walks

We release the following datasets, collected in October of 2010 through data scraping from Facebook :

  1. RW - A sample of 1M unique users obtained Facebook-wide by 25 independent simple Random Walks
  2. Hybrid - A sample of 1M unique users obtained Facebook-wide by 25 independent Stratified Weighted Random Walks (S-WRW) with hybrid conflict resolution. The measurement objective in the Hybrid sample are Facebook users with college network membership.

For each dataset, we release two files. The first file contains for each sampled userID, (i) the weight of the sampled user, (ii) the number of vfriends, the visitable friends during the social graph exploration (or friends for which “View Friends”=1), (iii) the total number of friends , and (iv) list of networkIDs of which the user is a member of. The symbol “#” is used as a separator.
<uid> <weight> <#vfriends> <#totalfriends> <networkID_1#networkID_2#>

The second file contains mappings from networkIDs to network names and network types (college,work,school) .
<networkID_1> <network_name_1> <network_type_1>

UserIDs are anonymized and are consistent across files.

RW
Hybrid

As a condition of usage, please cite the Facebook Weighted Random Walks data sets using the following bibtex entry:

   @inproceedings{kurant11_magnifying,
   title= {{Walking on a Graph with a Magnifying Glass: Stratified Sampling via Weighted Random Walks}},
   author= {Maciej Kurant and Minas Gjoka and Carter T. Butts and Athina Markopoulou},
   booktitle = {Proceedings of ACM SIGMETRICS '11},
   address = {San Jose, CA},
   month = {June},
   year = {2011}
   }