- Research
- Open access
- Published:
The risk of node re-identification in labeled social graphs
Applied Network Science volume 4, Article number: 33 (2019)
Abstract
Real network datasets provide significant benefits for understanding phenomena such as information diffusion or network evolution. Yet the privacy risks raised from sharing real graph datasets, even when stripped of user identity information, are significant. When nodes have associated attributes, the privacy risks increase. In this paper we quantitatively study the impact of binary node attributes on node privacy by employing machine-learning-based re-identification attacks and exploring the interplay between graph topology and attribute placement. We also analyze the risk of anonymity over epidemic networks subject to different node re-identification attacks. Our experiments show that the population’s diversity on the binary attribute consistently degrades anonymity. More interestingly, we show that similar diverse populations in the SI epidemic model maintain different levels of anonymity with different infection rates.
Introduction
Real graph datasets are fundamental to understanding a variety of phenomena, such as epidemics, adoption of behavior, crowd management and political uprisings. At the same time, many such datasets capturing computer-mediated social interactions are recorded nowadays by individual researchers or by organizations. However, while the need for real social graphs and the supply of such datasets are well established, the flow of data from data owners to researchers is significantly hampered by serious privacy risks: even when humans’ identities are removed, studies have proven repeatedly that de-anonymization is doable with high success rate (Narayanan et al. 2011; Srivatsa and Hicks 2012; Ji et al. 2014; Korula and Lattanzi 2014). Such de-anonymization techniques reconstruct user identities using third-party public data and the graph structure of the naively anonymized social network: specifically, the information about one’s social ties, even without the particularities of the individual nodes, is sufficient to re-identify individuals.
Many anonymization methods have been proposed to mitigate the privacy invasion of individuals from the public release of graph data (Ji et al. 2016). Naive anonymization schemes employ methods to scrub identities of nodes without modifying the graph structure. Structural anonymization methods change the topology of the original graph while attempting to preserve (at least some of) the original graph characteristics (Liu and Terzi 2008; Sala et al. 2011; Liu and Mittal 2016). Often the utility of an anonymized graph depends not only on preserving essential graph properties of the original graph, but also node attributes such as labels that identify nodes as cheaters or noncheaters in online gaming platforms (Blackburn and Iamnitchi 2014).
However, the effects of node attributes on the risks of re-identifications are not yet well understood. While intuitively any extra piece of information can be a danger to privacy, a rigorous understanding of what topological and attribute properties affect the re-identification risks is needed. In cases such as information dissemination, node attributes may be informed by the local graph topology. How does the interplay between topology and node attributes affect node privacy?
Our work assesses the additional vulnerability to re-identification attacks posed by the attributes of a labeled graph. We consider exactly one binary attribute to understand the lower bound of the damage that node attributes inflict. We focus our empirical study on the interplay between topology and labeling as a leverage point for re-identification. While most efforts for re-identification attacks are meant to show the vulnerability or resilience of a particular anonymization technique, this work is different, as it focuses on understanding in which conditions node re-identification is feasible, given the network topology and node attributes. Consequently, whether the network topology is original or anonymized is irrelevant for our study. We apply machine learning techniques that use both topological and attribute information to re-identify nodes based on a common threat model. Our study involves real-world graphs and synthetic graphs in which we control how labels are placed relative to ties to mimic the ubiquitous phenomena of homophily—the tendency to connect with similar people—found in social graphs (McPherson and Cook 2001).
Our empirical results show that the vulnerability to node re-identification depends on the population diversity with respect to the attribute considered (Horawalavithana et al. 2018). Using information about the distribution of labels in a node’s neighborhood provides additional leverage for the re-identification process, even when labels are rudimentary. In this study, we show more evidence on this phenomenon based on the well-studies Susceptible-Infectious (SI) epidemic model. Furthermore, we quantify the relative importance of attribute-related and topological features in graphs of different characteristics.
Related Work
The availability of auxiliary data (such as public records, product reviews, or comments posted online) helps reveal the true identities of anonymized individuals, as proven empirically in large privacy violation incidents (Lemos 2007; Griffith and Jakobsson 2005). Similarly, in the case of graph de-anonymization attacks, information from an auxiliary graph is used to re-identify the nodes in an anonymized graph (Narayanan and Shmatikov 2009). The quality of such an attack is determined by the rate of correct re-identification of the original nodes in the network. In general, de-anonymization attacks harness structural characteristics of nodes that uniquely distinguish them (Ji et al. 2016). Many such attacks can be categorized into seed-based and seed-free, based on the prior seed knowledge available to an attacker (Ji et al. 2016).
In seed-based attacks, known mappings of some nodes in an auxiliary graph aid the re-identification of anonymized nodes (Narayanan et al. 2011; Srivatsa and Hicks 2012; Ji et al. 2014; 2016; Korula and Lattanzi 2014). The effectiveness of such attacks is influenced by the quality of the seeds (Sharad 2016b). The quality of the seeds is defined by topological properties of the seeds’ neighborhoods: for example, seeds with high degree whose neighbors have also been mapped to real identities have been shown to be highly effective in helping the re-identification process of the other nodes.
In seed-free attacks, the problem of deanonymization is usually modeled as a graph matching problem. Several research efforts have proposed statistical models for the re-identification of nodes without relying on seeds, such as the Bayesian model (Pedarsani et al. 2013) or optimization models (Ji et al. 2014; 2016). Many heuristics are used in the propagation process of re-identification, exploiting graph characteristics such as degree (Gulyás et al. 2016), k-hop neighborhood (Yartseva and Grossglauser 2013), linkage-covariance (Aggarwal et al. 2011), eccentricity (Narayanan and Shmatikov 2009), or community (Nilizadeh et al. 2014).
Recently, there have been efforts to incorporate node attribute information into deanonymization attacks. Gong et al. (2014) evaluate the combination of structural and attribute information on link prediction models. Attributes not present may be inferred through prior knowledge and network homophily. Qian et al. (2016) apply link prediction and attribute inference to deanonymization by quantifying the prior background information of an attacker using knowledge graphs. In knowledge graphs, edges not only represent links between nodes but also node-attribute links and link relationships among attributes. The deanonymization attack in (Ji et al. 2017) maps node-attribute links between an anonymized graph and its auxiliary. In addition to structural similarity, nodes are matched by attribute difference, the union of the attributes of the node in the anonymized and auxiliary divided by their intersection.
However, the success rate of a de-anonymization process is often reported in the literature as dependent on the chosen heuristic of the attack, which is typically designed with knowledge of the anonymization technique (Sharad and Danezis 2014). Comparing the strengths of different anonymization techniques thus becomes challenging, if not impossible. Recently, Sharad (2016b) proposed a general threat model to measure the quality of a deanonymization attack which is independent of the anonymization scheme. He proposed a machine learning framework to benchmark perturbation-based graph anonymization schemes. This framework explores the hidden invariants and similarities to re-identify nodes in the anonymized graphs (Sharad and Danezis 2013; 2014). Importantly, this framework can be easily tuned to model various types of attacks.
Several researchers propose theoretical frameworks to examine how vulnerable or deanonymizable any (anonymized) graph dataset is, given its structure (Pedarsani and Grossglauser 2011; Ji et al. 2014; Ji et al. 2015; Ji et al. 2016). However, some techniques are based on Erdös-Rènyi (ER) models (Pedarsani and Grossglauser 2011), while others make impractical assumptions about the seed knowledge (Ji et al. 2015). Ji et al. (2016) also introduced a configuration model to quantify the deanonymizablity of graph datasets by considering the topological importance of nodes. The same set of authors analyzed the impact of attributes on graph data anonymity (Ji et al. 2017). They show a significant loss of anonymity when more node-attribute relations are shared between anonymized and auxiliary graph data. Specifically, they measure the entropy present in node-attribute mappings available for an attacker. As the entropy decreases, the graph loses node anonymity.
The main aspects distinguishing this study from existing works are as follows: i) In our work, we study the inherent conditions in graphs that provide resistance/vulnerability to a general node re-identification attack based on machine learning techniques. ii) To the best of our knowledge, this is the first work that quantifies the privacy impact of node attributes under an attribute attachment model biased towards homophily. iii) We analyze the interplay between the intrinsic vulnerability of the graph structure and attribute information.
Methodology
Our main objective is to quantitatively estimate the vulnerability to re-identification attacks added by node attributes. In particular, we ask: Given a graph topology, how much better does a node re-identification attack perform when the node attributes are included in the attack compared to when there is no node attribute information available to the attacker?
We are interested in measuring the intrinsic vulnerability of a graph with attributes on nodes, in the absence of any particular anonymization technique on topology or node attributes. The intuition is that particular graphs are inherently more private: for example, in a regular graph, nodes are structurally indistinguishable. Adding attributes to nodes, however, may contribute extra information that could make the re-identification attack more successful. Consider another example, in a highly disassortative network (such as a sexual relationships network), knowing the attribute values (i.e., gender) of a few nodes will quickly lead to correctly inferring the attribute values of the majority of nodes, and thus possibly contributing to the re-identification of more nodes. Thus, we also ask the following question in this study: How does the distribution of node attributes affect the intrinsic vulnerability to a re-identification attack of a labeled graph topology?
To answer these question, we developed a machine learning-based re-identification attack inspired from that presented in (Sharad 2016b). We use the same threat model (“The Threat Model” section) that aims at finding a bijectivemount a machine-learning based attack mapping between nodes in two different graphs. We mount a machine-learning based attack (“Machine Learning Attack” section), in which the algorithm learns the correct mapping between some pairs of nodes from the two graphs, and estimates the mapping of the rest of the dataset. As input data, we use both real and synthetic datasets (as presented in “Datasets” section).
The Threat Model
The threat model we consider is the classical threat model in this context (Pedarsani and Grossglauser 2011): The attacker aims to match nodes from two networks whose edge sets are correlated. We assume each node is associated with a binary valued attribute, and this attribute is publicly available. Common examples of such attributes are gender, professional level (i.e., junior or senior), or education level (i.e., higher education or not).
For clarity, consider the following example: an attacker has access to two networks of individuals in an organization that represent the communication patterns (e.g., email) and friendship information available from an online social network. Individuals in the communication network are described by professional seniority (e.g., junior or senior), while individuals in the friendship network are described by gender. These graphs are structurally overlapping, in that some individuals are present in both graphs, even if their identities have been removed. The attacker’s task is to find a bijective (i.e., one-to-one) mapping between the two subsets of nodes in the two graphs that correspond to the individuals present in both networks.
Machine Learning Attack
We assume that the adversary has a sanitized graph Gsan that could be associated with an auxiliary graph Gaux for the re-identification attack (as depicted in Fig. 1). As in the scenario discussed above, Gsan could be the communication network, while Gaux is the friendship network of a set of individuals in an organization.
In order to model this scenario using real data, we split a real dataset graph G=(V,E) into two subgraphs G1=(V1,E1) and G2=(V2,E2), such that V1⊂V, V2⊂V and V1∩V2=Vα, where Vα≠ϕ. The fraction of the overlap α is measured by the Jaccard coefficient of two subsets: \(\alpha =\frac {|V_{1} \cap V_{2}|}{|V_{1} \cup V_{2}|}\). In the shared subgraph induced by the nodes in Vα, nodes will preserve their edges with nodes from Vα but might have different edges to nodes that are part of V1−Vα or part of V2−Vα. Each nodes v∈V1∪V2 maintains its original attribute value.
In an optimistic scenario, an attacker has access to a part of the original graph (e.g., G1) as auxiliary data and to an unperturbed subgraph (e.g., G2) as the sanitized data whose nodes the attacker wants to re-identify. We use G1 and G2 as baseline graphs to measure the impact of attributes on de-anonymizability of network data. It is also possible to split G1 and G2 recursively into multiple overlapping graphs, maintaining the same values of overlap parameters as above. This allows us to assess the feasibility of the de-anonymization process for large networks by significantly reducing the size of G1 and G2.
The resulting graphs are now the equivalent of the email/friendship networks we used as an example above. The overlap is the knowledge repository that the attacker uses for de-anonymization (Henderson et al. 2011). Part of this knowledge will be made available to the machine learning algorithms.
Previous work shows that the larger α, the more successful the attack. However, the relative success of attacks under different anonymization schemes is observed to be independent of α (Sharad 2016b). In order to experiment with a homogeneous attack, we set the value of α=0.2, and we build Vα by building a breadth-first-search tree starting from the highest degree node (BFS-HD) in G. While other alternatives are certainly possible, we chose this approach for two reasons. First, it appears that the threat model we used is quite sensitive to the sampling process when generating G1 and G2 (Pedarsani and Grossglauser 2011). To avoid sampling bias, we chose a BFS-HD split to have a deterministic set of nodes in Vα. Second, we empirically found that BFS-HD provides the maximally informed seeds for an adversary to propagate the re-identification process, thus providing a best-case scenario for the attacker.
Node Signatures
Since we are employing machine learning techniques to re-identify nodes in a graph, we need to represent nodes as feature vectors. We define the node u’s features using a combination of two vectors made up from its neighborhood degree distribution (NDD) and neighborhood attribute distribution (NAD) (as depicted in Fig. 2).
NDD is a vector of positive integers where \(NDD^{q}_{u}[k]\) represents the number of u’s neighbors at distance q with degree k. We concatenate the binned version of \(NDD^{1}_{u}\) with the binned version of \(NDD^{2}_{u}\) to define the node u’s NDD signature. We use a bin size of 50, which was shown empirically (Sharad 2016b) to capture the high degree variations of large social graphs. For each q, we use 21 bins, which would correspond to a larger node degree of 1050. All larger values are binned in the last bin. This binning strategy is designed to capture the aggregate structure of ego networks, which is expected to be robust against edge perturbation (Sharad 2016a).
NAD is defined by \(NAD^{q}_{u}[i]\) which represents the number of u’s neighbors at distance q with an attribute value i. It is shown experimentally that the use of neighbor attributes as features often improves the accuracy of edge classification tasks (McDowell and Aha 2013).
We use the notation GS to represent the prediction results from the input features made up from the topology (e.g., NDD). GS(LBL) to represent features from both the topology and attribute information (e.g., concatenation of NDD and NAD vectors).
Random Forest Classification
Note that the nodes in Gsan∩Gaux, common to both graphs, can be recognized as being the same node (identical) in the two graphs based on their node identifier. Non-identical nodes are unique to each Gsan and Gaux and would not exist in the overlap. In the classification task, we wish to output 1 for an identical node pair and 0 for a non-identical node pair. This is the ground truth against which we measure the accuracy of the learning algorithms.
We generate examples for the training phase of the deanonymization attack by randomly picking node pairs from the sanitized (Gsan) and the auxiliary (Gaux) graphs, respectively. In most cases, we have an unbalanced dataset with the degree of imbalance depending on the overlap parameter α, where the majority is non-identical node pairs. We use the reservoir sampling technique (Haas 2016) to take ℓ=1000 balance sub-samples from the population S, and the SMOTE algorithm (Chawla et al. 2002) as an over-sampling technique for each sub-sample. Each sample is trained by a forest of 100 random decision trees that allows the algorithm to learn features. Gini-index is used as an impurity measure for the random forest classification. Given the size α of the overlap, we measure the quality of the classifier on the task of differentiating two nodes as identical or not.
Metrics
We measure the accuracy of the classifier in determining whether a randomly chosen pair of nodes (with one node in Gsan and another in Gaux) are identical or not. We use F1-score to evaluate the quality of the classifier. F1-score is the harmonic mean between precision and recall, typical metrics for prediction output of machine learning algorithms.
For each data sample, we perform 5×2 cross-validation to evaluate the classifier and record the mean F1-score. We thus build two vectors of mean F1-scores, each of size ℓ=1000 (as described above), one for the labeled (GS(LBL)) and one for the unlabeled network topology (GS). An important aspect of these vectors is that they are related in the sense that the ith element in one vector represents the same sample as the ith element of the other vector. This is important for the pairwise comparison of the two mean F1-score vectors.
We perform a standard T-test on these two vectors and report the T-statistic value. The T-statistic value is a measure of how close to the hypothesis an estimated value is. In our case, the hypothesis is the prediction accuracy of the node identities in the unlabeled graph (GS) and the estimated value is the prediction accuracy in the labeled graph (GS(LBL)). Thus, a large T-statistic value implies a significantly better prediction accuracy of node identities in GS(LBL) than in GS. In such cases, we can say that the network with node attributes is more vulnerable to node re-identification. This value serves as our statistical measurement to quantify the vulnerability cost of node attributes.
Datasets
Because our work is empirically driven, a larger set of test datasets promises a better understanding of the relations between vulnerability to re-identification attacks and the particular characteristics of the node attributes (such as fractions of attributes of a particular value or the assignment of attributes to topologically related nodes). In this respect, real datasets are always preferable to synthetic ones, as they potentially encapsulate phenomena that are missing in the graph generative models. As an example, until very recently, the relation between the local degree assortativity coefficient and node degree was not captured in graph topology generators (Sendiña-Nadal et al. 2016).
However, relying only on real datasets has its limitations, due to the scarcity of relevant data (in this case, networks with binary node attributes) and the difficulty of covering the relevant space of graph metrics when relying only on available real datasets. Thus, in this work, we combine real networks (described in “Real Network Datasets” section) with synthetic networks generated from the real datasets. For generating synthetic labelled networks, we employ ERGMs (Holland and Leinhardt 1981; Wasserman and Pattison 1996) and a controlled node-labeling algorithm as described in “Synthetic Graphs” section.
Real Network Datasets
We chose six publicly available datasets from four different contexts and generated eight networks with binary node attributes.
-
polblogs (Adamic and Glance 2005) is an interaction network between political blogs during the lead up to the 2004 US presidential election. This dataset includes ground-truth labels identifying each blog as either conservative or liberal.
-
fb-dartmouth, fb-michigan, and fb-caltech (Traud et al. 2012) are Facebook social networks extant at three US universities in 2005. A number of node attributes such as dorm, gender, graduation year, and academic major are available. We chose two such attributes that could be represented as binary attributes: gender and occupation, whereby occupation we could identify the attribute values “student” and “faculty”. From each dataset, we obtained two networks with the same topology but different node attribute distributions.
-
pokec-1 (Takac and Zabovsky 2012) is a sample of an online social network in Slovakia. While the Facebook samples are university networks, Pokec is a general social platform whose membership comprises 30% of the Slovakian population. pokec-1 is a one-fortieth sample. This dataset has gender information available as a node attribute.
-
amazon-products (Leskovec et al. 2007) is a bi-modal projection of categories in an Amazon product co-purchase network. Nodes are labeled as “book” or “music”, edges signify that the two items were purchased together.
As Table 1 shows, the networks generated from these datasets have different graph characteristics. For example, the density (d) of the graphs varies across three orders of magnitude, while degree assortativity oscillates between disassortative (for polblogs, r=−0.22, where there are more interactions between popular and obscure blogs than expected by chance) to assortative (as expected for social networks). All topologies except for amazon-products have small average path length.
The metrics p and τ shown in Table 1 are inspired from the synthetic node labeling algorithm used for generating synthetic graphs (and presented later), and they also show high variation across different networks. Intuitively, p captures the diversity of attribute values in the node population (with p=0.5 showing equal representation of the attributes) while τ captures the homophily phenomenon (that functions as an attraction force between nodes with identical attribute values). The homophilic attraction metric τ varies between 0 in pokec-1 (thus, no higher than chance preference for social ties with people of the same gender in Slovakia) to 0.99 in amazon-products (books are purchased together with other books much more strongly than given by chance). The diversity metric p varies between the overrepresentation of males in the US academic Facebook networks (8% female representation) to an almost perfect political representation in the polblogs dataset (where p=0.48). Note that, we only consider p as the minimum proportion of two node groups due to the symmetric nature of attributes in our experiments.
This wide variation in graph metrics values is what motivated our choice for these set of real networks. We opted to include the three Facebook networks from similar contexts to also capture more subtle variations in network characteristics.
Synthetic Graphs
In order to be able to control graph characteristics and node attribute distributions, we also generated a number of synthetic graphs comparable with the real datasets just described. The graph generation included two aspects: topology generation, for which we opted for ERGMs, and node attribute assignments, for which we implemented the technique proposed in (Skvoretz 2013).
Varying Topology via ERGMs
Exponential-family random graph models (ERGMs) or p-star models (Holland and Leinhardt 1981; Wasserman and Pattison 1996) are used in social network analysis for stipulating, within a set structural parameters, distribution probabilities for networks. Its primary use is to describe structural and local forces that shape the general topology of a network. This is achieved by using a selected set of parameters that encompass different structural forces (e.g., homophily, degree correlation/assortativity, clustering, and average path length). Once the model has converged, we can obtain maximum-likelihood estimates, model comparison and goodness-of-fit tests, and generate simulated networks tied to the relationship between the original network and the probability distribution provided by the ERGM.
Our interest in ERGMs is based on simulating graphs that retain set structural information from the original graph to generate a diverse set of graph structures. We used R (R Core Team 2014) and the statnet suite (Handcock et al. 2014), which contains several packages for network analysis, to produce ERGMs and simulate graphs from our real-world network datasets. In this case, we focused on three structural aspects of the graphs: clustering coefficient, average path length, and degree correlation/assortativity.
For the ERGM based on clustering coefficient, we used the edges and triangle parameters in the statnet package. The edges parameter measures the probability of linkage or no linkage between nodes, and the triangle term looks at the number of triangles or triad formations in the original graph. For the average path length model, edges and twopath terms were used. The twopath term measures the number of 2-paths in the original network and produces a probability distribution of their formation for the converged ERGM. Lastly, for the assortativity measure, the terms edges and degcor were used to produce the models. The degcor term considers the degree correlation of all pairs of tied nodes (for more on ERGMs see (Hunter et al. 2008; Morris et al. 2008)). These terms proved to be our best choices for preserving, to a certain extent, the desired structural information. Although the creation of ERGMs is a trial and error process, the selected terms were successful in producing models for each of the original networks.
After a successful model convergence, a simulated graph was generated constraining the number of edges to those of the original graph for each model. It is worth mentioning that within the built-in simulate function in the statnet suite there is no way of forcibly constraining the aspects of the original we want to control. Thus, we experience variation, in some cases more than others. The difference between the original and the simulated graphs seemed more prominent for smaller networks (see Table 1 and Table 2 for comparison) than models based on the larger networks which came closer to the real values of the original graphs.
Synthetic Labeling
A simple model that parameterizes a labeled graph with a tendency towards homophily (ties disproportionately between those of similar attribute background) is an “attraction” model (Skvoretz 2013). In the basic case of a binary attribute variable and a constant tendency to inbreed, two parameters, p and τ, both in the (0,1) interval, characterize the distribution of ties within and between the two groups. The first is the proportion of the population that takes on one value of the attribute (with 1−p, the proportion taking on the other value). The second parameter, the inbreeding coefficient or probability, expresses the degree to which a tie whose source is in one group is “attracted” to a target in that group. When τ=0, there is no special attraction and ties within and between groups occur in chance proportions. When τ>0, ties occur disproportionately within groups, increasing as τ approaches 1. Given a total number of ties, values for p and τ determine the number of ties/edges that are between groups, namely, δ=|E|×2×(1−τ)p(1−p).
In the process of generating synthetic node attributes, we first randomly assign two arbitrary values (i.e., R and B) as labels to all the nodes in the graph for a given p,1−p split. Then, we draw an R node and a B node at random and swap labels if it would decrease the number of R-B ties. This process would converge when the total number of cross-group ties reduce to δ for a particular value of τ.
Figure 3 shows the proportion of cross-group ties on the synthetic labelled networks generated from polblogs topology. The proportion of cross-group ties is proportional to p, while it is inversely proportional to τ. When p reaches its maximum (pmax=0.5 due to the symmetric nature of binary attribute values), the proportion of cross-group ties is larger at minimum inbreeding coefficient τ.
It should be noted that convergence is not guaranteed for all possible combinations of p and τ. The swapping procedure holds constant all graph properties except the mapping of nodes to labels, and consequently, it may not be possible to find a mapping of nodes to labels that achieves a target number of ties between groups (when that number is low as it is for higher values of τ).
Table 2 presents the graph characteristics of the synthetically generated labeled graphs.
Empirical Results
Our objective is not to measure the success of re-identification attacks on original datasets in which node identities have been removed: it has been demonstrated long ago (Backstrom et al. 2007) that naive anonymization of graph datasets does not provide privacy. Instead, our objective is to quantify the exposure provided by node attributes on top of the intrinsic vulnerability of the particular graph topology under attack.
In our experiments, we leverage the real and synthetic networks described above. We mount the machine learning attack described in “Machine Learning Attack” section to re-identify nodes using features based on both graph topology and node attributes. Our first guiding question is thus: How much risk of node re-identification is added to a network dataset by its binary node attributes?
The Vulnerability Cost of Node Attributes
Figure 4 presents the accuracy of node re-identification in the original graph topology GS and in the same topology augmented with node attributes GS(LBL). As expected, the re-identification attack performs (generally) better when node attributes are used in the attack. Surprising to us, however, is the relatively small vulnerability cost that node attributes introduce. For example, the occupation attribute has a barely noticeable benefit to the attacker in fb-dartmouth. More interestingly, however, the same attribute performs differently for the other two Facebook networks considered: for fb-caltech the occupation label functions as noise, leading to a small decrease in the F1-score. For fb-michigan, on the other hand, the occupation label significantly improves the attacker’s performance.
Another observation from this figure is that different node attributes applied to the same topology have different outcomes: see, for example, the case of the fb-michigan topology, where the difference between the impacts of the gender and the occupation attributes is the largest. We thus formulate a new question: What placement of attributes onto nodes reveal more information?
Diversity Matters, Homophily Not
To understand how the placement of attribute values on nodes affects vulnerability, we generate synthetic node attributes in a controlled manner. By varying p (the diversity ratio) and τ (the bias of nodes with same-value attributes to be connected by an edge), we can study the effect of these parameters on node re-identification.
Figure 5 presents the T-statistics of the F1-scores for node re-identification attacks on the original topology vs. labeled versions of the original topology. In addition to the original topologies, Fig. 5 also presents results on various synthetic networks generated as presented in “Synthetic Graphs” section.
We observe three phenomena: First, it appears that p is positively correlated with the T-statistic value measuring the re-identification impact of attributes. That is, the more diversity (that is, the larger p), the more vulnerable to re-identification the labeled nodes become on average. Intuitively, in a highly skewed attribute population, while the minority nodes will be identified quicker due to node attributes, the majority remains protected. On the other hand, when p=0.5, a network has two equal-sized sets of nodes where each set takes one of two attribute values. This is explained by the fact that the NAD feature vector captures more diverse information in the attributes of neighbots when p is larger. This is also the explanation for why the node attributes contribute so much more to vulnerability in the polblogs dataset, which has a large diversity (p=0.48) (thus, almost equal numbers of conservative and liberal blogs). Note that the effect of p on the added vulnerability remains consistent across all topologies (real and synthetic) tested.
The second observation is that there is no visible pattern on how τ influences the vulnerability added by binary node attributes. While this is disappointing from the perspective of story telling, it is potentially encouraging for data sharing, as it suggests that datasets that record homophily (or influence, the debate is irrelevant in this context) do not have to be anonymized by damaging this pattern. As a specific example, the privacy of a dataset that records an information dissemination phenomenon could be provided without perturbing the cascading-related ties.
The third class of observations is related to the relative effect of the topological characteristics on the added vulnerability. Both amazon-products and pokec-1 are orders of magnitude sparser than the other datasets considered. This means that the topological information available to the machine learning algorithm is limited. In this situation, the addition of the attribute information turns out to be very significant: the T-statistic values for these datasets are significantly larger than for the other datasets, with values over 400 in some cases.
Another topological effect is noticed when comparing the real pokec-1 topology with the ERGM-generated ones in Fig. 5e: the node attribute contributes much more to the vulnerability of the original topology compared to the synthetic topologies. The reason for this unusual behavior may lay in the different clustering coefficients of the networks, as seen in Tables 1 and 2: the ERGM-generated topologies have clustering coefficients one order of magnitude higher than the original topology (for the same graph density), which leads to more diverse NDD feature vectors for the networks with higher clustering and thus richer training information. This in turn leads to better accuracy in node re-identification in the unlabeled ERGM topologies (with higher clustering) than in the original topology. For example, the maximum F1-score for the ERGM-dc topology is 0.92 while for the original is 0.76 in pokec-1. Thus, the relative benefit of the node attribute is significantly higher when the topology features were poorer.
Topology Leaks
Figure 6 presents the importance of features that are used in node re-identification. A high importance score represents a feature that is responsible for accurately classifying a large proportion of examples.
We make three observations from this figure. First, most of the NAD features (together with node’s attribute value) that represent node attribute information prove to be important in all datasets.
Second, among the NDD features, only a small number contributes consistently to accurate prediction. As shown in Figs. 6c–i, the first bin of 1-hop and 2-hop NDD vectors contribute the most. That is, a high impact on the re-identification of a node is brought by the number of its neighbors with degrees between 1 and 50. Even in large networks such as pokec-1 and amazon-products with a larger range of node degrees, this behavior is observed.
Third, Fig. 6 suggests what features explain the effect of diversity p on node re-identification in labeled networks. On datasets with large diversity (such as polblogs or pokec-1), the topological information contributes less than on datasets with low diversity (such as fb-caltech (gender)). This is because high diversity correlates to richer NAD feature vectors, and thus the relative importance of the NAD features increases.
Epidemic and the Risk of Node Re-identification
In this section we consider the scenario of node attribute placement under the constraint of an epidemic process. We use the Susceptible-Infectious (SI) (Kermack and Mckendrick 2003) model to generate an epidemic process on the original graph topology. In the SI model,individuals are initially susceptible, with the exception of a small fraction of the population who is infectious. In contact with an infections individual, a susceptible individual becomes infectious with the probability β. Once infected, individuals stay infected and infectious throughout their lifetime.
We use this model to assign binary attributes (i.e., susceptible and infectious) to the nodes in the graph. In each experiment, we select the 0.1% highest degree nodes as infectious to initialize the epidemic. We vary the infection probability β between 0 and 1. We mount the machine-learning attack to each epidemic graph independently on the graph topology GS and on the same topology augmented with binary node attributes GS(LBL) under the respective epidemic process. We make two assumptions in this task. First, we assume that the graph topology remains static during the epidemic process. Second, we assume that the adversary does not have any prior information about other epidemic graphs in the series.
We calculate the significance of the vulnerability scores in GS(LBL) compared with GS via a standard T-test, and report the T-statistic value per each epidemic graph. Figure 7 shows the T-statistic values over multiple steps in the epidemic process including other characteristics (e.g., the node infection probability β, the estimated homophily τ observed in the network).
We observe the same phenomena on the correlation between population’s diversity (p) and the T-statistic values over the epidemic graphs. However, the T-statistic values show different patterns depending on the infection probability β. Note that, the population’s diversity (p) increases to a local maximum in the initial time-steps, and then drops in later time-steps. This is an intuitive observation given the properties of SI model (Kermack and Mckendrick 2003).
When the epidemic grows slowly (i.e., low infection probability), the T-statistic value also increases at a slower rate. On the other hand, when the epidemic outbreaks at a faster infection rate, the T-statistic value also increases at a higher rate and achieves a relative larger peak value. For the fb-caltech network, the T-statistic value reaches a peak value of 10 in four infection steps for β=0.1, while the T-statistic value reaches a peak value of 50 in two infection steps for β=0.9. Interestingly, the most diverse population in fb-caltech network is also observed after four infection steps for β=0.1, and two infection steps for β=0.9 (as shown in Fig. 7d). In polblogs, T-statistic values reach peak values of 31 and 36 for the infection rates of 0.1 and 0.9, respectively (as shown in Fig. 7h). The polblogs population becomes more diverse in the similar number of infection steps given the respective infection rate.
Summary and Discussions
This paper shows that the addition of even a single binary attribute to nodes in a network increases the vulnerability to node re-identification. The increase in vulnerability derives from the fact that the machine learning attack makes use of the relationship between topology and the distribution of node labels. Using information about the distribution of labels in a node’s neighborhood provides additional leverage for the re-identification process, even when the labels are rudimentary.
Furthermore, we find that a population’s diversity with regard to the binary attribute consistently degrades anonymity and increases vulnerability. Diversity means a more even distribution of the binary attribute, which produces a more varied set of neighborhood distributions that nodes can exhibit. Consequently, nodes are more easily distinguished from one another by virtue of their differing neighborhood distributions of labels.
This observation is critical for network datasets for which the node attributes are the result of an epidemic process. If the epidemic process is monitored, an adversary could observe the node states and their changes repeatedly over multiple time steps. In such a scenario, the adversary could mount an even stronger node re-identification attack. The techniques presented in this paper can be applied to build strong anonymization techniques for such cases. Specifically, our techniques can be used to estimate the rate of anonymity loss over the lifespan of an epidemic process and more efficiently guide data owners in the process of network data anonymization.
Another outcome of this work is that there is no consistent discernible impact of homophily, as measured by the inbreeding coefficient, on vulnerability. Our procedure for investigating the impact of homophily simply involves swapping labels without disturbing ties. Therefore, both local and global (unlabeled) topologies remain constant as we decrease the number of cross-group ties to achieve a target value implied by a particular inbreeding coefficient for a given proportional split along the binary attribute. This procedure disturbs the local labeled topology, but because the machine learning attack uses information from that local topology, it apparently can adapt to the changes and make equally successful predictions regardless of the value of the inbreeding coefficient.
There are multiple directions in which this work could be extended. For example, we would like to asses the vulnerability risk of network data that is subject to different epidemic processes, especially processes in which nodes can recover and become infected multiple times. We suspect that such dynamic processes could lead to less vulnerable network datasets. Also, we would like to apply the techniques developed in this paper for guiding efficient anonymization strategies for network datasets with dynamic node attributes, such as those assigned by an epidemic process.
Availability of data and materials
The datasets generated and/or analysed during the current study are available in the Graph_Unmasking repository, [https://github.com/SamTube405/Graph_Unmasking].
References
Adamic, LA, Glance N (2005) The political blogosphere and the 2004 us election: divided they blog In: Proceedings of the 3rd International Workshop on Link Discovery, 36–43.. ACM, New York.
Aggarwal, CC, Li Y, Philip SY (2011) On the hardness of graph anonymization In: Data Mining (ICDM), 2011 IEEE 11th International Conference On, 1002–1007.. IEEE, Vancouver.
Backstrom, L, Dwork C, Kleinberg J (2007) Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography In: Proceedings of the 16th International Conference on World Wide Web, 181–190.. ACM, New York.
Blackburn, KNSJRMJ, Iamnitchi A (2014) Cheating in online games: A social network perspective. ACM Transactions on Internet Technology 13(3):9–1925.
Chawla, NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: Synthetic minority over-sampling technique. J Artif Int Res 16(1):321–357.
Gong, NZ, Talwalkar A, Mackey L, Huang L, Shin ECR, Stefanov E, Shi ER, Song D (2014) Joint link prediction and attribute inference using a social-attribute network. ACM Transactions on Intelligent Systems and Technology (TIST) 5(2):27.
Griffith, V, Jakobsson M (2005) Messin’with texas deriving mother’s maiden names using public records In: Applied Cryptography and Network Security, 91–103.. Springer, New York.
Gulyás, GG, Simon B, Imre S (2016) An efficient and robust social network de-anonymization attack In: Proceedings of the 2016 ACM on Workshop on Privacy in the Electronic Society, 1–11.. ACM, New York.
Haas, PJ (2016) Data-stream sampling: basic techniques and results In: Data Stream Management, 13–44.. Springer, Berlin, Heidelberg.
Handcock, M, Hunter DR, Butts CT, Goodreau S, Krivitsky P, Bender-deMoll S, Morris M (2014) statnet: Software tools for the statistical analysis of network data. The Statnet Project. (http://www.statnet.org). R package version. Accessed 1 Mar 2019.
Henderson, K, Gallagher B, Li L, Akoglu L, Eliassi-Rad T, Tong H, Faloutsos C (2011) It’s who you know: graph mining using recursive structural features In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 663–671.. ACM, New York.
Holland, PW, Leinhardt S (1981) An exponential family of probability distributions for directed graphs. Journal of the american Statistical association 76(373):33–50.
Horawalavithana, S, Gandy C, Flores JA, Skvoretz J, Iamnitchi A (2018) Diversity, homophily and the risk of node re-identification in labeled social graphs In: International Conference on Complex Networks and Their Applications, 400–411.. Springer, Switzerland.
Hunter, DR, Handcock MS, Butts CT, Goodreau SM, Morris M (2008) ergm: A package to fit, simulate and diagnose exponential-family models for networks. Journal of statistical software 24(3):54860.
Ji, S, Li W, Gong NZ, Mittal P, Beyah RA (2015) On your social network de-anonymizablity: Quantification and large scale evaluation with seed knowledge In: NDSS.. NDSS, San Diego.
Ji, S, Li W, Srivatsa M, He JS, Beyah R (2014) Structure based data de-anonymization of social networks and mobility traces In: International Conference on Information Security, 237–254.. Springer, Switzerland.
Ji, S, Li W, Srivatsa M, Beyah R (2014) Structural data de-anonymization: Quantification, practice, and implications In: Proceedings of the 2014 ACM SIGSAC Conference on Computer and Communications Security, 1040–1053.. ACM, New York.
Ji, S, Li W, Srivatsa M, Beyah R (2016) Structural data de-anonymization: Theory and practice. IEEE/ACM Transactions on Networking 24(6):3523–3536. New York.
Ji, S, Li W, Srivatsa M, He JS, Beyah R (2016) General graph data de-anonymization: From mobility traces to social networks. ACM Transactions on Information and System Security (TISSEC) 18(4):12:1–12:29.
Ji, S, Li W, Yang S, Mittal P, Beyah R (2016) On the relative de-anonymizability of graph data: Quantification and evaluation In: Computer Communications, IEEE INFOCOM 2016-The 35th Annual IEEE International Conference On, 1–9.. IEEE.
Ji, S, Mittal P, Beyah R (2016) Graph data anonymization, de-anonymization attacks, and de-anonymizability quantification: A survey. IEEE Communications Surveys & Tutorials.
Ji, S, Wang T, Chen J, Li W, Mittal P, Beyah R (2017) De-sag: On the de-anonymization of structure-attribute graph data. IEEE Transactions on Dependable and Secure Computing PP(99):1–1. https://doi.org/10.1109/TDSC.2017.2712150.
Kermack, W, Mckendrick A (2003) A contribution to the mathematical theory of epidemics. Proc Roy Soc 5(772):700–721.
Korula, N, Lattanzi S (2014) An efficient reconciliation algorithm for social networks. Proceedings of the VLDB Endowment 7(5):377–388.
Lemos, R (2007) Researchers reverse Netflix anonymization. http://www.securityfocus.com/news/11497. Accessed 11 Aug 2017.
Leskovec, J, Adamic LA, Huberman BA (2007) The dynamics of viral marketing. ACM Transactions on the Web (TWEB) 1(1):5.
Liu, C, Mittal P (2016) Linkmirage: Enabling privacy-preserving analytics on social relationships In: NDSS.. NDSS, San Diego.
Liu, K, Terzi E (2008) Towards identity anonymization on graphs In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, 93–106.. ACM, New York.
McDowell, LK, Aha DW (2013) Labels or attributes?: rethinking the neighbors for collective classification in sparsely-labeled networks In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, 847–852.. ACM, New York.
McPherson, S-LLM, Cook J (2001) Birds of a feather: Homophily in social networks. Annual Review of Sociology 27:415–444.
Morris, M, Handcock MS, Hunter DR (2008) Specification of exponential-family random graph models: terms and computational aspects. Journal of statistical software 24(4):1548.
Narayanan, A, Shi E, Rubinstein BI (2011) Link prediction by de-anonymization: How we won the kaggle social network challenge In: Neural Networks (IJCNN), The 2011 International Joint Conference On, 1825–1834.. IEEE, San Jose.
Narayanan, A, Shmatikov V (2009) De-anonymizing social networks In: Security and Privacy, 2009 30th IEEE Symposium On, 173–187.. IEEE.
Nilizadeh, S, Kapadia A, Ahn Y-Y (2014) Community-enhanced de-anonymization of online social networks In: Proceedings of the 2014 Acm Sigsac Conference on Computer and Communications Security, 537–548.. ACM, New York.
Pedarsani, P, Figueiredo DR, Grossglauser M (2013) A bayesian method for matching two similar graphs without seeds In: Communication, Control, and Computing (Allerton), 2013 51st Annual Allerton Conference On, 1598–1607.. IEEE, Monticello.
Pedarsani, P, Grossglauser M (2011) On the privacy of anonymized networks In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1235–1243.. ACM, New York.
Qian, J, Li X-Y, Zhang C, Chen L (2016) De-anonymizing social networks and inferring private attributes using knowledge graphs In: Computer Communications, IEEE INFOCOM 2016-The 35th Annual IEEE International Conference On, 1–9.. IEEE, San Francisco.
Sala, A, Zhao X, Wilson C, Zheng H, Zhao BY (2011) Sharing graphs using differentially private graph models In: Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, 81–98.. ACM, New York.
Sendiña-Nadal, I, Danziger MM, Wang Z, Havlin S, Boccaletti S (2016) Assortativity and leadership emerge from anti-preferential attachment in heterogeneous networks. Scientific Reports 6:21297.
Sharad, K, Danezis G (2013) De-anonymizing d4d datasets In: Workshop on Hot Topics in Privacy Enhancing Technologies, 10.. PETS, Bloomington, Indiana.
Sharad, K, Danezis G (2014) An automated social graph de-anonymization technique In: Proceedings of the 13th Workshop on Privacy in the Electronic Society, 47–58.. ACM, New York.
Sharad, K (2016) Learning to de-anonymize social networks. PhD thesis. University of Cambridge, Computer Laboratory, University of Cambridge.
Sharad, K (2016) True friends let you down: Benchmarking social graph anonymization schemes In: Proceedings of the 2016 ACM Workshop on Artificial Intelligence and Security. AISec ’16, 93–104.. ACM, New York. https://doi.org/10.1145/2996758.2996765. http://doi.acm.org/10.1145/2996758.2996765.
Skvoretz, J (2013) Diversity, integration, and social ties: Attraction versus repulsion as drivers of intra- and intergroup relations. American Journal of Sociology 119:486–517.
Srivatsa, M, Hicks M (2012) Deanonymizing mobility traces: Using social network as a side-channel In: Proceedings of the 2012 ACM Conference on Computer and Communications Security, 628–637.. ACM, New York.
Takac, L, Zabovsky M (2012) Data analysis in public social networks In: International Scientific Conference and International Workshop Present Day Trends of Innovations, 1.. Present Day Trends of Innovations Lamza, Poland.
R Core Team (2014) R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. http://www.R-project.org/.
Traud, AL, Mucha PJ, Porter MA (2012) Social structure of facebook networks. Physica A: Statistical Mechanics and its Applications 391(16):4165–4180.
Wasserman, S, Pattison P (1996) Logit models and logistic regressions for social networks: I. an introduction to markov graphs andp. Psychometrika 61(3):401–425.
Yartseva, L, Grossglauser M (2013) On the performance of percolation graph matching In: Proceedings of the First ACM Conference on Online Social Networks, 119–130.. ACM, New York.
Acknowledgements
We are grateful to Clayton Gandy for his support with the acquisition and processing of network data.
Funding
This research is supported by National Science Foundation (NSF) in USA under the grant IIS 1546453.
Author information
Authors and Affiliations
Contributions
SH implemented and executed the experiments. SH and AI designed the experiments. JF and JS generated synthetic graphs based on the ERGM package. SH and AI wrote the manuscript with important contributions from JF and JS. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Cite this article
Horawalavithana, S., Arroyo Flores, J., Skvoretz, J. et al. The risk of node re-identification in labeled social graphs. Appl Netw Sci 4, 33 (2019). https://doi.org/10.1007/s41109-019-0148-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s41109-019-0148-x