Saturday, January 25, 2020

Measurements, Analyses, and Insights on the Entire Ethereum Blockchain Network


Measurements, Analyses, and Insights on the Entire Ethereum Blockchain Network

A summary of the WebConf 2020 (formerly WWW) research paper by Xi Tong Lee, Arijit Khan, Sourav Sen Gupta, Yu Hann Ong, and Xuan Liu.


Background of the Ethereum Blockchain: It has been more than ten years since Bitcoin [1] introduced the era of decentralized community-controlled currency. Since then, several cryptocurrency variants like Litecoin, Namecoin, Dash, Zcash, have been introduced. Blockchains are increasingly becoming popular due to the prevalence of such cryptocurrencies and decentralized applications. Decentralized applications are written on the framework of numerous blockchain networks like Hyperledger, Corda, Ripple, Stellar, EOS, NEO, IOTA, and many more. Among them, Ethereum [2] is a distributed public blockchain network that focuses on running code (smart contracts) for decentralized applications. More simply, it is a platform for sharing information in a global state that cannot be manipulated or changed.

While Bitcoin-like cryptocurrency networks concern themselves only with users (wallets) transacting over blockchain, Ethereum-like blockchains present a decentralized computing environment. Ethereum is a transaction-based state machine, where the state is made up of accounts. Transfer of value and information between accounts cause transitions in the global state of Ethereum, which are recorded in the blockchain [3]. There are primarily two types of accounts: (a) User accounts, controlled by external users with their private keys, and (b) Contract accounts, controlled by contract codes that behave like “autonomous agents”. Transactions in Ethereum are data packets sent by the user accounts, signed with their private keys, while Messages in Ethereum are virtual objects produced by contract accounts, generally sent to other contracts.

In addition to the native unit of value ether, Ethereum blockchain allows creation of Tokens, an abstraction of “digital assets”, with the help of suitable data structures and methods implemented through smart contracts. Similar to base transactions using ether, accounts in Ethereum may transact in tokens of various kinds, fungible or otherwise, through the appropriate smart contracts. This allows for a rich ecosystem of tokens, including various ERC20 (fungible) and ERC721 (non-fungible) tokens, to thrive on Ethereum blockchain.

Motivation of Our Study: The genre of blockchain introduced by Ethereum brings forth a fascinating ecosystem of humans and autonomous agents (smart contracts), cohabiting the underlying blockchain fabric. It is neither like conventional social networks, where the players are human users, nor like the cryptocurrencies, where all interactions are transfer of value or asset. In essence, a blockchain network is closer to the Internet or Web, where users are allowed to interact with one another, as well as with programs. However, different from Web, there is also an interaction framework for smart contracts, where they can call (or kill) each other to maintain and advance the global state of the blockchain. This motivates us to study a public permissionless blockchain network as a complex system. We choose Ethereum, the most prominent public permissionless blockchain, to measure and draw insights from the network interactions.

In the last decade, several works [4-9] explored Bitcoin, cryptocurrencies, and other blockchain networks based on graph theory and network analysis. This line of research gained momentum due to the transparency offered by public permissionless blockchain, which allows anyone to access transactional information on the networks. We follow the direction of these previous works to measure and analyse the entire Ethereum blockchain network — above and beyond the financially relevant token transfer layer. Our approach closely follows the norms of measuring and analyzing social networks, Internet, and the Web [10-23], as the entirety of the Ethereum blockchain network presents itself as an equally complex system.

Datasets and Four Constructed Networks: Google Cloud BigQuery curates the entire Ethereum blockchain data in terms of blocks, contracts, transactions, traces, logs, tokens and token transfers [24]. We extract all relevant data for Ethereum from the ethereum_blockchain dataset under the Google Cloud bigquery-public-data repository, till 2019-02-07 00:00:27 UTC, which amounts to all blocks from genesis (#0) up to #7185508.


Figure 1: Interactions in the Ethereum Blockchain Network

We are interested in all interactions between Ethereum accounts, both in terms of standard ether transactions and token transfers. This requires us to construct interaction network from the Ethereum blockchain data, where vertices are accounts (users or contracts) and arcs denote their interactions. There are four major types of interaction between Ethereum addresses — (i) User-to-User (transaction or token transfer), (ii) User-to-Contract (call or kill), (iii) Contract-to-User (transaction or token transfer), and (iv) Contract-to-Contract (create, call, kill or hard fork), as illustrated in Figure 1. In addition, there are some interactions to and from the Null address, which denote creation of smart contracts and generation of ether (mining rewards), respectively. We create four interaction networks out of the entire blockchain dataset of Ethereum, as follows.

TraceNet. We first create TraceNet, with all possible user and smart contract addresses found in the entire blockchain dataset as vertices, and all successful traces with non-null from/to addresses as arcs. We characterize the vertices by their type – regular users, miners/ mining pools, regular contracts, and miner/ mining pool contracts. This is the most comprehensive interaction network for Ethereum, with various well-classified vertices and arcs.

ContractNet. The second network, ContractNet, is a subgraph of TraceNet, where we retain the arcs with both from_address and to_address belonging to smart contracts (verified using the contracts table). This provides us with a pure contract-to-contract interaction network on Ethereum, where arcs are direct messages and/or transactions between smart contracts. We observe four arc types in this category — (i) Create arcs that involves creation of new smart contracts, (ii) Suicide arcs where the owner of the smart contract decides to kill the smart contract, (iii) Call arcs that transfer ether from one account to another or call another smart contract, and (iv) Daofork arcs where a hard fork has occurred in the blockchain. These arcs connect three major vertex types — (i) ERC20 token contracts, (ii) ERC721 token contracts, and (iii) other contracts for intermediary functions and token-related services.

TransactionNet. The third network, TransactionNet, is the network of all Ethereum transactions recorded in the transactions table. Transactions are made by users, either to other users or smart contracts, or to a Null address in case of smart contract creation. The vertices and arcs of this network are thus similar to that of the TraceNet, with the exception of the Null address as an extra vertex in the network, and the ‘User to Null’ arcs.

TokenNet. Finally, we create TokenNet, pertaining only to the transfer of tokens between Ethereum accounts. We use the token transfers table to extract only token related transactions in the blockchain. The basic types of users and arcs are somewhat similar to that in the TransactionNet, with an additional level of arc characterization based on the token in use.

We have open sourced our network datasets here [25]. Each interaction network provides us with a different perspective on the Ethereum blockchain, and our analyses on the networks reveal new insights by combining information from the four networks. While TraceNet presents a global view of interactions between Ethereum accounts, ContractNet focuses only on the automated multi-agent network of contracts, providing us with a functional view of the Ethereum state machine. While TransactionNet helps us analyze the base ether transactions
in the blockchain, TokenNet focusses on the rich and diverse token ecosystem built on top of the Ethereum blockchain.

Summary of Contributions: To the best of our knowledge, we are the first to conduct a comprehensive study of the large-scale Ethereum blockchain network, cohabited by both human users and autonomous agents (smart contracts).

— We study the four blockchain networks based on local and global graph properties, e.g., network size, density, degree distribution, in-to-out degree correlation, vertex centrality, reciprocity, assortativity, connected components, core decomposition, transitivity, clustering coefficient, higher-order motifs, articulation points, adhesion, cohesion, and small-world characterization. Such structural information is useful to characterize interactions, to evaluate current Ethereum-blockchain-system at scale – an effort that has not been attempted before. We also identify their similarities and differences with social networks and the Web, and draw interesting conclusions.

—We further consider three prominent token subnetworks, Bancor, Binance Coin, and Zilliqa, and investigate the amount of activity in the token network over time, as well as the size of the core community driving the token economy over time. We identify interesting correlation between the temporal evolutions of the number of cores in the token subgraphs against the corresponding evolution of price of the token in the cryptocurrency market.

— We open source the datasets [25] and highlight important research directions such as analysis of mining pools, identifying complex patterns to detect fraudulent activities, and temporal analysis of token subnetworks to forecast the price of Ethereum backed tokens.

Our Findings and Future Research Directions: We investigate several local and global graph properties over four Ethereum blockchain networks (TraceNet, ContractNet, TransactionNet, and TokenNet), as well as in three prominent token subnetworks (Bancor, Binance Coin, and Zilliqa), and conduct a thorough experimental evaluation.

We find that these blockchain networks are very different from social networks. In case of both TraceNet and TransactionNet, Lognormal, Weibull, and Power-law with cut-off are better fit than the traditional power-law degree distribution. In all four blockchain networks, we have higher outdegree vertices (e.g., mining pools and mixers), as well as higher indegree vertices (e.g., ICO smart contracts). This characteristic is similar to the Web, consisting of both hub (having higher outdegrees) and authority (with higher indegrees) vertices, and is unlikely in social networks, which usually have high correlation between indegrees and outdegrees. As a result, blockchain networks are disassortative, having very low transitivity. Moreover, most frequent motifs observed in blockchain graphs are chain and star-shaped. Complex patterns, such as triangles, cycles, and cliques occur less, indicating lack of community structure in blockchain networks. Removal of only the highest-degree vertex (e.g., Binance, a global cryptocurrency exchange) can disconnect the entire largest weakly connected components in these graphs.

In spite of the aforementioned differences, blockchain networks are surprisingly small-world and well-connected. Analogous to social networks, blockchain graphs have average shortest path lengths only 46. Similar to both social networks and the Web, blockchain networks contain a single, large strongly connected component (SCC), and about 98% of the remaining vertices can either reach this SCC, or can be reached from the SCC.

In terms of the four different networks, we observe that ContractNet has more self-loops and bidirectional arcs (hence, higher reciprocity), while TokenNet has fewer of them. As a result, the MultiDigraph of ContractNet is denser, while the simple, undirected version of TokenNet is more dense. Both of them yield larger core indices for vertices in the innermost cores, indicating higher density of their innermost cores. Moreover, both ContractNet and TokenNet have smaller radius and diameter compared to our larger networks, TraceNet and TransactionNet.

Following our characterization of Ethereum into four different blockchain networks, there is ample opportunity for future work. Study of individual mining pools as complex self-contained evolving networks would be interesting, as would be an investigation on the interplay between mining pools to identify instances of selfish mining and mining strategies. Further analysis of the individual Token networks in terms of activity signatures and temporal evolution (like change in coreness) may lead to more accurate forecasting of trading behavior and token prices in the cryptocurrency market. Identification of influential vertices and complex motifs (like cliques and cycles) in the blockchain networks may also lead to detection of fraudulent activities in the transaction and token networks of Ethereum. Quite naturally, a similar line of measurements and analyses can be applied to other public blockchain platforms to unearth interesting phenomena within and across the Web of blockchain networks.

For more information, click here for our paper.  Datasets: GitHub.
Blog post contributed by: Arijit Khan
  
[Reference]

[1] S. Nakamoto. 2008. Bitcoin: A Peer-to-Peer Electronic Cash System. https://bitcoin.org/bitcoin.pdf.

[2] F. Vogelsteller, V. Buterin, et al. [n.d.]. Ethereum Whitepaper. https://github.com/ethereum/wiki/wiki/White-Paper.

[3] G. Wood. [n.d.]. Ethereum: A Secure Decentralised Generalised Transaction Ledger. https://github.com/ethereum/yellowpaper.

[4] L. Ermann, K. M. Frahm, and D. L. Shepelyansky. 2018. Google Matrix of Bitcoin Network. The European Physical Journal B (2018), 91–127.

[5] S. Ferretti and G. D’Angelo. 2019. On the Ethereum Blockchain Structure: A Complex Networks Theory Perspective. Concurrency and Computation: Practice and Experience (2019), e5493.

[6] D. Ron and A. Shamir. 2013. Quantitative Analysis of the Full Bitcoin Transaction Graph. In Financial Cryptography and Data Security.

[7] S. Somin, G. Gordon, and Y. Altshuler. 2018. Network Analysis of ERC20 Tokens Trading on Ethereum Blockchain, In Complex Systems. Springer Proceedings in Complexity, 439–450.

[8] S. Somin, G. Gordon, and Y. Altshuler. 2018. Social Signals in the Ethereum Trading Network. CoRR abs/1805.12097 (2018). arXiv:1805.12097 http://arxiv.org/abs/1805.12097.

[9] F. Victor and B. K. Lüders. 2019. Measuring Ethereum-based ERC20 Token Networks. In Financial Cryptography and Data Security.

[10] L. Adamic, O. Buyukkokten, and E. Adar. 2003. A Social Network Caught in the Web. First Monday 8, 6 (2003).

[11] Y.-Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong. 2007. Analysis of Topological Characteristics of Huge Online Social Networking Services. In WWW.

[12] L. Backstrom, D. Huttenlocher, J. Kleinberg, and X. Lan. 2006. Group Formation in Large Social Networks: Membership, Growth, and Evolution. In KDD.

[13] A.-L. Barabasi and R. Albert. 1999. Emergence of Scaling in Random Networks. Science 286, 5439 (1999), 509–512.

[14] A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. 2000. Graph Structure in the Web. In WWW.

[15] M. Faloutsos, P. Faloutsos, and C. Faloutsos. 1999. On Power-law Relationships of the Internet Topology. SIGCOMM Comput. Commun. Rev. 29, 4 (1999), 251–262.

[16] M. S. Granovetter. 1973. The Strength of Weak Ties. The American Journal of Sociology 78, 6 (1973), 1360–1380.

[17] J. Kleinberg and S. Lawrence. 2001. The Structure of the Web. Science 294, 5548 (2001), 1849–1850.

[18] R. Kumar, J. Novak, and A. Tomkins. 2006. Structure and Evolution of Online Social Networks. In KDD.

[19] F. Liljeros, C. R. Edling, L. A. N. Amaral, H. E. Stanley, and Y. Aberg. 2001. The Web of Human Sexual Contacts. Nature 411 (2001), 907–908.

[20] A. Mislove, H. S. Koppula, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. 2008. Growth of the Flickr Social Network. In WOSN.

[21] A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel, and B. Bhattacharjee. 2007. Measurement and Analysis of Online Social Networks. In SIGCOMM IMC.

[22] M. E. J. Newman. 2001. The Structure of Scientific Collaboration Networks. Proceedings of the National Academy of Sciences 98, 2 (2001), 404–409.

[23] G. Siganos, S. L. Tauro, and M. Faloutsos. 2006. Jellyfish: A Conceptual Model for the AS Internet Topology. Journal of Communications and Networks 8, 3 (2006), 339–350.

[24] 2018. Ethereum in BigQuery: a Public Dataset for Smart Contract Analytics. https://cloud.google.com/blog/products/data-analytics/ethereum-bigquerypublic-dataset-smart-contract-analytics.  Accessed: 2010-10-12.


Friday, November 15, 2019

Maximizing Contrasting Opinions in Signed Social Networks

Maximizing Contrasting Opinions in Signed Social Networks


Maximizing Contrasting Opinions in Signed Social Networks


A summary of the IEEE BigData 2019 research paper by Kaivalya Rawal and
Arijit Khan.

Background: A central characteristic of social networks is that it facilitates rapid dissemination of information among large groups of individuals [1]. Online social networks, such as Facebook, Twitter, LinkedIn, Flickr, and Digg are used for spreading ideas and messages. Users’ behaviors and opinions are highly affected by their friends in social networks, which is defined as the social influence. Motivated by various real-world applications, e.g., viral marketing [2], social and political campaigning [3], social influence studies have attracted extensive research attention. The classic influence maximization problem [4], [2] identifies the top-k seed users in a social network such that the expected number of influenced users in the network, starting from those seeds and following an influence diffusion model, is maximized. The budget k on the seed set size usually depends on how many initial users the campaigner can directly influence by advertisements, re-tweets from “bots”, free samples, and discounted prices.

Problem: We investigate a novel influence diffusion problem: COSiNe(Contrasting Opinions Maximization in a Signed Social Network). We find limited influential seed nodes which maximize the adoption of two distinct, antithetical opinions in two non-overlapping user groups with opposing views.The objective behind such influence maximization is to create awareness in a population by improving the quality of the debate on naturally contentious issues.

Applications: An ideal application of our problem would be to increase awareness about infrequently discussed issues that are nonetheless controversial (such as capital punishment, nuclear energy, or affirmative action) — in a target population that naturally splits into two distinct ideological groups (such as democrats and republicans); in a forum that extensively debates topics and proposes mutually agreeable solutions based on compromise, diversity, and inclusion (such as the United States Senate or House of Representatives). Contrary to initial expectations, polarization of opinions and increased conflict can often be beneficial [5], [6], [7], [8], as discussed in the following.

The benefit of conflicting opinions among collaborators has been clearly observed in Wikipedia. Controversial articles such as those on the Syrian Civil War, Israel/Palestine, or George W. Bush attract a higher number of edits. Higher polarization in the contributing community is associated with higher article quality for a broad range of articles – from politics to science and social issues [6]. Increased diversity is often correlated also with greater business performance. Similarly, disagreements amongst co-workers have been found to improve the decision making capabilities at the organisation level. Thus, encouraging different opinions about the same topic can be leveraged to improve the productivity of the organisation [7]. When dealt with correctly, such differences in thought and opinions are a force for good.

Lastly, we illustrate an example from the world of politics that is most similar to our “ideal” application scenario. Unlike the American presidential system, in countries based upon the Westminster parliamentary system, there is an appointed head of government, different from the head of the state, and an appointed head of opposition. This balance between the government and the opposition is considered integral to the success of a functioning democracy in diverse countries such as in Britain and in India [8]. An equivalent analysis was made for the political system in the United States of America in 1950 by the American Political Science Association [5] which recommended a stronger two party system in order to strengthen the democratic process. Both these analyses point to the importance of opposition in political discourse, and go on to show that policies being enacted and implemented benefit from engagement, and even opposition. Meaningful discourse and spirited debate requires people who inherently hold opposing beliefs on a given issue, and thus maximizing opposing influences can be beneficial for a legislative body from the point of view of the general population.

Challenges: Contrasting opinions maximization, as required in our problem setting, is a non-trivial one. First, one must employ an influence cascade model that has properties different from those for commercial, one-time product purchasing based marketing strategies. For example, people’s opinions change over time; thus, activation based models, such as independent cascade (IC) and linear threshold (LT) models [4] are less appropriate in political contexts. Second, in reality a signed social network might not be perfectly balanced [9], that is, there may not exist a partition V1, V2 of the node set V, such that all edges with V1 and V2 are positive and all edges across V1 and V2 are negative. Such a network does not follow the social balance theory, and adds more complexity to the social influence cascade.

Contributions: In this work, we employ the voter model [9] [10], [11], [12] to characterize influence diffusion in the two population groups of a social network. We define our model such that opposite influences, when applied on the same user, cancel each other, leading to a decay in the influence strength on any given user. Our model does not mandate that a user’s choice be frozen upon one-time activation, explicitly allowing the user to switch opinions at later times. Moreover, voter model, being a stochastic one (it has a random walk based interpretation), can deal with signed networks that are not perfectly balanced. We then formulate our novel COSiNe problem (contrasting opinions maximization), and design an efficient, exact solution.

For more information, click here for our paper.  Codes and datasets: GitHub.
Blog post contributed by: Arijit Khan
  
[Reference]

[1] W. Chen, L. V. S. Lakshmanan, and C. Castillo, “Information and Influence Propagation in Social Networks”, Morgan & Claypool, 2013.

[2] P. Domingos and M. Richardson, “Mining the Network Value Customers”, KDD, 2001.

[3] B. A. Conway, K. Kenski, and D. Wang, “The Rise of Twitter in the Political Campaign: Searching for Intermedia Agenda-Setting Effects in the Presidential Primary”, Journal of Computer-Mediated Communication, vol. 20(4), 2015, pp. 363–380.

[4] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the Spread of Influence through Social Network”, KDD, 2003.

[5] A. Schlesinger JR, “Toward A More Responsible Two-Party System: A Report”, American Political Science Association, 1950.

[6] F. Shi, M. Teplitskiy, E. Duede, and J. A. Evans, “The Wisdom of Polarized Crowds”, Human Behaviour, 2019.

[7] K. Ferrazzi, “The Benefits of Conflict at Work”, 2014, http://fortune.com/2014/03/11/the-benefits-of-conflict-at-work.

[8] A. Beteille, “Democracy and It’s Institutions”, Oxford University Press, Chapter Government and Opposition, 2012.

[9] Y. Li, W. Chen, Y. Wang, and Z.-L. Zhang, “Influence Diffusion Dynamics and Influence Maximization in Social Networks with Friend and Foe Relationships”, WSDM, 2013.

[10] P. Clifford and A. Sudbury, “A Model for Spatial Conflict”, Biometrika, vol. 60(3), 1973, pp. 581–588.

[11] R. A. Holley and T. M. Liggett, “Ergodic Theorems for Weakly Interacting Infinite Systems and the Voter Model”, Ann. Probab., vol. 3(4), 1975, pp. 643–663.

[12] E. Even-Dar and A. Shapira, “A Note on Maximizing the Spread of Influence in Social Networks”, “Internet and Network Economics”, 2007.

Sunday, April 28, 2019

Distance-generalized Core Decomposition

Distance-generalized Core Decomposition

A summary of the SIGMOD 2019 research paper by Francesco Bonchi, Arijit Khan, and Lorenzo Severini

Background: Extracting dense structures from large graphs has emerged as a key graph-mining primitive in a variety of application scenarios, ranging from web mining, to biology, and finance. Many different definitions of dense subgraphs have been proposed (e.g., cliques, n-cliques, n-clans, k-plexes, f-groups, n-clubs, lambda sets), but most of them are NP-hard or at least quadratic. In this respect, the concept of core decomposition is particularly appealing because (i) it can be computed in linear time, and (ii) it is related to many of the various definitions of a dense subgraph and it can be used to speed-up or approximate their computation. The k-core of a graph is defined as a maximal subgraph in which every vertex is connected to at least k other vertices within that subgraph. The set of all k-cores of a graph, for each k, forms its core decomposition. The core index of a vertex v is the maximal k for which v belongs to the k-core.

While core decomposition is based on the number of immediate connections that a vertex has within a subgraph (its degree), the importance of studying network structures beyond the horizon of the distance-1 neighborhood of a vertex is well established since several decades, especially in social sciences [1]. Following this observation, in this paper we introduce an important generalization of the notion of core decomposition. Looking through the lens of shortest-path distance, one can see the degree of a vertex v as the number of vertices in the graph which have distance ≤ 1 from v, or equivalently, the size of the 1-neighborhood of v. From this perspective, a natural generalization is to consider a distance threshold h > 1. This leads smoothly to the notions of h-neighborhood, h-degree, and in turn, to the distance-generalized notion of (k,h)-core, i.e., the maximal subgraph in which every vertex has at least k other vertices at distance ≤ h within that subgraph. As we formally prove in this work, the (k,h)-core is unique and it is contained in the (k − 1,h)-core: these facts allow us to define the notion of distance-generalized core decomposition.


Figure 1 shows the differences between the classic core decomposition (on the left side) and the (k, 2)-core decomposition (on the right side). For this example graph, the classic core decomposition (i.e., the (k, 1)-core decomposition in our setting) puts all the vertices in core k = 2. On the contrary, by considering distance-2 neighborhood, the (k, 2)-core decomposition is able to capture structural differences among the vertices, thus providing a finer-grained analysis. In particular, it allows detecting the fact that the vertices from 4 to 13 form a denser and more structured region (the (6, 2)-core), while vertices 2 and 3 are in the (5, 2)-core, and vertex 1 only belongs to the (4, 2)-core.

Why Core Decomposition of Power Graph Does Not Work? One could think to obtain the distance-generalized (k,h)-core decomposition, by first computing the h-power of the input graph (as in Figure 2), and then applying the state-of-the-art algorithms for core decomposition. However, this does not provide the correct decomposition, as shown next. Figure 2 shows the power-graph G^2 of the graph in Figure 1. We can observe that according to the classic core decomposition of G^2 , the vertices 2 and 3 have core-index 6, while in the (k, 2)-core decomposition of G (right side of Figure 1) they have core-index 5. This is due to the fact that in G^2, vertices 2 and 3 becomes adjacent due to vertex 1, but this vertex does not belong to the (5, 2)-core. 

Challenges: In the classic core decomposition, when a vertex is removed, the degree of its neighbors is decreased by 1: this observation allows several optimizations which makes the classic case easy to compute and to scale [2, 3, 4, 5, 6, 7]. Instead, in the generalized (k,h)-core decomposition the removal of a vertex can decrease the h-degree of some of its h-neighbors by more than one. This is the reason why the idea of decomposing the h-power graph does not work, and it is also the main reason why distance-generalized core decomposition (h > 1) is much harder to compute than standard core decomposition (h = 1). In fact, when a vertex is removed, we need to compute again the h-degree of a large number of vertices, i.e., we need a large number of h-bounded BFS traversals. 

Algorithmic Contribution: In spite of such challenges, we devise efficient algorithms. As a baseline, we first extend the state-of-the-art Batagelj and Zaveršnik’s algorithm [2] to deal with (k,h)-core decomposition (we dub the algorithm h-BZ). Next, we exploit a lower bound on the core-index of a vertex to avoid a large number of useless h-degree re-computations. We call this algorithm h-LB. Finally, we propose an algorithm that further improves efficiency by computing an upper boundand processing the vertices with larger h-degrees as early as possible (dubbed h-LB+UB). In order to scale to larger graphs we also exploit multi-threading provided by modern architectures to parallelize the computations of h-degrees.

Applications: We show that distance-generalized core decomposition generalizes many of the nice properties of the classic core decomposition, e.g., its connection with the notion of distance-generalized chromatic number, or its usefulness in speeding-up or approximating distance-generalized notions of dense structures, such as h-club, (distance-generalized) densest subgraph and community search problems (i.e., cocktail party) [8]. We also show that it is very informative and performs well as a heuristic for selecting landmarks to build shortest-path-distance oracles.


For more information, click here for our paper. 
Blog post contributed by: Arijit Khan


[Reference]

[1] R. Luce. Connectivity and Generalized Cliques in Sociometric Group Structure. Psychometrika, 15(2):169–190, 1950.

[2] V. Batagelj and M. Zaveršnik. Fast Algorithms for Determining (Generalized) Core Groups in Social Networks. Advances in Data Analysis and Classification, 5(2):129–145, 2011.

[3] J. Cheng, Y. Ke, S. Chu, and M. T. Özsu. Efficient core decomposition in massive networks. In ICDE, 2011.

[4] W. Khaouid, M. Barsky, V. Srinivasan, and A. Thomo. K-core decomposition of large networks on a single pc. PVLDB, 9(1):13–23, 2015.

[5] A. Montresor, F. D. Pellegrini, and D. Miorandi. Distributed k-Core Decomposition. TPDS, 24(2), 2013.

[6] K. Pechlivanidou, D. Katsaros, and L. Tassiulas. MapReduce-Based Distributed K-Shell Decomposition for Online Social Networks. In SERVICES, 2014.

[7] A. E. Sariyüce, B. Gedik, G. Jacques-Silva, K. Wu, and Ü. V. Çatalyürek. Streaming Algorithms for k-Core Decomposition. PVLDB, 6(6), 2013.

[8] M. Sozio and A. Gionis. The community-search problem and how to plan a successful cocktail party. In KDD, 2010.

Wednesday, April 24, 2019

In-Depth Comparison of st Reliability Algorithms over Uncertain Graphs

An In-Depth Comparison of s-t Reliability Algorithms over Uncertain Graphs

A summary of the PVLDB 2019 research paper by Xiangyu Ke, Arijit Khan, and Leroy Lim Hong Quan

Uncertain, or probabilistic, graphs have been increasingly used to represent noisy linked data in many emerging applications, and have recently attracted the attention of the database research community [7]. A fundamental problem on uncertain graphs is the s-t reliability, which measures the probability that a target node t is reachable from a source node s in a probabilistic (or uncertain) graph, i.e., a graph where every edge is assigned a probability of existence.  This s - t reliability estimation has been used in many applications such as measuring the quality of connections between two terminals in a sensor network, finding other proteins that are highly probable to be connected with a specific protein in a protein-protein interaction (PPI) network, identifying highly reliable peers containing some file to transfer in a peer-to-peer (P2P) network, probabilistic path queries in a road network, and evaluating information diffusion in a social influence network.

Due to the inherent complexity of the problem (# P-hard), although the exact reliability detection has received attention in the past, the focus nowadays has mainly been on approximate and heuristic solutions over large-scale graphs. The large spectrum of the reliability problem is categorized in Figure 1. Notice that in this work, we have focused on sequential algorithms for the fundamental s-t reliability query. In particular, various sampling and indexing based efficient algorithms were proposed in the literature. Estimation of reliability in uncertain graphs has its beginnings with the usage of Monte Carlo (MC) sampling [1]. Subsequently, more advanced sampling methods were proposed in the form of recursive samplings [2, 3], lazy propagation sampling [4], and shared possible worlds [5], as well as other indexing methods [6]. With the wide range of algorithms available for estimating the s-t reliability over uncertain graphs, there is an urgent need to realize their trade-offs, and to employ the best algorithm for a given scenario.


Figure 1: The broad spectrum of reliability problem over uncertain graphs

As depicted in Figure 2, we find serious concerns in the existing experimental comparisons of state-of-the-art reliability estimation algorithms over uncertain graphs. (1) There is no prior work that compared all state-of-the-art methods with each other. It is, therefore, difficult to draw a general conclusion on the superiority and trade-offs of different methods. (2) As shown in Figure 2, with the exception of [4], other experimental studies in [2, 3, 5, 6] either considered a fixed number of samples (e.g., 1000), or the maximum number of samples was limited by 2000. However, we observe in our experiments that the number of samples necessary for the convergence of reliability estimation varies a lot depending on the specific algorithm used (e.g., for Recursive Stratified Sampling [3], #samples required for convergence is 250∼1000, while for Lazy Propagation [4], it is 500∼1500), and also on the underlying characteristics of the uncertain graph dataset. Therefore, the running time necessary to achieve convergence (and hence, good-quality results) should be reported differently, as opposed to using the same number of samples in all experiments. (3) The metrics used for empirically comparing these techniques were not consistent in the past literature, thereby making them apple-to-orange comparisons in the larger context. For example, [3] measured relative variance of different estimators and their running times for the same number of samples (2000). In both [5, 6], the authors reported accuracy (with respect to baseline MC sampling) and running times of different algorithms using a maximum of 1000 samples. On the other hand, [2] compared relative variance, accuracy, and running times of various estimators by considering a fixed (1000) number of samples. In addition, surprisingly none of these studies reported the online memory usage which, according to our experiments, varied a great extent. (4) Last but not least, we find certain errors and further optimization scopes in past algorithms(e.g., accuracy of [4], time complexity analysis of [5]), and by correcting (or, updating) them we significantly improve their performance.


Figure 2:  State-of-the-art reliability estimation algorithms in uncertain graphs: A directed arrow depicts reported superiority in prior works. All algorithms have not been thoroughly compared with each other. Moreover, previous works did not employ identical frameworks, datasets, and metrics for comparison. Thus, it is critical to investigate their trade-offs and superiority over each other.

Through our systematic and in-depth analysis of experimental results, we report surprising findings, such as many follow-up algorithms can actually be several orders of magnitude inefficient, less accurate, and more memory intensive compared to the one s that were proposed earlier. Our findings can be summarized as follows.

For estimator variance, both recursive methods: RHH and RSS exhibit significantly better performance than other four MC-based approaches: MC, BFSSharing, ProbTree, and LP+. Methods in the same category share very similar variance. In general, RSS is the best regarding variance, and achieves fastest convergence. To achieve convergence, there is no single sample size K that can be used across various datasets and estimators. Usually recursive methods require about 500 less samples than MC-based methods on the same dataset. 

For accuracy, all methods have similar relative error (<1.5%) at convergence. If K is set as 1000 for all estimators, some of them might not reach convergence, thus their relative errors can further be reduced by using larger K until convergence. 

For efficiency, RHH and RSS are the fastest when running time is measured at convergence. When K is set as 1000, there is no common winner in terms of running time. Overall, the running times of RHH, RSS, ProbTree, and LP+ are comparable. BFSSharing is 4× slower than MC, since it estimates all nodes’ reliability from the source node. 

The memory usage ranking (in increasing order of memory) is: MC < LP+ < ProbTree < BFSSharing < RHH ≈ RSS. 


Table 1: Summary and recommendation
Figure 3: The decision tree for selecting a proper reliability estimator under different scenarios

Recommendation. Table 1 summarizes the recommendation level of each method according to different performance metrics. The scale is from 1 to 4 stars, and larger star number stands for higher ranking. Clearly, there is no single winner. Considering various trade-offs, in conclusion we recommend ProbTree for s-t reliability estimation. It provides good performance in accuracy, online running time, and memory cost. Its index can slightly reduce the variance, compared to other MC-based estimators. Notably, we adopted MC as ProbTree’s reliability estimating component (as the original paper [6] did). However, one can replace this with any other estimator (e.g., recursive estimators: RHH and RSS) to further improve ProbTree’s efficiency and to reduce its variance (as demonstrated in our paper). 

The decision tree shown in Figure 3 demonstrates our recommended strategy for estimator selection under different constraints. Following the branch with red tick, the better estimator(s) under the current condition can be determined. Notice that the path from the root to the leaf of ProbTree consists of all red ticks, indicating its applicability and trade-off capacity considering various factors.


For more information, click here for our paper. Codebase: GitHub 
Blog post contributed by: Xiangyu Ke and Arijit Khan


[Reference]

[1] G. S. Fishman. A Comparison of Four Monte Carlo Methods for Estimating the Probability of s-t Connectedness. IEEE Tran. Rel., 1986.

[2] R. Jin, L. Liu, B. Ding, and H. Wang. Distance-Constraint Reachability Computation in Uncertain Graphs. PVLDB, 4(9):551–562, 2011.

[3] R. Li, J. X. Yu, R. Mao, and T. Jin. Recursive Stratified Sampling: A New Framework for Query Evaluation on Uncertain Graphs. IEEE TKDE, 28(2):468–482, 2016

[4] Y. Li, J. Fan, D. Zhang, and K.-L. Tan. Discovering Your Selling Points: Personalized Social Influential Tags Exploration. In SIGMOD, 2017.

[5] R. Zhu, Z. Zou, and J. Li. Top-k Reliability Search on Uncertain Graphs. In ICDM, 2015.

[6] S. Maniu, R. Cheng, and P. Senellart. An Indexing Framework for Queries on Probabilistic Graphs. ACM Trans. Database Syst., 42(2):13:1–13:34, 2017.

[7] A. Khan, Y. Ye, and L. Chen. On Uncertain Graphs. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2018.

Wednesday, November 7, 2018

Notes on Expressibility of Vertex Centric Graph Processing Paradigm

A summary of the IEEE BigData 2018 research paper by Siyuan Liu and Arijit Khan

We study the vertex-centric (VC) paradigm for distributed graph processing, popularized by Google’s Pregel system [1]. In Pregel, which was inspired by the Bulk Synchronous Parallel (BSP) model [2], graph algorithms are expressed as a sequence of iterations called supersteps (Figure 1). Each superstep is an atomic unit of parallel computation. During a superstep, Pregel executes a user-defined function for each vertex in parallel. The user-defined function specifies the operation at a single vertex v and at a single superstep S. The supersteps are globally synchronous among all vertices, and messages are usually sent along the outgoing edges from each vertex. In 2012, Yahoo! launched the Apache Giraph [3] as an open-source project, which clones the concepts of Pregel. Since then, there were several attempts to implement many graph algorithms in a vertex-centric framework, as well as efforts to design optimization techniques for improving their efficiency. Many follow up works experimentally compared the efficiency and scalability of existing VC systems (for a survey, see [10], [11], [12], [13]). However, to the best of our knowledge, there has not been any systematic study to analyze the expressibility of the VC paradigm itself.
Fig. 1: Vertex-Centric paradigm

In this work, we investigate some of the fundamental limitations of the vertex-centric paradigm itself, by implementing a few distributed graph algorithms that are difficult to be expressed in the VC paradigm. Several graph problems have multiple distributed algorithms, which vary in their efficiency. We observe that not all distributed algorithms of a graph problem can be implemented effectively in the vertex-centric framework. In particular, we focus on two specific classes of algorithms: bucketing-based and multi-phased.

Bucketing-based refers to an algorithm that relies on a bucket structure to define priorities of vertices. Vertices are processed in batch from the highest priority vertices to lowest priority vertices. Vertices can be added and removed from buckets during execution to update their priorities. Example algorithms in this class include ∆-Stepping algorithm [4] for single-source shortest path, weighted BFS [5], k-core [6], and approximate set cover [7].

Multi-phased implies those algorithms which consists of several distinct phases. Each phase itself may be iterative. However, overall the algorithm’s execution follows the predefined order of phases. Example algorithms that fall under this category include distributed Brandes’ algorithm [8] for betweenness centrality, finding bi-connected and strongly connected components [9].

We find that often the more efficient distributed algorithm of a graph problem under the above two categories cannot be effectively implemented in the vertex-centric paradigm. We highlight the key expressibility limitations (e.g., more work, many active vertices, large message overheads, etc.) of VC paradigm in regards to both bucketing-based and multi-phase algorithms. We conduct extensive experiments to demonstrate these expressibility challenges specific to vertex-centric paradigm, while also presenting that such bottlenecks do not exist in traditional message passing interface (MPI) distributed systems. 

We conclude by discussing our recommendations on the road ahead for VC systems, such as (1) directly incorporating priority-based data structures (e.g., buckets) in each worker node, that will permit defining and dynamically updating priorities of vertices, and (2) providing more control to the master node, so that it can re-activate vertices during phase changes.


For more information, click here for our paper.
Blog post contributed by: Arijit Khan

[Reference]

[1] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A System for Large-scale Graph Processing. In SIGMOD, 2010.
[2] L. G. Valiant. A Bridging Model for Parallel Computation. Commun. ACM, 33(8):103–111, 1990.
[4] U. Meyer and P. Sanders. ∆--stepping: A parallelizable shortest path algorithm. J. Algorithms, 49(1):114–152, 2003.
[5] R. Dial Algorithm 360: Shortest-path forest with topological ordering [H]. Commun. ACM, 12(11):632–633, 1969.
[6] L. Dhulipala, G. Blelloch, J. Shun Julienne: A framework for parallel graph algorithms using work-efficient bucketing. In SPAA, 2017.
[7] G. Blelloch, R. Peng, K. Tangwongsan Linear-work greedy parallel approximate set cover and variants. In SPAA, 2011.
[8] U. Brandes. A Faster Algorithm for Betweenness Centrality. Journal of Mathematical Sociology, 25(163), 2001.
[9] D. Yan, J. Cheng, K. Xing, Y. Lu, W. Ng, and Y. Bu. Pregel Algorithms for Graph Connectivity Problems with Performance Guarantees. PVLDB, 7(14):1821–1832, 2014.
[10] M. Han, K. Daudjee, K. Ammar, M. T. Ozsu, X. Wang, and T. Jin. An Experimental Comparison of Pregel-like Graph Processing Systems. PVLDB, 7(12):1047–1058, 2014.
[11] A. Khan and S. Elnikety. Systems for Big-Graphs. PVLDB, 7(13):1709– 1710, 2014.
[12] R. R. McCune, T. Weninger, and G. Madey. Thinking Like a Vertex: A Survey of Vertex-Centric Frameworks for Large-Scale Distributed Graph Processing. ACM Comput. Surv., 48(2):25:1–25:39, 2015.
[13] D. Yan, Y. Bu, Y. Tian, A. Deshpande, and J. Cheng. Big Graph Analytics Systems. In SIGMOD, 2016.

Wednesday, May 30, 2018

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

A summary of the USENIX ATC 2018 research paper by Arijit Khan, Gustavo Segovia, and Donald Kossmann

Background: Distributed Graph Querying & Graph Partitioning

Graphs with millions of nodes and billions of edges are ubiquitous to represent highly interconnected structures including the World Wide Web, social networks, knowledge graphs, genome and scientific databases, medical and government records. To support online search and query services (possibly from many clients) with low latency and high throughput, data centers and cloud operators consider scale-out solutions, in which the graph and its data are partitioned horizontally across cheap commodity servers. We assume that the graph topology and the data associated with nodes and edges are co-located, since they are often accessed together. Keeping with the modern database trends to support low-latency operations, we target a fully in-memory system, and use disks only for durability.
                                 
Fig. 1: State-of-the-art distributed graph querying systems (e.g., SEDGE [1], Trinity [2], Horton [3])

For efficiently answering online queries in a distributed environment, state-of-the-art systems (e.g., SEDGE[1], Horton[2], Trinity [3]) first partition the graph, and then place each partition on a separate server, where query answering over that partition takes place (Fig. 1). Since the server which contains the query node can only handle that request, the router maintains a fixed routing table (or, a fixed routing strategy, e.g., modulo hashing). Hence, these systems are less flexible with respect to query routing and fault tolerance, e.g., adding more machines will require updating the data partition and/or the routing table. Besides, an effective graph partitioning in these systems must achieve: (1) workload balancing to maximize parallelism, and (2) locality of data access to minimize network communication. It has been demonstrated in the literature that sophisticated partitioning schemes improve the performance of graph querying, compared to an inexpensive hash partitioning. Due to power-law degree distribution of real-world graphs, it is difficult to get high-quality partitions. Besides, a one-time partitioning cannot cope with later updates to graph structure or variations in query workloads. Several graph re-partitioning and replication-based strategies were proposed. However, online monitoring of workload changes, re-partitioning of the graph topology, and migration of graph data across servers are expensive; and they reduce the efficiency and throughput of online querying.

A Decoupled Graph Querying System: Pros and Cons

In contrast to existing systems, we consider a different architecture, which relies less on an effective graph partitioning. Instead, we decouple query processing and graph storage into two separate tiers (Fig. 2). This decoupling happens at a logical level. As an example, query processors can be different physical machines than storage servers. On the other hand, the same physical machine can also run a query processing daemon, together with storing a graph partition in its main memory as a storage server. However, the logical separation between the two layers is critical in our design.

In a decoupled framework, the graph is partitioned across servers allocated to the storage tier, and these storage servers hold the graph data in their main memory. Since a query processor is no longer assigned any fixed part of the graph, it is equally capable of handling any request, thus facilitating load balancing and fault tolerance. At the same time, the query router can send a request to any of the query processors, which adds more flexibility to query routing, e.g., more query processors can be added (or, a query processor that is down can be replaced) without affecting the routing strategy. Another benefit due to decoupled design is that each tier can be scaled-up independently. If a certain workload is processing intensive, more servers could be allocated to the processing tier. On the contrary, if the graph size increases over time, more servers can be added in the storage  tier. This decoupled architecture, being generic, can be employed in many existing graph querying systems.
Fig. 2: Decoupled architecture for graph querying

The idea of decoupling has been effectively used in the past. Facebook implemented a fast caching layer, Memcached on top of a graph database that scales the performance of graph query answering [4]. Google’s F1 [5] and ScaleDB [6] are based on a decoupling principle for scalability. Recently, Loesing et. al. [7] and Binnig et. al. [8] demonstrated the benefits of a decoupled, shared-data architecture, together with low latency and high throughput Infiniband network. Shalita et. al. [9] employed decoupling for an optimal assignment of HTTP requests over a distributed graph storage.

However, we must also consider the drawbacks of having the graph storage apart. First, query processors may need to communicate with the storage tier via the network. This includes an additional penalty to the response time for answering a query. Second, it is possible that this design causes high contention rates on either the network, storage tier, or both.

Our contribution lies in designing a smart query routing logic to utilize the cache of query processors over such decoupled architecture. Achieving more cache hits is critical in a decoupled architecture, in order to reduce the communication among query processors and storage servers. However, this is a non-trivial problem, e.g., exploiting cache locality and balancing workloads are conflicting in nature. For example, to achieve maximum cache locality, the router can send all the queries to the same processor (assuming no cache eviction happens). However, the workload of the processors will be highly imbalanced, resulting  in lower throughput. In addition, graph workloads are significantly different from traditional database applications. The interconnected nature of graph data results in poor locality, and each query usually accesses multiple neighboring nodes spreading across the distributed storage. Therefore, to maximize cache hit rates at query processors, it is not sufficient to only route the queries on same nodes to the same processor. Rather, successive queries on neighboring nodes should also be routed to the same processor, since the neighborhoods of two nearby nodes may significantly overlap. To the best of our knowledge, such smart query routing schemes for effectively leveraging the cache contents were not considered in existing graph querying systems.

h-Hop Graph Traversal Queries:

In this work, we consider online, h-hop queries that explore a small region of the entire graph, and require fast response time. These queries usually start with a query node, and traverse its neighboring nodes up to a certain number of hops (i.e., h = 2, 3). Examples include h-hop neighbor aggregation, h-step random walk with restart, h-hop reachability, etc. These queries are often used as the basis for more complex graph operations. For example, neighborhood aggregation is critical for node labeling and classification, that is, the label of an unlabeled node could be assigned as the most frequent label which is present within its h-hop neighborhood. The h-step random walk is useful in expert finding, ranking, discovering functional modules, complexes, and pathways. 

Smart Graph Query Routing Schemes:

We identify the following objectives that a smart graph query routing scheme must satisfy.

1. Leverage each processor’s cached data. Let us consider t successive queries received by the router. The router will send them to query processors in a way such that the average number of cache hits at the processors is maximized. This, in turn, reduces the average query processing latency. However, as stated earlier, to achieve maximum cache hits, it will not be sufficient to only route the queries on same nodes to the same processor. Rather, successive queries on neighboring nodes should also be routed to the same processor, since the neighborhoods of two nearby nodes may significantly overlap.

2. Balance workload even if skewed or contains hotspot. As earlier, let us consider a set of t successive queries. A na¨ıve approach will be to ensure that each query processor receives equal number of queries, e.g., a round-robin way of query dispatching by the router. However, each query might have a different workload, and would require a different processing time. We, therefore, aim at maximizing the overall throughput via query stealing, which automatically balances the workload across query processors.

3. Make fast routing decisions. The average time at the router to dispatch a query should be minimized, ideally a small constant time, or much smaller than O(n), where n is the number of nodes in the input graph. This reduces the query processing latency.

4. Have low storage overhead in the router. The router may store auxiliary data to enable fast routing decisions. However, this additional storage overhead must be a small fraction compared to the graph size.

Based on the above requirements, we design two smart routing strategies: landmark routing and graph embed routingKleinberg et. al. [10] discussed the problem of approximating graph distances using a small set of beacons (i.e., landmarks). Graph embedding algorithms, on the other hand, place nodes at points on some surface such that the inherent graph structure is preserved [11, 12]. In this post, we briefly discuss our graph embedding-based query routing strategy: We embed a graph into a lower D-dimensional Euclidean space such that the hop-count distance between graph nodes are approximately preserved via their Euclidean distance (Fig. 3).  Since each node receives D co-ordinates, it requires total O(nD) space in the router, which is linear in the number of nodes. 

Fig. 3: Example of graph embedding in 2D Euclidean plane

In embed routing, the router has access to each node's co-ordinates. By keeping an average of the query nodes’ co-ordinates that it sent to each processor, it is able to infer the cache contents in these processors. As a consequence, the router finds the distance between a query node u and a processor p, denoted as d(u, p), and defined as the distance of the query node's co-ordinates to the historical mean of the processor's cache contents. As recent queries are more likely to influence the cache contents due to LRU eviction policy, we use the exponential moving average (i.e., recent queries are given more weight) to compute the mean of the processor's cache contents. We select the processor with the smallest d(u, p) distance. Observe that the routing decision time is only O(PD), P being the number of processors and D the number of dimensions. Furthermore, the distance metric d(u, p) is useful not only in finding the best processor for a certain query, but it can also be used for load balancing, dealing with workload skew, and hotspots. As an example, let us assume that the closest processor for a certain query is busy, or is currently down. Since the distance metric gives us distances to all processors, the router is able to select the second, third, or so on closest processor (query stealing). This form of load balancing impacts nearby query nodes in the same way, hence they are routed to the same query processor, thereby leveraging neighborhood-based locality.

Our experiments with several real-world, large graphs demonstrate that the proposed framework gRouting, even with simple hash partitioning, achieves up to an order of magnitude better query throughput compared to existing distributed graph systems that employ expensive graph partitioning and re-partitioning strategies. In addition to workload balancing and deployment flexibility, gRouting is able to provide linear scalability in throughput with more number of query processors, works well in the presence of query hotspots, and is also adaptive to workload changes and graph updates.

For more information, click here for our paper.
Blog post contributed by: Arijit Khan

[Reference]

[1] YANG, S., YAN, X., ZONG, B., AND KHAN, A. Towards Effective Partition Management for Large Graphs. In SIGMOD (2012).
[2] SHAO, B., WANG, H., AND LI, Y. Trinity: A Distributed Graph Engine on a Memory Cloud. In SIGMOD (2013).
[3] SARWAT, M., ELNIKETY, S., HE, Y., AND MOKBEL, M. F. Horton+: A Distributed System for Processing Declarative Reachability Queries over Partitioned Graphs. PVLDB 6, 14 (2013). 
[4] NISHTALA, R., FUGAL, H., GRIMM, S., KWIATKOWSKI, M., LEE, H., LI, H. C., MCELROY, R., PALECZNY, M., PEEK, D., SAAB, P., STAFFORD, D., TUNG, T., AND VENKATARAMANI, V. Scaling Memcache at Facebook. In NSDI (2013).
[5] SHUTE, J., VINGRALEK, R., SAMWEL, B., HANDY, B., WHIPKEY, C., ROLLINS, E., OANCEA, M., LITTLEFIELD, K., MENESTRINA, D., ELLNER, S., CIESLEWICZ, J., RAE, I., STANCESCU, T., AND APTE, H. F1: A Distributed SQL Database That Scales. PVLDB 6, 11 (2013).
[6] http://scaledb.com/pdfs/TechnicalOverview.pdf
[7] LOESING, S., PILMAN, M., ETTER, T., AND KOSSMANN, D. On the Design and Scalability of Distributed Shared-Data Databases. In SIGMOD (2015).
[8] BINNIG, C., CROTTY, A., GALAKATOS, A., KRASKA, T., AND ZAMANIAN, E. The End of Slow Networks: It’s Time for a Redesign. PVLDB 9, 7 (2016).
[9] SHALITA, A., KARRER, B., KABILJO, I., SHARMA, A., PRESTA, A., ADCOCK, A., KLLAPI, H., AND STUMM, M. Social Hash: An Assignment Framework for Optimizing Distributed Systems Operations on Social Networks. In NSDI (2016).
[10] KLEINBERG, J., SLIVKINS, A., AND WEXLER, T. Triangulation and Embedding Using Small Sets of Beacons. J. ACM 56, 6 (2009). 
[11] DABEK, F., COX, R., KAASHOEK, F., AND MORRIS, R. Vivaldi: A Decentralized Network Coordinate System. In SIGCOMM (2004).
[12] ZHAO, X., SALA, A., WILSON, C., ZHENG, H., AND ZHAO, B. Y. Orion: Shortest Path Estimation for Large Social Graphs. In WOSN (2010).