Tutorial

Cloudlight Tutorial

This work was financed by CONICET (Argentinian Council of Research), in the context of a phd grant awarded to José Orlicki under supervision of José Ignacio Alvarez-Hamelin and Pablo Fierens (Beca Tipo I CONICET, Convocatoria 2009).

What is Cloudlight?

In the absence of sun on a cloudy overcast day you have cloudlight.

Cloudlight is a Python library designed for experimental support of the following features:

What is new about Cloudlight?

The original features of Cloudlight are:

Downloading

Cloudlight is only available through it's Subversion repository.

On *nix:

svn co http://svn2.assembla.com/svn/cloudlight/ /your/favorite/localpath

Unless you add /your/favorite/localpath/src (notice the src folder) to you command path or install the folder inside your Python installation Cloudlight can only be imported when you are located inside folder/your/favorite/localpath/src.

Cloudlight was tested on Python 2.5-2.6.

>>> import cloudlight
>>>
>>> dir(cloudlight)

Dependencies

Basic mandatory dependencies include:

Some functionality depends on other Python libraries or Python library wrappers:

Hello World!

Let's start with a simple example. As Cloudlight extends Graph and DiGraph classes from Networkx basic operations on graphs are the same.

>>> from cloudlight import DiGraph
>>> dg = DiGraph()
>>> dg.add_edge('Alice', 'Bob', {'message' : 'Hello World!'} )
>>> dg.edges()
[('Alice', 'Bob')]
>>> dg['Alice']['Bob']['message']
'Hello World!'

Notice that as we are using a directed graph then ordering of nodes when creating and printing edges is relevant.

Loading an edgelist file

>>> from cloudlight import Graph
>>> from cStringIO import StringIO
>>>
>>> example_txt = '''
... Alice Bob
... Alice Eve
... Eve Mallory
... '''
>>> graph = Graph()
>>> graph.load_edgelist( StringIO(example_txt) )
>>> graph.edges()
[('Mallory', 'Eve'), ('Bob', 'Alice'), ('Alice', 'Eve')]

In this case an undirected graph is created and some edges are added using a newline-separated text as input.

NetworkX Support and BigGraph

For in-memory graphs (Graph and DiGraph) complete support of methods and behavior from NetworkX is available. But for in-disk graph (BigGraph and BigDiGraph) node and edges attributes are not supported yet (except for integral edge weight that it is supported).

NetworkX Documentation

BigGraph implementation is designed to support iterator methods efficiently without consuming RAM memory except for SQL DBMS caches.

By default, BigGraph disk storage is located with extension .ldb on temporary files inside folder /tmp/, but you can specify the file you want to use. Temporary files are deleted when the class instance (object) is destroyed by Python. Graphs with and ad-hoc filename are not destroyed by Python and can live through many Python interpreter runs. Meaning that you can load previously created graph file with specific file names.

from cloudlight import BigGraph

# delete file after usage
temporary_disk_graph = BigGraph()

# no file deletion after usage
special_disk_graph = BigGraph('/home/user/mygraph.ldb')

BiGraph was stress-tested with graphs of over 50 million edges. Non-local non-linear time complexity analysis are discouraged with big graphs.

Notice: no repeated edges are allowed when indexing edges loaded into a BigGraph instance, so edge lists must be previously and independently curated to avoid edge repetition.

Symmetrizing a directed graph

We can extract symmetric (both directions occur) links from a directed graph into an undirected graph.

An extra method is needed to index graph edges if you want to achieve optimal performance when loading big graphs into the database, it is called Graph.create_indices(). Nodes are indexed by default at the moment of graph creation.

from cStringIO import StringIO
from cloudlight import BigGraph, BigDiGraph
from cloudlight.tests.data import example_txt

digraph = BigDiGraph()
digraph.load_edgelist(StringIO(example_txt))
digraph.create_indices()

graph = BigGraph()
digraph.add_only_symmetric_edgelist(graph)
graph.create_indices()

The same results can be obtained using shell tools directly over the edgelist text files. Surely shell tools for simpler tasks like this will ran faster.

cat graph.txt | awk -F'[ ]' '{ if ($1 <= $2) print $1, $2; else print $2, $1 }' | sort | uniq -d > graph.txt.symm

Loading and saving compressed graphs

To save some space in disk graphs can be encoded and compressed. Space improvements of 50% were observed on tests. Techniques are basic frequency encoding and Zempel-Liv algorithms (zlib).

from cloudlight import BigGraph
import cloudlight.tests.data_enc1

graph = BigGraph()
use_big_alphabet = False
graph.load_compressed_graph(cloudlight.tests.data_enc1, use_big_alphabet, False)
graph.create_indices()

graph.save_compressed_graph( '/tmp/compressed_graph.ldb', use_big_alphabet)

The encoding uses the position of the node ordered from biggest to smallest node degree. The flag use_big_alphabet is used for extending the alphabet in the encoding to 91 symbols. Otherwise only decimal symbols are used.

Complex graph node parameters or centralities

To analyze complex networks you compute values of properties from nodes and then built frequency distributions. Some properties are inherited from NetworkX and other are only included in Cloudlight. Edge properties are sometimes useful but analyzing is more common and there are also techniques to convert a graph into it's dual graph where nodes represent edges and viceversa. We will focus on BigGraph where only iterative methods are implemented, if you want to use non-iterative methods go for Graph. Usually only degree, betweenness, closeness and eigenvector values are considered centralities but the other parameters can also be considered centralities as they are relative to the node analysis we are making.

from cloudlight import Graph
import cloudlight.tests.data_enc1

graph = Graph()
graph.load_compressed_graph(cloudlight.tests.data_enc1, use_big_alphabet=False, has_num=False)

of_a_node = graph.nodes()[0]
print graph.degree( of_a_node )
print graph.clustering_indices( [of_a_node] )
print graph.average_neighbor_degrees( [of_a_node] )
print graph.eccentricities( [of_a_node] )
print graph.max_internal_scaling( [of_a_node] )
print graph.max_connectivity( [of_a_node] )

Warning: some parameters need edge indexes for BigGraph to work correctly, beware! Use BigGraph.create_indices() before!

Scripting all the node parameters and plotting

To automatically script the analysis of an undirected graph, in the case of a small graph (about 10^5 edges) you can use script complete_analysis.py with the following command shells parameters:

  1. filename: text files in edgelist format (one edge per line, node name separated by spaces).
  2. destination_folder: target folder where the numeric results and plots of the analysis will be stored.
  3. complex_network_parameters: optional, comma-separated list of complex networks parameters you want to analysis. By default only degree, average neighbor degree (knn) and clustering index (clustering) are analyzed. Other possible (and more expensive analysis) can include kcore, eccentricity, path_len, scaling and connectivity. Check out previous section for details on the parameters available.

The script computes the analytical parameters and the plots distributions and 2D comparisons of the different complex network parameters. Prints plots in PNG and PostScript formats, also raw text files with the numerical results with nodes in the same order as in the nodes.txt file (also dumped). Finally an example from the command line:

$ mkdir /home/cpetruza/analysis/

$ python ./complete_analysis.py /home/cpetruza/smallgraph.txt /home/cpetruza/analysis/ degree,clustering,kcore

INFO: FINISH INPUT load_edgelist(), link count = 10000
INFO: 1000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 1 its. | 9 total , 0.0 secs (0.0 mins) elapsed | 0.0 secs (0.0 mins) per it. | 0.1 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 2000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 2 its. | 9 total , 0.0 secs (0.0 mins) elapsed | 0.0 secs (0.0 mins) per it. | 0.1 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 3000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 3 its. | 9 total , 0.0 secs (0.0 mins) elapsed | 0.0 secs (0.0 mins) per it. | 0.1 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 4000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 4 its. | 9 total , 0.1 secs (0.0 mins) elapsed | 0.0 secs (0.0 mins) per it. | 0.1 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 5000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 5 its. | 9 total , 0.1 secs (0.0 mins) elapsed | 0.0 secs (0.0 mins) per it. | 0.1 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 6000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 6 its. | 9 total , 0.1 secs (0.0 mins) elapsed | 0.0 secs (0.0 mins) per it. | 0.0 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 7000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 7 its. | 9 total , 0.1 secs (0.0 mins) elapsed | 0.0 secs (0.0 mins) per it. | 0.0 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 8000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 8 its. | 9 total , 0.1 secs (0.0 mins) elapsed | 0.0 secs (0.0 mins) per it. | 0.0 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 9000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 9 its. | 9 total , 0.1 secs (0.0 mins) elapsed | 0.0 secs (0.0 mins) per it. | 0.0 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: complete graph analysis, sample size = 1000 , number of bin = 8
INFO: computing degree...
INFO: degree 1 for node 2517095 computed (1 of 9074).
INFO: FINISH OUTPUT degrees(), node count = 1000
INFO: computing clustering...
INFO: clustering 0.0 for node 2517095 computed (1 of 9074).
INFO: FINISH OUTPUT clustering_indices(), node count = 1000
INFO: computing kcore...
INFO: kcore 1 for node 2517095 computed (1 of 9074).
INFO: FINISH OUTPUT kcoreness(), node count = 1000
INFO: plotting parameter degree ...
INFO: plotting parameter kcore ...
INFO: plotting parameter clustering ...
INFO: plotting parameters degree and kcore ...
INFO: plotting parameters kcore and degree ...
INFO: plotting parameters clustering and degree ...
INFO: plotting parameters degree and clustering ...
INFO: plotting parameters clustering and kcore ...
INFO: plotting parameters kcore and clustering ...
$ cd /home/cpetruza/analysis

$ ls

analysis.txt           clustering-kcore.eps  degree-clustering.eps  degree-kcore.png      kcore-clustering.png  kcore.png
clustering-degree.eps  clustering-kcore.png  degree-clustering.png  degree.png            kcore-degree.eps      kcore.txt
clustering-degree.png  clustering.png        degree.eps             degree.txt            kcore-degree.png      nodes.txt
clustering.eps         clustering.txt        degree-kcore.eps       kcore-clustering.eps  kcore.eps

Notice: remember that as said on previous sections plotting features depend on libraries Numpy, Scipy and Matplotlib.

Scripting all the node parameters and plotting for BigGraph (larger graphs)

In the case of larger graphs (size > 10^5 edges) is recommended to use the BigGraph version of the complete analysis script.

Notice: no repeated edges are allowed when indexing edges loaded into a BigGraph instance, so edge lists must be previously and independently curated to avoid edge repetition.

Call Bash script load_and_analyze.sh to fully automate the process, that is divided in the following steps. Can be manually and separately activated if necessary:

  1. build_db.py : to load and index the edgelist text file, also indexing edges.
  2. index_db_all.py : create all indexes for node parameters.
  3. complete_big_analysis.sh : dumping of node parameters and numeric processing for plotting.

Basic node parameters included by default are:

Another, more expensive, node parameters that can be included (uncommenting directly on the file) are:

Dictionary parameters on file complete_big_analysis.py sets the max number of bins included in the plots.

Finally, an example. Notice that only the input edgelist file and the maximum number of edges to be included must be specified:

bash ./load_and_analyse.sh /tmp/largegraph.txt 100000

Too many scripts for too many funny things

A complete list of script implemented follows:

Scripting lookahead privacy simulations

Some specific privacy attack simulations are implemented in Cloudlight. Some are identical to lookahead privacy attack involving heuristics using a-priori visible degree of the nodes (Korolova et al, 2008). Other heuristics implemented are based on a privacy model where degree is not a-priori known. In the models tested an attack bribes a node to cover the local neighborhood of the node construction an attacker-visible subgraph, so the important decision is in which order the nodes are bribed.

Full-blown privacy simulations (including indexing of the network) can be launched with Bash script privacy_run.sh. Parameters include, in this order:

  1. input file with edgelist, undirected edges without repetitions.
  2. output folder for results and indexed network.
  3. filename for results and indexed network, file extensions are automatically appended.
  4. maximum number of input edges you want to include in the simulation.
  5. attack heuristic set, as heuristics are classified into passive, triangles and no_degree. Detailed later in this section.
  6. coverage types (comma separated) used to measure progress of the attack: node, link, complete_node or triangle. Node and link coverage is straight forward to understand, complete node coverage involves the number of nodes with all of their incident edges covered/visible. Triangle coverage measure the percentage of node triangles covered.
  7. end coverage 50, 60, 95, etc, is the halting condition, when the coverage specified reaches the end coverage the simulation ends.
  8. "index" or something not-null to index from scratch, if you avoid this optional parameter then it is assumed that you have previously indexed the same dataset with the same parameters and your want to safe some time.

Now we detail the global concepts behind the heuristics. For specific details read the source (privacy.py and privacy_attack.py).

Heuristic set passive involves heuristics degree, greedy, crawler and random. All based on heuristic described in bibliography (Korolova, 2008).

Heuristic set triangle involves heuristics triangles, greedy_triangles, crawler_triangles and crawler_seen_triangles. The heuristics are similar to degree heuristics but assuming an a-priori available table of node versus triangles where the node is involved (not seen in practice in social networks). Also a table of unseen_triangles is used in some cases. For crawler_seen_triangles no a-priori table of triangles is assumed, the seen triangles are seen as the attack progresses.

Heuristic set no_degree involves three heuristics that do not use an a-priori table of degree but constructs tables of seen degrees and histograms as the attack moves forward.

An example:

./privacy_run.sh /home/cpetruza/lj-200k.edgelist /home/cpetruza/results/ lj-200k-files 200000 passive node,link,korolova 40 index

This script indexed the networks, the parameters needed and also calls script test_big_privacy.py to start the simulation.

The output file with simulation results (see following excerpt) can be processed and plotted with script plot_effort.py (see source code foradditional details regarding script parameters and magic numbers). Lookahead 1 means that nodes have a visibility of nodes distant 2 hops from the position in the network.

# looahead coverage heuristic-strategy #bribed_nodes %bribed_nodes date

1  0.10  start_degree  2  0.004193  Tue_Jul_20_19:06:06_2010
1  0.10  start_greedy  2  0.004193  Tue_Jul_20_19:06:06_2010
1  0.45  start_random  26  0.054507  Tue_Jul_20_19:06:06_2010
...

...

Graph Bots, online crawling and interaction with the Web

Basic experimental support for collecting complex networks from the Internet is included. It is based on abstract classes of agents that implement:

Details of the interfaces and what methods should be implemented are detailed within the source code of the bot modules.

In the following example a bot called DegreeBot is constructed using multiple inheritance to compute the sum of the degrees (that is the same as two times the number of edges). NodeIterator only iterates the nodes, DegreeVisitor computes the degree of each node visited and SumBuilder sums the results from the Visitor.

from cloudlight import Graph
from cloudlight import Node, Bot, NodeIterator, DegreeVisitor, SumBuilder

g = Graph()
n1 = Node()
n2 = Node()
n3 = Node()
n4 = Node()
g.add_edge(n1, n2)
g.add_edge(n2,n3)
g.add_edge(n3,n1)
g.add_edge(n3,n4)

class DegreeBot(NodeIterator, DegreeVisitor, SumBuilder, Bot): pass

bot = DegreeBot()
assert bot.process(g) == 8.0

Now let's use the interfaces to implement a basic crawler for Twitter following and Facebook friends, at the same time!

from cloudlight import Graph
from cloudlight import Node, Bot, GraphIterator, GraphBuilder
from cloudlight import FacebookNode, FacebookFriendVisitor, FacebookBrowser
from cloudlight import TwitterNode, TwitterFriendVisitor, TwitterBrowser

g = Graph()
g.add_node( FacebookNode(id=100002106847654) )
g.add_node( TwitterNode('cfkargentina') )

# automatically choose which visitor to use depending on the node type
class MyVisitor(FacebookFriendVisitor, TwitterFriendVisitor): pass

class MyBot(GraphIterator, MyVisitor, GraphBuilder, Bot): pass
# is the same as
#from cloudlight import CopyBot
#class MyBot(CopyBot, MyVisitor): pass

bot = MyBot()
bot.twitter_browser = TwitterBrowser()
# facebook browser is not very stable
bot.facebook_browser = FacebookBrowser()
bot.facebook_browser.initialize(user='alice.private.life@gmail.com', passw='asdfasdf0')

new_graph = bot.process(g)

print new_graph.number_of_edges()

In this case we use GraphBuilder or CopyBot to construct a new graph but NullBuilder is used when you need to build nothing or you assume useful collateral effects generated by the visitor.