Version 80, last updated by therm000 at 20110408
Cloudlight Tutorial
This work was financed by CONICET (Argentinian Council of Research), in the context of a phd grant awarded to José Orlicki under supervision of José Ignacio AlvarezHamelin and Pablo Fierens (Beca Tipo I CONICET, Convocatoria 2009).
What is Cloudlight?
In the absence of sun on a cloudy overcast day you have cloudlight. 
Cloudlight is a Python library designed for experimental support of the following features:
 Flexible manipulation of data from large complex networks/graphs. For example collecting and analyzing the Twitter network of followers. Undirected graphs have the most rich interface, directed graphs are partially supported.
 Analysis is based on the theory of complex networks.
 Experimental support for crawlers and simulations of events within the complex networks are also supported.
What is new about Cloudlight?
The original features of Cloudlight are:
Finite computation of internal scaling dimension and connectivity dimensionof graphs based on concepts of fractal dimensión on complex graphos (paper).
Statistical analysis of node neighborhood size for persistant complex grapohs.
Automated analysis of all statistical graph parameters included in the library and 2D plotting. Including degree, clustering index, average neighbor degree, eccentricity, average path length, internal scaling, connectivity and shellindex.
Implementation of heuristic simulation for privacy attacks on social networks systems. Based on bibliography heuristics and centered on scenarios where the node degree is apriori visible(paper).
Downloading
Cloudlight is only available through it's Subversion repository.
On *nix:
svn co http://svn2.assembla.com/svn/cloudlight/ /your/favorite/localpath
Unless you add /your/favorite/localpath/src (notice the src folder) to you command path or install the folder inside your Python installation Cloudlight can only be imported when you are located inside folder/your/favorite/localpath/src.
Cloudlight was tested on Python 2.52.6.
>>> import cloudlight
>>>
>>> dir(cloudlight)
Dependencies
Basic mandatory dependencies include:
 Python 2.52.6.
 NetworkX 1.0.1 (or newer).
Some functionality depends on other Python libraries or Python library wrappers:
 SQLite 3: used in persistent graph with class BigGraph.
 Numpy, Scipy, Matplotlib: used in class Plot for custom graph parameter plotting.
 Matplotlib: used in class Graph for graph visualization.
 Mechanize: used by Bot classes and by some classes derived from Node to collect Web information.
 Python multiprocessing library: is used by script privacy_run.sh and test_big_privacy.py to natively parallelize the simulations launching many simultaneous processes.
Hello World!
Let's start with a simple example. As Cloudlight extends Graph and DiGraph classes from Networkx basic operations on graphs are the same.
>>> from cloudlight import DiGraph
>>> dg = DiGraph()
>>> dg.add_edge('Alice', 'Bob', {'message' : 'Hello World!'} )
>>> dg.edges()
[('Alice', 'Bob')]
>>> dg['Alice']['Bob']['message']
'Hello World!'
Notice that as we are using a directed graph then ordering of nodes when creating and printing edges is relevant.
Loading an edgelist file
>>> from cloudlight import Graph
>>> from cStringIO import StringIO
>>>
>>> example_txt = '''
... Alice Bob
... Alice Eve
... Eve Mallory
... '''
>>> graph = Graph()
>>> graph.load_edgelist( StringIO(example_txt) )
>>> graph.edges()
[('Mallory', 'Eve'), ('Bob', 'Alice'), ('Alice', 'Eve')]
In this case an undirected graph is created and some edges are added using a newlineseparated text as input.
NetworkX Support and BigGraph
For inmemory graphs (Graph and DiGraph) complete support of methods and behavior from NetworkX is available. But for indisk graph (BigGraph and BigDiGraph) node and edges attributes are not supported yet (except for integral edge weight that it is supported).
BigGraph implementation is designed to support iterator methods efficiently without consuming RAM memory except for SQL DBMS caches.
By default, BigGraph disk storage is located with extension .ldb on temporary files inside folder /tmp/, but you can specify the file you want to use. Temporary files are deleted when the class instance (object) is destroyed by Python. Graphs with and adhoc filename are not destroyed by Python and can live through many Python interpreter runs. Meaning that you can load previously created graph file with specific file names.
from cloudlight import BigGraph
# delete file after usage
temporary_disk_graph = BigGraph()
# no file deletion after usage
special_disk_graph = BigGraph('/home/user/mygraph.ldb')
BiGraph was stresstested with graphs of over 50 million edges. Nonlocal nonlinear time complexity analysis are discouraged with big graphs.
Notice: no repeated edges are allowed when indexing edges loaded into a BigGraph instance, so edge lists must be previously and independently curated to avoid edge repetition.
Symmetrizing a directed graph
We can extract symmetric (both directions occur) links from a directed graph into an undirected graph.
An extra method is needed to index graph edges if you want to achieve optimal performance when loading big graphs into the database, it is called Graph.create_indices(). Nodes are indexed by default at the moment of graph creation.
from cStringIO import StringIO
from cloudlight import BigGraph, BigDiGraph
from cloudlight.tests.data import example_txt
digraph = BigDiGraph()
digraph.load_edgelist(StringIO(example_txt))
digraph.create_indices()
graph = BigGraph()
digraph.add_only_symmetric_edgelist(graph)
graph.create_indices()
The same results can be obtained using shell tools directly over the edgelist text files. Surely shell tools for simpler tasks like this will ran faster.
cat graph.txt  awk F'[ ]' '{ if ($1 <= $2) print $1, $2; else print $2, $1 }'  sort  uniq d > graph.txt.symm
Loading and saving compressed graphs
To save some space in disk graphs can be encoded and compressed. Space improvements of 50% were observed on tests. Techniques are basic frequency encoding and ZempelLiv algorithms (zlib).
from cloudlight import BigGraph
import cloudlight.tests.data_enc1
graph = BigGraph()
use_big_alphabet = False
graph.load_compressed_graph(cloudlight.tests.data_enc1, use_big_alphabet, False)
graph.create_indices()
graph.save_compressed_graph( '/tmp/compressed_graph.ldb', use_big_alphabet)
The encoding uses the position of the node ordered from biggest to smallest node degree. The flag use_big_alphabet is used for extending the alphabet in the encoding to 91 symbols. Otherwise only decimal symbols are used.
Complex graph node parameters or centralities
To analyze complex networks you compute values of properties from nodes and then built frequency distributions. Some properties are inherited from NetworkX and other are only included in Cloudlight. Edge properties are sometimes useful but analyzing is more common and there are also techniques to convert a graph into it's dual graph where nodes represent edges and viceversa. We will focus on BigGraph where only iterative methods are implemented, if you want to use noniterative methods go for Graph. Usually only degree, betweenness, closeness and eigenvector values are considered centralities but the other parameters can also be considered centralities as they are relative to the node analysis we are making.
 Degree: number of edgesincident to the vertex, degree() for one node and degrees_iter() for many nodes. Local (efficiently computed) algorithm.
 Clustering coefficient: global clustering coefficient is based on triplets of nodes, clustering() for one node and clustering_indices_iter() for many nodes. Local (efficiently computed) algorithm.
 Average neighbor degree: is use to correlate the degree of a node with that of the neighbors, average_neighbor_degree() for one node, average_neighbor_degrees_iter() for many nodes. Local (efficiently computed) algorithm.
 Eccentricity: is the greatest geodesic distance between v and any other vertex, use eccentricities() or eccentricities_iter(). Global algorithm: O(number of edges).
 Average path length: as the name indicates, average path length between a node and all the nodes in the same connected component, use average_path_lengths_iter() or average_path_lengths(). Global algorithm: O(number of edges).
 Maximum internal scaling: finite approximation to the logarithmic growth of the node neighborhood. Use max_internal_scaling_iter() or max_internal_scaling(). Global algorithm: O(number of edges). Also related to this paper.
 Maximum connectivity: the same as the previous but for increasing neighborhood differences . Use max_connectivity_iter() or max_connectivity(). Global algorithm: O(number of edges).
 shellindex: see Network Workbench Wiki for a definition and bibliography. Use kcoreness_iter() or kcoreness().
from cloudlight import Graph
import cloudlight.tests.data_enc1
graph = Graph()
graph.load_compressed_graph(cloudlight.tests.data_enc1, use_big_alphabet=False, has_num=False)
of_a_node = graph.nodes()[0]
print graph.degree( of_a_node )
print graph.clustering_indices( [of_a_node] )
print graph.average_neighbor_degrees( [of_a_node] )
print graph.eccentricities( [of_a_node] )
print graph.max_internal_scaling( [of_a_node] )
print graph.max_connectivity( [of_a_node] )
Warning: some parameters need edge indexes for BigGraph to work correctly, beware! Use BigGraph.create_indices() before!
Scripting all the node parameters and plotting
To automatically script the analysis of an undirected graph, in the case of a small graph (about 10^5 edges) you can use script complete_analysis.py with the following command shells parameters:
 filename: text files in edgelist format (one edge per line, node name separated by spaces).
 destination_folder: target folder where the numeric results and plots of the analysis will be stored.
 complex_network_parameters: optional, commaseparated list of complex networks parameters you want to analysis. By default only degree, average neighbor degree (knn) and clustering index (clustering) are analyzed. Other possible (and more expensive analysis) can include kcore, eccentricity, path_len, scaling and connectivity. Check out previous section for details on the parameters available.
The script computes the analytical parameters and the plots distributions and 2D comparisons of the different complex network parameters. Prints plots in PNG and PostScript formats, also raw text files with the numerical results with nodes in the same order as in the nodes.txt file (also dumped). Finally an example from the command line:
$ mkdir /home/cpetruza/analysis/
$ python ./complete_analysis.py /home/cpetruza/smallgraph.txt /home/cpetruza/analysis/ degree,clustering,kcore
INFO: FINISH INPUT load_edgelist(), link count = 10000
INFO: 1000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 1 its.  9 total , 0.0 secs (0.0 mins) elapsed  0.0 secs (0.0 mins) per it.  0.1 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 2000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 2 its.  9 total , 0.0 secs (0.0 mins) elapsed  0.0 secs (0.0 mins) per it.  0.1 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 3000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 3 its.  9 total , 0.0 secs (0.0 mins) elapsed  0.0 secs (0.0 mins) per it.  0.1 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 4000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 4 its.  9 total , 0.1 secs (0.0 mins) elapsed  0.0 secs (0.0 mins) per it.  0.1 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 5000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 5 its.  9 total , 0.1 secs (0.0 mins) elapsed  0.0 secs (0.0 mins) per it.  0.1 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 6000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 6 its.  9 total , 0.1 secs (0.0 mins) elapsed  0.0 secs (0.0 mins) per it.  0.0 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 7000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 7 its.  9 total , 0.1 secs (0.0 mins) elapsed  0.0 secs (0.0 mins) per it.  0.0 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 8000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 8 its.  9 total , 0.1 secs (0.0 mins) elapsed  0.0 secs (0.0 mins) per it.  0.0 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: 9000 nodes processed in index_parameter_generic, param_name degree
(Wed Feb 23 16:10:02 2011) 9 its.  9 total , 0.1 secs (0.0 mins) elapsed  0.0 secs (0.0 mins) per it.  0.0 secs (0.0 mins) left (ending Wed Feb 23 16:10:02 2011)
INFO: complete graph analysis, sample size = 1000 , number of bin = 8
INFO: computing degree...
INFO: degree 1 for node 2517095 computed (1 of 9074).
INFO: FINISH OUTPUT degrees(), node count = 1000
INFO: computing clustering...
INFO: clustering 0.0 for node 2517095 computed (1 of 9074).
INFO: FINISH OUTPUT clustering_indices(), node count = 1000
INFO: computing kcore...
INFO: kcore 1 for node 2517095 computed (1 of 9074).
INFO: FINISH OUTPUT kcoreness(), node count = 1000
INFO: plotting parameter degree ...
INFO: plotting parameter kcore ...
INFO: plotting parameter clustering ...
INFO: plotting parameters degree and kcore ...
INFO: plotting parameters kcore and degree ...
INFO: plotting parameters clustering and degree ...
INFO: plotting parameters degree and clustering ...
INFO: plotting parameters clustering and kcore ...
INFO: plotting parameters kcore and clustering ...
$ cd /home/cpetruza/analysis
$ ls
analysis.txt clusteringkcore.eps degreeclustering.eps degreekcore.png kcoreclustering.png kcore.png
clusteringdegree.eps clusteringkcore.png degreeclustering.png degree.png kcoredegree.eps kcore.txt
clusteringdegree.png clustering.png degree.eps degree.txt kcoredegree.png nodes.txt
clustering.eps clustering.txt degreekcore.eps kcoreclustering.eps kcore.eps
Notice: remember that as said on previous sections plotting features depend on libraries Numpy, Scipy and Matplotlib.
Scripting all the node parameters and plotting for BigGraph (larger graphs)
In the case of larger graphs (size > 10^5 edges) is recommended to use the BigGraph version of the complete analysis script.
Notice: no repeated edges are allowed when indexing edges loaded into a BigGraph instance, so edge lists must be previously and independently curated to avoid edge repetition.
Call Bash script load_and_analyze.sh to fully automate the process, that is divided in the following steps. Can be manually and separately activated if necessary:
 build_db.py : to load and index the edgelist text file, also indexing edges.

index_db_all.py : create all indexes for node parameters.
 complete_big_analysis.sh : dumping of node parameters and numeric processing for plotting.
Basic node parameters included by default are:
 degree
 clustering index
 average neighbor degree (knn)
 kcoreness
 triangles (number of triangles where the node is involved, related to clustering index)
 linksphere1 (number of links in the neighborhood of distance 2 around the node)
 linksphere1 (number of nodes in the neighborhood of distance 2 around the node)
Another, more expensive, node parameters that can be included (uncommenting directly on the file) are:
 linksphere2, nodesphere2
 linksphere3, nodesphoere3
 eccentricity
 path_len
 scaling
 connectivity
Dictionary parameters on file complete_big_analysis.py sets the max number of bins included in the plots.
Finally, an example. Notice that only the input edgelist file and the maximum number of edges to be included must be specified:
bash ./load_and_analyse.sh /tmp/largegraph.txt 100000
Too many scripts for too many funny things
A complete list of script implemented follows:
 build_db_bigger_comp.py : extract from a BigGraph instance a random component, if lucky you will be extracting the biggest component.
 build_db_digraph.py : load an edgelist into a directed instance of a big graph, BigDiGraph.
 build_db_symmetric.py : load only the symmetric subgraph from a edgelist into a BigGraph instance.
 build_db.py : loading an edgelist file into a BigGraph instance and index edges.
 complete_analysis.py : discussed in previous sections.
 complete_big_analysis.py : discussed in previous sections.
 compute_linksphere1.py : directly compute linksphere1 node parameters based on degree, clustering and knn.
 connected.py : also saved the biggest connected component of a Graph instance.
 drop_index_db_*.py : works for erasing the index of any node parameter.
 dump_dp_graph.py : inverse process of building a graph db, just dump the BigGraph instance into an edgelist file.
 dump_snowball.py : similar to dump_db_graph.py but traversing the BigGraph instance in a BFS fashion.
 export_*.py : helps you dumping the node parameters you computed and indexed for BigGraph instances.
 index_db_all.py : previously discussed, indexes all basic graph node parameters.
 index_db_*.py : index a particular node parameter.
 test_big_privacy.py : deals with lookahead privacy simulations on BigGraph instances, discussed in the next section.
Scripting lookahead privacy simulations
Some specific privacy attack simulations are implemented in Cloudlight. Some are identical to lookahead privacy attack involving heuristics using apriori visible degree of the nodes (Korolova et al, 2008). Other heuristics implemented are based on a privacy model where degree is not apriori known. In the models tested an attack bribes a node to cover the local neighborhood of the node construction an attackervisible subgraph, so the important decision is in which order the nodes are bribed.
Fullblown privacy simulations (including indexing of the network) can be launched with Bash script privacy_run.sh. Parameters include, in this order:
 input file with edgelist, undirected edges without repetitions.
 output folder for results and indexed network.
 filename for results and indexed network, file extensions are automatically appended.
 maximum number of input edges you want to include in the simulation.
 attack heuristic set, as heuristics are classified into passive, triangles and no_degree. Detailed later in this section.
 coverage types (comma separated) used to measure progress of the attack: node, link, complete_node or triangle. Node and link coverage is straight forward to understand, complete node coverage involves the number of nodes with all of their incident edges covered/visible. Triangle coverage measure the percentage of node triangles covered.
 end coverage 50, 60, 95, etc, is the halting condition, when the coverage specified reaches the end coverage the simulation ends.
 "index" or something notnull to index from scratch, if you avoid this optional parameter then it is assumed that you have previously indexed the same dataset with the same parameters and your want to safe some time.
Now we detail the global concepts behind the heuristics. For specific details read the source (privacy.py and privacy_attack.py).
Heuristic set passive involves heuristics degree, greedy, crawler and random. All based on heuristic described in bibliography (Korolova, 2008).
 degree: the degree of the nodes are assumed available apriori (assumption usually seen in social networks such as LinkedIn), so we bribed nodes in descending degree size.
 greedy: using degree of the nodes are assumed available apriori, nodes are bribed in descending unseen degree size, unseen degree is a table dynamically updated during the attack.
 crawler: similar to greedy but attacker can only bribe nodes on a visible connected component.
 random: nodes to bribed are chosen at random, without repetitions.
Heuristic set triangle involves heuristics triangles, greedy_triangles, crawler_triangles and crawler_seen_triangles. The heuristics are similar to degree heuristics but assuming an apriori available table of node versus triangles where the node is involved (not seen in practice in social networks). Also a table of unseen_triangles is used in some cases. For crawler_seen_triangles no apriori table of triangles is assumed, the seen triangles are seen as the attack progresses.
Heuristic set no_degree involves three heuristics that do not use an apriori table of degree but constructs tables of seen degrees and histograms as the attack moves forward.
 crawler_seen_degree: the next node bribed is the node not fully covered with the dynamically updated biggest seen degree.
 crawler_degree_hist: a histogram of degrees is updated and the degree chosen is the one that maximizes and estimation of links discovered.
 crawler_degree_hist_bin_dist: the same as the previous one but with bin redistribution to make the estimation more accurate.
 crawler_degree_hist_bin_dist_rand: idem but with random redistribution of bins within the histogram.
An example:
./privacy_run.sh /home/cpetruza/lj200k.edgelist /home/cpetruza/results/ lj200kfiles 200000 passive node,link,korolova 40 index
This script indexed the networks, the parameters needed and also calls script test_big_privacy.py to start the simulation.
The output file with simulation results (see following excerpt) can be processed and plotted with script plot_effort.py (see source code foradditional details regarding script parameters and magic numbers). Lookahead 1 means that nodes have a visibility of nodes distant 2 hops from the position in the network.
# looahead coverage heuristicstrategy #bribed_nodes %bribed_nodes date
1 0.10 start_degree 2 0.004193 Tue_Jul_20_19:06:06_2010
1 0.10 start_greedy 2 0.004193 Tue_Jul_20_19:06:06_2010
1 0.45 start_random 26 0.054507 Tue_Jul_20_19:06:06_2010
...
...
Graph Bots, online crawling and interaction with the Web
Basic experimental support for collecting complex networks from the Internet is included. It is based on abstract classes of agents that implement:
 a Iterator interface: what things of the graph are visited and in which order.
 a Visitor interface: what happens when an element from the graph is visited.
 a Builder interface: what to do with the result of the visits to the elements of the graph.
Details of the interfaces and what methods should be implemented are detailed within the source code of the bot modules.
In the following example a bot called DegreeBot is constructed using multiple inheritance to compute the sum of the degrees (that is the same as two times the number of edges). NodeIterator only iterates the nodes, DegreeVisitor computes the degree of each node visited and SumBuilder sums the results from the Visitor.
from cloudlight import Graph
from cloudlight import Node, Bot, NodeIterator, DegreeVisitor, SumBuilder
g = Graph()
n1 = Node()
n2 = Node()
n3 = Node()
n4 = Node()
g.add_edge(n1, n2)
g.add_edge(n2,n3)
g.add_edge(n3,n1)
g.add_edge(n3,n4)
class DegreeBot(NodeIterator, DegreeVisitor, SumBuilder, Bot): pass
bot = DegreeBot()
assert bot.process(g) == 8.0
Now let's use the interfaces to implement a basic crawler for Twitter following and Facebook friends, at the same time!
from cloudlight import Graph
from cloudlight import Node, Bot, GraphIterator, GraphBuilder
from cloudlight import FacebookNode, FacebookFriendVisitor, FacebookBrowser
from cloudlight import TwitterNode, TwitterFriendVisitor, TwitterBrowser
g = Graph()
g.add_node( FacebookNode(id=100002106847654) )
g.add_node( TwitterNode('cfkargentina') )
# automatically choose which visitor to use depending on the node type
class MyVisitor(FacebookFriendVisitor, TwitterFriendVisitor): pass
class MyBot(GraphIterator, MyVisitor, GraphBuilder, Bot): pass
# is the same as
#from cloudlight import CopyBot
#class MyBot(CopyBot, MyVisitor): pass
bot = MyBot()
bot.twitter_browser = TwitterBrowser()
# facebook browser is not very stable
bot.facebook_browser = FacebookBrowser()
bot.facebook_browser.initialize(user='alice.private.life@gmail.com', passw='asdfasdf0')
new_graph = bot.process(g)
print new_graph.number_of_edges()
In this case we use GraphBuilder or CopyBot to construct a new graph but NullBuilder is used when you need to build nothing or you assume useful collateral effects generated by the visitor.