Version 4, last updated by Robert Isele at July 01, 2011 05:33 UTC
Silk MapReduce
Introduction
Silk MapReduce is used to generate RDF links between data sets using a cluster of multiple machines. Silk MapReduce is based on Hadoop and can for instance be run on Amazon Elastic MapReduce. Silk MapReduce enables Silk to scale out to very large datasets by distributing the link generation to multiple machines.
Usage
In order to run the Silk MapReduce, you need:
- Silk Link Discovery Framework.
- Java Runtime Environment The Silk Link Discovery Framework runs on top of the JVM. Get the most recent [[http://java.com|JRE]].
The Silk MapReduce linking workflow is divided into 2 phases:
Load phase
In the Load phase Silk loads all data sets which are specified by the user-provided link specifications into the instance cache.
hadoop jar silkmr.jar load configFile ouputDir [linkSpec]The following parameters are accepted:
-
configFileThe path to the Silk configuration file, which contains the link specifications. For details on the Silk – Link Specification Language, please read the Specification. -
ouputDirThe directory, where the instance cache will be written to. This will be the input directory of the Link Generation phase. -
linkSpec(optional) If given, only the specified link specification will be loaded. If not given, all link specifications in the provided configuration will be loaded.
Example:
hadoop jar silkmr.jar load ./config.xml ./cacheLink Generation phase
In the Link Generation phase Silk generates the links from the previously loaded instance cache.
hadoop jar silkmr.jar match inputDir ouputDir [linkSpec]The following parameters are accepted:inputDir The path to the previously loaded instance cache.ouputDir The directory, where the generated links will be written to.linkSpec (optional) The link specification for which links should be generated. Can be omitted if the provided configuration only contains one link specification.
Example:
hadoop jar silkmr.jar match ./cache ./linksUsage example
We employed Silk to find owl:sameAs links between cities in DBpedia and in LinkedGeoData. The used link specification can be found online. As both datasets are very large, we used a reduced dataset consisting of 10,5000 settlements from DBpedia and 59,000 cities and towns from LinkedGeoData (omitting villages).
We executed the link specification using two different configurations:
- Silk Single Machine running on a Intel Core2Duo E8500 with 8GB of RAM
- Silk MapReduce running on Amazon Elastic MapReduce cluster consisting of 10 Amazon EC2 instances (High-CPU Medium Instance Profile)
For each configuration, we executed the link specifications twice:
- Without the use of the blocking feature. In this case the Silk Linking Engine has to evaluate the full cartesian product resulting in over 6 billion instance comparisions.
- Using the blocking feature. The link specification has been extended to block the cities by name using 50 blocks
The number of generated links and the time needed to generate the links for each combination is given in the following table:
| Variant | Link Generation time | Number of links |
|---|---|---|
| Without Blocking | ||
| Silk Single Machine | 54 hours | 9,283 |
| Silk MapReduce | 6.7 hours | 9,283 |
| With Blocking | ||
| Silk Single Machine | 155.5 minutes | 9,224 |
| Silk MapReduce | 14.4 minutes | 9,224 |
The table clearly shows how Silk MapReduce reduces the execution time significantly by scaling to clusters with multiple machines. The performance has been further improved by employing the blocking feature included into Silk while losing less than 1 % of the links compared to the link specification without blocking.