Version 64, last updated by Andrea Matteini at 02 Apr 21:15 UTC

Benchmark machines

FUB machines – used for In-Memory, TDB and FUB Hadoop clusters

  • Hardware
    • Processors: Intel i7 950, 3.07GHz (quadcore)
    • Memory: 24GB
    • Hard Disks: 2 × 1.8TB (7,200 rpm) SATA2
  • Software
    • Operating System: Ubuntu 11.04 64-bit, Kernel: 2.6.38-10
    • Java version: 1.6.0_22

EC2 c1.medium instances – used for EC2 Hadoop clusters

  • Hardware
    • Processor: 5 EC2 Compute Units 1
    • Memory: 1.7 GB
    • Hard Disks: 350 GB
    • I/O Performance: Moderate
  • Software
    • Operating System: Ubuntu 11.04 32-bit

Benchmark clusters

In all cluster configurations, the master works as job tracker and name node, while the slaves work as data nodes and task trackers. See Hadoop configuration page for more details.

FUB 2-slaves cluster

  • 1 master, 2 slaves (FUB machines)
  • Network: Gigabit Ethernet

EC2 X-slaves cluster

  • 1 master, X slaves (EC2 c1.medium instances)

Tests description

Data sets, mappings and link specs are described in use case B of the benchmark page. For the scales 150M and 300M we added even more data from the KEGG GENES and UniProt data sets.
We used LDIF revision 3c5bed30e4 for the in-memory and TDB tests; LDIF revision 38d9c0fab4 for the Hadoop tests.

Results


25M run times

Phase In-memory2 TDB3 Hadoop FUB 2-slaves
Load and build entities for R2R 131 s 1334 s 688 s
R2R data translation 110 s 85 s 87 s
Build entities for Silk 13 s 139 s 225 s
Silk identity resolution 123 s 213 s 216 s
URIs rewriting 17 s 15 s 427 s
Overall execution 6.5 min 29.7 min 27.4 min

100M run times

Phase In-memory2 TDB3 Hadoop FUB 2-slaves
Load and build entities for R2R 834 s 7014 s 3112 s
R2R data translation 735 s 607 s 165 s
Build entities for Silk 87 s 806 s 405 s
Silk identity resolution 2743 s 4656 s 544 s
URIs rewriting 124 s 118 s 1056 s
Overall execution 75 min 220 min 88 min

150M run times

Phase In-memory2 TDB3 Hadoop FUB 2-slaves
Load and build entities for R2R Out of Memory 13776 s 3830 s
R2R data translation - 1206 s 170 s
Build entities for Silk - 847 s 380 s
Silk identity resolution - 5328 s 688 s
URIs rewriting - 173 s 1235 s
Overall execution - 355 min 105 min

300M run times

Phase In-memory2 TDB3 Hadoop FUB 2-slaves
Load and build entities for R2R Out of Memory 22870 s 6070 s
R2R data translation - 1203 s 179 s
Build entities for Silk - 1006 s 436 s
Silk identity resolution - 8392 s 1022 s
URIs rewriting - 176 s 1232 s
Overall execution - 560 min 148 min

300M run times on Hadoop EC2 clusters

Phase Hadoop EC2 8-slaves Hadoop EC2 16-slaves Hadoop EC2 32-slaves
Load and build entities for R2R 7933 s 4647 s 2382 s
R2R data translation 297 s 173 s 114 s
Build entities for Silk 646 s 421 s 324 s
Silk identity resolution 1546 s 932 s 580 s
URIs rewriting 2174 s 1430 s 1085 s
Overall execution 209 min 126 min 75 min

3.69B (3690M, complete UniProt and complete KEGG GENES) run times

Config files:

Run times (format: hh:mm:ss)

Phase In-memory2 TDB3 Hadoop FUB 2-slaves
Load and build entities for R2R Out of Memory 9:34:44
R2R data translation - 15:22
Build entities for Silk - 51:12
Silk identity resolution - 15:27:52
Find sameAs URI sets - 1:44:24
URIs rewriting - 1:45:29
Overall execution - 29:09:03

Data statistics

Metric Amount
Overall input triples 3,687,918,681
Triples in Uniprot 3,602,395,798
Triples in KEGG 85,522,883
Triples after filtering 1,164,250,713
Triples after mapping phase 98,380,001
Number of entities considered for entity resolution 22,496,678
Number of pairs of equivalent entities resolved 6,321,750

1 According to this article, c1.medium instances with 5 ECU are observed to run as 2 of 4 cores of an Intel E5410 processor (4 cores, 2.33 GhZ).

2 JVM parameters: -Xmx20G
From release 0.3, data translation starts during the previous entity building phase
From release 0.3, URIs rewriting doesn’t run all in-memory any more (runs slower, but uses less memory)

3 JVM parameters: -Xmx4G