Version 64, last updated by Andrea Matteini at 02 Apr 21:15 UTC
Benchmark machines
FUB machines – used for In-Memory, TDB and FUB Hadoop clusters
- Hardware
- Processors: Intel i7 950, 3.07GHz (quadcore)
- Memory: 24GB
- Hard Disks: 2 × 1.8TB (7,200 rpm) SATA2
- Software
- Operating System: Ubuntu 11.04 64-bit, Kernel: 2.6.38-10
- Java version: 1.6.0_22
EC2 c1.medium instances – used for EC2 Hadoop clusters
- Hardware
- Processor: 5 EC2 Compute Units 1
- Memory: 1.7 GB
- Hard Disks: 350 GB
- I/O Performance: Moderate
- Software
- Operating System: Ubuntu 11.04 32-bit
Benchmark clusters
In all cluster configurations, the master works as job tracker and name node, while the slaves work as data nodes and task trackers. See Hadoop configuration page for more details.
FUB 2-slaves cluster
- 1 master, 2 slaves (FUB machines)
- Network: Gigabit Ethernet
EC2 X-slaves cluster
- 1 master, X slaves (EC2 c1.medium instances)
Tests description
Data sets, mappings and link specs are described in use case B of the benchmark page. For the scales 150M and 300M we added even more data from the KEGG GENES and UniProt data sets.
We used LDIF revision 3c5bed30e4 for the in-memory and TDB tests; LDIF revision 38d9c0fab4 for the Hadoop tests.
Results
25M run times
| Phase | In-memory2 | TDB3 | Hadoop FUB 2-slaves |
|---|---|---|---|
| Load and build entities for R2R | 131 s | 1334 s | 688 s |
| R2R data translation | 110 s | 85 s | 87 s |
| Build entities for Silk | 13 s | 139 s | 225 s |
| Silk identity resolution | 123 s | 213 s | 216 s |
| URIs rewriting | 17 s | 15 s | 427 s |
| Overall execution | 6.5 min | 29.7 min | 27.4 min |
100M run times
| Phase | In-memory2 | TDB3 | Hadoop FUB 2-slaves |
|---|---|---|---|
| Load and build entities for R2R | 834 s | 7014 s | 3112 s |
| R2R data translation | 735 s | 607 s | 165 s |
| Build entities for Silk | 87 s | 806 s | 405 s |
| Silk identity resolution | 2743 s | 4656 s | 544 s |
| URIs rewriting | 124 s | 118 s | 1056 s |
| Overall execution | 75 min | 220 min | 88 min |
150M run times
| Phase | In-memory2 | TDB3 | Hadoop FUB 2-slaves |
|---|---|---|---|
| Load and build entities for R2R | Out of Memory | 13776 s | 3830 s |
| R2R data translation | - | 1206 s | 170 s |
| Build entities for Silk | - | 847 s | 380 s |
| Silk identity resolution | - | 5328 s | 688 s |
| URIs rewriting | - | 173 s | 1235 s |
| Overall execution | - | 355 min | 105 min |
300M run times
| Phase | In-memory2 | TDB3 | Hadoop FUB 2-slaves |
|---|---|---|---|
| Load and build entities for R2R | Out of Memory | 22870 s | 6070 s |
| R2R data translation | - | 1203 s | 179 s |
| Build entities for Silk | - | 1006 s | 436 s |
| Silk identity resolution | - | 8392 s | 1022 s |
| URIs rewriting | - | 176 s | 1232 s |
| Overall execution | - | 560 min | 148 min |
300M run times on Hadoop EC2 clusters
| Phase | Hadoop EC2 8-slaves | Hadoop EC2 16-slaves | Hadoop EC2 32-slaves |
|---|---|---|---|
| Load and build entities for R2R | 7933 s | 4647 s | 2382 s |
| R2R data translation | 297 s | 173 s | 114 s |
| Build entities for Silk | 646 s | 421 s | 324 s |
| Silk identity resolution | 1546 s | 932 s | 580 s |
| URIs rewriting | 2174 s | 1430 s | 1085 s |
| Overall execution | 209 min | 126 min | 75 min |
3.69B (3690M, complete UniProt and complete KEGG GENES) run times
Config files:
Run times (format: hh:mm:ss)
| Phase | In-memory2 | TDB3 | Hadoop FUB 2-slaves |
|---|---|---|---|
| Load and build entities for R2R | Out of Memory | 9:34:44 | |
| R2R data translation | - | 15:22 | |
| Build entities for Silk | - | 51:12 | |
| Silk identity resolution | - | 15:27:52 | |
| Find sameAs URI sets | - | 1:44:24 | |
| URIs rewriting | - | 1:45:29 | |
| Overall execution | - | 29:09:03 |
Data statistics
| Metric | Amount |
|---|---|
| Overall input triples | 3,687,918,681 |
| Triples in Uniprot | 3,602,395,798 |
| Triples in KEGG | 85,522,883 |
| Triples after filtering | 1,164,250,713 |
| Triples after mapping phase | 98,380,001 |
| Number of entities considered for entity resolution | 22,496,678 |
| Number of pairs of equivalent entities resolved | 6,321,750 |
1 According to this article, c1.medium instances with 5 ECU are observed to run as 2 of 4 cores of an Intel E5410 processor (4 cores, 2.33 GhZ).
2 JVM parameters: -Xmx20G
From release 0.3, data translation starts during the previous entity building phase
From release 0.3, URIs rewriting doesn’t run all in-memory any more (runs slower, but uses less memory)
3 JVM parameters: -Xmx4G