Version 5, last updated by Robert Isele at September 26, 2011 14:11 UTC
Silk Server
Introduction
Silk Server is an extension to the Silk Link Discovery Framework. Silk Server is designed to be used with an incoming stream of RDF instances, produced for example by a Linked Data crawler such as LDspider. Silk Server matches data describing incoming instances against a local set of known instances and discovers missing links between them based on user-provided link specifications. Incoming instances which do not match a known instance are added to the local set of instances continuously. Using the Silk Link Specification Language (Silk-LSL) conditions data items must fulfill in order to be interlinked can be specified by combining various similarity metrics and taking the graph around a data item into account. Based on this assessment, an application can store data about newly discovered instances in its repository or fuse data that is already known about an entity with additional data about the entity from the Web. Silk Server can be used within Linked Data application architectures as an identity resolution component to add missing RDF links to data that is consumed from the Web of Linked Data.
The main features of the Silk Server are:
- It runs as an HTTP server and offers a REST interface that allows applications to check whether data that is discovered on the Web describes an entity that is already known to the system. If the entity is already known, Silk Server returns an RDF link pointing at the URI identifying the known entity.
- It provides a flexible, declarative language for specifying the conditions that are checked in order to determine whether an entity is already known to the system.
- It is high-performing by holding the data about all known instances that is required for the comparisons in an in-memory cache, which is also updated as soon as new instances are discovered. In addition, the performance can be further enhanced using a blocking feature.
- It is available under an open source license and can be run on all major platforms.
Configuration
Server configuration parameters
| Parameter | Description |
|---|---|
| configDir | The directory where the Silk Link Specifications can be found. On startup, the Silk Server will load all Link Specifications in this directory. For details on writing a Link Specification for the Silk Server see the next Section. |
| directory | Specifies whether unknown instances should be added to the instance cache. |
| writeUnknownEntities | Specifies whether unknown entities should be added to the instance cache. |
| returnUnknownEntities | Specifies whether the server entities should contain unknown instances, too. |
Writing Link Specifications for the Server
For general information on how to write a Linking Specification refer to Silk Link Specification Language Specification. In addition, there are a few points which should be considered when writing a Link Specification for the Server:
DataSources
In a typical use case there is some initial dataset, which shall be loaded by the server on start-up. This can be accomplished by specifying a source dataset in the Link Specification. The target dataset will be formed by the incoming stream and thus is ignored by the server.
Example:
...
<DataSources>
<DataSource id="source" type="file">
<Param name="file" value="./initialData.rdf"/>
<Param name="format" value="RDF/XML"/>
</DataSource>
<DataSource id="inputStream" type="rdf">
<Param name="format" value="N-TRIPLE"/>
<Param name="input" value=""/>
</DataSource>
</DataSources>
<Interlinks>
<Interlink id="persons">
<LinkType>owl:sameAs</LinkType>
<SourceDataset dataSource="source" var="a">
<RestrictTo>
?a rdf:type foaf:Person .
</RestrictTo>
</SourceDataset>
<TargetDataset dataSource="inputStream" var="b">
<RestrictTo>
?b rdf:type foaf:Person .
</RestrictTo>
</TargetDataset>
...Output
As the Silk Server returns the generated links in the response, the output section of the Link Specification may be left empty.
3. Usage
3.1 Running the Server
In order to run the Silk Server, you need:
- Silk Link Discovery Framework .
- Java Runtime Environment The Silk Link Discovery Framework runs on top of the JVM. Get the most recent JRE.
- Maven is used for project management and build automation. Get it from: http://maven.apache.org.
What to do:
- Write Silk-LSL configuration files to specify which resources should be interlinked.
- Modify the Server configuration if needed.
- Install the Silk Framework in the local Maven Repository: Navigate to the main Silk folder and execute:
mvn install - Run the Silk Server: Navigate to the server folder and execute:
mvn jetty:run
Note: In order to configure the underlying Jetty Server (e.g. the port on which requests are accepted), you need to edit the pom.xml of the server module. Refer to the official homepage for details.
Making requests the Server
The Server accepts HTTP Post requests on the URL http://{ip:port}/api/process?format={format}. The input data must be included in the body of the request. By default, the Server expects the input data to be serialized as RDF/XML. Other input formats can be specified using the format parameter. Supported input formats are: “RDF/XML”, “N-TRIPLE”, “TURTLE”, “TTL” and “N3”
The Server response contains the generated links as N-Triples. If the returnUnknownInstances configuration parameter is set, it will additionally contain a statement of the form unknownInstance <http://www4.wiwiss.fu-berlin.de/bizer/silk/matchingResult> <http://www4.wiwiss.fu-berlin.de/bizer/silk/UnknownInstance> for each unknown instance in the request.
If the writeUnknownInstances configuration parameter is set, each unknown instances will be added to the instance cache. In that case, it will be included in the link generation in future requests.
Usage example
This section reports on the results of an experiment in which we used Silk Server to generate RDF links between authors and publications from a Semantic Web Dog Food Corpus dump and a stream of FOAF profiles that we crawled from the Web. Semantic Web Dog Food Corpus publishes information on people and publications from Semantic Web conferences. FOAF is a widely used vocabulary to describe persons, their connections, projects, publications and interests. Twitter is a social networking and microblogging website which provides user information as RDFa. Given these different sources for information on persons, the experiment aims at linking duplicate person descriptions. In the following, we explain the Silk-LSL specification used by Silk Server in the experiment; we then first describe the setup of the experiment and finally report on and discuss the results of the experiment.
The Link Specification used
We have used the following link configuration for linking data items describing the same person:
01 <?xml version="1.0" encoding="utf-8" ?>
02 <Silk>
03 <Prefixes>
04 <Prefix id="rdf" namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#"/>
05 <Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#"/>
06 <Prefix id="owl" namespace="http://www.w3.org/2002/07/owl#"/>
07 <Prefix id="dcterms" namespace="http://purl.org/dc/terms/"/>
08 <Prefix id="foaf" namespace="http://xmlns.com/foaf/0.1/"/>
09 <Prefix id="vcard" namespace="http://www.w3.org/2006/vcard/ns#" />
10 </Prefixes>
11 <DataSources>
12 <DataSource id="sw_dog_food" type="file">
13 <Param name="file" value="semantic_web_dog_food.rdf"/>
14 <Param name="format" value="RDF/XML"/>
15 </DataSource>
16 <DataSource id="input_stream" type="rdf">
17 <Param name="format" value="N-TRIPLE"/>
18 <Param name="input" value=""/>
19 </DataSource>
20 </DataSources>
21 <Interlinks>
22 <Interlink id="persons">
23 <LinkType>owl:sameAs</LinkType>
24 <SourceDataset dataSource="input_stream" var="a">
25 <RestrictTo>
26 ?a rdf:type foaf:Person .
27 </RestrictTo>
28 </SourceDataset>
29 <TargetDataset dataSource="sw_dog_food" var="b">
30 <RestrictTo>
31 ?b rdf:type foaf:Person .
32 </RestrictTo>
33 </TargetDataset>
34 <LinkageRule>
35 <Aggregate type="average">
36 <Aggregate type="max" required="true">
37 <Compare metric="jaroWinkler">
38 <TransformInput function="lowerCase">
39 <Input path="?a/foaf:name"/>
40 </TransformInput>
41 <TransformInput function="lowerCase">
42 <Input path="?b/foaf:name"/>
43 </TransformInput>
44 </Compare>
45 </Aggregate>
46 <Aggregate type="max" weight="2" required="true">
47 <Compare metric="jaroWinkler">
48 <TransformInput function="lowerCase">
49 <Input path="?a/foaf:homepage"/>
50 </TransformInput>
51 <TransformInput function="lowerCase">
52 <Input path="?b/foaf:homepage"/>
53 </TransformInput>
54 </Compare>
55 <Compare metric="jaroWinkler">
56 <Input path="?a/foaf:mbox_sha1sum"/>
57 <Input path="?b/foaf:mbox_sha1sum"/>
58 </Compare>
59 </Aggregate>
60 </Aggregate>
61 </LinkageRule>
62 <Filter threshold="0.9"/>
63 </Interlink>
64 </Interlinks>
65 </Silk>The complete link configuration for discovering RDF links between persons as well as publications is available online.
Linkage Rules
The linkage rule specifies how two data entities are compared for similarity. It consists of a number of comparison operators which are combined using aggregation functions.
A comparison operator evaluates two inputs and computes their similarity based on a user-defined metric. Silk provides several similarity metrics including string, numeric, date, and URI similarity. String comparison methods cover the most common ones like Jaro, Jaro-Winkler and Levenshtein.
Multiple comparisons can be aggregated using a specific aggregation method by using the <Aggregate> directive.
In the given experiment’s linkage rules we compute similarity values for the FOAF names, homepages, and mailbox hash sums (lines 34 to 61). The overall similarity value of two data entities is derived by the weighted average of the similarity values of all comparisons. To identify a person uniquely, either a homepage or a mailbox hash sum is required. Thus, two persons are considered equal if both names and either the homepage or the mailbox hash sum match.
Some comparison operators might be more relevant for the correct establishment of a link between two resources than others and can therefore be weighted higher. If no weight is supplied, a default weight of 1 will be assumed. As a person may be known under different names, matching homepages or mailbox hash sums are more important and therefore weighted higher (line 46).
Filtering
The generated links can be filtered by using the <Filter> directive. A threshold for the minimum similarity of two data items required to generate a link between them can be defined (line 62). The number of links originating from a single data item can be limited. Only the highest-rated links per source data item will remain after the filtering.
Setup of the Experiment
For the experiment, we loaded the Semantic Web Dog Food Corpus into the Silk Server. The Semantic Web Dog Food Corpus contains profiles for 3.739 persons from which 2.580 provide either a homepage or a mailbox hash which is required to uniquely identify them. We have set up a Linked Data crawler which takes a number of FOAF profile URIs as seeds and follows linked profiles. The crawled documents are forwarded to Silk Server which generates owl:sameAs links to known persons from the Semantic Web Dog Food Corpus. All generated links have been written to an ouput file.
The crawler was also used to traverse the RDFa of Twitter accounts for which the server identified the corresponding persons in the Semantic Web Dog Food Corpus if any.
In order to show the flexibility of Silk Server, the link configuration was further enhanced to also match publications. For this purpose the crawler was employed to also follow publication links in addition to FOAF profiles.
Results of the Experiment
Generated links to FOAF profiles
At first, we evaluated how exhaustive the found links are. For this purpose, we exploited the fact that for 56 persons the Semantic Web Dog Food Corpus already sets links to their FOAF profile. For 51 of these persons, Silk Server was able to reconstruct links from the stream. For some persons even multiple duplicated profiles could be identified. For example e.g. in addition to Tom Heath’s (http://data.semanticweb.org/person/tom-heath) official FOAF profile http://tomheath.com/id/me, Silk Server also identified him on http://www.eswc2006.org/people/#tom-heath. Because in some cases, Silk Server found a link to another profile than the one given in the data set, we checked all links manually for correctness. Thereby, all generated links have been found to be correct.
Next, we evaluated for how many persons in the Semantic Web Dog Food Corpus, the server was able to generate links to a FOAF profile. In total, Silk Server was able to find profiles for 228 persons in the data set. Thus, Silk Server was able to discover links to the FOAF profile of additional 177 persons for which the Semantic Web Dog Food Corpus did not contain a link yet.
Generated links to Twitter accounts
For 89 persons in the Semantic Web Dog Food Corpus, Silk Server was able to find a corresponding Twitter account. Silk Server was able to detect more than one account for persons holding multiple accounts. For example, it found that Ralph Hodgson (http://data.semanticweb.org/person/ralph-hodgson) not only uses the account http://twitter.com/ralphtq but also the account http://twitter.com/oegovnews.
Generated links to publications
For 37 publications in the Semantic Web Dog Food Corpus Silk Server was able to find the corresponding publication in the Web of Data. The number of links is lower than the number of found FOAF profiles because many persons do not link their publications in their profile. One exception is the Digital Enterprise Research Institute (DERI), which publishes the meta data about all publications as RDF (http://www.deri.ie/publications/).