Version 20, last updated by fkleedorfer at December 15, 2011 08:36 UTC
Comparison
Overview
A comparison operator evaluates two inputs and computes the similarity based on a user-defined distance measure and a user-defined threshold.
The distance measure always outputs 0 for a perfect match, and a higher value for an imperfect match. Only distance values between 0 and threshold will result in a positive similarity score. Therefore it is important to know how the distance measures work and what the range of their output values is in order to set a threshold value sensibly.
Parameters
| Parameter | Description |
|---|---|
| required (optional) | If required is true, the parent aggregation only yields a confidence value if the given inputs have values for both instances. |
| weight (optional) | Weight of this comparison. The weight is used by some aggregations such as the weighted average aggregation. |
| threshold | The maximum distance. For normalized distance measures, the threshold should be between 0.0 and 1.0. |
| distanceMeasure | The used distance measure. For a list of available distance measures see below. |
| Inputs | The 2 inputs for the comparison. |
Examples
XML
<Compare metric="levenshteinDistance" threshold="2.0" required="true">
<TransformInput function="lowerCase">
<Input path="?a/rdfs:label"/>
</TransformInput>
<TransformInput function="lowerCase">
<Input path="?b/rdfs:label"/>
</TransformInput>
</Compare>Scala API
Comparison(
id = "labels",
required = false,
weight = 1,
threshold = 2.0,
metric = LevenshteinDistance()
inputs = PathInput(path = Path.parse("?a/rdfs:label")) ::
PathInput(path = Path.parse("?b/rdfs:label")) :: Nil
)Threshold
The threshold is used to convert the computed distance to a confidence between -1.0 and 1.0. Links will be generated for confidences above 0 while higher confidence values imply a higher similarity between the compared entities.

Distance Measures
Character-Based Distance Measures
Character-based distance measures compare strings on the character level. They are well suited for handling typographical errors.
| Measure | Description | Normalized |
|---|---|---|
| levenshteinDistance | Levenshtein distance. The minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character | No |
| levenshtein | The levensthein distance normalized to the interval [0,1] | Yes |
| jaro | Jaro distance metric. Simple distance metric originally developed to compare person names. | Yes |
| jaroWinkler | Jaro-Winkler distance measure. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names | Yes |
| equality | 0 if strings are equal, 1 otherwise. | Yes |
| inequality | 1 if strings are equal, 0 otherwise. | Yes |
Example:
<Compare metric="levenshteinDistance" threshold="2">
<Input path="?a/rdfs:label" />
<Input path="?b/gn:name" />
</Compare>Token-Based Distance Measures
While character-based distance measure work well for typographical
errors, they are are number of tasks where token-base distance measures are better suited:
- Strings where parts are reordered e.g. “John Doe” and “Doe, John”
- Texts consisting of multiple words
| Measure | Description | Normalized |
|---|---|---|
| jaccard | Jaccard distance coefficient. | Yes |
| dice | Dice distance coefficient. | Yes |
| softjaccard | Soft Jaccard similarity coefficient. Same as Jaccard distance but values within an levenhstein distance of ‘maxDistance’ are considered equivalent. | Yes |
Example:
<Compare metric="jaccard" threshold="0.2">
<TransformInput function="tokenize">
<Input path="?a/rdfs:label" />
</TransformInput>
<TransformInput function="tokenize">
<Input path="?b/gn:name" />
</TransformInput>
</Compare>Special Purpose Distance Measures
Silk offers a number of distance measures which are designed to compare specific types of data e.g. numeric values.
| Measure | Description | Normalized |
|---|---|---|
| num(float minValue, float maxValue) | Computes the numeric difference between two numbers Parameters: minValue, maxValue The minimum and maximum values which occur in the datasource |
No |
| date | Computes the distance between two dates (“YYYY-MM-DD” format). Returns the difference in days | No |
| dateTime | Computes the distance between two date time values (xsd:dateTime format). Returns the difference in seconds | No |
| wgs84(string unit, string curveStyle) | Computes the geographical distance between two points. Parameters: unit The unit in which the distance is measured. Allowed values: “meter” or “m” (default) , “kilometer” or “km”Author: Konrad Höffner (MOLE subgroup of Research Group AKSW, University of Leipzig) |
No |
Example:
<Compare metric="wgs84" threshold="50">
<Input path="?a/wgs84:geometry" />
<Input path="?b/wgs84:geometry" />
<Param name="unit" value="km"/>
</Compare>