Version 20, last updated by fkleedorfer at December 15, 2011 08:36 UTC

Overview

A comparison operator evaluates two inputs and computes the similarity based on a user-defined distance measure and a user-defined threshold.

The distance measure always outputs 0 for a perfect match, and a higher value for an imperfect match. Only distance values between 0 and threshold will result in a positive similarity score. Therefore it is important to know how the distance measures work and what the range of their output values is in order to set a threshold value sensibly.

Parameters

Parameter Description
required (optional) If required is true, the parent aggregation only yields a confidence value if the given inputs have values for both instances.
weight (optional) Weight of this comparison. The weight is used by some aggregations such as the weighted average aggregation.
threshold The maximum distance. For normalized distance measures, the threshold should be between 0.0 and 1.0.
distanceMeasure The used distance measure. For a list of available distance measures see below.
Inputs The 2 inputs for the comparison.

Examples

XML

<Compare metric="levenshteinDistance" threshold="2.0" required="true">
  <TransformInput function="lowerCase">
    <Input path="?a/rdfs:label"/>
  </TransformInput>
  <TransformInput function="lowerCase">
    <Input path="?b/rdfs:label"/>
  </TransformInput>
</Compare>

Scala API

Comparison(
  id = "labels",
  required = false,
  weight = 1,
  threshold = 2.0,
  metric = LevenshteinDistance()
  inputs = PathInput(path = Path.parse("?a/rdfs:label")) ::
           PathInput(path = Path.parse("?b/rdfs:label")) :: Nil  
)

Threshold

The threshold is used to convert the computed distance to a confidence between -1.0 and 1.0. Links will be generated for confidences above 0 while higher confidence values imply a higher similarity between the compared entities.


Distance Measures

Character-Based Distance Measures

Character-based distance measures compare strings on the character level. They are well suited for handling typographical errors.

Measure Description Normalized
levenshteinDistance Levenshtein distance. The minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character No
levenshtein The levensthein distance normalized to the interval [0,1] Yes
jaro Jaro distance metric. Simple distance metric originally developed to compare person names. Yes
jaroWinkler Jaro-Winkler distance measure. The Jaro–Winkler distance metric is designed and best suited for short strings such as person names Yes
equality 0 if strings are equal, 1 otherwise. Yes
inequality 1 if strings are equal, 0 otherwise. Yes

Example:

<Compare metric="levenshteinDistance" threshold="2">
  <Input path="?a/rdfs:label" />
  <Input path="?b/gn:name" />
</Compare>

Token-Based Distance Measures

While character-based distance measure work well for typographical
errors, they are are number of tasks where token-base distance measures are better suited:

  • Strings where parts are reordered e.g. “John Doe” and “Doe, John”
  • Texts consisting of multiple words
Measure Description Normalized
jaccard Jaccard distance coefficient. Yes
dice Dice distance coefficient. Yes
softjaccard Soft Jaccard similarity coefficient. Same as Jaccard distance but values within an levenhstein distance of ‘maxDistance’ are considered equivalent. Yes

Example:

<Compare metric="jaccard" threshold="0.2">
  <TransformInput function="tokenize">
    <Input path="?a/rdfs:label" />
  </TransformInput>
  <TransformInput function="tokenize">
    <Input path="?b/gn:name" />
  </TransformInput>
</Compare>

Special Purpose Distance Measures

Silk offers a number of distance measures which are designed to compare specific types of data e.g. numeric values.

Measure Description Normalized
num(float minValue, float maxValue) Computes the numeric difference between two numbers
Parameters:
minValue, maxValue The minimum and maximum values which occur in the datasource
No
date Computes the distance between two dates (“YYYY-MM-DD” format). Returns the difference in days No
dateTime Computes the distance between two date time values (xsd:dateTime format). Returns the difference in seconds No
wgs84(string unit, string curveStyle) Computes the geographical distance between two points.
Parameters:
unit The unit in which the distance is measured. Allowed values: “meter” or “m” (default) , “kilometer” or “km”
Author: Konrad Höffner (MOLE subgroup of Research Group AKSW, University of Leipzig)
No

Example:

<Compare metric="wgs84" threshold="50">
  <Input path="?a/wgs84:geometry" />
  <Input path="?b/wgs84:geometry" />
  <Param name="unit" value="km"/>
</Compare>