Version 13, last updated by Robert Isele at November 07, 2011 19:34 UTC

Overview

As different datasets usually use different data formats, a transformation can be used to normalize the values prior to comparison.

Parameters

TODO

Examples

XML

<TransformInput function="lowerCase">
  <TransformInput function="replace">
    <Input path="?a/rdfs:label" />
    <Param name="search" value="_" />
    <Param name="replace" value=" " />
  </TransformInput>
</TransformInput>

Scala API

TransformInput(
  id = "ReplaceUnderscores",
  transformer = ReplaceTransformer("_", " ")  
  inputs = PathInput(path = Path.parse("?a/rdfs:label"))
)

Transformations

Silk provides the following transformation and normalization functions:

Function and parameters Description
removeBlanks Remove whitespace from a string.
removeSpecialChars Remove special characters (including punctuation) from a string.
lowerCase Convert a string to lower case.
upperCase Convert a string to upper case.
capitalize(allWords) Capitalizes the string i.e. converts the first character to upper case. If ‘allWords’ is set to true, all words are capitalized and not only the first character. By default ‘allWords’ is set to false.
stem Apply word stemming to the string.
alphaReduce Strip all non-alphabetic characters from a string.
numReduce Strip all non-numeric characters from a string.
replace(string search, string replace) Replace all occurrences of “search” with “replace” in a string.
regexReplace(string regex, string replace) Replace all occurrences of a regex “regex” with “replace” in a string.
stripPrefix Strip the prefix from a string.
stripPostfix Strip the postfix from a string.
stripUriPrefix Strip the URI prefix (e.g. http://dbpedia.org/resource/) from a string.
concat Concatenates strings from two inputs.
logarithm([base]) Transforms all numbers by applying the logarithm function. Non-numeric values are left unchanged. If base is not defined, it defaults to 10.
convert(string sourceCharset, string targetCharset) Converts the string from “sourceCharset” to “targetCharset”
tokenize([regex]) Splits the string into tokens. Splits at all matches of “regex” if provided and at whitespaces otherwise.
removeValues(blacklist) Removes specific values (i.e. stop words) from the value set. ‘blacklist’ is a comma-separated list of words.