Overview

As different datasets usually use different data formats, a transformation can be used to normalize the values prior to comparison.

Examples

XML

<TransformInput function="lowerCase">
  <TransformInput function="replace">
    <Input path="?a/rdfs:label" />
    <Param name="search" value="_" />
    <Param name="replace" value=" " />
  </TransformInput>
</TransformInput>

Scala API

TransformInput(
  id = "ReplaceUnderscores",
  transformer = ReplaceTransformer("_", " ")  
  inputs = PathInput(path = Path.parse("?a/rdfs:label"))
)

Transformations

The following transformation and normalization functions are available by default:

Function and parameters	Description
removeBlanks	Remove whitespace from a string.
removeSpecialChars	Remove special characters from a string.
lowerCase	Convert a string to lower case.
upperCase	Convert a string to upper case.
capitalize	Capitalizes the string i.e. converts the first character to upper case. If ‘allWords’ is set to true, all words are capitalized and not only the first character. By default ‘allWords’ is set to false.
stem	Apply word stemming to the string.
alphaReduce	Strip all non-alphabetic characters from a string.
numReduce	Strip all non-numeric characters from a string.
replace	Replace all occurrences of “search” with “replace” in a string.
regexReplace	Replace all occurrences of a regex “regex” with “replace” in a string.
stripPrefix	Strip the prefix from a string.
stripPostfix	Strip the postfix from a string.
stripUriPrefix	Strip the URI prefix from a string.
concat	Concatenates strings from two inputs.
logarithm	Transforms all numbers by applying the logarithm function. Non-numeric values are left unchanged. If base is not defined, it defaults to 10.
convert	Converts the string from “sourceCharset” to “targetCharset”
tokenize	Splits the string into tokens. Splits at all matches of “regex” if provided and at whitespaces otherwise.
removeValues	Removes specific values from the value set. ‘blacklist’ is a comma-separated list of words.
removeParentheses	Removes all parentheses including their content, e.g., transforms ‘Berlin ’ → ’Berlin’.