Version 13, last updated by Robert Isele at November 07, 2011 19:34 UTC
Transformation
Overview
As different datasets usually use different data formats, a transformation can be used to normalize the values prior to comparison.
Parameters
TODO
Examples
XML
<TransformInput function="lowerCase">
<TransformInput function="replace">
<Input path="?a/rdfs:label" />
<Param name="search" value="_" />
<Param name="replace" value=" " />
</TransformInput>
</TransformInput>Scala API
TransformInput(
id = "ReplaceUnderscores",
transformer = ReplaceTransformer("_", " ")
inputs = PathInput(path = Path.parse("?a/rdfs:label"))
)Transformations
Silk provides the following transformation and normalization functions:
| Function and parameters | Description |
|---|---|
| removeBlanks | Remove whitespace from a string. |
| removeSpecialChars | Remove special characters (including punctuation) from a string. |
| lowerCase | Convert a string to lower case. |
| upperCase | Convert a string to upper case. |
| capitalize(allWords) | Capitalizes the string i.e. converts the first character to upper case. If ‘allWords’ is set to true, all words are capitalized and not only the first character. By default ‘allWords’ is set to false. |
| stem | Apply word stemming to the string. |
| alphaReduce | Strip all non-alphabetic characters from a string. |
| numReduce | Strip all non-numeric characters from a string. |
| replace(string search, string replace) | Replace all occurrences of “search” with “replace” in a string. |
| regexReplace(string regex, string replace) | Replace all occurrences of a regex “regex” with “replace” in a string. |
| stripPrefix | Strip the prefix from a string. |
| stripPostfix | Strip the postfix from a string. |
| stripUriPrefix | Strip the URI prefix (e.g. http://dbpedia.org/resource/) from a string. |
| concat | Concatenates strings from two inputs. |
| logarithm([base]) | Transforms all numbers by applying the logarithm function. Non-numeric values are left unchanged. If base is not defined, it defaults to 10. |
| convert(string sourceCharset, string targetCharset) | Converts the string from “sourceCharset” to “targetCharset” |
| tokenize([regex]) | Splits the string into tokens. Splits at all matches of “regex” if provided and at whitespaces otherwise. |
| removeValues(blacklist) | Removes specific values (i.e. stop words) from the value set. ‘blacklist’ is a comma-separated list of words. |