Version 14, last updated by Andrea Matteini at 03 Apr 15:05 UTC
This wiki contains development information about the LDIF project.
Information about the current public LDIF release is provided on the main LDIF homepage.
About
The Web of Linked Data grows rapidly and contains data from a wide range of different domains, including life science data, geographic data, government data, library and media data, as well as cross-domain data sets such as DBpedia or Freebase. Linked Data applications that want to consume data from this global data space face the challenges that:
- data sources use a wide range of different RDF vocabularies to represent data about the same type of entity;
- the same real-world entity, for instance a person or a place, is identified with different URIs within different data sources;
- data about the same real-world entity coming from different sources may contain conflicting value, for example the single value attribute population for a specific country can have multiple, different values after merging data from different sources.
This usage of different vocabularies as well as the usage of URI aliases makes it very cumbersome for an application developer to write SPARQL queries against Web data which originates from multiple sources. In order to ease using Web data in the application context, it is thus advisable to translate data to a single target vocabulary (vocabulary mapping) and to replace URI aliases with a single target URI on the client side (identity resolution), before starting to ask SPARQL queries against the data.
Up-till-now, there have not been any integrated tools that help application developers with these tasks. With LDIF, we try to fill this gap and provide an an open-source Linked Data Integration Framework that can be used by Linked Data applications to translate Web data and normalize URI while keeping track of data provenance.
The LDIF integration pipeline consists of the following steps:
- Collect Data: Access modules locally replicate data sets via file download, crawling or SPARQL;
- Map to Schema: An expressive mapping language allows for translating data from the various vocabularies that are used on the Web into a consistent, local target vocabulary;
- Resolve Identities: An identity resolution component discovers URI aliases in the input data and replaces them with a single target URI based on user-provided matching heuristics;
- Quality Assessment and Data Fusion: A data cleansing component filters data according to different quality assessment policies and provides for fusing data according to different conflict resolution methods;
- Output: LDIF outputs the integrated data and that can be written to file or to a QuadStore. For provenance tracking, LDIF employs the Named Graphs data model.
The figure below shows the schematic architecture of Linked Data applications that implement the crawling/data warehousing pattern. The figure highlights the steps of the data integration process that are currently supported by LDIF.

Next steps
Over the next months, we plan to extend LDIF along the following lines:
- Flexible integration workflow. Currently the integration flow is static and can only be influenced by predefined configuration parameters. We plan to make the workflow and its configuration more flexible in order to make it easier to include additional modules that cover other data integration aspects.