Dataset Integration Workflow

This is the entry point to the integrated version of the Dutch historical censuses (1795-1971) in RDF. The source data is availalbe at volkstellingen.nl, as a collection of disparate Excel files that are very difficult to query in a systematic way. We usually refer to such collection as a messy spreadsheet collection (MSC). To integrate this MSC on the Web, we use the Integrator, a framework for integrating any kind of MSC using Web technology. To produce an integrated version of this MSC in RDF, we run the Integrator with this configuration. The result is served through this SPARQL endpoint and this YASGUI instance with examples (use the named graphs <urn:graph:cedar:raw-data>, <urn:graph:cedar:rules> and <urn:graph:cedar:release>).

Pipeline

The following picture illustrates the data integration process. Starting from Excel files which contain the transcription of the census books we generate a set of "raw-rdf". This process requires a manual input (red arrow) which consists in annotating the cells with the type of content they contain, this is necessary to deal with the heterogeneity of the files. Once the raw rdf files are created they are integrated following a number of harmonization rules produced by historians. Eventual corrections about the content are also taken into consideration at that stage.

Data integration workflow

Data Location Definitions

Data in this MSC is arbitrary located in several places of the spreadsheet layout. To precisely define where data observations and dimensions are located, we mark up with styles the source data. A sample of this markup can be found here.

Conciliation Rules

Dimensions in this MSC are implicitly defined. In order to make them explicit, we generate a set of context-aware conciliation mappings that can be easily written by domain experts. Example mapping files on several dimensions can be found here. A master metadata file is used as an index to all defined mapping files.

Measurement Transformations

Data in this MSC needs to be transforomed in order to be fully integrated. To this end, we extend SPARQL into SPARQLSS (SPARQL Speaks Statistics), a Web friendly way of transforming Web data without quiting the Web ecosystem. An example implementation that imputes missing values for this MSC can be found here.

Error Detection

To detect data errors and inconsistencies after the execution of previous stages, we use Linked Edit Rules (LER). LER allow us to encode domain constraints that the data cubes with which the data cubes must be compliant. First, we generate this set of LER using domain expert knowledge. The results of checking these LER against the data, containing 44,207 observations that do not meet these LER, are available here.

Additional Resources

The following list contains links to additional resources contributed by various users:

Links to external datasets

Provenance

Use the example SPARQL queries (an API is on the way) to get the full provenance trace associated with any observation.

Funding

The Integrator has been developed with funds from the Royal Dutch Academy of Arts and Sciences (KNAW) and the Dutch National programme COMMIT. For more information, learn about the eHumanities Group and CEDAR.

Contact

More information available at http://cedar-project.nl