2nd International Workshop on Semantic Statistics, ISWC 2014 — Report

Following up on last year’s edition, in October 19th we participated in the second edition of SemStats, the ISWC workshop that brings together the Semantic Web and the Statistics communities. About 40 attendees showed up to talk all things Linked Statistical Data (LSD).

We had two contributions for the workshop:

A paper entitled “From Flat Lists to Taxonomies: Bottom-up Concept Scheme Generation in Linked Statistical Data“, where we leverage lexical and semantic properties of historical statistical data to automatically build standard classification schemes;
A poster named “Semantic Similarity and Correlation of Linked Statistical Data Analysis“, where we explore the relationship between correlation of Linked Statistical Datasets and their semantic similarity.

If last year’s first edition was a call to both communities to see whether they could work out interesting research together, this year’s keynote made clear that SemStats has a sound agenda, and that National Statistical Offices (NSO) are convinced about the use of Linked Data in Statistical data publishing. Emanuele Baldacci from the Istat (Italian National Institute of Statistics) gave a great keynote, including the unavoidable vision that data is (and will be) published in a distributed manner out of NSOs, and that the real challenge is to solve the bottleneck of mixing and comparing these statistical data, including non-matching or misleading dimensions and classifications. He also emphasised that traditionally NSOs work with monolithic and one-way data process workflows with no communication with the external world whatsoever. Linked Data poses an interesting paradigm to change this, allowing the Web to play an interactive role in publication of statistics. Advantages include solving the ancient problem of harmonised classification systems, but also changing the mind setting of NSOs that tend to think of integration as a post-hoc process: Linked Data makes data publishers think of integration since the very beginning.

Interestingly, paper presentations were much more tool-focused rather than model-focused, the trend on the first edition. We presented LSD Dimensions (an observatory of the current use and reuse of dimensions in LSD), TabCluster (an automatic concept scheme builder from flat lists of literals) and sense-of-lsd-analysis (leveraging semantics to discover statistical correlations that make sense in LSD), tools that emphasise the “analysis” of Linked Statistical Data; but many others were devoted to “publishing” Linked Statistical Data, like the guys of Open Cube.

Overall an interesting follow-up on last year’s first edition, and a real boost for a community with great motivation, a solid roadmap and first working solutions.