Data sharing: two small words, but a lot to be unpacked

05 November 2020

By Stephen MacFeely, Chief Statistician, UNCTAD

Big data
Default image copyright and description


When I hear the words “data sharing” my immediate reaction is to start asking questions. I can’t help it.

Firstly, what do we mean by the word “data”? Are we talking about aggregate statistics or are we talking about microdata, i.e., individual records? If it’s the latter, are they anonymized (and how well have they been pseudonymized) or do the records contain information that could identify persons or entities?

The answer to those questions will have a profound impact on what I might say about “sharing”.

But even regarding the word “sharing” I have questions. What do we mean by sharing? Does that mean public dissemination? Or does it mean giving selective, bilateral access? Does it require transmission of the data? Who are we sharing the data with? Another unit within the same entity? An external partner? When the data were first collected were any conditions attached that would prohibit sharing or giving access?

Let’s explore a bit

If we are talking about aggregate statistics, then there are relatively few complications, especially if the statistics are official (unless of course member states try to suppress their data). In that case, provided confidentiality is properly protected, official statistics are designed to be public goods and should, by definition, be accessible to – and thus shared with – everyone at the same time. See principles 1 and 6 of the UN Fundamental Principles of Official Statistics and the Principles Governing International Statistical Activities, which are the ”constitutions” for national and international statistical compilers respectively.

Of course, if we are talking about microdata, i.e., individual records, then it’s a whole different discussion, as safeguarding confidentiality is much more challenging.

So, first things first – where did the data come from? If they are primary data, i.e., you collected them yourself, what was the stated purpose and what guarantees did you give to the respondents? Did you tell them the data could or would be shared, and if yes, with whom and for what purposes? If respondents were told their data wouldn’t be shared with anyone, then that promise must be respected. If it was made clear to respondents that data would be shared, then whatever conditionality was set out must be respected. So this might mean, at a minimum, stripping all unique identifiers (names, addresses, social security numbers, etc.), but probably also aggregating some data into cohorts. For example, say someone is aged 37, then we might replace their actual age with an age cohort, say 30-40. Ditto for income or any other factors that when combined might reveal an individual identity. For example, if you combine sex, occupation and town, then someone might be able to determine who the person is, as well as their income or health status.

If the data are secondary, i.e., repurposed data that weren’t collected for statistical purposes, e.g., tax records (an example of administrative data) or mobile phone CDR records (an example of big data), then things get even more complex. As above, there may already be strict conditions attached to the data (from the primary data collector), but there will also be conditions with your use of the secondary data – maybe you don’t even have permission to share it.

As if things weren’t already complex enough, what about recursive data, i.e., data produced from other data?

The waters start to get very muddy, because now the issue of ownership is less clear. If I create data from your data, do I need your permission to share it? After all, you didn’t give it to me, I derived it. Does that give you a stake in the ownership?

Apart from the legal, contractual and ethical issues, there are typically a range of logistical issues surrounding the sharing of data.

At the most basic level, are the data digitized? Are they machine readable? If sharing involves transmission, i.e., moving the data, then the data may require encryption, or if the files are very large, they may require sophisticated IT infrastructure. But maybe the data are so sensitive they can only be shared under strict lab conditions – which means putting in place physical infrastructure and security. It may require harmonized data infrastructure, common classifications and codes, that allow datasets to speak to each other.

Some legal issues

An important and persistent problem across many organizations is the ambiguity surrounding their licensing and the terms of reuse. This probably stops a lot of data sharing because offices may be afraid to share data out of genuine fear of what is allowed legally. Equally, some users may be afraid to use the data for the same reason. But ambiguity can be used deliberately also as a justification or excuse to data hoard and not share data. Another issue of growing sensitivity politically is the sharing of data across international borders. After all, how can governments protect their citizens information when it resides outside their jurisdiction? 

From a statistics perspective, these are critical issues for modern statistical offices, whether national or international. The strategic plans “Data strategy of the Secretary-General for action by everyone, everywhere with insight, impact and integrity 2020-22” and the “System-wide roadmap for innovating UN data and statistics”, which were both endorsed by the United Nations chief executive board in May 2020, will together be grappling with these and other related issues.

In the context of COVID-19 there is one last but very important set of issues to consider. The pandemic, I think, has exposed a tension between community and individual rights, just as the threat of terrorism has done in the past. And the tension is becoming white hot as our capacity for “dataveillance” increases, i.e., using data for surveillance. How do we balance the right to individual privacy with the “common good”, however that might be defined? And who decides?

Should governments be allowed to track citizens using data shared by social media platforms or telecoms to contain COVID-19? That’s a big question with no easy answer. Should governments be allowed to reuse those data for other purposes, say to track protesters? I would argue no, but others will, no doubt, disagree.   

Data sharing: two small words, but a lot of big issues to be unpacked

These two words will be the basis of some very important conversations and negotiations in the years to come.  Data sharing fuels the digital economy and globalization. It will drive artificial intelligence and the algorithms that will make all sorts of decisions that will affect our daily lives.

Data sharing, I suspect, will have a profound impact on future geopolitics. It will, without question, have existential implications for official statistics.

This opinion piece was first published in the The UN Brief Special Edition on Data Governance