The Commissioner's Blog

The Privacy Ramifications of Big Data

Advances in Information and Communications Technology

Digitization, the Internet and other advances in information and communications technology have led to the phenomenon of “big data”. Vast quantities of data are generated, gathered, stored, linked and analysed with phenomenal ease and efficiency.

Content vs Contextual Data

Much of this big data is personal data generated online through our social interactions, our relationship with organisations, and our connection with smart devices which record and process data. This includes “content” data such as tweets, texts, emails, phone calls, social network posts, photos and videos. At the same time, communications devices and communications service providers can generate and retain “contextual” data (called metadata) relating to these communications, for example, the time, origin, destination and duration of a communication.

Internet-based service providers have the capabilities to review our “content” data. It is no secret that Google reads each and every word of the email messages of Gmail users and serves up ads based on the content of these messages. On the other hand, it may not be readily apparent that our metadata can be more revealing than the actual content of the communication. Metadata portrays a detailed, comprehensive and time-stamped picture of who is communicating with whom, when, how often, and for how long; where the senders and recipients are located, who else is connected to whom, and so forth. It thus reveals the details of our personal, political, social, financial, and working lives.

One can argue whether web search, purchase and browsing histories are data or metadata. The distinction is academic. What is important is that they are very intimate and revealing information about an individual that are often tracked by Internet companies. As Google’s CEO Eric Schmidt put it in 2010, “We know where you are. We know where you’ve been. We can more or less know what you’re thinking about.”

Benefits vs Privacy Risks

No doubt big data can bring enormous economic and societal benefits as companies and governments use it to unleash powerful analytic capabilities. They are connecting data from different sources to find patterns and generate new insights for optimising customer relationship, targeted behavioural advertising, combatting criminal activities, improving health care and many, many other aspects of our lives.

At its core, big data analytics works wonders by uncovering the correlations between data. For example, Google Flu Trends kicked off in 2009 to track outbreaks of influenza around the world: the more people in a particular location search for information relating to flu through Google, the more people in that location have contracted the virus. Similarly, the retail giant Target, through analysing its customers’ purchasing patterns, was able to identify two dozen products (unscented lotion, certain nutrient supplements etc.) which could be used as proxies for predicting pregnancy, so it could send relevant coupons to the target customer.

While these efforts are to be welcomed, they have potential ramifications for privacy and data protection.

Correlation vs Causality

First, correlation does not necessarily imply a cause-and-effect relationship. At best, it points the way for causal investigations. Hence while clinical researchers have found a correlation between breakfast skippers and obesity, it would be wrong for us to conclude that eating breakfast will beat obesity¹ . It is possible that the research participant was physically inactive and that was why he did not feel hungry in the morning and at the same time tended to gain weight. It is also possible that he habitually slept late and found no time for breakfast, and he was a junk-food eater. In both scenarios, encouraging the person to eat breakfast would only aggravate obesity.

In a similar vein, Google Flu Trends has been criticised as a less than accurate predictor of flu². It has been persistently overestimating flu prevalence because most people who think they have flu and undertake flu-related Google searches do not actually have it and in some cases, the flu-like symptoms were due to other viruses.

Another example showing big data can be misleading is the Street Bump community project initiated by Boston in the United States in 2012 to help residents improve their neighbourhood streets. As the volunteers drove, the mobile app Street Bump identified potholes by recording "bump" data, providing the city with real-time information to fix the potholes and plan long term investments. However, the results recorded were skewed in favour of wealthier neighbourhoods with greater smartphone penetration. Had the skewed data not been adjusted, social prejudice would be perpetuated.

Profiling

Second, big data may be used in profiling, with its attendant risks. For example, some insurance companies tried to use credit reports and lifestyle data as proxies for the analysis of blood and urine samples for determinations on eligibility and offers. This has the advantage of offering a more convenient and affordable service as the customer could complete the transaction online by answering a number of apparently neutral questions and he/she is relieved of the painful and costly lab tests. However, such predictive modelling always entails some margins of error. On the one hand, customers at high insurance risks may be accepted erroneously. On the other hand, perfectly healthy applicants may either be rejected or be accepted but have to pay higher insurance premium unknowingly, nor would they be able to access and correct any misleading information about them.

Similarly, in the fight against terrorism, the use of blacklists based on statistical inferences is bound to result in false positives and false negatives. It offers no absolute guarantee that terrorist passengers will be intercepted while some innocent passengers would inevitably be prevented from boarding a plane. You can only hope that you will not someday be one of the unfortunate ones in the latter category.

In the United States, “big data” scores abound today. They are compiled based on financial, demographic, ethnic, racial, health, social, consumer and other data to characterise individuals or predict behaviours like spending, health, fraud, academic performance, employability and promotability. Scores can be correct, or they can be inaccurate or misleading. More often than not, they lack transparency. Persons affected may not be aware of the existence of the score itself, its uses, the underlying factors and data sources. They are therefore unable to challenge the score, correct the data on which it is based, or opt out of being the subject of the score. Its use is thus discriminatory, unfair and biased³.

Privacy Intrusiveness

Third, the use of big data could be creepy. In the pregnancy prediction example mentioned above, that Target has “data-mined” its way into the customer’s womb is clearly very privacy-intrusive. The story was uncovered subsequent to a complaint by the father of a teenage girl who was deeply embarrassed after finding out that her daughter was three months pregnant, based on the increased amount of pregnancy-related advertisements from Target arriving in the mail.

The Snowden revelations in 2013 offered perhaps the most illuminating example of how governments can exploit big data to undertake mass surveillance on their own citizens and worldwide, causing extreme privacy intrusiveness to the daily lives of ordinary people. The National Security Administration of the United States, together with its intelligence partners worldwide, run programmes which collect telephone metadata from US telephone companies, and monitor international Internet traffic. This reminds us of the remarks made by Sun Microsystems’ CEO Scott McNealy way back in 1999: “You have zero privacy anyway. Get over it.”

De-identification

Users of big data may claim that they are working with de-identified information, that is, data stripped of the name and other personal identifiers. They contend that with anonymisation, privacy is no longer an issue. However, such assertion may be a fallacy.

Our online tracks are tied to smartphones or personal computers through UDIDs, IP addresses, “fingerprinting” and other means. Given how closely these personal communication devices are associated with each of us, information linked to these devices is, to all intents and purposes, linked to us as individuals.

Furthermore, big data can increase the risk of re-identification, and in some cases, inadvertently re-identify large swaths of de-identified data all at once. The consequences could be fatal in the event of a data breach.

In 2006, the internet giant AOL released 20 million old search queries of 658,000 subscribers for public view in connection with the company’s newly launched research site. Although identification numbers were used instead of names, user IDs or IP addresses when listing the search logs, privacy advocates were concerned that the subscribers could still be identified individually based on the search log data. Indeed, within days, the New York Times based on search queries like “60 single men”, “tea for good health” and “landscapers in Lillburn, Ga” identified correctly one of the subscribers to be a 62-year–old widow from Lillburn, Georgia. Her whole personal life was exposed immediately as people reviewed her search queries which included also “nicotine effect”, “dry mouth”, “hand tremors” and “bipolar disorder”. The ensuing public outcry led to a public apology from AOL and the removal of all the search logs in a matter of 10 days.

Concluding remarks

Whilst the intelligent use of big data holds great promise for enriching the quality of life and enhances productivity, consumer privacy and data protection must remain a priority. The challenge before us is how to ensure a win-win outcome by exploiting big data’s potential while addressing its downsides.

In this regard, we hope to benefit from the insights and experience of an international panel of experts who will be speaking in a half-day conference to be held on 10 June 2015⁴ . They comprise regulators, academics and privacy professionals from think tanks and multi-national corporations. They will cover legislative controls as well as innovative approaches based on risk benefit analysis and ethical information governance. Their presentations are expected to stimulate further discussion among various interested sectors in Hong Kong.

¹ http://youtu.be/ROpbdO-gRUo

² http://siliconangle.com/blog/2014/03/24/google-flu-trends-a-case-of-big-data-gone-bad/

³ See World Privacy Forum, The Scoring of America: How Secret Consumer Scores Threaten Your Privacy and Your Future at http://www.worldprivacyforum.org/wp-content/uploads/2014/04/WPF_Scoring_of_America_April2014_fs.pdf

⁴ see http://www.pcpd.org.hk/privacyconference2015/index.html

All Commissioner's Blog