Open Data: Privacy vs. Innovation

With a countless number of avenues in which we can find data, it can be difficult to differentiate between credible and illegitimate sources. The Internet at large is an information platform, where accurate facts are lost in a jungle of fictitious data. Open data is the transparent and accessible flow of researched information, allowing new data to be universally available with permission to reuse or redistribute the data.

The number of available government and scientific information has skyrocketed ad infinitum. These datasets have empowered citizens, government agencies, and researchers to innovate, make informed decisions, and collaborate as a cohesive unit. Open data is founded on databases that are reputable, through association and transparency in information gathering. Government agencies, publishing decadal U.S. census data, to companies such as Garmin, a GPS tracking firm worth over a $7 billion, demonstrate the diversity of shared datasets. In fact, firms like Garmin are built from data based on collected information from the government.  One way Garmin gathers data on real-time traffic alerts on major roads is from fixed traffic sensors, maintained by the state governments..

It is almost impossible to cut back on one type of open data without affecting another entity, as data from across disciplines are spread, cited, and cross-referenced among governments and private firms.

With this acceleration of shareable knowledge, inevitably, problems arise. Where is the demarcation between acceptable amounts of data and too much? Should privacy be considered when publishing open data that could lead to scientific innovation? Is the collection and publication of government open data ethical, and does scientific open data protect the researcher’s intellectual property?

To understand these conundrums, two perspectives must be considered: the government’s responsibility in accounting for its citizens and the scientific community’s common goal of propelling biological and artificial innovation forward.

Open government data helps shape fiscal and monetary policy as well as public legislation. In 2009, the U.S. became the first country to publicize more than 138,000 datasets collected by their various agencies — except for personal identifying characteristics and data that would threaten national security. In an effort to organize and lead open data on, Pres. Barack Obama in February 2015 announced Dr. DJ Patil as the first U.S. Chief Data Scientist. The Obama Administration aims to make data available, data “that taxpayers have already paid for.”  The government aims to collect as much valuable data as possible in order to account for all individuals. For instance, information gathered by the U.S. Census Bureau is referenced when deciding funding for federal programs — the budgeting is based on public need.

The Sunlight Foundation, a nonpartisan organization that advocates for open government, insists that the government should explicitly state the “values” and “goals” of collected information. By doing so, the government can build rapport with the public, promoting “a sense of trust, transparency, accountability, and civil engagement.” The Sunlight Foundation additionally advocates for a larger information release in aggregate, if the information “may provoke concern if released at the individual-level,” in order to fully protect individual citizens while maintaining the integrity of the study.

For instance, the Bureau of Labor Statistics publishes statistical data, describing the employment landscape of the U.S. Private information is not released; rather numerical datasets are used in place of personal, qualitative data. Private, identifying data from individuals can be used in research, if agreed upon, on a micro level, by academic scholars. The foundation attempts to provide alternative routes for citizens who potentially feel scrutinized, while also emphasizing the incredible value of gathered government research.

Data collection must be handled with sensitivity. Barbara Parker, city attorney for Oakland, describes the Mosaic Effect, “a phenomenon in which non-PII [personally identifiable information] data can be combined with other available information in such a way as to pose a risk of identifying an individual.” Government agencies gathering information from citizens are strongly encouraged to consider other previously gathered information about individuals that could make it possible to piece together identifying data. For example, when information of gun ownerships in Westchester and Rockland County was “transformed by a local newspaper into a data set, mapped and then published,” the publicized data alerted many individuals. This caused great concern, as the map would identify potential targets for burglars. By carelessly distributing locatable information, a possibly dangerous situation erupted. Such blunders with open data cause for mistrust and anxiety with the government.

Although private citizens may not directly benefit or understand the value of open data, scientific innovation thrives on collaboration; academic journals, conferences, and forums exemplify this use. The scientific scholars’ engagement in this exchange of information is — if properly acknowledged — one that is celebrated and encouraged. Open data platforms provide scientists with the necessary tools to collaborate and ultimately innovate.

With open scientific data, researchers are naturally worried about their work being plagiarized. In such a highly competitive field, scientists are often skeptical to the concept of open data. However, researcher Steve Koch insists that scientists should not hesitate using open data, as the platform actually makes it easier to track unethical practices and prove priority in data findings as “all they [scientists] need to do is point to their open notebooks to show that they had an idea or found a result first”.

Scientists must be cognizant of the nature of their research and their intentions. Open data may harm them individually in their personal career goals. For instance, chemist, Jean-Claude Bradley of Drexel University, founded the term “Open Notebook Science.” He openly shares his data, methods, and findings in real time on public Google spreadsheets and documents. By publicly displaying his in-progress work, flaws are exposed that are vital to the progression of his study. As a result, the open data leads to “[transforming] potential rivals to collaborators.”

Currently, Bradley’s open data philosophy and practice works well, but this was not always the case. In 2005, when Bradley researched nanotechnology, open data was not an option; it would have impacted his opportunity to receive patents to protect his intellectual property. Because traditional academic research journals will not publish previously published work, scholars seeking tenure will not publish open data. This limits the number of platforms where researchers are able to display their work.

Open data platforms have tremendous power to educate the public in an accurate way. While it is impossible to negate all statistical bias, credible open data informs the public with data of the highest caliber. The future of open data relies on individual participation and understanding. Many individuals, unless they have previous experience with data mining and analytics, are illiterate in reading graphs and datasets. Since people are not familiar with data mining, they are less likely to use open data and understand its true, intrinsic, value. To resolve this dilemma, organizations are forming “hack-a-thons” and open data conventions. Entrepreneurs and students alike eagerly flock to these events in hopes to gain new insight. Joel Gurin, founder of the think-tank Open Data Now, indicates that without a way for the public to be educated on data mining, the “raw information” in datasets are “meaningless”; that is why think-tanks like Gurin’s are essential to the sustainable future of open data.

There is no simple solution to quell the contentious atmosphere that open data creates between privacy and innovation, as the two concepts, in which open data services collide and influence each other. As inverse relationship is created between privacy and innovation as open data becomes more readily available—privacy decreases while innovation increases. Gurin opines that it ultimately boils down to judging how much privacy is worth protecting in the name of scientific growth. Open data is critical to the growth of the country; government policies can be reformed based on statistical, researched data; scientific ingenuity can surge, and the public can become more informed about their environment.