As a member of one of several research teams in UCLA’s Department of Information Studies, Irene Pasquetto is studying methods of management and sharing of scientific data, as gathered by astronomy, earth and biology scientists. Another practical application of her research encompasses the impact of data on society and government. This month, Pasquetto is helping to present a Hackathon event, to be held at UCLA’s Perloff Hall on Feb. 14, to examine data mining of large data sets on police brutality in Los Angeles County, and how these data can be best managed and used to exact positive change.
A former practitioner of data-driven journalism in her native Italy, Pasquetto says that building awareness of the significance of these data sets is critical.
“Building awareness is important because the whole concept of public opinion is founded on the possibility for citizens to be informed,” she says. “Without accurate information, citizens are not free to have accurate opinions on facts, and consequently, their political choices are biased. We hope to verify the state of current available information on police-involved homicides. We found some discrepancies between official databases, such as the FBI, and citizen-curated datasets on the same topic. We aim to identify the reasons behind such discrepancies, and why federal datasets didn’t report some cases.”
Last October, Pasquetto was given a travel award to attend the “First Hands-On Workshop on Leveraging High Performance Computing Resources for Managing Large Datasets,” at the 2014 International Conference on Big Data presented by the Institute of Electrical and Electronics Engineers (IEEE), which was held in Washington DC and hosted by the University of Texas at Austin. Pasquetto says that the workshops at the conference helped to give her a better understanding of how large databases are managed. As a member of Professor Christine Borgman’s Knowledge Infrastructures research team, Pasquetto works on several projects that examine the practices of research scientists and the ways in which they share – or do not share – their findings.
“Open access refers to the fact that scientists are willing or not willing to share their documents and papers with the broader research community and with the public,” says Pasquetto. “The open data movement has the same goal of sharing the results of research, but it is specifically focused on data.
“Open access to publications is already a complicated practice,” she says. “When it comes to data, it’s even more complicated. Privacy and copyright issues around datasets limit sharing practices. It is necessary to rethink and adapt such policies to the needs for data sharing. In addition, the technical management of these databases is difficult. Even when scientists are willing to share their data, this sharing is complicated because you need specific repositories or software to share the data.”
The Knowledge Infrastructures research team is working with four researcher communities – two groups of earth scientists and two groups of astronomers; each community happens to be at a different stage in its research. Pasquetto’s specific contribution is to study the practices that lead to open data, and the effect of standards and policies, the laws of open access, and even requirements for federal grants.
“The National Science Foundation has some requirements, meaning that if you want to obtain a specific kind of grant for your project, you have to assure the scientific community that you will somehow share your data or your publication,” says Pasquetto. “Some of the researchers we work with are willing to share their data. For example, astronomy is a field where they are pretty advanced in sharing both papers and research data.
“The team also studied a community of earth scientists. It was not that they were not willing to share their data. But for their kind of research, they have many different kinds of expertise involved, all with different skills and different databases. For one project, you can have five different databases, and [the researchers] don’t always talk to each other. [Data] has to be organized in a specific way, with the specific software to read the database. What we are trying to see are the conditions that could help scientists to share their data, and also, when they are not willing to share it, why and how open access can be improved.”
Pasquetto earned her bachelor’s degree in digital communications and her master’s degree in journalism at the University of Verona. While still in Europe, she attended the School of Data of the Open Knowledge Foundation.
“While earning my master’s degree, I studied data-driven journalism,” says Pasquetto. “We studied statistics and tools to visualize and analyze data. Before I joined Professor Borgman’s team I was working on government data – data on cities, public transportation, health issues, and data that has to do with society. The School of Data travels all over Europe, teaching journalists and activists how to use data in their work and how data has a social impact.”
Pasquetto says that she enjoys the opportunity to blend her passion for writing and her interest in technology through her work on the Knowledge Infrastructure team.
“When I started working with data, I liked the combination of technical skills and communication skills,” she says. “It’s pretty rare, because people are usually either into engineering and math, or they study literature. That’s what I like about information science in general, that it’s a mixture of these.”
Last summer, Pasquetto participated in the online Big Data Summer School organized by JPL and Caltech. She says that the need to manage big data and the popularity of MOOCs (Massive Online Open Courses) are interconnected andshares the story of Jack Andraka, who in 2012 at the age of 15, invented a new type of diagnostic test for early-stage pancreatic, ovarian, or lung cancer. Creating the test using a paper sensor similar to a diabetic test strip, that tests for the level of mesothelin, a soluble cancer biomarker, with an accuracy rate of more than 90 percent.
“It shouldn’t be that because you belong to a specific institution, you have access to scientific knowledge,” says Pasquetto. “The main reason why publications and data should now be open to people who are not scientists is because the way that people learn is changing. Because of MOOCs, 15-year-old students around the world can read scientific publications and use that data for actual science. We will have a society where [knowledge] is available to everybody, not just within traditional education. This is one of the results of public access.”