As numbers and facts continue to accumulate in today’s world of big data, a growing challenge is how to sift through the reams of data for relevant discoveries. It’s a critical and timely issue because policymakers and scientists base wide-ranging decisions on the results of data analysis. But old methods of data analysis don’t always work well with super-sized data sets.
"Big data" needs some big new ideas.
Enter Sudeepa Roy, assistant professor of computer science. Roy is a database researcher who is creating new ways to help mine enormous data sets for meaning.
"Sudeepa is tackling these really tough challenges where people are really nervous about the existing techniques,” says Jun Yang, professor of computer science. “She’s out there leading this charge."
In one of Roy’s projects, she’s creating a system that will help users more efficiently find explanations for observations coming out of big databases. Her work is funded by a five-year, $550,000 NSF CAREER award that she won in 2016.
Let’s say you have a database and you ask a question in the form of a query. Perhaps the result is a surprising or interesting observation. Maybe the data show that unmarried mothers are more likely to give birth to underweight babies. You might wonder if that result has something to do with the condition of being unwed, or whether it’s related to age, socioeconomic status, race, geography, education ... or some combination of these and other factors.
Finding an explanation, especially a multi-factorial one, in a database of multiple interrelated tables is like looking for the proverbial needle in a haystack. Just generating all the possible explanations could take a prohibitively long time.
To make the task easier, Roy is working on techniques to create “explanation-ready databases.” All possible explanations are pre-loaded into the database. Each explanation involves an “intervention”— removing or changing a particular set of data points—that has an effect on the observation.
“The interventions are changes in the database that change the observation,” Roy says. “For every possible explanation, which may or may not be valid, there is an intervention attached. The system outputs a set of candidate explanations.”
These candidates still need to be evaluated by a human user, but that’s like looking for a needle in a sewing kit instead of a haystack. “The system will find facts that may illuminate the surprising property without you having to do it yourself,” Roy says. “Out of the 50,000 possible explanations, what are the top five for this particular question?”
While computers may be good at generating explanations, humans are good at understanding context and plausibility. So Roy and her colleagues are working to improve the system so that users can give input in terms of asking for explanations of one kind or another, or grouping very similar explanations together.
A major goal of the project is to create a flexible system that will work with a wide variety of data, databases, and applications—whether in epidemiology, sociology, physics, or any other field.
In two other grant-funded projects, Roy is working on a different aspect of data analysis: causal analysis. In this case, the goal is to analyze databases to identify factors that actually cause certain outcomes.
Causal analysis is valuable in situations where randomized controlled clinical trials are not ethical—for example, studies of dangerous behaviors like smoking or binge drinking.
The trick is to distinguish between correlation and causation. Techniques exist to do this, but, as Roy says, “The current approaches for causal analysis are not scalable for big data.”
No matter the project, Roy relishes working with colleagues from other specialties, whether machine learning, statistics, or epidemiology. That makes Duke a natural fit for her. “The atmosphere here is very, very collegial,” she says. “Everyone is very, very helpful. It is also very collaborative.”
Yang says Roy’s skills and interests complement the rest of the computer department nicely. “She’s looking at problems that require an interdisciplinary team,” he says, “and she’s starting to build a bridge between database and machine learning expertise.” He adds: “She’s just a bundle of energy. She makes things happen.”
Roy came to Duke in September, 2015. A native of hot and humid Kolkata, she prefers the climate here to that of Philadelphia, where she earned her PhD, or Seattle, where she was a postdoc at the University of Washington.
In her scant free time, Roy likes to read mysteries, watch movies, and travel—particularly to visit family in India.