Marketing companies want to comb big data sets to determine what you’re likely to buy, so they can show you the right offers at the right time. Health care providers want to analyze big data so they can determine when and where outbreaks might happen and stop them before they become virulent. Various three-letter agencies based in DC want to mine that data so they can determine who the bad guys are before somebody decides to blow something up. The list is almost endless.
The problem with Big Data is, well, it’s big. It’s everywhere. It’s in multiple locations and incompatible file formats, with the same information duplicated in hundreds of different ways. Making sense of it can be a nightmare from which many organizations may never awake. But it doesn’t necessarily have to be that way.
I just had a fascinating discussion with a company called Chiliad (pronounced Kiliad, from the Greek word for a “group of one thousand”). Chiliad’s Discovery/Alert 7.0 software allows organizations with massive amounts of data spread across multiple locations to pretend that it’s all in the same place and in the same format, making it easier to search.
Without getting too heavily into the nitty gritty, Chiliad works by installing a small appliance bit of code on every network where the data resides. Each appliance The code indexes all the data that’s available from each source, and communicates with the other appliances on the other networks. Chiliad’s software lets you use English language queries to get at this data and refine your searches. It then flags correlations between different bits of data and brings the relevant connections to the surface – leaving the data exactly where it is, without changing, converting, or even cleaning it.
So, let’s say you’re searching a medical database to find out if Vitamin D is helpful in treating metabolic syndrome. (Something I do at least once a week–not.) Chiliad’s results may tell you that it is, but they can also point out that Vitamin D is helpful in combatting diabetes and high blood pressure as well – something that may not have been otherwise obvious. And the software can do it virtually instantaneously (at least it did in the demo, from which this screen shot derives).
If broadly deployed, tools like Chiliad’s could make this kind of data correlation as easy as using Google. In fact, if there were a consumer version of Discovery/Alert – or I had a few hundred thousand dollars to spare – I would want to use it on top of Google, so I could sort results to get the most up-to-date information on a topic (something Google is particularly bad at). I’d buy that in a heartbeat.
What’s interesting from a privacy perspective are Chiliad’s clients. For the past decade Chiliad has worked with the Department of Homeland Security and the FBI, helping the spooks sort through 15 billion records from more than 100 sources. The Feds use Chiliad to monitor all flights, trucks, cars, ships, and passengers entering this country. If you’ve ever passed through US Customs, your data has flowed through Chiliad’s software.
Beyond that, though, Chiliad couldn’t tell me what kinds of data the DHS is interested in. Even they don’t know all of the data sources the DHS combs through each day, and if they did, well… you know the rest of that joke.
Now Chiliad is moving into medical records, another area where the data is jumbled, massive, and highly sensitive, with lots of privacy landmines. The idea is that by allowing hospitals and pharmaceutical companies to search across patient records and research data from multiple sources, doctors may be able to pick out patterns that may not have been visible when looking at smaller data sets – like, say, seeing that a particular treatment is vastly more effective for a certain malady. And Chiliad’s auditing capabilities will allow admins to know exactly who saw what data and when they saw it, keeping it in compliance with regulations like HIPAA.
Chiliad marketing veep Ken Rosen says Chiliad’s experience working with the DHS makes it uniquely qualified to handle highly regulated personal data such as medical records.
The other problem with Big Data is that, if it’s misinterpreted, the results could be disastrous. Let’s say the DHS has analyzed a few petabytes of consumer purchase data and determined that terrorists like to order takeout pizza, pay cash for their groceries, take lots cross Atlantic flights, and visit Jihadist Web sites. If you happen to do all of these things – and you’re not a terrorist — it’s conceivable you could be mistaken for one, based entirely on what the data seems to be saying about you.
Obviously, the NSA is not going to release the algorithm it uses to determine whether suspect A is a potential terrorist while suspect B is not. On the other hand, Rosen says the Feds take individual privacy quite seriously.
“I’m not pretending to be naïve and say that every person in government takes it just as seriously,” he says, “but one of the hottest issues among our government customers is the need to guarantee that information on individual US citizens does not fall into the wrong hands. There’s rigorous scrutiny to ensure that information isn’t analyzed inappropriately.”
And, he adds, the Feds aren’t generally interested in individuals so much as the connections between networks of people. “They need to understand the entire conspiracy, the complete story,” he says.
So if the spooks do mistake you for a wanted terrorist and toss you into Gitmo, at least you might end up there with people you know. Hopefully, though, better, smarter search tools for manipulating Big Data will make this outcome less likely, not more.
This piece originally appeared on ITworld’s TY4NS blog.
Disembodied head of Data (Brent Spiner) courtesy of FastCompany.