Statistics

Data Mining, PRISM, NSA & False Positives: Update

We’re spying on you for your own good.

Remember how—this is a really brief history lesson–remember how the NSA, CIA, FBI, and many other of those lettered agencies with their ever-increasing budgets, super-sophisticated computers, genius brain mathematicians, statisticians, and computer scientists were able, through thinking clever thoughts and applying unimaginably exquisite algorithms, were able to identify and thus prevent 9/11?

And how they took the knowledge they gained through that success, in combination with even and ever more data culled from private citizens, to stop the Boston bombing before it happened, to thwart the Fort Hood killings, to stop Benghazi, to defeat the Wisconsin Sikh temple shooting, to discover who sent anthrax through the mail, and how they similarly protected us from a few dozen other bloodlettings?

Neither do I remember.

These gross failures have an explanation; actually two. The first falls under the “If only” rubric. If only the budgets of these bureaucracies weren’t so meager we would have been safe. If only they were allowed a freer hand. If only they were able to gather more information. If only endlessly.

The hard news about If only is that there is some truth in it. An algorithm designed to ferret out fiends from phone records won’t work if there are no phone records. Solution? Get some. And, the logic pushed way past the breaking point and over the Cliffs of Insanity (free Princess Bride reference), if you have some all is better. So get them all, and be damned the morality of the act. Lives are on the line. Think about the children.

This line of reasoning is convincing to politicians anxious to increase their grope of the body politic. This is why Dianne Feinstein trotted to the microphones yesterday to say that “I knew about these programs all along. Congress was briefed.” The implication is that since one budget-consuming branch of the government knew what another budget-consuming branch of the government did, there was nothing to worry about. Feel better?

The second explanation for the intelligence strike out slump is more disheartening and more likely true. The methods the NSA and others are using don’t work.

Or they don’t work too well. Why? Because there is no task more fraught with error, misunderstanding, and misplaced certainty than predicting human behavior. Just when you think you have it pegged, it changes.

We can, for instance, forecast with reasonable accuracy and on most weekdays that the bulk of New York City residents will pile into cars, cabs, and trains round about 5 pm. But this is only most weekdays. Sometimes, for reasons foreseen by nobody, the regular pattern is disrupted and the forecast goes to Hades.

How much harder is it then to predict what time Mr Smith heads home for the day? And then ask our model to discern whether Smith will vary his route next Tuesday to toss an IED into the Army recruiting center (a real example). You can ask, and the computer will answer, but it will be talking out its bus.

It won’t be long before some disingenuous politician or bureaucrat will say, “Theses agencies may have been too zealous. But come on. If you have nothing to hide, you have nothing to worry about.”

Bovine spongiography. You have plenty to worry about. Like being falsely suspected or accused. Like having the full might and weight of the Federal Government bear down on you, as the IRS currently does to those it perceives as its enemies. Financial ruin, loss of reputation, careened career—and worse.

Look. Take the smartest guys you can find and have them cobble together a model which predicts whether each citizen is a “potential” terrorist or no. This model will spit out numbers in the form of probabilities. Some of these will be high enough to exceed the “reasonable, articulable suspicion” threshold (the quote is from Feinstein).

At that point even an editorial from the New York Times won’t be able to convince our great brains that they might have the wrong men. The allure of numbers from a computer printed out and physically real is too strong, even for the people who programmed these computers and who know intimately their many and weep-worthy limitations.

The nature of these models is that there are bound to be many, many more false identifications of terrorists than true ones. The price we pay for these errors is a loss of privacy, and the placement of our secrets into the hands of government. What could go wrong? Everything.

Oh, the algorithms will also claim some aren’t a threat when they really are. This we know from all too frequent experience. Yet the confidence of government that success is ever around the corner never abates.

Update Just saw this in today’s WSJ “Thank You For Data Mining”:

The effectiveness of data-mining is inversely proportional to the size of the sample, so the NSA must sweep broadly to learn what is normal and refine the deviations.

This is false. The more people/records searched the greater the number of false positives, the costlier the subsequent follow-up searches (each potential terrorist has to be investigated further), and the more eventual lives harmed. The follow-up searches are also not error free. Blanket screening is rarely a good idea.

The many who are writing editorials today praising the data collection and computer screening have faith but no experience in statistic modeling.

Update 2 The “inversely” was a typo/mistake of the WSJ editors. They have since removed it from the online version; the mistake remains in the print version.


https://twitter.com/mattstat/status/343890674057371648

Categories: Statistics

36 replies »

  1. What! You mean the television show “Numbers” exaggerated how well we can predict behaviour using algorithms!! No!!!!

    (Very good piece–it’s too bad people don’t understand how limited data mining actually is and imprecise.)

  2. Sounds kind of like inspecting for quality. If you have to inspect every part, to make sure it meets the standard, you don’t really have a quality system.

    Of course, as mentioned, human behavior is unpredictable anyway.

  3. Your update quote is backwards. Have you subconsciously rewritten it to be what you desire?

  4. So, exactly what data mining algorithms or experts have those agencies employed? Perhaps the problem is they haven’t used any. No?

    It only shows that individual human behavior is hard to predict with all the information available to those agencies. Based on what I know about my children, I have no problem predicting how they’d react to certain situations!

  5. Person of Interest isn’t a documentary? Darn, what’s next? Santa Claus?

    False positives are a very real problem. An algorithm that is 99% accurate in identifying a terrorist is 1% inaccurate. If used to screen one million individuals, it has 10,000 false positives. On top of that, the terrorist ratio is likely far less than 1:1000000.

    JH,

    The Tempest standard came out of NSA. It assumes the existence of data mining of a sort though maybe not in the database crunching sense. It’s hard to imaging that an agency that relies on picking information out of noise would forego methods used elsewhere.

  6. The effectiveness of data-mining is inversely to the size of the sample proportional

    So a data-mining algorithm based upon a sample of one is the most accurate?

  7. DAV, William,

    It was a typo in the paper. See the update above.

    I even didn’t notice it, given the context of the editorial. The WSJ is for intrusive data mining.

  8. Why would these brainiacs not design and write this software? They get paid, medical insurance, early retirement, good pensions. Working in Silicon Valley has its downsides too.

    And it isn’t their fault that somebody else is not capable of properly interpreting the data. After all, that kind of defense works for everybody else creating dangerous goods too.

    And this creates lots of interesting business opportunities. Email can already be encrypted, but a Whatsapp variant that sends everything encrypted using public key encryption ain’t there yet. And what about a bot that is capable of creating patriotic Facebook postings?

  9. You must be psychic!!
    “It won’t be long before some disingenuous politician or bureaucrat will say, “Theses agencies may have been too zealous. But come on. If you have nothing to hide, you have nothing to worry about.””

    “I’m a Verizon customer. I could care less if they’re looking at my phone records. … If you’re not getting a call from a terrorist organization, you got nothing to worry about,” said Sen. Lindsey Graham, R-S.C.

  10. RK,

    Did he really? How depressing.

    Yet another piece of evidence supporting term limits.

  11. I should have included a reference. The quote I posted above came from the AP article titled “US declassifies phone program details after uproar” by Josh Lederman and Donna Cassata.

  12. The WSJ had it halfway right.
    The effectiveness of data-mining is proportional to the proportion of terrorists in the population, and dependent on how accurate the data mining algorithms are. A positive “hit” in data mining of any population where the probability of being a terrorist is relatively small compared to the non terrorist population, even using highly accurate (sensitive and specific) algorithms will most likely result in a false positive.
    An example using an exaggerated number of terrorists, and an highly accurate data mining test would serve to show just how likely the innocent will be identified over the true terrorist by data mining.
    The population of americans in 2012 was 314 million. If there are 100,000 terrorists in america, (prior probability of being a terrorist in America would be 100,000/314 million = 0.03%), and if the NSA’s data mining is 99% sensitive (detects 99/100 terrorists) and 99% specific (detects 99 terrorists/100 “data hits”), then the probability of any data “hit” being a terrorist would be 3% (If I did the math right) in this very unlikely scenario. So for every 100 “hits”, 97 will be innocent good ole americans.

    In reality the number of terrorists is likely 3 to 4 orders of magnitude lower than this example. Though I really have no idea how accurate the NSA algorithms are I suspect the sensitivity and specificity are likely far 99%. Even Sen Lindsey Graham needs to be concerned.

  13. Wasn’t it the Republican that blamed Obama for Boston. The Tsarniev brother succeeded because they didn’t talk to anyone.

    That program started after 9/11 which happened after Bush/Cheney disregarded the specific threat of using plane as bomb.

    The truth behind the IRS came out at yesterday’s hearing. It was admitted that the change in words from exclusively for charity to primary in 1959 ( changes made while Eisenhower was president).

    The truth is that all or most 501-c-4 should have been rejected, while they were all accepted;
    even though there is no legal requirement to submit an application dor the status to begin with.

  14. I don’t have a problem with the gathering of phone records by the NSA.

    I don’t think I have a privacy right to control my usage of a third party telephone network.

    I think the phone data is owned by the phone company.

    So it is up to the phone companies as to whether to voluntarily turn over their data or turn it over with a court order.

    Verizon was the only company after 9/11 who said no when the government asked for phone records – all the others cooperated voluntarily.

    Perhaps Verizon is the only one with a court order – perhaps they all are under a court order – but I feel pretty confident that the government has been collecting all the phone records for the entire country since 9/11.

    My basic point is that it is not your data – it is the phone companies data.

    Think of it this way – when you send a letter, do you own the data as to who the letter is going to.

    No – because of course many people will have to look at the envelope data to see where the letter is going to. The government knows the date you sent it (the postmark), the weight (the cost of postage), who sent it (the return address) and where it is going.

    You don’t have a privacy interest in any of that data.

    You do have a fourth amendment right from the letter being opened and read (without a search warrant).

    Just like you have a fourth amendment right from the phone conversation being listened to (without a search warrant).

    But the date you used Verizon to place the call, the number you asked Verizon to connect you to, the duration of the call and the number called from are all the exact same equivalent to the envelope information.

    Now Verizon has a right to complain – because it is their data.

    But I am sure they did, and the court ruled they had to turn it over anyway.

    But this is not about government intrusion into your right of privacy – because you don’t have that right when you use a third party system.

    Ditto for Visa, Mastercard, your electricity, your natural gas, your garbage container size, or any other metadata related to third party services you may use – that is all their data (in my opinion).

  15. What is most surprising in all of this is that people seem shocked that what they say on the phone or write on the internet is being monitored.

  16. Great article! Now just suppose the reason for such massive data gathering is not for combatting terror, but for political reasons. Recent government activities seem to bear this out.

  17. RickA,

    Ditto for Visa, Mastercard, your electricity, your natural gas, your garbage container size, or any other metadata related to third party services you may use – that is all their data

    You left out surveillance cameras. Who does that belong to?

    Once upon a time — and not all that long ago — what one did or said on public would fade from memory. Not any more. Increasingly, actions are being recorded for posterity. And that means umpteen years from now, it’s possible it could resurface and bite regardless of how inconsequential the actions may have been viewed at the time of occurrence.

    It was bad enough when, back in the McCarthy Era during a time when no one was really actively archiving things, a simple, single attendance of a socialist rally while in college was used against a fair number of people. Who’s to say some similar future “sensibility” won’t be applied retroactively?

    I don’t know how old you are but apparently not old enough to know how one’s past can be used in some really resourceful (and not necessarily constructive) ways or old enough to miss what it is you may have already lost.

    Freedom is lost in tiny steps.

  18. RickA,

    I’m not aware of the post office keeping records on letters, either source or destination, except possibly for special delivery. This would require an enormous amount of additional paperwork. Nobody writes anything down when you post a letter it is just stamped, cancelled and dumped into a big bag. Envelop exteriors are not scanned. I know that zip codes are often scanned by automatic routing machines but unconnected to names this information would have no other use and wouldn’t be saved. Even for land phone lines local calls could not be tracked, in the past, unless a tap was physically placed on your line. A bare minimum of records were kept for long distance calls for billing purposes. That was the past. Now with computers making the keeping of vast records feasible, more and more information is stored, especially for cell phones with their more complicated billing systems. It is also much easier to tap into individual conversations should someone decide to do so. So yes, I believe that changing times require a fresh look into privacy issues. It is an important legal issue that should not be dismissed by vague reference to third parties.

  19. What people say on the phone or write on the internet is not being monitored under this program. Only the overall pattern of which numbers called which numbers (and at which time and for how long). Given an actual terrorist as a node, one can study incoming and outgoing calls vis a vis that individual number. After screening out pizza delivery services and calls to mama (maybe), other nodes in the network might be identified. But without a warrant, the calls cannot be listened to.
    Internet chatter is often done in public and is no more immune to being read than a conversation in a bus terminal is immune to being overheard by the beat cop.

  20. YOS,

    A rare instance where we disagree. One, we do not know that what people write on the internet isn’t being monitored; some reports suggests emails were taken. Two, if people at a bus stop suspected they were being earwigged, they would speak softer. Or walk.

    Update

    Washington Post:

    The National Security Agency and the FBI are tapping directly into the central servers of nine leading U.S. Internet companies, extracting audio and video chats, photographs, e-mails, documents, and connection logs that enable analysts to track foreign targets, according to a top-secret document obtained by The Washington Post.

  21. William, I don’t disagree with any of the points you made; I do disagree with some possible omissions (the operative term being possible omissions): Your first four paragraphs sarcastically imply that our intelligence and security agencies dropped the ball big time, and so they did. But how many successes have we not heard about simply because they were successes(and likely can’t be revealed w/out compromising sources, means & methods, etc)? How many attacks were prevented?

    This falls under the general heading of “No one notices things that are working; we only notice when they aren’t.” (There is probably a Latin phrase for that – anyone?)

  22. PaddikJ,

    It’s true that they might have foiled some. Which? We hear about some, but these are the almost entrapment cases where the FBI etc. lure in an internet denizen who might not have otherwise acted had the FBI not tempted him. We not so hear about others the data mining stopped. Doesn’t mean that there weren’t such cases, of course, but given this administration’s habit of boasting of even small successes (usually by leaks), we might have expected to hear about some of these.

    Still, it remains true, as you claim, that this program is a success and worth the loss of freedom and that Ben Franklin (in this case) was wrong when he said, “Those who would give up Essential Liberty to purchase a little Temporary Safety, deserve neither Liberty nor Safety.”

    Update

    From NBC, no conservative stronghold:

    The National Security Agency has at times mistakenly intercepted the private email messages and phone calls of Americans who had no link to terrorism, requiring Justice Department officials to report the errors to a secret national security court and destroy the data, according to two former U.S. intelligence officials.

    “Mistakenly.”

    Phone records, even of just call times and locations, in the hands of a political enemy (like the government) are not the same as those records in the hands of a company. A company you can leave. A government you cannot, or at least not easily.

  23. I often wonder as to the value of government protection.
    Does the cost exceed the risk?
    What was the cost of terrorism versus the money spent slamming the barn door after 911?
    I suspect terrorism would drop dramatically if we imposed Roman solutions.
    The historic quote: “Nits breed lice” had basic truth in it.
    When threatened by a species of predator or a group of people our forefathers did what was necessary to protect their children. This is an inherited and successful survival trait.
    Not nice, but very effective.

  24. Problem: People are at war with us, and we can’t bring ourselves to be ruthless and wipe them out.

    Solution: We need more data.

  25. Briggs,

    You realize that it is probably the first time you agree with Michael moore.

  26. Sylvain Allard,

    Just goes to show that even the dimmest bulbs can burn brightly at least momentarily. If he keeps on this road, who knows what good may come out of him.

  27. Long time listener, first time caller…ha ha.

    My guess is that the phone database is primarily used as a “backwards” viewing device. Not necessarily identifying new terrorists, but used to see who they’ve been communicating with after they have been identified, usually through alternate means. Previously you would need to get a wiretap and only have access to this data going forward, which is limiting.

    That being said, do you trust the government to have this large database sitting around and it to never be abused by a “few rogue employees in Cincinnati”? You run for political office and “somehow” all those websites you looked at 15 years ago when you were 17 are leaked to the media.

    This makes me very wary.

  28. Here are my questions for the security branches of government:

    Do you intend to do a better job of communicating with each other? It seems to me that the failures have been due to a failure to communicate, i.e., the Boston bomber trip to Russia and back. This breakdown did not depend on the NSA data bank.

    Do you have a plan to deal with the false phone calls to average Americans dialed by terrorists attempting thwart NSA by dialing numbers at random from our phone books just to harass us? If a terrorist dialed 100 people on their home phones and dialed one person who is in sympathy with the terrorist movement or linked to twitter and facebook, will the 99 people be investigated and how would they know which person to watch, or will all be watched?

    What is the probability that a conspiracy of activists within the security agencies could get judge to authorize surveillance of a private but innocent American?
    What processes in the NSA prohibit a rogue activist to release data to the media? Security of information depends on the moral and ethical standards of the people employed at the agencies. Are we to believe that leaks are not possible? The issue is trust.

    What are the other surveillance activities by the federal or state governments?
    It appears that the other government agencies have major surveillance activities which may impinge directly on our constitutional rights. Will the other regulatory agencies rely on the NSA data bases when terrorists attempt to undermine their regulations?

    Suppose that a terrorist calls me on a phone on the security agencies watch list and strikes up a conversation with me. Since I didn’t know that person is a terrorist, will my emails and cell phone conversations be subjected to review? If I got a call from a gun owner who is an acquaintance and offers to sell me a illegal rifle, will I now be subjected to searching for illegal guns? Without the first call, the second crime would not have been detected. Will I now be viewed as a criminal?

  29. In some ways, a larger data set would be likely to improve the data mining. To make any type of predictions, it is likely that they would try to establish some criteria to describe “normal” behaviour ande then identify people whose behaviour stands out as different. Theoretically, a larger data set would give a better chance of establishing the “normal” criteria — up to a point. Though there would be diminishing returns as the algorithm starts to converge.

    That said, I suspect that the data would be more useful for confirmation than prediction. Having identified (by some other means) a potential suspect, trolling through their past communications could potentially strengthen the suspicion.

  30. Stuffing this database with false leads will become a popular way of screwing with the enemy. It will take away resources in a time of conflict, it will sow suspicion amongst parties that will need to cooperate, and it can be done easily and cheaply using automation. What’s there not to like?

    Anyway, Facebook is toast. And I wouldn’t want to be a LinkedIn shareholder right now.

Leave a Reply

Your email address will not be published. Required fields are marked *