Sunday, September 30, 2012

Poll Averages Have No History of Consistent Partisan Bias - NYTimes.com

Information Irresponsibility 101: If you don’t like the data, claim bias in its collection.

Presidential elections are high-stakes affairs. So perhaps it is no surprise that when supporters of one candidate do not like the message they are hearing from the polls they tend to blame the messenger.

In 2004, Democratic Web sites were convinced that the polls were biased toward George W. Bush, asserting that they showed an implausible gain in the number of voters identifying as Republicans. But in fact, the polls were very near the actual result. Mr. Bush defeated John Kerry by 2.5 percentage points, close to (in fact just slightly better than) the 1- or 2-point lead that he had on average in the final polls. Exit polls that year found an equal number of voters describing themselves as Democrats and Republicans, also close to what the polls had predicted.

Since President Obama gained ground in the polls after the Democrats’ convention, it has been the Republicans’ turn to make the same accusations.

Poll Averages Have No History of Consistent Partisan Bias - NYTimes.com

Saturday, September 29, 2012

Overkill Analytics

An alternative to underkill analytics, I guess.

And therein lies the beauty of overkill analytics, a term that Carter might have coined, but that appears to be catching on — especially in the world of web companies and big data. Carter says he doesn’t want to spend a lot of time fine-tuning models, writing complex algorithms or pre-analyzing data to make it work for his purposes. Rather, he wants to utilize some simple models, reduce things to numbers and process the heck out of the data set on as much hardware as is possible.

It’s not about big data so much as it is about big computing power, he said. There’s still work to be done on smaller data sets like the majority of the world deals with, but Hadoop clusters and other architectural advances let you do more to that data in a faster time than was previously possible. Now, Carter said, as long as you account for the effects of overprocessing data, you can create a black-box-like system and run every combination of simple techniques on data until you get the most-accurate answer.

http://gigaom.com/data/forget-your-fancy-data-science-try-overkill-analytics/

W.T.F.M: Write The Freaking Manual - Floopsy

WTF, Man?

It seems that nowadays, the original phrase R.T.F.M. is also quickly becoming the need to W.T.F.M.

Developers: You spend hours, days, months, perhaps years refining your masterpiece.  It is an expression of your life’s work, heart and soul.  Why, then, would you shortchange yourself by providing poor or no documentation for the rest of us?

W.T.F.M: Write The Freaking Manual - Floopsy

Saturday, September 22, 2012

Exploring Local » Blog Archive » Google Maps announces a 400 year advantage over Apple Maps

Of the vast commentary generated by the high-profile failure of Apple Maps, this bit stands out as highly perceptive.  Again, the human factor in data quality (see the previous post) makes itself known.

Perhaps the most egregious error is that Apple’s team relied on quality control by algorithm and not a process partially vetted by informed human analysis. You cannot read about the errors in Apple Maps without realizing that these maps were being visually examined and used for the first time by Apple’s customers and not by Apple’s QC teams. If Apple thought that the results were going to be any different than they are, I would be surprised. Of course, hubris is a powerful emotion.

If you go back over this blog and follow my recounting of the history of Google’s attempts at developing a quality mapping service, you will notice that they initially tried to automate the entire process and failed miserably, as has Apple. Google learned that you cannot take the human out of the equation. While the mathematics of mapping appear relatively straight forward, I can assure you that if you take the informed human observer who possesses local and cartographic knowledge out of the equation that you will produce exactly what Apple has produced – A failed system.

The issue plaguing Apple Maps is not mathematics or algorithms, it is data quality and there can be little doubt about the types of errors that are plaguing the system. What is happening to Apple is that their users are measuring data quality. Users look for familiar places they know on maps and use these as methods of orienting themselves, as well as for testing the goodness of maps. They compare maps with reality to determine their location. They query local businesses to provide local services. When these actions fail, the map has failed and this is the source of Apple’s most significant problems. Apple’s maps are incomplete, illogical, positionally erroneous, out of date, and suffer from thematic inaccuracies.

Exploring Local » Blog Archive » Google Maps announces a 400 year advantage over Apple Maps

DARPA combines human brains and 120-megapixel cameras to create the ultimate military threat detection system | ExtremeTech

Talk about “The Human Side of Data Quality.”  which, by the way, will be a theme of the 2013 MIT Chief Data Officer & Information Quality Conference.

There are two discrete parts to the system: The 120-megapixel camera, which is tripod-mounted and looks over the battlefield (pictured below); and the computer system, where a soldier sits in front of a computer monitor with an EEG strapped to his head (pictured above). Images from the camera are fed into the computer system, which runs cognitive visual processing algorithms to detect possible threats (enemy combatants, sniper nests, IEDs). These possible threats are then shown to a soldier whose brain then works out if they’re real threats — or a false alarm (a tree branch, a shadow thrown by an overheard bird).

DARPA combines human brains and 120-megapixel cameras to create the ultimate military threat detection system | ExtremeTech

Data Scientist: The Sexiest Job of the 21st Century - Harvard Business Review

If thought leaders believe the most basic, universal skill for data scientists is the ability to write code, data science is at risk of repeating the narrative already experienced in more conventional information management.  We now know that in designing information systems, it is a fool’s errand to favor technical prowess over technology-neutral skills like requirements analysis and conceptual modeling.  Yes, big data tools are young and a little rough around the edges, so folks who can master the technology will be needed. But the basic, universal skills for data scientists must be acknowledged: understanding how data (both structured and unstructured) work and how humans experience it.

Data scientists’ most basic, universal skill is the ability to write code. This may be less true in five years’ time, when many more people will have the title “data scientist” on their business cards. More enduring will be the need for data scientists to communicate in language that all their stakeholders understand—and to demonstrate the special skills involved in storytelling with data, whether verbally, visually, or—ideally—both.

Data Scientist: The Sexiest Job of the 21st Century - Harvard Business Review

Saturday, September 15, 2012

Become Data Literate in 3 Simple Steps - The Data Journalism Handbook

While we’re on the topic of data quality and journalism…

Just as literacy refers to “the ability to read for knowledge, write coherently and think critically about printed material” data-literacy is the ability to consume for knowledge, produce coherently and think critically about data. Data literacy includes statistical literacy but also understanding how to work with large data sets, how they were produced, how to connect various data sets and how to interpret them.

Become Data Literate in 3 Simple Steps - The Data Journalism Handbook

He Said, She Said, and the Truth - NYTimes.com

File this under “Naive notions of information quality, formalized in our cultural and civic institutions.”  

But while balance may be necessary to mediating a dispute between teenage siblings, a different kind of balance — some call it “false equivalency” — has come under increasing fire. The firing squad is the public: readers and viewers who rely on accurate news reporting to make them informed citizens.

Simply put, false balance is the journalistic practice of giving equal weight to both sides of a story, regardless of an established truth on one side. And many people are fed up with it. They don’t want to hear lies or half-truths given credence on one side, and shot down on the other. They want some real answers.

He Said, She Said, and the Truth - NYTimes.com

Friday, September 14, 2012

Language Log » They cut me out

A lesson about the human side of information quality… sigh.

While Victor Mair sweats over sheets of Chinese characters and Mark Liberman generates graphs to see if the results of refereed papers can be replicated from reprocessed raw data, I just play. There's no linguistics at all in a piece like "I Wish I'd Said That", though it is sort of basically about language; and something similar is true for quite a few other posts listed on my reference page of Lingua Franca posts. But in today's piece, for once, everything I say is completely true, and I actually try to teach a tiny bit about syntactic ambiguity. And my reward was swift and cold: the compilers of the daily email newsletter through which The Chronicle points its subscribers to what they can find today on the web refused to include a pointer to my piece.

Language Log » They cut me out

Thursday, September 13, 2012

Write Good. Code Gooder.

Not surprising, but confirms what years of experience has already shown: Computer scientists are generally less articulate than we might hope.  The link leads to a table showing scores on the GRE exam, broken out by expected field of study. Those who plan to study computer and information sciences scored miserably in both the verbal and the analytical writing sections. Could this be a factor in the Agile movements aversion to formal specifications?  (Ya think?)

http://www.ets.org/s/gre/pdf/gre_guide_table4.pdf

An Open Letter to Wikipedia About Anatole Broyard and "The Human Stain" : The New Yorker

Not the first time Wikipedia’s preference for—insistence on, actually—low-quality data has drawn attention. (See also here.)

Yet when, through an official interlocutor, I recently petitioned Wikipedia to delete this misstatement, along with two others, my interlocutor was told by the “English Wikipedia Administrator”—in a letter dated August 25th and addressed to my interlocutor—that I, Roth, was not a credible source: “I understand your point that the author is the greatest authority on their own work,” writes the Wikipedia Administrator—“but we require secondary sources.”

An Open Letter to Wikipedia About Anatole Broyard and "The Human Stain" : The New Yorker

United States Patent: 8254902

Is that a threat as in “We should turn off this fellow’s cell phone because he is texting about his autobiographical screenplay while driving,” or as in “Stop him—he is organizing a protest march against the politically powerful?”

Moreover, in certain situations, the communications capability that the wireless device accords to its user may be what poses the threat.

United States Patent: 8254902

Wednesday, September 5, 2012

SQL vs. NoSQL | Linux Journal

I am well acquainted with the benefits of non-SQL approaches to data management. And I honor those who speak responsibly about the differences between relational and non-relational approaches, including one of my current clients, a vendor of a highly scalable non-SQL DBMS.

I’ve also noticed some irresponsible prattle on this topic, as has the author of this excellent article in Linux Journal.

This scaling myth is perpetuated and given credence every time popular Web sites announce that such-and-such RDBMS doesn't meet their needs, and so they are moving to NoSQL database X. The opinion of some in the RDBMS world is that many of these moves are not so much because the database they were using is deficient in some fundamental way, but because it was being used in a way for which it wasn't designed. To make an analogy, it's like people using flat-head screwdrivers to tighten Phillips-head screws, because it worked well enough to get the job done, but now they've discovered it is better to tighten Phillips screws with an actual Phillips screwdriver, and isn't it wonderful, and we should throw away all flat-head screwdrivers, because their time is past, and Phillips is the future.

One recent SQL-to-NoSQL move involved Digg.com moving from MySQL to Cassandra. As part of the move, Digg folks blogged about how they were using MySQL and why it didn't meet their needs. Others were skeptical. Dennis Forbes, in a series of posts on his site (see Resources), questioned whether Digg needed to use a NoSQL solution like Cassandra at all. His claims centered on what he considered very poor database usage on the part of Digg combined with inadequate hardware. In his mind, if Digg had just designed its database properly or switched to using SSDs in its servers, it would have had no problems. His best quote is this:“The way that many are using NoSQL is like discovering the buggy whip at the beginning of the automotive era.” Ouch.

SQL vs. NoSQL | Linux Journal

With Rise of Gene Sequencing, Ethical Puzzles - NYTimes.com

Reminds me of the “Fruit of the Poisonous Tree” metaphor from criminal law—which dictates that certain evidence and all evidence that flows from it—must be ignored if it was collected illegitimately.   That case and this one shed some light a human aspect of information quality: That our civic institutions sometimes demand (legitimately, in my opinion) that high-quality data be ignored.

Dr. Arul Chinnaiyan stared at a printout of gene sequences from a man with cancer, a subject in one of his studies. There, along with the man’s cancer genes, was something unexpected — genes of the virus that causes AIDS.

It could have been a sign that the man was infected with H.I.V.; the only way to tell was further testing. But Dr. Chinnaiyan, who leads the Center for Translational Pathology at the University of Michigan, was not able to suggest that to the patient, who had donated his cells on the condition that he remain anonymous.

With Rise of Gene Sequencing, Ethical Puzzles - NYTimes.com