“BOLLOCKS”, says a Cambridge professor. “Hubris,” write researchers at Harvard. “Big data is bullshit,” proclaims Obama’s reelection chief number-cruncher1^11. A few years ago almost no one had heard of “big data”. Today it’s hard to avoid—and as a result, the digerati love to condemn it. Wired, Time, Harvard Business Review and other publications are falling over themselves2^22 to dance on its grave3^33. “Big data: are we making a big mistake?,” asks the Financial Times. “Eight (No, Nine!) Problems with Big Data,” says the New York Times. What explains the big-data backlash?
backlash [ˈbæklæʃ] 反对
proclaims [prəˈkleɪmz] 宣告
reelection 连任
digerati [ˌdɪdʒə’rɑːti] 计算机专家
condemn [kənˈdem] 谴责,抨击
“胡扯”,剑桥教授这样说道。“狂妄”,一位哈佛研究院在文章中这样写道。“大数据就是个胡说八道的概念。”奥巴马连任首席 统计学家1^11 这样宣称。现在,大数据成为了一个无法避免的概念,带来的后果是计算机专家们热衷于抨击这个概念。《连线》,《时代》,《哈佛商业评论》以及其他出版物都在 卖力(盼望)2^22 的 唱衰3^33 大数据。“大数据:我们是否正在犯一个大错误?”,时代的金融版提出了这样的质疑。“八个(不对,是九个)大数据存在的问题”,纽约时代周刊这样写道。这些对大数据的强烈反对究竟源自哪里?
Big data refers to the idea that society can do things with a large body of data that that weren’t possible when working with smaller amounts. The term was originally applied a decade ago to massive datasets from astrophysics, genomics and internet search engines, and to machine-learning systems (for voice-recognition and translation, for example) that only work well when given lots of data to chew on. Now it refers to the application of data-analysis and statistics in new areas, from retailing to human resources. The backlash began in mid-March, prompted by an article in Science by David Lazer and others at Harvard and Northeastern University. It showed that a big-data poster-child1^11 —Google Flu Trends, a 2009 project which identified flu outbreaks from search queries alone—had overestimated the number of cases for four years running, compared with reported data from the Centres for Disease Control (CDC). This led to a wider attack on the idea of big data.
astrophysics [ˌæstrəʊˈfɪzɪks] 天体物理学
genomics [dʒiˈnɒmɪks] 基因组学
retailing [ˈriːteɪlɪŋ] 零售业
prompt [ˈprɒmptɪd] 导致,促使
大数据指的是这样一种概念,即社会借助庞大的数据体可以完成在小数据量下不能完成的工作。这个概念在十年前原本用于描述源自天体物理学,基因组学和网络搜索引擎的巨量数据以及在大数据量下才能运行良好的机器学习系统(如语音识别系统和翻译)。现在这个概念涵盖了数据分析的应用和统计学在新的领域中的应用,从零售到人力资源。此次的反对浪潮开始于三元中旬,由大卫兰泽和其他一众哈佛大学和西北大学的教授在《科学》杂志上发表的一片文章开始。这篇文章表示与疾病控制中心出具的报导数据相比谷歌流感趋势(这项2009年的项目仅通过序列分析预判流感爆发)这一大数据分析 典型1^11 模范高估了项目运行的四年当中的高估了流感的数量。这篇文章引致了更大范围的对大数据技术的批判。
The criticisms fall into three areas that are not intrinsic to big data per se1^11, but endemic to data analysis, and have some merit. First, there are biases inherent2^22 to data that must not be ignored. That is undeniably the case. Second, some proponents of big data have claimed that theory (ie, generalisable models about how the world works) is obsolete. In fact, subject-area3^33 knowledge remains necessary even when dealing with large data sets. Third, the problem of spurious correlations—associations that are statistically robust but only happen by chance—increases with more data. Although there are new statistical techniques to identify and banish spurious correlations, such as running many tests against subsets of the data, this will always be a problem.
intrinsic [ɪnˈtrɪnzɪk] 固有的
endemic [enˈdemɪk] 特有的
merit [ˈmerɪt] 优点,长处
biases [ˈbaɪəsɪz] 偏见
inherent [ɪnˈherənt] 固有的,内在的
proponent [prəˈpəʊnənts] 支持者
obsolete [ˈɒbsəliːt] 过时的
spurious [ˈspjʊəriəs] 虚假
correlation [ˌkɒrəˈleɪʃn] 相关
associations [əˌsəʊsɪˈeɪʃ(ə)nz] 联系
statistically [stə’tɪstɪkli] 统计的
robust [rəʊˈbʌst] 强健的
批评集中在三个领域,这三个领域不仅仅针对大数据 (本身)1^11 的特性而是对于整个数据分析都有意义,并且具备一些优势。首先,存在着不可忽视的 数据有偏性2^22,在这个项目里尤其无法被否认。其次,一些大数据支持者认为相关理论,例如世界如何运行的普世理论,已经过时。事实上,即使在处理大型数据集时,学科领域3^33 的知识仍然是必要的。再次,特定的虚假相关现象会随着数据的增加而增加。特定的虚假相关指的是由于随机性产生的较强的关联。尽管现在产生了识别和消除这类虚假相关的统计学技术,例如进行多次针对数据子集的测试,这类虚假相关依然是一个问题。
There is some merit to the naysayers’ case, in other words. But these criticisms do not mean that big-data analysis has no merit whatsoever. Even the Harvard researchers who decried big data “hubris” admitted in Science that melding Google Flu Trends analysis with CDC’s data improved the overall forecast—showing that big data can in fact be a useful tool. And research published in PLOS Computational Biology on April 17th shows it is possible to estimate the prevalence of the flu based on visits to Wikipedia articles related to the flu. Behind the big data backlash is the classic hype cycle1^11, in which a technology’s early proponents make overly grandiose claims, people sling arrows when those promises fall flat2^22, but the technology eventually transforms the world, though not necessarily in ways the pundits expected. It happened with the web, and television, radio, motion pictures and the telegraph before it. Now it is simply big data’s turn to face the grumblers.
naysayers 反对者
whatsoever 无论如何
decry [dɪˈkraɪ] 谴责
prevalence ['prevələns] 流行
grandiose [ˈɡrændiəʊs] 浮夸的
sling [slɪŋ] 扔
pundit [ˈpʌndɪt] 专家
grumbler 异议者
换句话来说,持反对意见者选取的例子有一些建树。但是那些反对者并不是认为大数据分析毫无意义。及时是那位职责大数据是“傲慢”的哈佛教授也在文章中承认将融合谷歌流感趋势和疾病控制中心的数据可以改善预测的总体水平。这显示出大数据确实是一项有力的工具。在PLOS计算生物学杂志4月17日刊载的文章表示确实存在通过维基百科的浏览数分析评估流感流行的可能性。在此次大数据批评背后的是 技术成熟曲线1^11,它描述了这样一种现象:在一项技术早期宣传者做出了过分夸张的宣传,人们在他们做出的许诺 无效2^22 时对这项技术大肆攻击。但是技术最后会改变这个世界,尽管不是以饱学之士希望看到的方式。这样的循环发生在网络,电视,无线电,动态画片以及之前的电报。现在是大数据应该面对异议者的时候。