8.Looking Too Hard for Patterns: a post about finding spurious patterns

最新推荐文章于 2023-02-12 23:34:39 发布

qianer

最新推荐文章于 2023-02-12 23:34:39 发布

阅读量692

点赞数

分类专栏： Joy of Greenfoot 文章标签： statistics recursion numbers graph character google

Joy of Greenfoot 专栏收录该内容

37 篇文章

订阅专栏

本文探讨了《Pi》这部电影中主角Max如何运用数学和计算机科学来预测股市，揭示了科学方法的正确应用以及寻找结果带来的偏见。通过使用Google Correlate工具分析搜索趋势，展示了在数据背后隐藏的潜在误导性关联，强调了正确的统计应用对于理解自然现象的重要性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

Today, March 14th, is Pi Day. In celebration, this post is related to the film Pi.

Check out the retro style of his computer

Pi is the first film by Darren Aronofsky, who went on to make Requiem for a Dream and Black Swan. I’ll try not to spoil too much, but the starting premise is that the main character, Max, is a mathematician/computer-scientist, who believes he can model the stock market and predict future stock behaviour, if only he finds the right model. I was recently reminded of this central quote from Pi (via Tom Crick), which can be heard in the film’s trailer:

Restate my assumptions:

Mathematics is the language of nature.
Everything around us can be represented and understood through numbers.
If you graph these numbers, patterns emerge. Therefore: there are patterns everywhere in nature.

By stating his assumptions, Max is following the scientific process (hurrah!). This allows us to analyse his assumptions and see if he has made a mistake. Indeed — the implication of his third assumption is flawed: if you graph things, patterns do emerge — but they might well be spurious.

Google Correlate

Google have released a tool that (inadvertently?) demonstrates this wonderfully:Google Correlate. The idea is that you can enter a term and see what other search terms produce a similar trend. That sounds somewhat useful. I decided to use the term “Greenfoot”. Here’s one of the top results I got at the time (Greenfoot is blue, the matching term is red):

That’s quite a decent match, and has a correlation coefficient of 0.9477. As Max suggested, we’ve graphed the numbers, and a pattern has emerged. This red term that matches so well with Greenfoot is… “Google Images”. Not very useful, and not much of a pattern: these two terms correlate well because they originated around the same time, and have grown in search-popularity with a similar pattern ever since. But really, this seems to me to be a spurious result (technically, a “type I” error): we’ve found an effect where really there is none.

This is the problem with Max’s approach. There are patterns everywhere if you look hard enough, but that doesn’t mean that they’re useful. And this is a real problem in science, especially with measurement techniques that generate a large amount of data (on which you can then perform a large variety of analysis). One example of a troublesome area is the neuroscience technique fMRI, wheretoo many comparisons can lead to a dead fish detecting human emotions. The quality of our understanding of the human brain is dependent on statistics being applied properly… by human brains. (Recursion!)

And so in Pi, Max demonstrates the dark side of science: an obsession with finding a result that drives him so hard that he loses his impartiality and risks finding phantom results. There are techniques to mitigate this problem, called alpha-level correction, and I intend to cover some statistics in future blog posts which will explain these sorts of issues.