Python as a statistics workbench

最新推荐文章于 2022-01-09 23:04:22 发布

转载最新推荐文章于 2022-01-09 23:04:22 发布 · 149 阅读

本文探讨了使用Python作为统计工作台的可能性，对比了Python与R等专业统计软件的优势与不足，并推荐了一系列Python库，如NumPy、SciPy、pandas、statsmodels等，用于实现从简单描述性统计到复杂统计建模的功能。

Lots of people use a main tool like Excel or another spreadsheet, SPSS, Stata, or R for their statistics needs. They might turn to some specific package for very special needs, but a lot of things can be done with a simple spreadsheet or a general stats package or stats programming environment.

I've always liked Python as a programming language, and for simple needs, it's easy to write a short program that calculates what I need. Matplotlib allows me to plot it.

Has anyone switched completely from, say R, to Python? R (or any other statistics package) has a lot of functionality specific to statistics, and it has data structures that allow you to think about the statistics you want to perform and less about the internal representation of your data. Python (or some other dynamic language) has the benefit of allowing me to program in a familiar, high-level language, and it lets me programmatically interact with real-world systems in which the data resides or from which I can take measurements. But I haven't found any Python package that would allow me to express things with "statistical terminology" – from simple descriptive statistics to more complicated multivariate methods.

What can you recommend if I wanted to use Python as a "statistics workbench" to replace R, SPSS, etc.?

What would I gain and lose, based on your experience?

It's hard to ignore the wealth of statistical packages available in R/CRAN. That said, I spend alot of time in Python land and would never dissuade anyone from having as much fun as I do. :) Here are some libraries/links you might find useful for statistical work.

NumPy/Scipy You probably know about these already. But let me point out theCookbook where you can read about many statistical facilities already available and theExample List which is a great reference for functions (including data manipulation and other operations). Another handy reference is John Cook'sDistributions in Scipy.
pandas This is a really nice library for working with statistical data -- tabular data, time series, panel data. Includes many builtin functions for data summaries, grouping/aggregation, pivoting. Also has a statistics/econometrics library.
larry Labeled array that plays nice with NumPy. Provides statistical functions not present in NumPy and good for data manipulation.
python-statlib A fairly recent effort which combined a number of scattered statistics libraries. Useful for basic and descriptive statistics if you're not using NumPy or pandas.
statsmodels Statistical modeling: Linear models, GLMs, among others.
scikits Statistical and scientific computing packages -- notably smoothing, optimization and machine learning.
PyMC For your Bayesian/MCMC/hierarchical modeling needs. Highly recommended.
PyMix Mixture models.