The Wide and Long Data Format for Repeated Measures Data

https://www.theanalysisfactor.com/wide-and-long-data/

The Wide and Long Data Format for Repeated Measures Data

by KAREN GRACE-MARTIN

One issue in data analysis that feels like it should be obvious, but often isn’t, is setting up your data.

The kinds of issues involved include:

  • What is a variable?
  • What is a unit of observation?
  • Which data should go in each row of the data matrix?

Answering these practical questions is one of those skills that comes with experience, especially in complicated data sets.

Even so, it’s extremely important. If the data isn’t set up right, the software won’t be able to run any of your analyses.

And in many data situations, you will need to set up the data different ways for different parts of the analyses. This article will outline one of the issues in data set up: using the long vs. the wide data format.

The Wide Format

In the wide format, a subject’s repeated responses will be in a single row, and each response is in a separate column.

For example, in this data set, each county was measured at four time points, once every 10 years starting in 1970. The outcome variable is Jobs, and indicates the number of jobs in each county. There are three predictor variables: Land Area, Natural Amenity (4=no and 3=Yes), and the proportion of the county population in that year that had graduated from college.

Since land area and presence of a natural amenity doesn’t change from decade to decade, those predictors have only one variable per county. But both our outcome, Jobs, and one predictor, College, have different values in each year, so require a different variable (column) for each year.

(click to see larger)

The Long Format

In the long format, each row is one time point per subject. So each subject (county) will have data in multiple rows. Any variables that don’t change across time will have the same value in all the rows.

You can see the same five counties’ data below in the long format. Each county has four rows of data–one for each year.

All the same information is there; we’re just set up the data differently.

We no longer need four columns for either Jobs or College. Instead, all four values of Jobs for each county are stacked–they’re all in the Jobs column. The same is true for the four values of College.

But to keep track of which observation occurred in which year, we need to add a variable, Year.

You’ll notice that variables that didn’t change from year to year–Land Area and Natural Amenity–have the same value in each of the four rows for each county. It looks strange, but it’s okay to have it this way, and as long as you analyze the data using the correct procedures, it will take into account that these are redundant.

image002

 

A Comparison of the Two Approaches

One reason for setting up the data in one format or the other is simply that different analyses require different set ups.

For example, in all software that I know of, the wide format is required for MANOVA and repeated measures procedures.

Many data manipulations are much, much easier as well when data are in the wide format.

Likewise, mixed models and many survival analysis procedures require data to be in the long format.

Beyond software requirements, each approach has analytical implications. For example, in the wide format, the unit of analysis is the subject–the county–whereas in the long format, the unit of analysis is each measurement occasion for each county.

The practical difference is that when the occasion is the unit of analysis, you can use each decade’s college education rate as a covariate for the same decade’s Jobs value. In the wide-format, when the unit of observation is the county, there is no way to do this. You can use any of the college rates as covariates for all years, but you can’t have decade-specific covariates.

Another implication is that in the wide format, those repeated outcomes are considered different and non-interchangeable variables. Each can have its own distribution. Each is distinct. This makes sense in the county example where each observation occurred in the same four years for every county. But if each county had been measured a different number of times, or measured in different years, this set up doesn’t make a lot of sense.

So it’s important to think about the implications before you enter data.

Luckily, converting from one to the other is generally not too difficult in most software packages. For example, you can do it with Proc Transpose in SAS or with the Restructure wizard in SPSS.

This is a good skill to practice, as it’s quite helpful to be able to switch back and forth. For example, it’s often easier to enter and manipulate data in the wide format, even if you need to analyze it in the long format.

内容概要:文章基于4A架构(业务架构、应用架构、数据架构、技术架构),对SAP的成本中心和利润中心进行了详细对比分析。业务架构上,成本中心是成本控制的责任单元,负责成本归集与控制,而利润中心是利润创造的独立实体,负责收入、成本和利润的核算。应用架构方面,两者都依托于SAP的CO模块,但功能有所区分,如成本中心侧重于成本要素归集和预算管理,利润中心则关注内部交易核算和获利能力分析。数据架构中,成本中心与利润中心存在多对一的关系,交易数据通过成本归集、分摊和利润计算流程联动。技术架构依赖SAP S/4HANA的内存计算和ABAP技术,支持实时核算与跨系统集成。总结来看,成本中心和利润中心在4A架构下相互关联,共同为企业提供精细化管理和决策支持。 适合人群:从事企业财务管理、成本控制或利润核算的专业人员,以及对SAP系统有一定了解的企业信息化管理人员。 使用场景及目标:①帮助企业理解成本中心和利润中心在4A架构下的运作机制;②指导企业在实施SAP系统时合理配置成本中心和利润中心,优化业务流程;③提升企业对成本和利润的精细化管理水平,支持业务决策。 其他说明:文章不仅阐述了理论概念,还提供了具体的应用场景和技术实现方式,有助于读者全面理解并应用于实际工作中。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值