一行拆多行，动态列

润乾软件

于 2021-10-15 09:16:21 发布

阅读量109

点赞数

CC 4.0 BY-SA版权

分类专栏： JAVA计算文章标签：数据处理 java csv

本文链接：https://blog.youkuaiyun.com/raqsoft/article/details/120776827

JAVA计算专栏收录该内容

363 篇文章

订阅专栏

针对带有特定前缀的CSV文件，需要转换为新的CSV格式，包括小写化头衔、删除头衔前缀、合并共享部分名称的列，并保留原始行的特殊值。使用SPL语言可以实现动态转换，处理过程中需要处理未知数量和名称的列，同时确保具有特定标识（如'Inc'）的列在合并时得以保留。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

【问题】

I have a CSV file whose awful format I cannot change (simplified here):

Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
1,1,1.5,"5 Things",2,2.5,"10 Things"
2,5,5.5,"10 Things",6,6.5,"20 Things"
Inc,a_One,a_Two,a_Three,b_One,b_Two,b_Three
3,9,9.5,"15 Things",10,10.5,"30 Things"

My desired output is a new CSV containing:

inc,label,one,two,three
1,"a",1,1.5,"5 Things"
2,"a",5,5.5,"10 Things"
3,"a",9,9.5,"15 Things"
1,"b",2,2.5,"10 Things"
2,"b",6,6.5,"20 Things"
3,"b",10,10.5,"30 Things"

Basically:

lowercase the headers
strip off header prefixes and preserve them by adding them to a new column
remove header repetitions in later rows
stack each column that shares the latter part of their names (e.g. a_One andb_Onevalues should be merged into the same column).
During this process, preserve theIncvalue from the original row (there may be more than one row like this in various places).

With caveats:

I don't know the column names ahead of time (many files, many different columns). These need to be parsed if they are to be used as logic for stripping the repetitious header rows.
There may or may not be more than one column with properties like Inc that need to be preserved when everything gets stacked. Generally, Inc represents any column that does not have a prefix like a_ or b_. I have a regex to strip out these prefixes already.

So far, I've accomplished this:

> wip_path <- 'C:/path/to/horrible.csv'
> rawwip <- read.csv(wip_path, header = FALSE, fill = FALSE)
> rawwip
   V1    V2    V3        V4    V5    V6        V7
1 Inc a_One a_Two   a_Three b_One b_Two   b_Three
2   1     1   1.5  5 Things     2   2.5 10 Things
3   2     5   5.5 10 Things     6   6.5 20 Things
4 Inc a_One a_Two   a_Three b_One b_Two   b_Three
5   3     9   9.5 15 Things    10  10.5 30 Things

> skips <- which(rawwip$V1==rawwip[1,1])
> skips
[1] 1 4

> filwip <- rawwip[-skips,]
> filwip
  V1 V2  V3        V4 V5   V6        V7
2  1  1 1.5  5 Things  2  2.5 10 Things
3  2  5 5.5 10 Things  6  6.5 20 Things
5  3  9 9.5 15 Things 10 10.5 30 Things

> rawwip[1,]
   V1    V2    V3      V4    V5    V6      V7
1 Inc a_One a_Two a_Three b_One b_Two b_Three

But then when I try to apply a tolower() to these strings, I get:

> tolower(rawwip[1,])
[1] "4" "4" "4" "4" "4" "4" "4"

And this is quite unexpected.

So my questions are:

1)How can I gain access to the header strings in `rawwip[1,]` so that I can reformat them with `tolower()` and other string-manipulating functions?

2) Once I’ve done that, what’s the most effective way to stack the columns with shared names while preserving the inc value for each row?

Bear in mind, there will be well over a thousand repetitious columns that can be filtered down to perhaps 20 shared column names. I will not know the position of each stackable column ahead of time. This needs to be determined within the script.

【回答】

如果是静态变换，这个问题用R比较容易实现，但如果是动态变换就困难了，比如要求：字段名可以是任何字符串（不限于a,b,one,two,three），任意数量(不限于2*3)。用SPL容易实现动态变换，代码如下：

	A	B
1	=file("d:\\source.csv").import@tc()
2	=A1.select(Inc!="Inc")
3	=A1.fname().to(2,).(~.split("_"))
4	=A3.id(~(1))	=A3.id(~(2))
5	=create(inc,label,${B4.(lower(~)).concat@c()})
6	=A2.run(A4.run(A5.record(A2.Inc\|A4.~\|B4.(A2.~.field(A4.~+"_"+~)))))
7	=file("d:\\result.csv").export@tc(A5)