LC_COLLATE is a variable which determines the collation order

本文探讨了Bash shell中使用tr命令进行字符串大小写转换时遇到的问题,包括为何需要对字母范围进行引用以避免文件名通配符匹配问题,以及LC_COLLATE设置如何影响字符范围表达式的解释。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

There's a code snippet in ABS guide.

#!/bin/bash
# uppercase.sh : Changes input to uppercase.

tr 'a-z' 'A-Z'
#  Letter ranges must be quoted
#+ to prevent filename generation from single-letter filenames.

exit 0
Why do I have to quote the letter ranges?


The snippet's comments are wrong (as is much of the ABS; it's a very poor reference and should not be used).

If there were square brackets:

tr [A-Z] [a-z]

...then you'd have a concern about [A-Z] matching files named AB, etc. For a more visible demonstration, try this:

mkdir -p ~/tmp
cd ~/tmp
touch A B C
echo tr [A-Z] [a-z]

...and see what it emits.


As a note -- it's possible to get in trouble here even without single-character filenames on your disk if the nullglob option is set. To demonstrate that:

rm -rf ~/tmp
mkdir -p ~/tmp
cd ~/tmp
shopt -s nullglob
echo tr [A-Z] [a-z]

...and you'll see that tr is invoked with no arguments at all, since [A-Z] and [a-z] are both interpreted as glob expressions that don't match any files, and nullglob tells the shell to simply replace such glob expressions with nothing at all.


To be clear -- glob expansion has nothing to do with tr specifically; the shell would change an unquoted [A-Z] to a list of single-character filenames matching the pattern no matter what program is being run.







///////////////////////////////////////////////////////////////////////////

Note that when using range expressions like [a-z], letters of the other case may be included, depending on the setting of LC_COLLATE.

LC_COLLATE is a variable which determines the collation order used when sorting the results of pathname expansion, and determines the behavior of range expressions, equivalence classes, and collating sequences within pathname expansion and pattern matching.


Consider the following:

$ touch a A b B c C x X y Y z Z
$ ls
a  A  b  B  c  C  x  X  y  Y  z  Z
$ echo [a-z] # Note the missing uppercase "Z"
a A b B c C x X y Y z
$ echo [A-Z] # Note the missing lowercase "a"
A b B c C x X y Y z Z

Notice when the command echo [a-z] is called, the expected output would be all files with lower case characters. Also, with echo [A-Z], files with uppercase characters would be expected.


Standard collations with locales such as en_US have the following order:

aAbBcC...xXyYzZ
  • Between a and z (in [a-z]) are ALL uppercase letters, except for Z.
  • Between A and Z (in [A-Z]) are ALL lowercase letters, except for a.

See:

     aAbBcC[...]xXyYzZ
     |              |
from a      to      z

     aAbBcC[...]xXyYzZ
      |              |
from  A     to       Z

If you change the LC_COLLATE variable to C it looks as expected:

$ export LC_COLLATE=C
$ echo [a-z]
a b c x y z
$ echo [A-Z]
A B C X Y Z

So, it's not a bug, it's a collation issue.


Instead of range expressions you can use POSIX defined character classes, such as upper or lower. They work also with different LC_COLLATE configurations and even with accented characters:

$ echo [[:lower:]]
a b c x y z à è é
$ echo [[:upper:]]
A B C X Y Z
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值