Linux software RAID 1 - root filesystem becomes read-only after a fault on one disk

本文记录了一次CentOS 5.2系统中使用软件RAID1配置遇到的问题:当一个磁盘出现故障时,根文件系统变为只读状态。文章详细描述了故障现象、查看磁盘状态的方法及尝试解决此问题的过程。

 

转载于 http://serverfault.com/questions/52067/linux-software-raid-1-root-filesystem-becomes-read-only-after-a-fault-on-one-d

 

0 down vote favorite

Linux software RAID 1 locking to read-only mode

The setup:
Centos 5.2, 2x 320 GB sata drives in RAID 1.

  • /dev/md0 (/dev/sda1 + /dev/sdb1) is /boot
  • /dev/md1 (/dev/sda1 + /dev/sdb1) is an LVM partition which contains /, /data and swap partitions

All filesystems other than swap are ext3

We've had problem on several systems where a fault on one drive has locked the root filesystem as readonly, which obviously causes problems.

[root@myserver /]# mount | grep Root
/dev/mapper/VolGroup00-LogVolRoot on / type ext3 (rw)
[root@myserver /]# touch /foo
touch: cannot touch `/foo': Read-only file system

I can see that one of the partitions in the array is faulted:

[root@myserver /]# mdadm --detail /dev/md1
/dev/md1:
[...]
          State : clean, degraded
 Active Devices : 1
Working Devices : 1
 Failed Devices : 1
  Spare Devices : 0
[...]
    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       18        1      active sync   /dev/sdb2
       2       8        2        -      faulty spare   /dev/sda2

Remounting as rw fails:

[root@myserver /]# mount -n -o remount /
mount: block device /dev/VolGroup00/LogVolRoot is write-protected, mounting read-only

The LVM tools give an error unless --ignorelockingfailure is used (because they can't write to /var) but show the volume group as rw:

[root@myserver /]# lvm vgdisplay
Locking type 1 initialisation failed.
[root@myserver /]# lvm pvdisplay --ignorelockingfailure
  --- Physical volume ---
  PV Name               /dev/md1
  VG Name               VolGroup00
  PV Size               279.36 GB / not usable 15.56 MB
  Allocatable           yes (but full)
  [...]

[root@myserver /]# lvm vgdisplay --ignorelockingfailure
  --- Volume group ---
  VG Name               VolGroup00
  System ID
  Format                lvm2
  Metadata Areas        1
  Metadata Sequence No  4
  VG Access             read/write
  VG Status             resizable
  [...]

[root@myserver /]# lvm lvdisplay /dev/VolGroup00/LogVolRoot --ignorelockingfailure
  --- Logical volume ---
  LV Name                /dev/VolGroup00/LogVolRoot
  VG Name                VolGroup00
  LV UUID                PGoY0f-rXqj-xH4v-WMbw-jy6I-nE04-yZD3Gx
  LV Write Access        read/write
  [...]

In this case /boot (seperate RAID meta-device) and /data (a different logical volume in the same volume group) are still writtable. From the previous occurances I know that a restart will bring the system back up with a read/write root filesystem and a properly degraded RAID array.

So, I have two questions:

1) When this occurs, how can I get the root filesystem back to read/write without a system restart?

2) What needs to be changed to stop this filesystem locking? With a RAID 1 failure on a single disk we don't want the filesystems to lockup, we want the system to keep running until we can replace the bad disk.


Edit: I can see this in teh dmesg output - doe sthis indicate a failure of /dev/sda, then a seperate failure on /dev/sdb that lead to the filesystem being set to read only?

sda: Current [descriptor]: sense key: Aborted Command
    Add. Sense: Recorded entity not found

Descriptor sense data with sense descriptors (in hex):
        72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00
        00 03 ce 85
end_request: I/O error, dev sda, sector 249477
raid1: Disk failure on sda2, disabling device.
        Operation continuing on 1 devices
ata1: EH complete
SCSI device sda: 586072368 512-byte hdwr sectors (300069 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back
RAID1 conf printout:
 --- wd:1 rd:2
 disk 0, wo:1, o:0, dev:sda2
 disk 1, wo:0, o:1, dev:sdb2
RAID1 conf printout:
 --- wd:1 rd:2
 disk 1, wo:0, o:1, dev:sdb2
ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
ata2.00: irq_stat 0x40000001
ata2.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 51/04:00:34:cf:f3/00:00:00:f3:40/a3 Emask 0x1 (device error)
ata2.00: status: { DRDY ERR }
ata2.00: error: { ABRT }
ata2.00: configured for UDMA/133
ata2: EH complete



sdb: Current [descriptor]: sense key: Aborted Command
    Add. Sense: Recorded entity not found

Descriptor sense data with sense descriptors (in hex):
        72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00
        01 e3 5e 2d
end_request: I/O error, dev sdb, sector 31677997
Buffer I/O error on device dm-0, logical block 3933596
lost page write due to I/O error on dm-0
ata2: EH complete
SCSI device sdb: 586072368 512-byte hdwr sectors (300069 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
ata2.00: irq_stat 0x40000008
ata2.00: cmd 61/38:00:f5:d6:03/00:00:00:00:00/40 tag 0 ncq 28672 out
         res 41/10:00:f5:d6:03/00:00:00:00:00/40 Emask 0x481 (invalid argument) <F>
ata2.00: status: { DRDY ERR }
ata2.00: error: { IDNF }
ata2.00: configured for UDMA/133
sd 1:0:0:0: SCSI error: return code = 0x08000002
sdb: Current [descriptor]: sense key: Aborted Command
    Add. Sense: Recorded entity not found

Descriptor sense data with sense descriptors (in hex):
        72 0b 14 00 00 00 00 0c 00 0a 80 00 00 00 00 00
        00 03 d6 f5
end_request: I/O error, dev sdb, sector 251637
ata2: EH complete
SCSI device sdb: 586072368 512-byte hdwr sectors (300069 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back
Aborting journal on device dm-0.
journal commit I/O error
ext3_abort called.
EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
Remounting filesystem read-only
link | improve this question

 
 

Was this post useful to you?     

<think>我们正在解决“getty因只读文件系统无法打开/dev/null”的问题。用户提到“getty can not open /dev/null read-only filesystem”,这表明系统在启动过程中试图以读写方式访问`/dev/null`,但文件系统为只读状态。这通常发生在系统启动的早期阶段,根文件系统被挂载为只读(通常是初始挂载方式),而在切换到读写模式之前,getty服务已经尝试访问设备。 ### 问题分析 1. **根本原因**:在系统启动过程中,根文件系统最初以只读(ro)方式挂载。某些服务(如getty)可能在挂载为读写(rw)之前就已经启动,并尝试打开`/dev/null`(需要读写权限),导致失败。 2. **典型场景**:在初始化脚本或systemd单元中,getty服务可能在`local-fs.target`(完成文件系统挂载)之前启动,而此时根文件系统仍为只读。 ### 解决方案 #### 方法1:调整systemd服务依赖(推荐) 修改getty的systemd服务文件,确保它在文件系统可写(`systemd-remount-fs.service`)之后启动。 1. **创建覆盖目录**: ```bash sudo mkdir -p /etc/systemd/system/getty@.service.d ``` 2. **创建配置文件** `/etc/systemd/system/getty@.service.d/wait-for-rw.conf`: ```ini [Unit] After=systemd-remount-fs.service Requires=systemd-remount-fs.service ``` 3. **重新加载systemd配置并重启getty**: ```bash sudo systemctl daemon-reload sudo systemctl restart getty@tty1 # 替换为实际tty ``` #### 方法2:在服务文件中添加`ConditionPathIsReadWrite`条件 确保服务仅在根文件系统可写时启动。 1. **编辑服务文件**(同上创建覆盖): ```ini [Service] ConditionPathIsReadWrite=/ ``` #### 方法3:修改初始挂载选项(谨慎使用) 在启动参数中设置根文件系统可读写(不推荐,可能导致fsck问题)。 - 编辑`/etc/default/grub`: ```bash GRUB_CMDLINE_LINUX_DEFAULT="... rw ..." ``` 然后更新grub配置: ```bash sudo update-grub ``` #### 方法4:检查`/dev/null`设备节点 确保`/dev/null`设备节点存在且权限正确: ```bash # 检查设备节点 ls -l /dev/null # 正确应为:crw-rw-rw- 1 root root 1, 3 May 10 10:00 /dev/null # 如果不存在,重新创建 sudo mknod -m 666 /dev/null c 1 3 ``` ### 验证步骤 重启系统并检查: ```bash dmesg | grep devtmpfs # 应显示成功挂载devtmpfs systemctl status systemd-remount-fs.service # 确保该服务成功执行 journalctl -u getty@tty1 | grep null # 检查getty日志中是否有关于/dev/null的错误 ``` ### 根本原因 Linux启动流程中,根文件系统首先以只读方式挂载(用于fsck检查),然后由`systemd-remount-fs.service`重新挂载为可读写。如果某个服务在重新挂载前启动并试图写入根文件系统下的设备节点(如`/dev/null`),就会失败。 > **注意**:`/dev/null`通常由`devtmpfs`管理,该文件系统在早期就被挂载为可读写。因此,如果`/dev`是单独挂载的,通常不会有问题。但若根文件系统为只读,且`/dev`没有单独挂载(即`/dev`是根文件系统下的目录),则`/dev/null`会继承根文件系统的只读属性。 §§相关问题§§ 1. 如何调试systemd服务的启动顺序问题? 2. Linux启动过程中根文件系统是如何从只读切换到读写的? 3. 如何手动创建/dev目录下的设备节点? 4. 为什么/dev/null需要读写权限? </think>### 解决 "getty can not open /dev/null read-only filesystem" 问题 该问题通常发生在系统启动时,由于根文件系统被挂载为只读模式(常见于系统故障修复场景),导致 getty 服务无法访问 `/dev/null` 设备节点。以下是完整的解决方案: #### 1. **临时解决方法(紧急恢复)** ```bash # 重新挂载根文件系统为读写模式 mount -o remount,rw / # 验证挂载状态 mount | grep " / " # 正确输出: /dev/xxx on / type ext4 (rw,...) # 重启 getty 服务 systemctl restart getty@tty1 ``` #### 2. **永久解决方案(修改启动配置)** ```bash # 编辑 GRUB 配置文件 vi /etc/default/grub # 在 GRUB_CMDLINE_LINUX 参数中添加 rw 选项 # 修改前示例: GRUB_CMDLINE_LINUX="quiet splash" # 修改后示例: GRUB_CMDLINE_LINUX="quiet splash rw" # 更新 GRUB 配置 update-grub # Ubuntu/Debian 或 grub2-mkconfig -o /boot/grub2/grub.cfg # RHEL/CentOS # 重启系统 reboot ``` #### 3. **检查文件系统错误** ```bash # 卸载文件系统(需从LiveCD启动) umount /dev/sdXN # 修复文件系统错误 fsck -y /dev/sdXN # 重新挂载 mount /dev/sdXN /mnt ``` #### 4. **重建设备节点(极端情况)** ```bash # 卸载 /dev umount /dev # 重建 devtmpfs mount -t devtmpfs devtmpfs /dev # 手动创建 /dev/null mknod -m 666 /dev/null c 1 3 ``` ### 问题原因分析(系统启动流程) 1. **正常启动序列**: $$ \begin{array}{c} \text{BIOS/UEFI} \rightarrow \text{GRUB} \rightarrow \text{内核加载} \\ \downarrow \\ \text{initramfs} \rightarrow \text{/挂载为ro} \rightarrow \text{fsck} \rightarrow \text{remount,rw} \end{array} $$ 2. **故障场景**: - 文件系统错误导致 fsck 失败 - 内核参数缺少 `rw` 选项 - initramfs 未正确挂载 devtmpfs - 磁盘硬件故障(需检查 SMART 状态) ### 验证修复效果 ```bash # 检查根文件系统挂载模式 findmnt -n -o OPTIONS / # 检查 getty 服务状态 journalctl -u getty@tty1 | grep null # 测试设备节点访问 echo "test" > /dev/null && echo "Success" || echo "Failure" ``` > **关键点**:Linux 系统启动时默认以只读(ro)模式挂载根文件系统,完成文件系统检查(fsck)后才会重新挂载为读写(rw)模式[^1]。若此过程被中断,系统将保持只读状态。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值