one or more partitions are busy

本文介绍了如何使用Linux命令强制卸载忙状态的磁盘分区,包括使用fuser和umount命令的方法,并提供了一些实用的选项说明,如-l、-f等。

原文:http://www.cyberciti.biz/tips/how-do-i-forcefully-unmount-a-disk-partition.html


Understanding device error busy error

What happens basically, is that Linux / UNIX will not allow you to unmount a device that is busy. There are many reasons for this (such as program accessing partition or open file) , but the most important one is toprevent data loss.

Try the following command to find out what processes have activities on the device/partition. If your device name is /dev/sdb1, enter the following command as root user:
# lsof | grep '/dev/sda1'
Output:

vi 4453       vivek    3u      BLK        8,1                 8167 /dev/sda1

Above output tells that user vivek has a vi process running that is using /dev/sda1. All you have to do is stop vi process and run umount again. As soon as that program terminates its task, the device will no longer be busy and you can unmount it with the following command:
# umount /dev/sda1

Following disussion allows you to unmout device and partition forcefully using Linux commands.

Linux fuser command to forcefully unmount a disk partition

Suppose you have /dev/sda1 mounted on /mnt directory then you can use fuser command as follows:

WARNING! These examples may result into data loss if not executed properly (see " Understanding device error busy error" for more information).

Type the command to unmount /mnt forcefully:
# fuser -km /mnt
Where,

  • -k : Kill processes accessing the file.
  • -m : Name specifies a file on a mounted file system or a block device that is mounted. In above example you are using /mnt

Linux umount command to unmount a disk partition
You can also try umount command with –l option:
# umount -l /mnt
Where,

  • -l : Also known as Lazy unmount. Detach the filesystem from the filesystem hierarchy now, and cleanup all references to the filesystem as soon as it is not busy anymore. This option works with kernel version 2.4.11+ and above only.

If you would like to unmount a NFS mount point then try following command:
# umount -f /mnt
Where,

  • -f: Force unmount in case of an unreachable NFS system

Caution: Using these commands or option can cause data loss for open files; programs which access files after the file system has been unmounted will get an error.



Sometimes fuser and lsof fail. One option that works for me is remounting read-only, and then doing a lazy unmount:

mount -o ro,remount /dev/sdb3
umount -l /dev/sdb3


protected-mode no port 6379 tcp-backlog 511 timeout 0 tcp-keepalive 300 daemonize no pidfile /var/run/redis_6379.pid loglevel notice logfile "" databases 16 always-show-logo no set-proc-title yes proc-title-template "{title} {listen-addr} {server-mode}" stop-writes-on-bgsave-error yes rdbcompression yes rdbchecksum yes dbfilename dump.rdb rdb-del-sync-files no dir ./ replica-serve-stale-data yes replica-read-only yes repl-diskless-sync no repl-diskless-sync-delay 5 repl-diskless-load disabled repl-disable-tcp-nodelay no replica-priority 100 acllog-max-len 128 requirepass Guyuan@2021 # New users are initialized with restrictive permissions by default, via the # equivalent of this ACL rule 'off resetkeys -@all'. Starting with Redis 6.2, it # is possible to manage access to Pub/Sub channels with ACL rules as well. The # default Pub/Sub channels permission if new users is controlled by the # acl-pubsub-default configuration directive, which accepts one of these values: # # allchannels: grants access to all Pub/Sub channels # resetchannels: revokes access to all Pub/Sub channels # # To ensure backward compatibility while upgrading Redis 6.0, acl-pubsub-default # defaults to the 'allchannels' permission. # # Future compatibility note: it is very likely that in a future version of Redis # the directive's default of 'allchannels' will be changed to 'resetchannels' in # order to provide better out-of-the-box Pub/Sub security. Therefore, it is # recommended that you explicitly define Pub/Sub permissions for all users # rather then rely on implicit default values. Once you've set explicit # Pub/Sub for all existing users, you should uncomment the following line. # # acl-pubsub-default resetchannels # Command renaming (DEPRECATED). # # ------------------------------------------------------------------------ # WARNING: avoid using this option if possible. Instead use ACLs to remove # commands from the default user, and put them only in some admin user you # create for administrative purposes. # ------------------------------------------------------------------------ # # It is possible to change the name of dangerous commands in a shared # environment. For instance the CONFIG command may be renamed into something # hard to guess so that it will still be available for internal-use tools # but not available for general clients. # # Example: # # rename-command CONFIG b840fc02d524045429941cc15f59e41cb7be6c52 # # It is also possible to completely kill a command by renaming it into # an empty string: # # rename-command CONFIG "" # # Please note that changing the name of commands that are logged into the # AOF file or transmitted to replicas may cause problems. ################################### CLIENTS #################################### # Set the max number of connected clients at the same time. By default # this limit is set to 10000 clients, however if the Redis server is not # able to configure the process file limit to allow for the specified limit # the max number of allowed clients is set to the current file limit # minus 32 (as Redis reserves a few file descriptors for internal uses). # # Once the limit is reached Redis will close all the new connections sending # an error 'max number of clients reached'. # # IMPORTANT: When Redis Cluster is used, the max number of connections is also # shared with the cluster bus: every node in the cluster will use two # connections, one incoming and another outgoing. It is important to size the # limit accordingly in case of very large clusters. # # maxclients 10000 ############################## MEMORY MANAGEMENT ################################ # Set a memory usage limit to the specified amount of bytes. # When the memory limit is reached Redis will try to remove keys # according to the eviction policy selected (see maxmemory-policy). # # If Redis can't remove keys according to the policy, or if the policy is # set to 'noeviction', Redis will start to reply with errors to commands # that would use more memory, like SET, LPUSH, and so on, and will continue # to reply to read-only commands like GET. # # This option is usually useful when using Redis as an LRU or LFU cache, or to # set a hard memory limit for an instance (using the 'noeviction' policy). # # WARNING: If you have replicas attached to an instance with maxmemory on, # the size of the output buffers needed to feed the replicas are subtracted # from the used memory count, so that network problems / resyncs will # not trigger a loop where keys are evicted, and in turn the output # buffer of replicas is full with DELs of keys evicted triggering the deletion # of more keys, and so forth until the database is completely emptied. # # In short... if you have replicas attached it is suggested that you set a lower # limit for maxmemory so that there is some free RAM on the system for replica # output buffers (but this is not needed if the policy is 'noeviction'). # # maxmemory <bytes> # MAXMEMORY POLICY: how Redis will select what to remove when maxmemory # is reached. You can select one from the following behaviors: # # volatile-lru -> Evict using approximated LRU, only keys with an expire set. # allkeys-lru -> Evict any key using approximated LRU. # volatile-lfu -> Evict using approximated LFU, only keys with an expire set. # allkeys-lfu -> Evict any key using approximated LFU. # volatile-random -> Remove a random key having an expire set. # allkeys-random -> Remove a random key, any key. # volatile-ttl -> Remove the key with the nearest expire time (minor TTL) # noeviction -> Don't evict anything, just return an error on write operations. # # LRU means Least Recently Used # LFU means Least Frequently Used # # Both LRU, LFU and volatile-ttl are implemented using approximated # randomized algorithms. # # Note: with any of the above policies, when there are no suitable keys for # eviction, Redis will return an error on write operations that require # more memory. These are usually commands that create new keys, add data or # modify existing keys. A few examples are: SET, INCR, HSET, LPUSH, SUNIONSTORE, # SORT (due to the STORE argument), and EXEC (if the transaction includes any # command that requires memory). # # The default is: # # maxmemory-policy noeviction # LRU, LFU and minimal TTL algorithms are not precise algorithms but approximated # algorithms (in order to save memory), so you can tune it for speed or # accuracy. By default Redis will check five keys and pick the one that was # used least recently, you can change the sample size using the following # configuration directive. # # The default of 5 produces good enough results. 10 Approximates very closely # true LRU but costs more CPU. 3 is faster but not very accurate. # # maxmemory-samples 5 # Eviction processing is designed to function well with the default setting. # If there is an unusually large amount of write traffic, this value may need to # be increased. Decreasing this value may reduce latency at the risk of # eviction processing effectiveness # 0 = minimum latency, 10 = default, 100 = process without regard to latency # # maxmemory-eviction-tenacity 10 # Starting from Redis 5, by default a replica will ignore its maxmemory setting # (unless it is promoted to master after a failover or manually). It means # that the eviction of keys will be just handled by the master, sending the # DEL commands to the replica as keys evict in the master side. # # This behavior ensures that masters and replicas stay consistent, and is usually # what you want, however if your replica is writable, or you want the replica # to have a different memory setting, and you are sure all the writes performed # to the replica are idempotent, then you may change this default (but be sure # to understand what you are doing). # # Note that since the replica by default does not evict, it may end using more # memory than the one set via maxmemory (there are certain buffers that may # be larger on the replica, or data structures may sometimes take more memory # and so forth). So make sure you monitor your replicas and make sure they # have enough memory to never hit a real out-of-memory condition before the # master hits the configured maxmemory setting. # # replica-ignore-maxmemory yes # Redis reclaims expired keys in two ways: upon access when those keys are # found to be expired, and also in background, in what is called the # "active expire key". The key space is slowly and interactively scanned # looking for expired keys to reclaim, so that it is possible to free memory # of keys that are expired and will never be accessed again in a short time. # # The default effort of the expire cycle will try to avoid having more than # ten percent of expired keys still in memory, and will try to avoid consuming # more than 25% of total memory and to add latency to the system. However # it is possible to increase the expire "effort" that is normally set to # "1", to a greater value, up to the value "10". At its maximum value the # system will use more CPU, longer cycles (and technically may introduce # more latency), and will tolerate less already expired keys still present # in the system. It's a tradeoff between memory, CPU and latency. # # active-expire-effort 1 ############################# LAZY FREEING #################################### # Redis has two primitives to delete keys. One is called DEL and is a blocking # deletion of the object. It means that the server stops processing new commands # in order to reclaim all the memory associated with an object in a synchronous # way. If the key deleted is associated with a small object, the time needed # in order to execute the DEL command is very small and comparable to most other # O(1) or O(log_N) commands in Redis. However if the key is associated with an # aggregated value containing millions of elements, the server can block for # a long time (even seconds) in order to complete the operation. # # For the above reasons Redis also offers non blocking deletion primitives # such as UNLINK (non blocking DEL) and the ASYNC option of FLUSHALL and # FLUSHDB commands, in order to reclaim memory in background. Those commands # are executed in constant time. Another thread will incrementally free the # object in the background as fast as possible. # # DEL, UNLINK and ASYNC option of FLUSHALL and FLUSHDB are user-controlled. # It's up to the design of the application to understand when it is a good # idea to use one or the other. However the Redis server sometimes has to # delete keys or flush the whole database as a side effect of other operations. # Specifically Redis deletes objects independently of a user call in the # following scenarios: # # 1) On eviction, because of the maxmemory and maxmemory policy configurations, # in order to make room for new data, without going over the specified # memory limit. # 2) Because of expire: when a key with an associated time to live (see the # EXPIRE command) must be deleted from memory. # 3) Because of a side effect of a command that stores data on a key that may # already exist. For example the RENAME command may delete the old key # content when it is replaced with another one. Similarly SUNIONSTORE # or SORT with STORE option may delete existing keys. The SET command # itself removes any old content of the specified key in order to replace # it with the specified string. # 4) During replication, when a replica performs a full resynchronization with # its master, the content of the whole database is removed in order to # load the RDB file just transferred. # # In all the above cases the default is to delete objects in a blocking way, # like if DEL was called. However you can configure each case specifically # in order to instead release memory in a non-blocking way like if UNLINK # was called, using the following configuration directives. lazyfree-lazy-eviction no lazyfree-lazy-expire no lazyfree-lazy-server-del no replica-lazy-flush no # It is also possible, for the case when to replace the user code DEL calls # with UNLINK calls is not easy, to modify the default behavior of the DEL # command to act exactly like UNLINK, using the following configuration # directive: lazyfree-lazy-user-del no # FLUSHDB, FLUSHALL, and SCRIPT FLUSH support both asynchronous and synchronous # deletion, which can be controlled by passing the [SYNC|ASYNC] flags into the # commands. When neither flag is passed, this directive will be used to determine # if the data should be deleted asynchronously. lazyfree-lazy-user-flush no ################################ THREADED I/O ################################# # Redis is mostly single threaded, however there are certain threaded # operations such as UNLINK, slow I/O accesses and other things that are # performed on side threads. # # Now it is also possible to handle Redis clients socket reads and writes # in different I/O threads. Since especially writing is so slow, normally # Redis users use pipelining in order to speed up the Redis performances per # core, and spawn multiple instances in order to scale more. Using I/O # threads it is possible to easily speedup two times Redis without resorting # to pipelining nor sharding of the instance. # # By default threading is disabled, we suggest enabling it only in machines # that have at least 4 or more cores, leaving at least one spare core. # Using more than 8 threads is unlikely to help much. We also recommend using # threaded I/O only if you actually have performance problems, with Redis # instances being able to use a quite big percentage of CPU time, otherwise # there is no point in using this feature. # # So for instance if you have a four cores boxes, try to use 2 or 3 I/O # threads, if you have a 8 cores, try to use 6 threads. In order to # enable I/O threads use the following configuration directive: # # io-threads 4 # # Setting io-threads to 1 will just use the main thread as usual. # When I/O threads are enabled, we only use threads for writes, that is # to thread the write(2) syscall and transfer the client buffers to the # socket. However it is also possible to enable threading of reads and # protocol parsing using the following configuration directive, by setting # it to yes: # # io-threads-do-reads no # # Usually threading reads doesn't help much. # # NOTE 1: This configuration directive cannot be changed at runtime via # CONFIG SET. Aso this feature currently does not work when SSL is # enabled. # # NOTE 2: If you want to test the Redis speedup using redis-benchmark, make # sure you also run the benchmark itself in threaded mode, using the # --threads option to match the number of Redis threads, otherwise you'll not # be able to notice the improvements. ############################ KERNEL OOM CONTROL ############################## # On Linux, it is possible to hint the kernel OOM killer on what processes # should be killed first when out of memory. # # Enabling this feature makes Redis actively control the oom_score_adj value # for all its processes, depending on their role. The default scores will # attempt to have background child processes killed before all others, and # replicas killed before masters. # # Redis supports three options: # # no: Don't make changes to oom-score-adj (default). # yes: Alias to "relative" see below. # absolute: Values in oom-score-adj-values are written as is to the kernel. # relative: Values are used relative to the initial value of oom_score_adj when # the server starts and are then clamped to a range of -1000 to 1000. # Because typically the initial value is 0, they will often match the # absolute values. oom-score-adj no # When oom-score-adj is used, this directive controls the specific values used # for master, replica and background child processes. Values range -2000 to # 2000 (higher means more likely to be killed). # # Unprivileged processes (not root, and without CAP_SYS_RESOURCE capabilities) # can freely increase their value, but not decrease it below its initial # settings. This means that setting oom-score-adj to "relative" and setting the # oom-score-adj-values to positive values will always succeed. oom-score-adj-values 0 200 800 #################### KERNEL transparent hugepage CONTROL ###################### # Usually the kernel Transparent Huge Pages control is set to "madvise" or # or "never" by default (/sys/kernel/mm/transparent_hugepage/enabled), in which # case this config has no effect. On systems in which it is set to "always", # redis will attempt to disable it specifically for the redis process in order # to avoid latency problems specifically with fork(2) and CoW. # If for some reason you prefer to keep it enabled, you can set this config to # "no" and the kernel global to "always". disable-thp yes ############################## APPEND ONLY MODE ############################### # By default Redis asynchronously dumps the dataset on disk. This mode is # good enough in many applications, but an issue with the Redis process or # a power outage may result into a few minutes of writes lost (depending on # the configured save points). # # The Append Only File is an alternative persistence mode that provides # much better durability. For instance using the default data fsync policy # (see later in the config file) Redis can lose just one second of writes in a # dramatic event like a server power outage, or a single write if something # wrong with the Redis process itself happens, but the operating system is # still running correctly. # # AOF and RDB persistence can be enabled at the same time without problems. # If the AOF is enabled on startup Redis will load the AOF, that is the file # with the better durability guarantees. # # Please check https://redis.io/topics/persistence for more information. appendonly yes # The name of the append only file (default: "appendonly.aof") appendfilename "appendonly.aof" # The fsync() call tells the Operating System to actually write data on disk # instead of waiting for more data in the output buffer. Some OS will really flush # data on disk, some other OS will just try to do it ASAP. # # Redis supports three different modes: # # no: don't fsync, just let the OS flush the data when it wants. Faster. # always: fsync after every write to the append only log. Slow, Safest. # everysec: fsync only one time every second. Compromise. # # The default is "everysec", as that's usually the right compromise between # speed and data safety. It's up to you to understand if you can relax this to # "no" that will let the operating system flush the output buffer when # it wants, for better performances (but if you can live with the idea of # some data loss consider the default persistence mode that's snapshotting), # or on the contrary, use "always" that's very slow but a bit safer than # everysec. # # More details please check the following article: # http://antirez.com/post/redis-persistence-demystified.html # # If unsure, use "everysec". # appendfsync always appendfsync everysec # appendfsync no # When the AOF fsync policy is set to always or everysec, and a background # saving process (a background save or AOF log background rewriting) is # performing a lot of I/O against the disk, in some Linux configurations # Redis may block too long on the fsync() call. Note that there is no fix for # this currently, as even performing fsync in a different thread will block # our synchronous write(2) call. # # In order to mitigate this problem it's possible to use the following option # that will prevent fsync() from being called in the main process while a # BGSAVE or BGREWRITEAOF is in progress. # # This means that while another child is saving, the durability of Redis is # the same as "appendfsync none". In practical terms, this means that it is # possible to lose up to 30 seconds of log in the worst scenario (with the # default Linux settings). # # If you have latency problems turn this to "yes". Otherwise leave it as # "no" that is the safest pick from the point of view of durability. no-appendfsync-on-rewrite no # Automatic rewrite of the append only file. # Redis is able to automatically rewrite the log file implicitly calling # BGREWRITEAOF when the AOF log size grows by the specified percentage. # # This is how it works: Redis remembers the size of the AOF file after the # latest rewrite (if no rewrite has happened since the restart, the size of # the AOF at startup is used). # # This base size is compared to the current size. If the current size is # bigger than the specified percentage, the rewrite is triggered. Also # you need to specify a minimal size for the AOF file to be rewritten, this # is useful to avoid rewriting the AOF file even if the percentage increase # is reached but it is still pretty small. # # Specify a percentage of zero in order to disable the automatic AOF # rewrite feature. auto-aof-rewrite-percentage 100 auto-aof-rewrite-min-size 64mb # An AOF file may be found to be truncated at the end during the Redis # startup process, when the AOF data gets loaded back into memory. # This may happen when the system where Redis is running # crashes, especially when an ext4 filesystem is mounted without the # data=ordered option (however this can't happen when Redis itself # crashes or aborts but the operating system still works correctly). # # Redis can either exit with an error when this happens, or load as much # data as possible (the default now) and start if the AOF file is found # to be truncated at the end. The following option controls this behavior. # # If aof-load-truncated is set to yes, a truncated AOF file is loaded and # the Redis server starts emitting a log to inform the user of the event. # Otherwise if the option is set to no, the server aborts with an error # and refuses to start. When the option is set to no, the user requires # to fix the AOF file using the "redis-check-aof" utility before to restart # the server. # # Note that if the AOF file will be found to be corrupted in the middle # the server will still exit with an error. This option only applies when # Redis will try to read more data from the AOF file but not enough bytes # will be found. aof-load-truncated yes # When rewriting the AOF file, Redis is able to use an RDB preamble in the # AOF file for faster rewrites and recoveries. When this option is turned # on the rewritten AOF file is composed of two different stanzas: # # [RDB file][AOF tail] # # When loading, Redis recognizes that the AOF file starts with the "REDIS" # string and loads the prefixed RDB file, then continues loading the AOF # tail. aof-use-rdb-preamble yes ################################ LUA SCRIPTING ############################### # Max execution time of a Lua script in milliseconds. # # If the maximum execution time is reached Redis will log that a script is # still in execution after the maximum allowed time and will start to # reply to queries with an error. # # When a long running script exceeds the maximum execution time only the # SCRIPT KILL and SHUTDOWN NOSAVE commands are available. The first can be # used to stop a script that did not yet call any write commands. The second # is the only way to shut down the server in the case a write command was # already issued by the script but the user doesn't want to wait for the natural # termination of the script. # # Set it to 0 or a negative value for unlimited execution without warnings. lua-time-limit 5000 ################################ REDIS CLUSTER ############################### # Normal Redis instances can't be part of a Redis Cluster; only nodes that are # started as cluster nodes can. In order to start a Redis instance as a # cluster node enable the cluster support uncommenting the following: # # cluster-enabled yes # Every cluster node has a cluster configuration file. This file is not # intended to be edited by hand. It is created and updated by Redis nodes. # Every Redis Cluster node requires a different cluster configuration file. # Make sure that instances running in the same system do not have # overlapping cluster configuration file names. # # cluster-config-file nodes-6379.conf # Cluster node timeout is the amount of milliseconds a node must be unreachable # for it to be considered in failure state. # Most other internal time limits are a multiple of the node timeout. # # cluster-node-timeout 15000 # A replica of a failing master will avoid to start a failover if its data # looks too old. # # There is no simple way for a replica to actually have an exact measure of # its "data age", so the following two checks are performed: # # 1) If there are multiple replicas able to failover, they exchange messages # in order to try to give an advantage to the replica with the best # replication offset (more data from the master processed). # Replicas will try to get their rank by offset, and apply to the start # of the failover a delay proportional to their rank. # # 2) Every single replica computes the time of the last interaction with # its master. This can be the last ping or command received (if the master # is still in the "connected" state), or the time that elapsed since the # disconnection with the master (if the replication link is currently down). # If the last interaction is too old, the replica will not try to failover # at all. # # The point "2" can be tuned by user. Specifically a replica will not perform # the failover if, since the last interaction with the master, the time # elapsed is greater than: # # (node-timeout * cluster-replica-validity-factor) + repl-ping-replica-period # # So for example if node-timeout is 30 seconds, and the cluster-replica-validity-factor # is 10, and assuming a default repl-ping-replica-period of 10 seconds, the # replica will not try to failover if it was not able to talk with the master # for longer than 310 seconds. # # A large cluster-replica-validity-factor may allow replicas with too old data to failover # a master, while a too small value may prevent the cluster from being able to # elect a replica at all. # # For maximum availability, it is possible to set the cluster-replica-validity-factor # to a value of 0, which means, that replicas will always try to failover the # master regardless of the last time they interacted with the master. # (However they'll always try to apply a delay proportional to their # offset rank). # # Zero is the only value able to guarantee that when all the partitions heal # the cluster will always be able to continue. # # cluster-replica-validity-factor 10 # Cluster replicas are able to migrate to orphaned masters, that are masters # that are left without working replicas. This improves the cluster ability # to resist to failures as otherwise an orphaned master can't be failed over # in case of failure if it has no working replicas. # # Replicas migrate to orphaned masters only if there are still at least a # given number of other working replicas for their old master. This number # is the "migration barrier". A migration barrier of 1 means that a replica # will migrate only if there is at least 1 other working replica for its master # and so forth. It usually reflects the number of replicas you want for every # master in your cluster. # # Default is 1 (replicas migrate only if their masters remain with at least # one replica). To disable migration just set it to a very large value or # set cluster-allow-replica-migration to 'no'. # A value of 0 can be set but is useful only for debugging and dangerous # in production. # # cluster-migration-barrier 1 # Turning off this option allows to use less automatic cluster configuration. # It both disables migration to orphaned masters and migration from masters # that became empty. # # Default is 'yes' (allow automatic migrations). # # cluster-allow-replica-migration yes # By default Redis Cluster nodes stop accepting queries if they detect there # is at least a hash slot uncovered (no available node is serving it). # This way if the cluster is partially down (for example a range of hash slots # are no longer covered) all the cluster becomes, eventually, unavailable. # It automatically returns available as soon as all the slots are covered again. # # However sometimes you want the subset of the cluster which is working, # to continue to accept queries for the part of the key space that is still # covered. In order to do so, just set the cluster-require-full-coverage # option to no. # # cluster-require-full-coverage yes # This option, when set to yes, prevents replicas from trying to failover its # master during master failures. However the replica can still perform a # manual failover, if forced to do so. # # This is useful in different scenarios, especially in the case of multiple # data center operations, where we want one side to never be promoted if not # in the case of a total DC failure. # # cluster-replica-no-failover no # This option, when set to yes, allows nodes to serve read traffic while the # the cluster is in a down state, as long as it believes it owns the slots. # # This is useful for two cases. The first case is for when an application # doesn't require consistency of data during node failures or network partitions. # One example of this is a cache, where as long as the node has the data it # should be able to serve it. # # The second use case is for configurations that don't meet the recommended # three shards but want to enable cluster mode and scale later. A # master outage in a 1 or 2 shard configuration causes a read/write outage to the # entire cluster without this option set, with it set there is only a write outage. # Without a quorum of masters, slot ownership will not change automatically. # # cluster-allow-reads-when-down no # In order to setup your cluster make sure to read the documentation # available at https://redis.io web site. ########################## CLUSTER DOCKER/NAT support ######################## # In certain deployments, Redis Cluster nodes address discovery fails, because # addresses are NAT-ted or because ports are forwarded (the typical case is # Docker and other containers). # # In order to make Redis Cluster working in such environments, a static # configuration where each node knows its public address is needed. The # following four options are used for this scope, and are: # # * cluster-announce-ip # * cluster-announce-port # * cluster-announce-tls-port # * cluster-announce-bus-port # # Each instructs the node about its address, client ports (for connections # without and with TLS) and cluster message bus port. The information is then # published in the header of the bus packets so that other nodes will be able to # correctly map the address of the node publishing the information. # # If cluster-tls is set to yes and cluster-announce-tls-port is omitted or set # to zero, then cluster-announce-port refers to the TLS port. Note also that # cluster-announce-tls-port has no effect if cluster-tls is set to no. # # If the above options are not used, the normal Redis Cluster auto-detection # will be used instead. # # Note that when remapped, the bus port may not be at the fixed offset of # clients port + 10000, so you can specify any port and bus-port depending # on how they get remapped. If the bus-port is not set, a fixed offset of # 10000 will be used as usual. # # Example: # # cluster-announce-ip 10.1.1.5 # cluster-announce-tls-port 6379 # cluster-announce-port 0 # cluster-announce-bus-port 6380 ################################## SLOW LOG ################################### # The Redis Slow Log is a system to log queries that exceeded a specified # execution time. The execution time does not include the I/O operations # like talking with the client, sending the reply and so forth, # but just the time needed to actually execute the command (this is the only # stage of command execution where the thread is blocked and can not serve # other requests in the meantime). # # You can configure the slow log with two parameters: one tells Redis # what is the execution time, in microseconds, to exceed in order for the # command to get logged, and the other parameter is the length of the # slow log. When a new command is logged the oldest one is removed from the # queue of logged commands. # The following time is expressed in microseconds, so 1000000 is equivalent # to one second. Note that a negative number disables the slow log, while # a value of zero forces the logging of every command. slowlog-log-slower-than 10000 # There is no limit to this length. Just be aware that it will consume memory. # You can reclaim memory used by the slow log with SLOWLOG RESET. slowlog-max-len 128 ################################ LATENCY MONITOR ############################## # The Redis latency monitoring subsystem samples different operations # at runtime in order to collect data related to possible sources of # latency of a Redis instance. # # Via the LATENCY command this information is available to the user that can # print graphs and obtain reports. # # The system only logs operations that were performed in a time equal or # greater than the amount of milliseconds specified via the # latency-monitor-threshold configuration directive. When its value is set # to zero, the latency monitor is turned off. # # By default latency monitoring is disabled since it is mostly not needed # if you don't have latency issues, and collecting data has a performance # impact, that while very small, can be measured under big load. Latency # monitoring can easily be enabled at runtime using the command # "CONFIG SET latency-monitor-threshold <milliseconds>" if needed. latency-monitor-threshold 0 ############################# EVENT NOTIFICATION ############################## # Redis can notify Pub/Sub clients about events happening in the key space. # This feature is documented at https://redis.io/topics/notifications # # For instance if keyspace events notification is enabled, and a client # performs a DEL operation on key "foo" stored in the Database 0, two # messages will be published via Pub/Sub: # # PUBLISH __keyspace@0__:foo del # PUBLISH __keyevent@0__:del foo # # It is possible to select the events that Redis will notify among a set # of classes. Every class is identified by a single character: # # K Keyspace events, published with __keyspace@<db>__ prefix. # E Keyevent events, published with __keyevent@<db>__ prefix. # g Generic commands (non-type specific) like DEL, EXPIRE, RENAME, ... # $ String commands # l List commands # s Set commands # h Hash commands # z Sorted set commands # x Expired events (events generated every time a key expires) # e Evicted events (events generated when a key is evicted for maxmemory) # t Stream commands # d Module key type events # m Key-miss events (Note: It is not included in the 'A' class) # A Alias for g$lshzxetd, so that the "AKE" string means all the events # (Except key-miss events which are excluded from 'A' due to their # unique nature). # # The "notify-keyspace-events" takes as argument a string that is composed # of zero or multiple characters. The empty string means that notifications # are disabled. # # Example: to enable list and generic events, from the point of view of the # event name, use: # # notify-keyspace-events Elg # # Example 2: to get the stream of the expired keys subscribing to channel # name __keyevent@0__:expired use: # # notify-keyspace-events Ex # # By default all notifications are disabled because most users don't need # this feature and the feature has some overhead. Note that if you don't # specify at least one of K or E, no events will be delivered. notify-keyspace-events "" ############################### GOPHER SERVER ################################# # Redis contains an implementation of the Gopher protocol, as specified in # the RFC 1436 (https://www.ietf.org/rfc/rfc1436.txt). # # The Gopher protocol was very popular in the late '90s. It is an alternative # to the web, and the implementation both server and client side is so simple # that the Redis server has just 100 lines of code in order to implement this # support. # # What do you do with Gopher nowadays? Well Gopher never *really* died, and # lately there is a movement in order for the Gopher more hierarchical content # composed of just plain text documents to be resurrected. Some want a simpler # internet, others believe that the mainstream internet became too much # controlled, and it's cool to create an alternative space for people that # want a bit of fresh air. # # Anyway for the 10nth birthday of the Redis, we gave it the Gopher protocol # as a gift. # # --- HOW IT WORKS? --- # # The Redis Gopher support uses the inline protocol of Redis, and specifically # two kind of inline requests that were anyway illegal: an empty request # or any request that starts with "/" (there are no Redis commands starting # with such a slash). Normal RESP2/RESP3 requests are completely out of the # path of the Gopher protocol implementation and are served as usual as well. # # If you open a connection to Redis when Gopher is enabled and send it # a string like "/foo", if there is a key named "/foo" it is served via the # Gopher protocol. # # In order to create a real Gopher "hole" (the name of a Gopher site in Gopher # talking), you likely need a script like the following: # # https://github.com/antirez/gopher2redis # # --- SECURITY WARNING --- # # If you plan to put Redis on the internet in a publicly accessible address # to server Gopher pages MAKE SURE TO SET A PASSWORD to the instance. # Once a password is set: # # 1. The Gopher server (when enabled, not by default) will still serve # content via Gopher. # 2. However other commands cannot be called before the client will # authenticate. # # So use the 'requirepass' option to protect your instance. # # Note that Gopher is not currently supported when 'io-threads-do-reads' # is enabled. # # To enable Gopher support, uncomment the following line and set the option # from no (the default) to yes. # # gopher-enabled no ############################### ADVANCED CONFIG ############################### # Hashes are encoded using a memory efficient data structure when they have a # small number of entries, and the biggest entry does not exceed a given # threshold. These thresholds can be configured using the following directives. hash-max-ziplist-entries 512 hash-max-ziplist-value 64 # Lists are also encoded in a special way to save a lot of space. # The number of entries allowed per internal list node can be specified # as a fixed maximum size or a maximum number of elements. # For a fixed maximum size, use -5 through -1, meaning: # -5: max size: 64 Kb <-- not recommended for normal workloads # -4: max size: 32 Kb <-- not recommended # -3: max size: 16 Kb <-- probably not recommended # -2: max size: 8 Kb <-- good # -1: max size: 4 Kb <-- good # Positive numbers mean store up to _exactly_ that number of elements # per list node. # The highest performing option is usually -2 (8 Kb size) or -1 (4 Kb size), # but if your use case is unique, adjust the settings as necessary. list-max-ziplist-size -2 # Lists may also be compressed. # Compress depth is the number of quicklist ziplist nodes from *each* side of # the list to *exclude* from compression. The head and tail of the list # are always uncompressed for fast push/pop operations. Settings are: # 0: disable all list compression # 1: depth 1 means "don't start compressing until after 1 node into the list, # going from either the head or tail" # So: [head]->node->node->...->node->[tail] # [head], [tail] will always be uncompressed; inner nodes will compress. # 2: [head]->[next]->node->node->...->node->[prev]->[tail] # 2 here means: don't compress head or head->next or tail->prev or tail, # but compress all nodes between them. # 3: [head]->[next]->[next]->node->node->...->node->[prev]->[prev]->[tail] # etc. list-compress-depth 0 # Sets have a special encoding in just one case: when a set is composed # of just strings that happen to be integers in radix 10 in the range # of 64 bit signed integers. # The following configuration setting sets the limit in the size of the # set in order to use this special memory saving encoding. set-max-intset-entries 512 # Similarly to hashes and lists, sorted sets are also specially encoded in # order to save a lot of space. This encoding is only used when the length and # elements of a sorted set are below the following limits: zset-max-ziplist-entries 128 zset-max-ziplist-value 64 # HyperLogLog sparse representation bytes limit. The limit includes the # 16 bytes header. When an HyperLogLog using the sparse representation crosses # this limit, it is converted into the dense representation. # # A value greater than 16000 is totally useless, since at that point the # dense representation is more memory efficient. # # The suggested value is ~ 3000 in order to have the benefits of # the space efficient encoding without slowing down too much PFADD, # which is O(N) with the sparse encoding. The value can be raised to # ~ 10000 when CPU is not a concern, but space is, and the data set is # composed of many HyperLogLogs with cardinality in the 0 - 15000 range. hll-sparse-max-bytes 3000 # Streams macro node max size / items. The stream data structure is a radix # tree of big nodes that encode multiple items inside. Using this configuration # it is possible to configure how big a single node can be in bytes, and the # maximum number of items it may contain before switching to a new node when # appending new stream entries. If any of the following settings are set to # zero, the limit is ignored, so for instance it is possible to set just a # max entries limit by setting max-bytes to 0 and max-entries to the desired # value. stream-node-max-bytes 4096 stream-node-max-entries 100 # Active rehashing uses 1 millisecond every 100 milliseconds of CPU time in # order to help rehashing the main Redis hash table (the one mapping top-level # keys to values). The hash table implementation Redis uses (see dict.c) # performs a lazy rehashing: the more operation you run into a hash table # that is rehashing, the more rehashing "steps" are performed, so if the # server is idle the rehashing is never complete and some more memory is used # by the hash table. # # The default is to use this millisecond 10 times every second in order to # actively rehash the main dictionaries, freeing memory when possible. # # If unsure: # use "activerehashing no" if you have hard latency requirements and it is # not a good thing in your environment that Redis can reply from time to time # to queries with 2 milliseconds delay. # # use "activerehashing yes" if you don't have such hard requirements but # want to free memory asap when possible. activerehashing yes # The client output buffer limits can be used to force disconnection of clients # that are not reading data from the server fast enough for some reason (a # common reason is that a Pub/Sub client can't consume messages as fast as the # publisher can produce them). # # The limit can be set differently for the three different classes of clients: # # normal -> normal clients including MONITOR clients # replica -> replica clients # pubsub -> clients subscribed to at least one pubsub channel or pattern # # The syntax of every client-output-buffer-limit directive is the following: # # client-output-buffer-limit <class> <hard limit> <soft limit> <soft seconds> # # A client is immediately disconnected once the hard limit is reached, or if # the soft limit is reached and remains reached for the specified number of # seconds (continuously). # So for instance if the hard limit is 32 megabytes and the soft limit is # 16 megabytes / 10 seconds, the client will get disconnected immediately # if the size of the output buffers reach 32 megabytes, but will also get # disconnected if the client reaches 16 megabytes and continuously overcomes # the limit for 10 seconds. # # By default normal clients are not limited because they don't receive data # without asking (in a push way), but just after a request, so only # asynchronous clients may create a scenario where data is requested faster # than it can read. # # Instead there is a default limit for pubsub and replica clients, since # subscribers and replicas receive data in a push fashion. # # Both the hard or the soft limit can be disabled by setting them to zero. client-output-buffer-limit normal 0 0 0 client-output-buffer-limit replica 256mb 64mb 60 client-output-buffer-limit pubsub 32mb 8mb 60 # Client query buffers accumulate new commands. They are limited to a fixed # amount by default in order to avoid that a protocol desynchronization (for # instance due to a bug in the client) will lead to unbound memory usage in # the query buffer. However you can configure it here if you have very special # needs, such us huge multi/exec requests or alike. # # client-query-buffer-limit 1gb # In the Redis protocol, bulk requests, that are, elements representing single # strings, are normally limited to 512 mb. However you can change this limit # here, but must be 1mb or greater # # proto-max-bulk-len 512mb # Redis calls an internal function to perform many background tasks, like # closing connections of clients in timeout, purging expired keys that are # never requested, and so forth. # # Not all tasks are performed with the same frequency, but Redis checks for # tasks to perform according to the specified "hz" value. # # By default "hz" is set to 10. Raising the value will use more CPU when # Redis is idle, but at the same time will make Redis more responsive when # there are many keys expiring at the same time, and timeouts may be # handled with more precision. # # The range is between 1 and 500, however a value over 100 is usually not # a good idea. Most users should use the default of 10 and raise this up to # 100 only in environments where very low latency is required. hz 10 # Normally it is useful to have an HZ value which is proportional to the # number of clients connected. This is useful in order, for instance, to # avoid too many clients are processed for each background task invocation # in order to avoid latency spikes. # # Since the default HZ value by default is conservatively set to 10, Redis # offers, and enables by default, the ability to use an adaptive HZ value # which will temporarily raise when there are many connected clients. # # When dynamic HZ is enabled, the actual configured HZ will be used # as a baseline, but multiples of the configured HZ value will be actually # used as needed once more clients are connected. In this way an idle # instance will use very little CPU time while a busy instance will be # more responsive. dynamic-hz yes # When a child rewrites the AOF file, if the following option is enabled # the file will be fsync-ed every 32 MB of data generated. This is useful # in order to commit the file to the disk more incrementally and avoid # big latency spikes. aof-rewrite-incremental-fsync yes # When redis saves RDB file, if the following option is enabled # the file will be fsync-ed every 32 MB of data generated. This is useful # in order to commit the file to the disk more incrementally and avoid # big latency spikes. rdb-save-incremental-fsync yes # Redis LFU eviction (see maxmemory setting) can be tuned. However it is a good # idea to start with the default settings and only change them after investigating # how to improve the performances and how the keys LFU change over time, which # is possible to inspect via the OBJECT FREQ command. # # There are two tunable parameters in the Redis LFU implementation: the # counter logarithm factor and the counter decay time. It is important to # understand what the two parameters mean before changing them. # # The LFU counter is just 8 bits per key, it's maximum value is 255, so Redis # uses a probabilistic increment with logarithmic behavior. Given the value # of the old counter, when a key is accessed, the counter is incremented in # this way: # # 1. A random number R between 0 and 1 is extracted. # 2. A probability P is calculated as 1/(old_value*lfu_log_factor+1). # 3. The counter is incremented only if R < P. # # The default lfu-log-factor is 10. This is a table of how the frequency # counter changes with a different number of accesses with different # logarithmic factors: # # +--------+------------+------------+------------+------------+------------+ # | factor | 100 hits | 1000 hits | 100K hits | 1M hits | 10M hits | # +--------+------------+------------+------------+------------+------------+ # | 0 | 104 | 255 | 255 | 255 | 255 | # +--------+------------+------------+------------+------------+------------+ # | 1 | 18 | 49 | 255 | 255 | 255 | # +--------+------------+------------+------------+------------+------------+ # | 10 | 10 | 18 | 142 | 255 | 255 | # +--------+------------+------------+------------+------------+------------+ # | 100 | 8 | 11 | 49 | 143 | 255 | # +--------+------------+------------+------------+------------+------------+ # # NOTE: The above table was obtained by running the following commands: # # redis-benchmark -n 1000000 incr foo # redis-cli object freq foo # # NOTE 2: The counter initial value is 5 in order to give new objects a chance # to accumulate hits. # # The counter decay time is the time, in minutes, that must elapse in order # for the key counter to be divided by two (or decremented if it has a value # less <= 10). # # The default value for the lfu-decay-time is 1. A special value of 0 means to # decay the counter every time it happens to be scanned. # # lfu-log-factor 10 # lfu-decay-time 1 ########################### ACTIVE DEFRAGMENTATION ####################### # # What is active defragmentation? # ------------------------------- # # Active (online) defragmentation allows a Redis server to compact the # spaces left between small allocations and deallocations of data in memory, # thus allowing to reclaim back memory. # # Fragmentation is a natural process that happens with every allocator (but # less so with Jemalloc, fortunately) and certain workloads. Normally a server # restart is needed in order to lower the fragmentation, or at least to flush # away all the data and create it again. However thanks to this feature # implemented by Oran Agra for Redis 4.0 this process can happen at runtime # in a "hot" way, while the server is running. # # Basically when the fragmentation is over a certain level (see the # configuration options below) Redis will start to create new copies of the # values in contiguous memory regions by exploiting certain specific Jemalloc # features (in order to understand if an allocation is causing fragmentation # and to allocate it in a better place), and at the same time, will release the # old copies of the data. This process, repeated incrementally for all the keys # will cause the fragmentation to drop back to normal values. # # Important things to understand: # # 1. This feature is disabled by default, and only works if you compiled Redis # to use the copy of Jemalloc we ship with the source code of Redis. # This is the default with Linux builds. # # 2. You never need to enable this feature if you don't have fragmentation # issues. # # 3. Once you experience fragmentation, you can enable this feature when # needed with the command "CONFIG SET activedefrag yes". # # The configuration parameters are able to fine tune the behavior of the # defragmentation process. If you are not sure about what they mean it is # a good idea to leave the defaults untouched. # Enabled active defragmentation # activedefrag no # Minimum amount of fragmentation waste to start active defrag # active-defrag-ignore-bytes 100mb # Minimum percentage of fragmentation to start active defrag # active-defrag-threshold-lower 10 # Maximum percentage of fragmentation at which we use maximum effort # active-defrag-threshold-upper 100 # Minimal effort for defrag in CPU percentage, to be used when the lower # threshold is reached # active-defrag-cycle-min 1 # Maximal effort for defrag in CPU percentage, to be used when the upper # threshold is reached # active-defrag-cycle-max 25 # Maximum number of set/hash/zset/list fields that will be processed from # the main dictionary scan # active-defrag-max-scan-fields 1000 # Jemalloc background thread for purging will be enabled by default jemalloc-bg-thread yes # It is possible to pin different threads and processes of Redis to specific # CPUs in your system, in order to maximize the performances of the server. # This is useful both in order to pin different Redis threads in different # CPUs, but also in order to make sure that multiple Redis instances running # in the same host will be pinned to different CPUs. # # Normally you can do this using the "taskset" command, however it is also # possible to this via Redis configuration directly, both in Linux and FreeBSD. # # You can pin the server/IO threads, bio threads, aof rewrite child process, and # the bgsave child process. The syntax to specify the cpu list is the same as # the taskset command: # # Set redis server/io threads to cpu affinity 0,2,4,6: # server_cpulist 0-7:2 # # Set bio threads to cpu affinity 1,3: # bio_cpulist 1,3 # # Set aof rewrite child process to cpu affinity 8,9,10,11: # aof_rewrite_cpulist 8-11 # # Set bgsave child process to cpu affinity 1,10,11 # bgsave_cpulist 1,10-11 # In some cases redis will emit warnings and even refuse to start if it detects # that the system is in bad state, it is possible to suppress these warnings # by setting the following config which takes a space delimited list of warnings # to suppress # # ignore-warnings ARM64-COW-BUG 在里面那边加上bind 0.0.0.0
05-24
xref: /casio_MT6878_16.0.0_master/vnd/kernel-6.1/block/blk-core.c HomeAnnotateLine# Scopes# Navigate#Raw Download current directory 1 // SPDX-License-Identifier: GPL-2.0 2 /* 3 * Copyright (C) 1991, 1992 Linus Torvalds 4 * Copyright (C) 1994, Karl Keyte: Added support for disk statistics 5 * Elevator latency, (C) 2000 Andrea Arcangeli <andrea@suse.de> SuSE 6 * Queue request tables / lock, selectable elevator, Jens Axboe <axboe@suse.de> 7 * kernel-doc documentation started by NeilBrown <neilb@cse.unsw.edu.au> 8 * - July2000 9 * bio rewrite, highmem i/o, etc, Jens Axboe <axboe@suse.de> - may 2001 10 */ 11 12 /* 13 * This handles all read/write requests to block devices 14 */ 15 #include <linux/kernel.h> 16 #include <linux/module.h> 17 #include <linux/bio.h> 18 #include <linux/blkdev.h> 19 #include <linux/blk-pm.h> 20 #include <linux/blk-integrity.h> 21 #include <linux/highmem.h> 22 #include <linux/mm.h> 23 #include <linux/pagemap.h> 24 #include <linux/kernel_stat.h> 25 #include <linux/string.h> 26 #include <linux/init.h> 27 #include <linux/completion.h> 28 #include <linux/slab.h> 29 #include <linux/swap.h> 30 #include <linux/writeback.h> 31 #include <linux/task_io_accounting_ops.h> 32 #include <linux/fault-inject.h> 33 #include <linux/list_sort.h> 34 #include <linux/delay.h> 35 #include <linux/ratelimit.h> 36 #include <linux/pm_runtime.h> 37 #include <linux/t10-pi.h> 38 #include <linux/debugfs.h> 39 #include <linux/bpf.h> 40 #include <linux/part_stat.h> 41 #include <linux/sched/sysctl.h> 42 #include <linux/blk-crypto.h> 43 44 #define CREATE_TRACE_POINTS 45 #include <trace/events/block.h> 46 47 #include "blk.h" 48 #ifndef __GENKSYMS__ 49 #include "blk-mq-debugfs.h" 50 #endif 51 #include "blk-mq-sched.h" 52 #include "blk-pm.h" 53 #include "blk-cgroup.h" 54 #include "blk-throttle.h" 55 #include "blk-ioprio.h" 56 57 #ifdef CONFIG_BLK_MQ_USE_LOCAL_THREAD 58 extern long bio_cnt; // total bio sumbit 59 extern long rt_bio_cnt; // total rt bio sumbit, part of bio_cnt 60 extern long ux_bio_cnt; // total ux bio sumbit, part of rt_bio_cnt 61 #endif 62 63 struct dentry *blk_debugfs_root; 64 65 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_remap); 66 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_remap); 67 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_complete); 68 EXPORT_TRACEPOINT_SYMBOL_GPL(block_split); 69 EXPORT_TRACEPOINT_SYMBOL_GPL(block_unplug); 70 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_insert); 71 EXPORT_TRACEPOINT_SYMBOL_GPL(block_bio_queue); 72 EXPORT_TRACEPOINT_SYMBOL_GPL(block_getrq); 73 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_issue); 74 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_merge); 75 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_requeue); 76 EXPORT_TRACEPOINT_SYMBOL_GPL(block_rq_complete); 77 78 DEFINE_IDA(blk_queue_ida); 79 80 /* 81 * For queue allocation 82 */ 83 struct kmem_cache *blk_requestq_cachep; 84 struct kmem_cache *blk_requestq_srcu_cachep; 85 86 /* 87 * Controlling structure to kblockd 88 */ 89 static struct workqueue_struct *kblockd_workqueue; 90 91 /** 92 * blk_queue_flag_set - atomically set a queue flag 93 * @flag: flag to be set 94 * @q: request queue 95 */ 96 void blk_queue_flag_set(unsigned int flag, struct request_queue *q) 97 { 98 set_bit(flag, &q->queue_flags); 99 } 100 EXPORT_SYMBOL(blk_queue_flag_set); 101 102 /** 103 * blk_queue_flag_clear - atomically clear a queue flag 104 * @flag: flag to be cleared 105 * @q: request queue 106 */ 107 void blk_queue_flag_clear(unsigned int flag, struct request_queue *q) 108 { 109 clear_bit(flag, &q->queue_flags); 110 } 111 EXPORT_SYMBOL(blk_queue_flag_clear); 112 113 /** 114 * blk_queue_flag_test_and_set - atomically test and set a queue flag 115 * @flag: flag to be set 116 * @q: request queue 117 * 118 * Returns the previous value of @flag - 0 if the flag was not set and 1 if 119 * the flag was already set. 120 */ 121 bool blk_queue_flag_test_and_set(unsigned int flag, struct request_queue *q) 122 { 123 return test_and_set_bit(flag, &q->queue_flags); 124 } 125 EXPORT_SYMBOL_GPL(blk_queue_flag_test_and_set); 126 127 #define REQ_OP_NAME(name) [REQ_OP_##name] = #name 128 static const char *const blk_op_name[] = { 129 REQ_OP_NAME(READ), 130 REQ_OP_NAME(WRITE), 131 REQ_OP_NAME(FLUSH), 132 REQ_OP_NAME(DISCARD), 133 REQ_OP_NAME(SECURE_ERASE), 134 REQ_OP_NAME(ZONE_RESET), 135 REQ_OP_NAME(ZONE_RESET_ALL), 136 REQ_OP_NAME(ZONE_OPEN), 137 REQ_OP_NAME(ZONE_CLOSE), 138 REQ_OP_NAME(ZONE_FINISH), 139 REQ_OP_NAME(ZONE_APPEND), 140 REQ_OP_NAME(WRITE_ZEROES), 141 REQ_OP_NAME(DRV_IN), 142 REQ_OP_NAME(DRV_OUT), 143 }; 144 #undef REQ_OP_NAME 145 146 /** 147 * blk_op_str - Return string XXX in the REQ_OP_XXX. 148 * @op: REQ_OP_XXX. 149 * 150 * Description: Centralize block layer function to convert REQ_OP_XXX into 151 * string format. Useful in the debugging and tracing bio or request. For 152 * invalid REQ_OP_XXX it returns string "UNKNOWN". 153 */ 154 inline const char *blk_op_str(enum req_op op) 155 { 156 const char *op_str = "UNKNOWN"; 157 158 if (op < ARRAY_SIZE(blk_op_name) && blk_op_name[op]) 159 op_str = blk_op_name[op]; 160 161 return op_str; 162 } 163 EXPORT_SYMBOL_GPL(blk_op_str); 164 165 static const struct { 166 int errno; 167 const char *name; 168 } blk_errors[] = { 169 [BLK_STS_OK] = { 0, "" }, 170 [BLK_STS_NOTSUPP] = { -EOPNOTSUPP, "operation not supported" }, 171 [BLK_STS_TIMEOUT] = { -ETIMEDOUT, "timeout" }, 172 [BLK_STS_NOSPC] = { -ENOSPC, "critical space allocation" }, 173 [BLK_STS_TRANSPORT] = { -ENOLINK, "recoverable transport" }, 174 [BLK_STS_TARGET] = { -EREMOTEIO, "critical target" }, 175 [BLK_STS_NEXUS] = { -EBADE, "critical nexus" }, 176 [BLK_STS_MEDIUM] = { -ENODATA, "critical medium" }, 177 [BLK_STS_PROTECTION] = { -EILSEQ, "protection" }, 178 [BLK_STS_RESOURCE] = { -ENOMEM, "kernel resource" }, 179 [BLK_STS_DEV_RESOURCE] = { -EBUSY, "device resource" }, 180 [BLK_STS_AGAIN] = { -EAGAIN, "nonblocking retry" }, 181 [BLK_STS_OFFLINE] = { -ENODEV, "device offline" }, 182 183 /* device mapper special case, should not leak out: */ 184 [BLK_STS_DM_REQUEUE] = { -EREMCHG, "dm internal retry" }, 185 186 /* zone device specific errors */ 187 [BLK_STS_ZONE_OPEN_RESOURCE] = { -ETOOMANYREFS, "open zones exceeded" }, 188 [BLK_STS_ZONE_ACTIVE_RESOURCE] = { -EOVERFLOW, "active zones exceeded" }, 189 190 /* everything else not covered above: */ 191 [BLK_STS_IOERR] = { -EIO, "I/O" }, 192 }; 193 194 blk_status_t errno_to_blk_status(int errno) 195 { 196 int i; 197 198 for (i = 0; i < ARRAY_SIZE(blk_errors); i++) { 199 if (blk_errors[i].errno == errno) 200 return (__force blk_status_t)i; 201 } 202 203 return BLK_STS_IOERR; 204 } 205 EXPORT_SYMBOL_GPL(errno_to_blk_status); 206 207 int blk_status_to_errno(blk_status_t status) 208 { 209 int idx = (__force int)status; 210 211 if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors))) 212 return -EIO; 213 return blk_errors[idx].errno; 214 } 215 EXPORT_SYMBOL_GPL(blk_status_to_errno); 216 217 const char *blk_status_to_str(blk_status_t status) 218 { 219 int idx = (__force int)status; 220 221 if (WARN_ON_ONCE(idx >= ARRAY_SIZE(blk_errors))) 222 return "<null>"; 223 return blk_errors[idx].name; 224 } 225 226 /** 227 * blk_sync_queue - cancel any pending callbacks on a queue 228 * @q: the queue 229 * 230 * Description: 231 * The block layer may perform asynchronous callback activity 232 * on a queue, such as calling the unplug function after a timeout. 233 * A block device may call blk_sync_queue to ensure that any 234 * such activity is cancelled, thus allowing it to release resources 235 * that the callbacks might use. The caller must already have made sure 236 * that its ->submit_bio will not re-add plugging prior to calling 237 * this function. 238 * 239 * This function does not cancel any asynchronous activity arising 240 * out of elevator or throttling code. That would require elevator_exit() 241 * and blkcg_exit_queue() to be called with queue lock initialized. 242 * 243 */ 244 void blk_sync_queue(struct request_queue *q) 245 { 246 del_timer_sync(&q->timeout); 247 cancel_work_sync(&q->timeout_work); 248 } 249 EXPORT_SYMBOL(blk_sync_queue); 250 251 /** 252 * blk_set_pm_only - increment pm_only counter 253 * @q: request queue pointer 254 */ 255 void blk_set_pm_only(struct request_queue *q) 256 { 257 atomic_inc(&q->pm_only); 258 } 259 EXPORT_SYMBOL_GPL(blk_set_pm_only); 260 261 void blk_clear_pm_only(struct request_queue *q) 262 { 263 int pm_only; 264 265 pm_only = atomic_dec_return(&q->pm_only); 266 WARN_ON_ONCE(pm_only < 0); 267 if (pm_only == 0) 268 wake_up_all(&q->mq_freeze_wq); 269 } 270 EXPORT_SYMBOL_GPL(blk_clear_pm_only); 271 272 /** 273 * blk_put_queue - decrement the request_queue refcount 274 * @q: the request_queue structure to decrement the refcount for 275 * 276 * Decrements the refcount of the request_queue kobject. When this reaches 0 277 * we'll have blk_release_queue() called. 278 * 279 * Context: Any context, but the last reference must not be dropped from 280 * atomic context. 281 */ 282 void blk_put_queue(struct request_queue *q) 283 { 284 kobject_put(&q->kobj); 285 } 286 EXPORT_SYMBOL(blk_put_queue); 287 288 void blk_queue_start_drain(struct request_queue *q) 289 { 290 /* 291 * When queue DYING flag is set, we need to block new req 292 * entering queue, so we call blk_freeze_queue_start() to 293 * prevent I/O from crossing blk_queue_enter(). 294 */ 295 blk_freeze_queue_start(q); 296 if (queue_is_mq(q)) 297 blk_mq_wake_waiters(q); 298 /* Make blk_queue_enter() reexamine the DYING flag. */ 299 wake_up_all(&q->mq_freeze_wq); 300 } 301 302 /** 303 * blk_queue_enter() - try to increase q->q_usage_counter 304 * @q: request queue pointer 305 * @flags: BLK_MQ_REQ_NOWAIT and/or BLK_MQ_REQ_PM 306 */ 307 int blk_queue_enter(struct request_queue *q, blk_mq_req_flags_t flags) 308 { 309 const bool pm = flags & BLK_MQ_REQ_PM; 310 311 while (!blk_try_enter_queue(q, pm)) { 312 if (flags & BLK_MQ_REQ_NOWAIT) 313 return -EAGAIN; 314 315 /* 316 * read pair of barrier in blk_freeze_queue_start(), we need to 317 * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and 318 * reading .mq_freeze_depth or queue dying flag, otherwise the 319 * following wait may never return if the two reads are 320 * reordered. 321 */ 322 smp_rmb(); 323 wait_event(q->mq_freeze_wq, 324 (!q->mq_freeze_depth && 325 blk_pm_resume_queue(pm, q)) || 326 blk_queue_dying(q)); 327 if (blk_queue_dying(q)) 328 return -ENODEV; 329 } 330 331 return 0; 332 } 333 334 int __bio_queue_enter(struct request_queue *q, struct bio *bio) 335 { 336 while (!blk_try_enter_queue(q, false)) { 337 struct gendisk *disk = bio->bi_bdev->bd_disk; 338 339 if (bio->bi_opf & REQ_NOWAIT) { 340 if (test_bit(GD_DEAD, &disk->state)) 341 goto dead; 342 bio_wouldblock_error(bio); 343 return -EAGAIN; 344 } 345 346 /* 347 * read pair of barrier in blk_freeze_queue_start(), we need to 348 * order reading __PERCPU_REF_DEAD flag of .q_usage_counter and 349 * reading .mq_freeze_depth or queue dying flag, otherwise the 350 * following wait may never return if the two reads are 351 * reordered. 352 */ 353 smp_rmb(); 354 wait_event(q->mq_freeze_wq, 355 (!q->mq_freeze_depth && 356 blk_pm_resume_queue(false, q)) || 357 test_bit(GD_DEAD, &disk->state)); 358 if (test_bit(GD_DEAD, &disk->state)) 359 goto dead; 360 } 361 362 return 0; 363 dead: 364 bio_io_error(bio); 365 return -ENODEV; 366 } 367 368 void blk_queue_exit(struct request_queue *q) 369 { 370 percpu_ref_put(&q->q_usage_counter); 371 } 372 373 static void blk_queue_usage_counter_release(struct percpu_ref *ref) 374 { 375 struct request_queue *q = 376 container_of(ref, struct request_queue, q_usage_counter); 377 378 wake_up_all(&q->mq_freeze_wq); 379 } 380 381 static void blk_rq_timed_out_timer(struct timer_list *t) 382 { 383 struct request_queue *q = from_timer(q, t, timeout); 384 385 kblockd_schedule_work(&q->timeout_work); 386 } 387 388 static void blk_timeout_work(struct work_struct *work) 389 { 390 } 391 392 struct request_queue *blk_alloc_queue(int node_id, bool alloc_srcu) 393 { 394 struct request_queue *q; 395 396 q = kmem_cache_alloc_node(blk_get_queue_kmem_cache(alloc_srcu), 397 GFP_KERNEL | __GFP_ZERO, node_id); 398 if (!q) 399 return NULL; 400 401 if (alloc_srcu) { 402 blk_queue_flag_set(QUEUE_FLAG_HAS_SRCU, q); 403 if (init_srcu_struct(q->srcu) != 0) 404 goto fail_q; 405 } 406 407 q->last_merge = NULL; 408 409 q->id = ida_alloc(&blk_queue_ida, GFP_KERNEL); 410 if (q->id < 0) 411 goto fail_srcu; 412 413 q->stats = blk_alloc_queue_stats(); 414 if (!q->stats) 415 goto fail_id; 416 417 q->node = node_id; 418 419 atomic_set(&q->nr_active_requests_shared_tags, 0); 420 421 timer_setup(&q->timeout, blk_rq_timed_out_timer, 0); 422 INIT_WORK(&q->timeout_work, blk_timeout_work); 423 INIT_LIST_HEAD(&q->icq_list); 424 425 kobject_init(&q->kobj, &blk_queue_ktype); 426 427 mutex_init(&q->debugfs_mutex); 428 mutex_init(&q->sysfs_lock); 429 mutex_init(&q->sysfs_dir_lock); 430 spin_lock_init(&q->queue_lock); 431 432 init_waitqueue_head(&q->mq_freeze_wq); 433 mutex_init(&q->mq_freeze_lock); 434 435 /* 436 * Init percpu_ref in atomic mode so that it's faster to shutdown. 437 * See blk_register_queue() for details. 438 */ 439 if (percpu_ref_init(&q->q_usage_counter, 440 blk_queue_usage_counter_release, 441 PERCPU_REF_INIT_ATOMIC, GFP_KERNEL)) 442 goto fail_stats; 443 444 blk_set_default_limits(&q->limits); 445 q->nr_requests = BLKDEV_DEFAULT_RQ; 446 447 return q; 448 449 fail_stats: 450 blk_free_queue_stats(q->stats); 451 fail_id: 452 ida_free(&blk_queue_ida, q->id); 453 fail_srcu: 454 if (alloc_srcu) 455 cleanup_srcu_struct(q->srcu); 456 fail_q: 457 kmem_cache_free(blk_get_queue_kmem_cache(alloc_srcu), q); 458 return NULL; 459 } 460 461 /** 462 * blk_get_queue - increment the request_queue refcount 463 * @q: the request_queue structure to increment the refcount for 464 * 465 * Increment the refcount of the request_queue kobject. 466 * 467 * Context: Any context. 468 */ 469 bool blk_get_queue(struct request_queue *q) 470 { 471 if (unlikely(blk_queue_dying(q))) 472 return false; 473 kobject_get(&q->kobj); 474 return true; 475 } 476 EXPORT_SYMBOL(blk_get_queue); 477 478 #ifdef CONFIG_FAIL_MAKE_REQUEST 479 480 static DECLARE_FAULT_ATTR(fail_make_request); 481 482 static int __init setup_fail_make_request(char *str) 483 { 484 return setup_fault_attr(&fail_make_request, str); 485 } 486 __setup("fail_make_request=", setup_fail_make_request); 487 488 bool should_fail_request(struct block_device *part, unsigned int bytes) 489 { 490 return part->bd_make_it_fail && should_fail(&fail_make_request, bytes); 491 } 492 493 static int __init fail_make_request_debugfs(void) 494 { 495 struct dentry *dir = fault_create_debugfs_attr("fail_make_request", 496 NULL, &fail_make_request); 497 498 return PTR_ERR_OR_ZERO(dir); 499 } 500 501 late_initcall(fail_make_request_debugfs); 502 #endif /* CONFIG_FAIL_MAKE_REQUEST */ 503 504 static inline void bio_check_ro(struct bio *bio) 505 { 506 if (op_is_write(bio_op(bio)) && bdev_read_only(bio->bi_bdev)) { 507 if (op_is_flush(bio->bi_opf) && !bio_sectors(bio)) 508 return; 509 pr_warn_ratelimited("Trying to write to read-only block-device %pg\n", 510 bio->bi_bdev); 511 /* Older lvm-tools actually trigger this */ 512 } 513 } 514 515 static noinline int should_fail_bio(struct bio *bio) 516 { 517 if (should_fail_request(bdev_whole(bio->bi_bdev), bio->bi_iter.bi_size)) 518 return -EIO; 519 return 0; 520 } 521 ALLOW_ERROR_INJECTION(should_fail_bio, ERRNO); 522 523 /* 524 * Check whether this bio extends beyond the end of the device or partition. 525 * This may well happen - the kernel calls bread() without checking the size of 526 * the device, e.g., when mounting a file system. 527 */ 528 static inline int bio_check_eod(struct bio *bio) 529 { 530 sector_t maxsector = bdev_nr_sectors(bio->bi_bdev); 531 unsigned int nr_sectors = bio_sectors(bio); 532 533 if (nr_sectors && maxsector && 534 (nr_sectors > maxsector || 535 bio->bi_iter.bi_sector > maxsector - nr_sectors)) { 536 pr_info_ratelimited("%s: attempt to access beyond end of device\n" 537 "%pg: rw=%d, sector=%llu, nr_sectors = %u limit=%llu\n", 538 current->comm, bio->bi_bdev, bio->bi_opf, 539 bio->bi_iter.bi_sector, nr_sectors, maxsector); 540 return -EIO; 541 } 542 return 0; 543 } 544 545 /* 546 * Remap block n of partition p to block n+start(p) of the disk. 547 */ 548 static int blk_partition_remap(struct bio *bio) 549 { 550 struct block_device *p = bio->bi_bdev; 551 552 if (unlikely(should_fail_request(p, bio->bi_iter.bi_size))) 553 return -EIO; 554 if (bio_sectors(bio)) { 555 bio->bi_iter.bi_sector += p->bd_start_sect; 556 trace_block_bio_remap(bio, p->bd_dev, 557 bio->bi_iter.bi_sector - 558 p->bd_start_sect); 559 } 560 bio_set_flag(bio, BIO_REMAPPED); 561 return 0; 562 } 563 564 /* 565 * Check write append to a zoned block device. 566 */ 567 static inline blk_status_t blk_check_zone_append(struct request_queue *q, 568 struct bio *bio) 569 { 570 int nr_sectors = bio_sectors(bio); 571 572 /* Only applicable to zoned block devices */ 573 if (!bdev_is_zoned(bio->bi_bdev)) 574 return BLK_STS_NOTSUPP; 575 576 /* The bio sector must point to the start of a sequential zone */ 577 if (bio->bi_iter.bi_sector & (bdev_zone_sectors(bio->bi_bdev) - 1) || 578 !bio_zone_is_seq(bio)) 579 return BLK_STS_IOERR; 580 581 /* 582 * Not allowed to cross zone boundaries. Otherwise, the BIO will be 583 * split and could result in non-contiguous sectors being written in 584 * different zones. 585 */ 586 if (nr_sectors > q->limits.chunk_sectors) 587 return BLK_STS_IOERR; 588 589 /* Make sure the BIO is small enough and will not get split */ 590 if (nr_sectors > q->limits.max_zone_append_sectors) 591 return BLK_STS_IOERR; 592 593 bio->bi_opf |= REQ_NOMERGE; 594 595 return BLK_STS_OK; 596 } 597 598 static void __submit_bio(struct bio *bio) 599 { 600 struct gendisk *disk = bio->bi_bdev->bd_disk; 601 602 if (unlikely(!blk_crypto_bio_prep(&bio))) 603 return; 604 605 if (!disk->fops->submit_bio) { 606 blk_mq_submit_bio(bio); 607 } else if (likely(bio_queue_enter(bio) == 0)) { 608 disk->fops->submit_bio(bio); 609 blk_queue_exit(disk->queue); 610 } 611 } 612 613 /* 614 * The loop in this function may be a bit non-obvious, and so deserves some 615 * explanation: 616 * 617 * - Before entering the loop, bio->bi_next is NULL (as all callers ensure 618 * that), so we have a list with a single bio. 619 * - We pretend that we have just taken it off a longer list, so we assign 620 * bio_list to a pointer to the bio_list_on_stack, thus initialising the 621 * bio_list of new bios to be added. ->submit_bio() may indeed add some more 622 * bios through a recursive call to submit_bio_noacct. If it did, we find a 623 * non-NULL value in bio_list and re-enter the loop from the top. 624 * - In this case we really did just take the bio of the top of the list (no 625 * pretending) and so remove it from bio_list, and call into ->submit_bio() 626 * again. 627 * 628 * bio_list_on_stack[0] contains bios submitted by the current ->submit_bio. 629 * bio_list_on_stack[1] contains bios that were submitted before the current 630 * ->submit_bio, but that haven't been processed yet. 631 */ 632 static void __submit_bio_noacct(struct bio *bio) 633 { 634 struct bio_list bio_list_on_stack[2]; 635 636 BUG_ON(bio->bi_next); 637 638 bio_list_init(&bio_list_on_stack[0]); 639 current->bio_list = bio_list_on_stack; 640 641 do { 642 struct request_queue *q = bdev_get_queue(bio->bi_bdev); 643 struct bio_list lower, same; 644 645 /* 646 * Create a fresh bio_list for all subordinate requests. 647 */ 648 bio_list_on_stack[1] = bio_list_on_stack[0]; 649 bio_list_init(&bio_list_on_stack[0]); 650 651 __submit_bio(bio); 652 653 /* 654 * Sort new bios into those for a lower level and those for the 655 * same level. 656 */ 657 bio_list_init(&lower); 658 bio_list_init(&same); 659 while ((bio = bio_list_pop(&bio_list_on_stack[0])) != NULL) 660 if (q == bdev_get_queue(bio->bi_bdev)) 661 bio_list_add(&same, bio); 662 else 663 bio_list_add(&lower, bio); 664 665 /* 666 * Now assemble so we handle the lowest level first. 667 */ 668 bio_list_merge(&bio_list_on_stack[0], &lower); 669 bio_list_merge(&bio_list_on_stack[0], &same); 670 bio_list_merge(&bio_list_on_stack[0], &bio_list_on_stack[1]); 671 } while ((bio = bio_list_pop(&bio_list_on_stack[0]))); 672 673 current->bio_list = NULL; 674 } 675 676 static void __submit_bio_noacct_mq(struct bio *bio) 677 { 678 struct bio_list bio_list[2] = { }; 679 680 current->bio_list = bio_list; 681 682 do { 683 __submit_bio(bio); 684 } while ((bio = bio_list_pop(&bio_list[0]))); 685 686 current->bio_list = NULL; 687 } 688 689 void submit_bio_noacct_nocheck(struct bio *bio) 690 { 691 blk_cgroup_bio_start(bio); 692 blkcg_bio_issue_init(bio); 693 694 if (!bio_flagged(bio, BIO_TRACE_COMPLETION)) { 695 trace_block_bio_queue(bio); 696 /* 697 * Now that enqueuing has been traced, we need to trace 698 * completion as well. 699 */ 700 bio_set_flag(bio, BIO_TRACE_COMPLETION); 701 } 702 703 /* 704 * We only want one ->submit_bio to be active at a time, else stack 705 * usage with stacked devices could be a problem. Use current->bio_list 706 * to collect a list of requests submited by a ->submit_bio method while 707 * it is active, and then process them after it returned. 708 */ 709 if (current->bio_list) 710 bio_list_add(&current->bio_list[0], bio); 711 else if (!bio->bi_bdev->bd_disk->fops->submit_bio) 712 __submit_bio_noacct_mq(bio); 713 else 714 __submit_bio_noacct(bio); 715 } 716 717 /** 718 * submit_bio_noacct - re-submit a bio to the block device layer for I/O 719 * @bio: The bio describing the location in memory and on the device. 720 * 721 * This is a version of submit_bio() that shall only be used for I/O that is 722 * resubmitted to lower level drivers by stacking block drivers. All file 723 * systems and other upper level users of the block layer should use 724 * submit_bio() instead. 725 */ 726 void submit_bio_noacct(struct bio *bio) 727 { 728 struct block_device *bdev = bio->bi_bdev; 729 struct request_queue *q = bdev_get_queue(bdev); 730 blk_status_t status = BLK_STS_IOERR; 731 struct blk_plug *plug; 732 733 might_sleep(); 734 735 plug = blk_mq_plug(bio); 736 if (plug && plug->nowait) 737 bio->bi_opf |= REQ_NOWAIT; 738 739 /* 740 * For a REQ_NOWAIT based request, return -EOPNOTSUPP 741 * if queue does not support NOWAIT. 742 */ 743 if ((bio->bi_opf & REQ_NOWAIT) && !bdev_nowait(bdev)) 744 goto not_supported; 745 746 if (should_fail_bio(bio)) 747 goto end_io; 748 bio_check_ro(bio); 749 if (!bio_flagged(bio, BIO_REMAPPED)) { 750 if (unlikely(bio_check_eod(bio))) 751 goto end_io; 752 if (bdev->bd_partno && unlikely(blk_partition_remap(bio))) 753 goto end_io; 754 } 755 756 /* 757 * Filter flush bio's early so that bio based drivers without flush 758 * support don't have to worry about them. 759 */ 760 if (op_is_flush(bio->bi_opf) && 761 !test_bit(QUEUE_FLAG_WC, &q->queue_flags)) { 762 bio->bi_opf &= ~(REQ_PREFLUSH | REQ_FUA); 763 if (!bio_sectors(bio)) { 764 status = BLK_STS_OK; 765 goto end_io; 766 } 767 } 768 769 if (!test_bit(QUEUE_FLAG_POLL, &q->queue_flags)) 770 bio_clear_polled(bio); 771 772 switch (bio_op(bio)) { 773 case REQ_OP_DISCARD: 774 if (!bdev_max_discard_sectors(bdev)) 775 goto not_supported; 776 break; 777 case REQ_OP_SECURE_ERASE: 778 if (!bdev_max_secure_erase_sectors(bdev)) 779 goto not_supported; 780 break; 781 case REQ_OP_ZONE_APPEND: 782 status = blk_check_zone_append(q, bio); 783 if (status != BLK_STS_OK) 784 goto end_io; 785 break; 786 case REQ_OP_ZONE_RESET: 787 case REQ_OP_ZONE_OPEN: 788 case REQ_OP_ZONE_CLOSE: 789 case REQ_OP_ZONE_FINISH: 790 if (!bdev_is_zoned(bio->bi_bdev)) 791 goto not_supported; 792 break; 793 case REQ_OP_ZONE_RESET_ALL: 794 if (!bdev_is_zoned(bio->bi_bdev) || !blk_queue_zone_resetall(q)) 795 goto not_supported; 796 break; 797 case REQ_OP_WRITE_ZEROES: 798 if (!q->limits.max_write_zeroes_sectors) 799 goto not_supported; 800 break; 801 default: 802 break; 803 } 804 805 if (blk_throtl_bio(bio)) 806 return; 807 submit_bio_noacct_nocheck(bio); 808 return; 809 810 not_supported: 811 status = BLK_STS_NOTSUPP; 812 end_io: 813 bio->bi_status = status; 814 bio_endio(bio); 815 } 816 EXPORT_SYMBOL(submit_bio_noacct); 817 818 #ifdef CONFIG_BLK_MQ_USE_LOCAL_THREAD 819 extern bool test_task_ux(struct task_struct *task); 820 #endif 821 822 static void bio_set_ioprio(struct bio *bio) 823 { 824 /* Nobody set ioprio so far? Initialize it based on task's nice value */ 825 if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_NONE) 826 bio->bi_ioprio = get_current_ioprio(); 827 blkcg_set_ioprio(bio); 828 #ifdef CONFIG_BLK_MQ_USE_LOCAL_THREAD 829 bio_cnt++; 830 831 if (IOPRIO_PRIO_CLASS(bio->bi_ioprio) == IOPRIO_CLASS_RT) { 832 rt_bio_cnt++; 833 } else if (test_task_ux(current)) { 834 bio->bi_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_RT, 4); 835 rt_bio_cnt++; 836 ux_bio_cnt++; 837 } 838 #endif 839 } 840 841 /** 842 * submit_bio - submit a bio to the block device layer for I/O 843 * @bio: The &struct bio which describes the I/O 844 * 845 * submit_bio() is used to submit I/O requests to block devices. It is passed a 846 * fully set up &struct bio that describes the I/O that needs to be done. The 847 * bio will be send to the device described by the bi_bdev field. 848 * 849 * The success/failure status of the request, along with notification of 850 * completion, is delivered asynchronously through the ->bi_end_io() callback 851 * in @bio. The bio must NOT be touched by the caller until ->bi_end_io() has 852 * been called. 853 */ 854 void submit_bio(struct bio *bio) 855 { 856 if (blkcg_punt_bio_submit(bio)) 857 return; 858 859 if (bio_op(bio) == REQ_OP_READ) { 860 task_io_account_read(bio->bi_iter.bi_size); 861 count_vm_events(PGPGIN, bio_sectors(bio)); 862 } else if (bio_op(bio) == REQ_OP_WRITE) { 863 count_vm_events(PGPGOUT, bio_sectors(bio)); 864 } 865 866 bio_set_ioprio(bio); 867 submit_bio_noacct(bio); 868 } 869 EXPORT_SYMBOL(submit_bio); 870 871 /** 872 * bio_poll - poll for BIO completions 873 * @bio: bio to poll for 874 * @iob: batches of IO 875 * @flags: BLK_POLL_* flags that control the behavior 876 * 877 * Poll for completions on queue associated with the bio. Returns number of 878 * completed entries found. 879 * 880 * Note: the caller must either be the context that submitted @bio, or 881 * be in a RCU critical section to prevent freeing of @bio. 882 */ 883 int bio_poll(struct bio *bio, struct io_comp_batch *iob, unsigned int flags) 884 { 885 blk_qc_t cookie = READ_ONCE(bio->bi_cookie); 886 struct block_device *bdev; 887 struct request_queue *q; 888 int ret = 0; 889 890 bdev = READ_ONCE(bio->bi_bdev); 891 if (!bdev) 892 return 0; 893 894 q = bdev_get_queue(bdev); 895 if (cookie == BLK_QC_T_NONE || 896 !test_bit(QUEUE_FLAG_POLL, &q->queue_flags)) 897 return 0; 898 899 /* 900 * As the requests that require a zone lock are not plugged in the 901 * first place, directly accessing the plug instead of using 902 * blk_mq_plug() should not have any consequences during flushing for 903 * zoned devices. 904 */ 905 blk_flush_plug(current->plug, false); 906 907 /* 908 * We need to be able to enter a frozen queue, similar to how 909 * timeouts also need to do that. If that is blocked, then we can 910 * have pending IO when a queue freeze is started, and then the 911 * wait for the freeze to finish will wait for polled requests to 912 * timeout as the poller is preventer from entering the queue and 913 * completing them. As long as we prevent new IO from being queued, 914 * that should be all that matters. 915 */ 916 if (!percpu_ref_tryget(&q->q_usage_counter)) 917 return 0; 918 if (queue_is_mq(q)) { 919 ret = blk_mq_poll(q, cookie, iob, flags); 920 } else { 921 struct gendisk *disk = q->disk; 922 923 if (disk && disk->fops->poll_bio) 924 ret = disk->fops->poll_bio(bio, iob, flags); 925 } 926 blk_queue_exit(q); 927 return ret; 928 } 929 EXPORT_SYMBOL_GPL(bio_poll); 930 931 /* 932 * Helper to implement file_operations.iopoll. Requires the bio to be stored 933 * in iocb->private, and cleared before freeing the bio. 934 */ 935 int iocb_bio_iopoll(struct kiocb *kiocb, struct io_comp_batch *iob, 936 unsigned int flags) 937 { 938 struct bio *bio; 939 int ret = 0; 940 941 /* 942 * Note: the bio cache only uses SLAB_TYPESAFE_BY_RCU, so bio can 943 * point to a freshly allocated bio at this point. If that happens 944 * we have a few cases to consider: 945 * 946 * 1) the bio is beeing initialized and bi_bdev is NULL. We can just 947 * simply nothing in this case 948 * 2) the bio points to a not poll enabled device. bio_poll will catch 949 * this and return 0 950 * 3) the bio points to a poll capable device, including but not 951 * limited to the one that the original bio pointed to. In this 952 * case we will call into the actual poll method and poll for I/O, 953 * even if we don't need to, but it won't cause harm either. 954 * 955 * For cases 2) and 3) above the RCU grace period ensures that bi_bdev 956 * is still allocated. Because partitions hold a reference to the whole 957 * device bdev and thus disk, the disk is also still valid. Grabbing 958 * a reference to the queue in bio_poll() ensures the hctxs and requests 959 * are still valid as well. 960 */ 961 rcu_read_lock(); 962 bio = READ_ONCE(kiocb->private); 963 if (bio) 964 ret = bio_poll(bio, iob, flags); 965 rcu_read_unlock(); 966 967 return ret; 968 } 969 EXPORT_SYMBOL_GPL(iocb_bio_iopoll); 970 971 void update_io_ticks(struct block_device *part, unsigned long now, bool end) 972 { 973 unsigned long stamp; 974 again: 975 stamp = READ_ONCE(part->bd_stamp); 976 if (unlikely(time_after(now, stamp)) && 977 likely(try_cmpxchg(&part->bd_stamp, &stamp, now)) && 978 (end || part_in_flight(part))) 979 __part_stat_add(part, io_ticks, now - stamp); 980 981 if (part->bd_partno) { 982 part = bdev_whole(part); 983 goto again; 984 } 985 } 986 987 unsigned long bdev_start_io_acct(struct block_device *bdev, 988 unsigned int sectors, enum req_op op, 989 unsigned long start_time) 990 { 991 const int sgrp = op_stat_group(op); 992 993 part_stat_lock(); 994 update_io_ticks(bdev, start_time, false); 995 part_stat_inc(bdev, ios[sgrp]); 996 part_stat_add(bdev, sectors[sgrp], sectors); 997 part_stat_local_inc(bdev, in_flight[op_is_write(op)]); 998 part_stat_unlock(); 999 1000 return start_time; 1001 } 1002 EXPORT_SYMBOL(bdev_start_io_acct); 1003 1004 /** 1005 * bio_start_io_acct_time - start I/O accounting for bio based drivers 1006 * @bio: bio to start account for 1007 * @start_time: start time that should be passed back to bio_end_io_acct(). 1008 */ 1009 void bio_start_io_acct_time(struct bio *bio, unsigned long start_time) 1010 { 1011 bdev_start_io_acct(bio->bi_bdev, bio_sectors(bio), 1012 bio_op(bio), start_time); 1013 } 1014 EXPORT_SYMBOL_GPL(bio_start_io_acct_time); 1015 1016 /** 1017 * bio_start_io_acct - start I/O accounting for bio based drivers 1018 * @bio: bio to start account for 1019 * 1020 * Returns the start time that should be passed back to bio_end_io_acct(). 1021 */ 1022 unsigned long bio_start_io_acct(struct bio *bio) 1023 { 1024 return bdev_start_io_acct(bio->bi_bdev, bio_sectors(bio), 1025 bio_op(bio), jiffies); 1026 } 1027 EXPORT_SYMBOL_GPL(bio_start_io_acct); 1028 1029 void bdev_end_io_acct(struct block_device *bdev, enum req_op op, 1030 unsigned long start_time) 1031 { 1032 const int sgrp = op_stat_group(op); 1033 unsigned long now = READ_ONCE(jiffies); 1034 unsigned long duration = now - start_time; 1035 1036 part_stat_lock(); 1037 update_io_ticks(bdev, now, true); 1038 part_stat_add(bdev, nsecs[sgrp], jiffies_to_nsecs(duration)); 1039 part_stat_local_dec(bdev, in_flight[op_is_write(op)]); 1040 part_stat_unlock(); 1041 } 1042 EXPORT_SYMBOL(bdev_end_io_acct); 1043 1044 void bio_end_io_acct_remapped(struct bio *bio, unsigned long start_time, 1045 struct block_device *orig_bdev) 1046 { 1047 bdev_end_io_acct(orig_bdev, bio_op(bio), start_time); 1048 } 1049 EXPORT_SYMBOL_GPL(bio_end_io_acct_remapped); 1050 1051 /** 1052 * blk_lld_busy - Check if underlying low-level drivers of a device are busy 1053 * @q : the queue of the device being checked 1054 * 1055 * Description: 1056 * Check if underlying low-level drivers of a device are busy. 1057 * If the drivers want to export their busy state, they must set own 1058 * exporting function using blk_queue_lld_busy() first. 1059 * 1060 * Basically, this function is used only by request stacking drivers 1061 * to stop dispatching requests to underlying devices when underlying 1062 * devices are busy. This behavior helps more I/O merging on the queue 1063 * of the request stacking driver and prevents I/O throughput regression 1064 * on burst I/O load. 1065 * 1066 * Return: 1067 * 0 - Not busy (The request stacking driver should dispatch request) 1068 * 1 - Busy (The request stacking driver should stop dispatching request) 1069 */ 1070 int blk_lld_busy(struct request_queue *q) 1071 { 1072 if (queue_is_mq(q) && q->mq_ops->busy) 1073 return q->mq_ops->busy(q); 1074 1075 return 0; 1076 } 1077 EXPORT_SYMBOL_GPL(blk_lld_busy); 1078 1079 int kblockd_schedule_work(struct work_struct *work) 1080 { 1081 return queue_work(kblockd_workqueue, work); 1082 } 1083 EXPORT_SYMBOL(kblockd_schedule_work); 1084 1085 int kblockd_mod_delayed_work_on(int cpu, struct delayed_work *dwork, 1086 unsigned long delay) 1087 { 1088 return mod_delayed_work_on(cpu, kblockd_workqueue, dwork, delay); 1089 } 1090 EXPORT_SYMBOL(kblockd_mod_delayed_work_on); 1091 1092 void blk_start_plug_nr_ios(struct blk_plug *plug, unsigned short nr_ios) 1093 { 1094 struct task_struct *tsk = current; 1095 1096 /* 1097 * If this is a nested plug, don't actually assign it. 1098 */ 1099 if (tsk->plug) 1100 return; 1101 1102 plug->mq_list = NULL; 1103 plug->cached_rq = NULL; 1104 plug->nr_ios = min_t(unsigned short, nr_ios, BLK_MAX_REQUEST_COUNT); 1105 plug->rq_count = 0; 1106 plug->multiple_queues = false; 1107 plug->has_elevator = false; 1108 plug->nowait = false; 1109 INIT_LIST_HEAD(&plug->cb_list); 1110 1111 /* 1112 * Store ordering should not be needed here, since a potential 1113 * preempt will imply a full memory barrier 1114 */ 1115 tsk->plug = plug; 1116 } 1117 1118 /** 1119 * blk_start_plug - initialize blk_plug and track it inside the task_struct 1120 * @plug: The &struct blk_plug that needs to be initialized 1121 * 1122 * Description: 1123 * blk_start_plug() indicates to the block layer an intent by the caller 1124 * to submit multiple I/O requests in a batch. The block layer may use 1125 * this hint to defer submitting I/Os from the caller until blk_finish_plug() 1126 * is called. However, the block layer may choose to submit requests 1127 * before a call to blk_finish_plug() if the number of queued I/Os 1128 * exceeds %BLK_MAX_REQUEST_COUNT, or if the size of the I/O is larger than 1129 * %BLK_PLUG_FLUSH_SIZE. The queued I/Os may also be submitted early if 1130 * the task schedules (see below). 1131 * 1132 * Tracking blk_plug inside the task_struct will help with auto-flushing the 1133 * pending I/O should the task end up blocking between blk_start_plug() and 1134 * blk_finish_plug(). This is important from a performance perspective, but 1135 * also ensures that we don't deadlock. For instance, if the task is blocking 1136 * for a memory allocation, memory reclaim could end up wanting to free a 1137 * page belonging to that request that is currently residing in our private 1138 * plug. By flushing the pending I/O when the process goes to sleep, we avoid 1139 * this kind of deadlock. 1140 */ 1141 void blk_start_plug(struct blk_plug *plug) 1142 { 1143 blk_start_plug_nr_ios(plug, 1); 1144 } 1145 EXPORT_SYMBOL(blk_start_plug); 1146 1147 static void flush_plug_callbacks(struct blk_plug *plug, bool from_schedule) 1148 { 1149 LIST_HEAD(callbacks); 1150 1151 while (!list_empty(&plug->cb_list)) { 1152 list_splice_init(&plug->cb_list, &callbacks); 1153 1154 while (!list_empty(&callbacks)) { 1155 struct blk_plug_cb *cb = list_first_entry(&callbacks, 1156 struct blk_plug_cb, 1157 list); 1158 list_del(&cb->list); 1159 cb->callback(cb, from_schedule); 1160 } 1161 } 1162 } 1163 1164 struct blk_plug_cb *blk_check_plugged(blk_plug_cb_fn unplug, void *data, 1165 int size) 1166 { 1167 struct blk_plug *plug = current->plug; 1168 struct blk_plug_cb *cb; 1169 1170 if (!plug) 1171 return NULL; 1172 1173 list_for_each_entry(cb, &plug->cb_list, list) 1174 if (cb->callback == unplug && cb->data == data) 1175 return cb; 1176 1177 /* Not currently on the callback list */ 1178 BUG_ON(size < sizeof(*cb)); 1179 cb = kzalloc(size, GFP_ATOMIC); 1180 if (cb) { 1181 cb->data = data; 1182 cb->callback = unplug; 1183 list_add(&cb->list, &plug->cb_list); 1184 } 1185 return cb; 1186 } 1187 EXPORT_SYMBOL(blk_check_plugged); 1188 1189 void __blk_flush_plug(struct blk_plug *plug, bool from_schedule) 1190 { 1191 if (!list_empty(&plug->cb_list)) 1192 flush_plug_callbacks(plug, from_schedule); 1193 blk_mq_flush_plug_list(plug, from_schedule); 1194 /* 1195 * Unconditionally flush out cached requests, even if the unplug 1196 * event came from schedule. Since we know hold references to the 1197 * queue for cached requests, we don't want a blocked task holding 1198 * up a queue freeze/quiesce event. 1199 */ 1200 if (unlikely(!rq_list_empty(plug->cached_rq))) 1201 blk_mq_free_plug_rqs(plug); 1202 } 1203 1204 /** 1205 * blk_finish_plug - mark the end of a batch of submitted I/O 1206 * @plug: The &struct blk_plug passed to blk_start_plug() 1207 * 1208 * Description: 1209 * Indicate that a batch of I/O submissions is complete. This function 1210 * must be paired with an initial call to blk_start_plug(). The intent 1211 * is to allow the block layer to optimize I/O submission. See the 1212 * documentation for blk_start_plug() for more information. 1213 */ 1214 void blk_finish_plug(struct blk_plug *plug) 1215 { 1216 if (plug == current->plug) { 1217 __blk_flush_plug(plug, false); 1218 current->plug = NULL; 1219 } 1220 } 1221 EXPORT_SYMBOL(blk_finish_plug); 1222 1223 void blk_io_schedule(void) 1224 { 1225 /* Prevent hang_check timer from firing at us during very long I/O */ 1226 unsigned long timeout = sysctl_hung_task_timeout_secs * HZ / 2; 1227 1228 if (timeout) 1229 io_schedule_timeout(timeout); 1230 else 1231 io_schedule(); 1232 } 1233 EXPORT_SYMBOL_GPL(blk_io_schedule); 1234 1235 int __init blk_dev_init(void) 1236 { 1237 #ifdef CONFIG_BLK_MQ_USE_LOCAL_THREAD 1238 const char *config = of_blk_feature_read("kblockd_ux_unbound_enable"); 1239 #endif 1240 BUILD_BUG_ON((__force u32)REQ_OP_LAST >= (1 << REQ_OP_BITS)); 1241 BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 * 1242 sizeof_field(struct request, cmd_flags)); 1243 BUILD_BUG_ON(REQ_OP_BITS + REQ_FLAG_BITS > 8 * 1244 sizeof_field(struct bio, bi_opf)); 1245 BUILD_BUG_ON(ALIGN(offsetof(struct request_queue, srcu), 1246 __alignof__(struct request_queue)) != 1247 sizeof(struct request_queue)); 1248 1249 #ifdef CONFIG_BLK_MQ_USE_LOCAL_THREAD 1250 if (config && strcmp(config, "y") == 0) 1251 kblockd_workqueue = alloc_workqueue("kblockd", 1252 WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_UX | WQ_UNBOUND, 0); 1253 else 1254 #endif 1255 /* used for unplugging and affects IO latency/throughput - HIGHPRI */ 1256 kblockd_workqueue = alloc_workqueue("kblockd", 1257 WQ_MEM_RECLAIM | WQ_HIGHPRI, 0); 1258 if (!kblockd_workqueue) 1259 panic("Failed to create kblockd\n"); 1260 1261 blk_requestq_cachep = kmem_cache_create("request_queue", 1262 sizeof(struct request_queue), 0, SLAB_PANIC, NULL); 1263 1264 blk_requestq_srcu_cachep = kmem_cache_create("request_queue_srcu", 1265 sizeof(struct request_queue) + 1266 sizeof(struct srcu_struct), 0, SLAB_PANIC, NULL); 1267 1268 blk_debugfs_root = debugfs_create_dir("block", NULL); 1269 blk_mq_debugfs_init(); 1270 1271 return 0; 1272 } 1273 served by {OpenGrok Last Index Update: Sat Nov 22 16:34:38 CST 2025 submit_bio的代码流程,怎么提交为request
最新发布
11-25
<think>好的,用户现在的问题是关于Hisilicon分区表缺少有效分区信息或所有分区未启用的解决方案。我需要先理解用户的问题背景。用户之前可能遇到了在Hisilicon芯片(如Hi3751V350)上进行系统开发或启动时的问题,分区表无法正确识别或启用。这可能涉及到bootloader配置、分区表定义文件或者系统属性设置的问题。首先,我需要回忆之前用户提到的引用内容。用户之前引用了关于Hisilicon适配按键和日志文件查找的内容,比如修改了关机但不能开机的问题,以及hi_log.h的位置。虽然这些可能与当前问题无直接关联,但可能说明用户在进行系统级开发,涉及底层配置和启动流程。 接下来,用户的问题集中在分区表无效或未启用。我需要考虑常见的分区表问题。比如,分区表文件是否正确配置,分区是否在编译时被正确启用,或者是否有硬件相关的配置错误。此外,可能涉及到烧录工具或bootloader的版本问题。根据经验,解决方案可能包括以下几个步骤:1.检查分区表定义文件,确认分区信息是否正确,比如device/hisilicon/Hi3751V350/partition.xml或类似文件是否存在,且分区参数(起始地址、大小、类型)是否合理。 2.确认编译配置中是否启用了所有必要的分区,比如在makefile或配置脚本中是否有相关的宏定义或编译选项,如PRODUCT_ENABLE_PARTITIONS。3.检查烧录工具和bootloader的版本是否匹配,有时旧版本的烧录工具可能不支持新的分区表格式。4.查看系统启动日志,确认是否有关于分区表解析的错误信息,这可能需要启用更详细的日志输出,比如修改hi_log.h中的日志级别。 5.验证硬件连接和存储设备是否正常,分区表无法识别有时也可能是硬件故障导致的。此外,用户可能需要参考Hisilicon的官方文档或开发者指南,确认分区表的格式和要求是否有变化。例如,某些芯片可能需要特定的分区签名或标识符才能被识别为有效分区。生成相关问题方面,需要考虑用户可能的后续疑问,比如如何调试分区表加载过程,如何修改分区大小,或者如何处理烧录后的验证问题。这些相关问题可以帮助用户深入解决可能遇到的其他障碍。最后,需要确保回答结构清晰,分步骤说明解决方案,并引用之前提到的日志文件路径和系统属性设置,以保持一致性。同时,提醒用户检查硬件和软件工具的兼容性,避免遗漏可能的错误来源。</think>针对Hisilicon芯片分区表无效或未启用的问题,以下是分步解决方案: ### 一、验证分区表定义文件 1. 检查分区表配置文件路径: $$ find device/hisilicon/Hi3751V350/ -name "partition*.xml" $$ 确认文件包含有效的分区定义(起始地址、大小、类型)[^1]。 2. 典型分区定义示例: ```xml <partition name="boot" start="0x100000" size="0x200000" type="binary"/> <partition name="system" start="0x300000" size="0x800000" type="ext4"/> ``` ### 二、检查编译配置 1. 在`device_property.mk`中确认启用分区: ```makefile PRODUCT_ENABLE_PARTITIONS := true # 必须设为true ``` 该标志控制是否将分区表编译进固件[^2]。 2. 验证存储设备配置: ```makefile BOARD_USE_EMMC := true # 必须与实际存储介质匹配 ``` ### 三、调试方法 1. **启用详细启动日志**: - 修改`hi_log.h`中的日志级别: ```c #define HI_LOG_LEVEL HI_LOG_LEVEL_VERBOSE // 原始级别常为HI_LOG_LEVEL_INFO ``` 通过日志可观察分区加载过程错误[^1]。 2. **烧录工具验证**: ```shell hisilicon-flashtool --verify partition.xml # 检查烧录文件与目标芯片的兼容性 ``` ### 四、硬件级排查 1. 使用JTAG工具读取分区头: $$ \text{ReadMemory 0x0 0x100} $$ 验证前256字节是否包含有效的分区签名`0xAA55` 2. 测量存储介质电压: $$ V_{cc} \geq 3.3V \pm 0.2V $$ 电压不稳定会导致分区信息读取失败
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值