长期运行程序的文件描述符泄漏问题

本文探讨了服务器在高负载下因文件句柄泄露而被操作系统终止的问题,详细介绍了句柄泄露的原因及其来源,并提供了错误解决方案的辨析及正确的排查与修复方法。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >


So you’ve created a socket server that opens sockets over TCP/IP. After running it for a few hours under heavy load the operating system kills your process with the message “Too many open files”.

What happened?

The problem

On most POSIX operating systems, including Linux and Os X, each process is allocated a fixed table of file handles, or file descriptors. This is typically about 1024 handles. When you try to open 1025th file descriptor the operating system kills your process.

The problem is almost certainly that you are leaking file handles. That is, handles are being opened, and after you are done with them they are not closed.

Leaked file handles can come from many sources, not just open files. Some common sources are:

  • Sockets: Every socket also opens a file handle. Sockets need to be closed even if the remote party closes the connection. The handle is not automatically released if the connection closes.
  • Pipes: If you spawn sub-processes that communicate through pipes, often the pipe ends have to be duplicated ( with dup or dup2() ) and the right parts of the pipe need to be closed. If this isn’t done properly the server will leak file handles.
  • Database connections: When communicating with a database, even through a third party library, each connection will open some kind of file handle. SQL databases will typically open network sockets. File oriented databaseses like BerkeleyDB will open file handles, but the communication can also happen through named pipes. Each database API will have a call to close the connection.
  • Windows HANDLES: On windows there are a nearly infinite number of objects that hold handles. Each one must be closed properly
  • Files: You can also leak handles the old fashioned way by failing to close() handles to regular files.

One very sneaky source of open file handles is when you fork a process, or execute a subprocess. By default the new process inherits a copy of every open file handle. Even after the new process replaces itself with a new program image, and all the data structures of the old process are wiped away, the old process’ file handles will remain, unless they were set with the FD_CLOEXEC flag. More on this later.

Wrong Answers, Myths and Bad Ideas

Raise the file handle limit

One common answer to this problem is to just raise the limit of open file handles and then restart the server every day or every few hours.

On POSIX systems to find the operating system file handle limit you look in:

cat /proc/sys/fs/file-max

Or from a shell you can execute:

ulimit -n
 
 
To raise the limit:
 
 
ulimit -n 4096

This will delay the problem but likely will not fix it. It is possible that your program is not leaking and has a legitimate need to hold a large number of file handles. But if your program is designed correctly there usually isn’t a need to keep a large number of handles open – even if you have thousands of simultaneous connections. We’ll discuss some methods of managing that later.

If this was a good idea the operating system would already come configured with a higher file descriptor limit. If this was necessary, Apache would require you to up this limit before running.

Myth: Sockets in TCP TIME_WAIT are holding file handles Hostage

When you close a TCP/IP socket the operating system does not release the socket right away. For complex reasons, the socket structure must be kept out of circulation for a few minutes because there is a small chance that an IP packet may arrive on that socket after it has been closed. If the operating system re-used the socket then the new user of that connection would have their session affected by someone else’s lost packets.

But this does not hold a file handle open. When you close the socket’s file descriptor, the file descriptor itself is closed. You will not get the “Too many files open” error. If you have too many sockets open, then your server may stop accepting new connections. There are ways to deal with that ( allowing sockets to be re-used, or lowering TCP TIME_WAIT )- but raising the file handle limit isn’t one of them.

Myth: It takes time for file handles to be released

This is related to the TCP TIME_WAIT myth. The mistaken belief that when you close a file handle that you must wait some time for the operating system to release the handle.

Closing a file handle will call into whatever os method releases the resource, and the OS will release that resource either immediately, or sometimes later as in the case with sockets, but close() will release the file handle in the file handle table immediately. Your process is in complete control of its file handle table, and doesn’t need to wait for anything to free a slot in its own file descriptor table.

Finding and Fixing File Handle Leaks

Finding all your open file descriptors is easy. When you print them out it will probably be clear what type of handles are not being closed.

File handles will be integers from 0 to the handle limit minus one. So if you can find the handle limit, you can just query for information about each one. getdtablesize() will return the size of the number of available file handles for the current process. Once you you know that, you iterate over all the integers making any file handle os call you want, like fnctl( i, F_GETFD). If the call is successful, then the handle is currently open:

void showFDInfo()
{
    s32 numHandles = getdtablesize();
 
    for ( s32 i = 0; i < numHandles; i++ )
    {
       s32 fd_flags = fcntl( i, F_GETFD );
       if ( fd_flags == -1 ) continue ;
 
       showFDInfo( i );
    }
}

For each valid file descriptor, you query it to get as much information about it as you can. Dumping its name and flags will help you identify its likely source:

void showFDInfo( s32 fd )
{
    char buf[256];
 
    s32 fd_flags = fcntl( fd, F_GETFD );
    if ( fd_flags == -1 ) return ;
 
    s32 fl_flags = fcntl( fd, F_GETFL );
    if ( fl_flags == -1 ) return ;
 
    char path[256];
    sprintf ( path, "/proc/self/fd/%d" , fd );
 
    memset ( &buf[0], 0, 256 );
    ssize_t s = readlink( path, &buf[0], 256 );
    if ( s == -1 )
    {
         cerr << " (" << path << "): " << "not available" ;
         return ;
    }
    cerr << fd << " (" << buf << "): " ;
 
    if ( fd_flags & FD_CLOEXEC )  cerr << "cloexec " ;
 
    // file status
    if ( fl_flags & O_APPEND   )  cerr << "append " ;
    if ( fl_flags & O_NONBLOCK )  cerr << "nonblock " ;
 
    // acc mode
    if ( fl_flags & O_RDONLY   )  cerr << "read-only " ;
    if ( fl_flags & O_RDWR     )  cerr << "read-write " ;
    if ( fl_flags & O_WRONLY   )  cerr << "write-only " ;
 
    if ( fl_flags & O_DSYNC    )  cerr << "dsync " ;
    if ( fl_flags & O_RSYNC    )  cerr << "rsync " ;
    if ( fl_flags & O_SYNC     )  cerr << "sync " ;
 
    struct flock fl;
    fl.l_type = F_WRLCK;
    fl.l_whence = 0;
    fl.l_start = 0;
    fl.l_len = 0;
    fcntl( fd, F_GETLK, &fl );
    if ( fl.l_type != F_UNLCK )
    {
       if ( fl.l_type == F_WRLCK )
          cerr << "write-locked" ;
       else
          cerr << "read-locked" ;
       cerr << "(pid:" << fl.l_pid << ") " ;
    }
}

In /proc/self/fd you will find a descriptive name for every file handle. You can get this information from readlink call. The flags can also help you narrow down likely culprits.

The output looks something like this:

open file descriptors:
    
    0     (/dev/pts/2):                              read-write
    1     (/dev/pts/2):                              read-write
    2     (/dev/pts/2):                              read-write
    3     (/blar/blar/blar/ system . log ):              cloexec append read-write
    4     (/blar/blar/blar/access. log ):              cloexec append read-write
    5     (/blar/blar/blar/abuse. log ):               cloexec append read-write
    6     (anon_inode:[eventpoll]):                  cloexec read-write
    7     (anon_inode:[eventfd]):                    cloexec nonblock read-write
    8     (socket:[49863900]):                       cloexec nonblock read-write
    9     (/blar/blar/blar/db/db_err):               cloexec append read-write
    10    (/blar/blar/blar/db/blar.db):              cloexec read-write
    11    (/blar/blar/blar/ log .0000000001):          cloexec read-write
    12    (socket:[50603354]):                       cloexec nonblock read-write
    13    (socket:[50603364]):                       cloexec nonblock read-write
    14    (socket:[50603365]):                       cloexec nonblock read-write
    15    (socket:[50603366]):                       cloexec nonblock read-write
    16    (socket:[50603367]):                       cloexec nonblock read-write
    17    (socket:[50603368]):                       cloexec nonblock read-write

This is from one of my own projects. Notice that all the file descriptors except the first three are set to close on execute.

If you still can’t figure it out, you could log every call that creates file descriptors ( e.g. fopen()socket()acceptdup()pipeD() ) along with __FILE__ and __LINE__.

Limiting Open File Handles with Thousands of Sockets

If you have thousands of simultaneous network requests you can limit the number of open file descriptors by just not accept()‘ing every connection simultaneously.

If you really need to keep thousands of TCP connections open simultaneously then you really do need to up the file descriptor limit. But if your server is request based, like a web server, then don’t accept()them all at once. They will wait until you are ready, holding a socket open – but not consuming a file handle – until you are ready to handle them.

If you have this problem then you should be using one of many event handling APIs ( like select or epoll() ). These APIs will notify you when a socket is ready, and you only accept them as you have system resources available.

Also if you go this route, it is better to use a library like libev or libevent. Each operating system provides different event handling APIs. These libraries make it easier to use the best API for each system.

Dealing with Duplicate File Descriptors from Sub-Processes

When you spawn subprocesses they inherit all the open file descriptors of the parent process unless those descriptors have been specifically flagged as FD_CLOEXEC

This is because the operating system doesn’t know which file descriptors will be used by the subprocess for inter-process communication. Or if the subprocess will be the one that will handle an open network socket.

Normally when you spawn subprocesses the only file descriptors that need to stay open are the pipes that are connected to STDINSTDOUT, and STDERR of the child process. But every subprocess is different.

On POSIX systems, you flag a file descriptor to not be duplicated on exec like this:

bool setCloExec( s32 fd, bool val )
{
    s32 flags = fcntl( fd, F_GETFD );
    if ( flags == -1 ) return false ;
 
    if ( val )
    {
       if ( fcntl( fd, F_SETFD, flags |  FD_CLOEXEC ) == -1 ) return false ;
    } else {
       if ( fcntl( fd, F_SETFD, flags & ~FD_CLOEXEC ) == -1 ) return false ;
    }
 
    return true ;
}

On Win32 you do it like this:

bool setCloExec( u32 handle, bool val )
{
    if ( val )
    {
       if ( SetHandleInformation( ( HANDLE )handle, HANDLE_FLAG_INHERIT, 0 )) return true ;
    } else {
       if ( SetHandleInformation( ( HANDLE )handle, HANDLE_FLAG_INHERIT, 1 )) return true ;
    }
 
    assert (0); // failed to set inheritance flag.
    return true ;
}

Right before your server calls fork() or exec(), call showFdInfo() to make sure that all file descriptors are labeled FD_CLOEXEC except for the ones you need to be duplicated.

Dealing with Multiple Database Connections

Don’t open a new database connection for every database request. This is just wasteful and unnecessary. Each database connection may open a new TCP/IP socket, which will double the number of sockets in TCP_WAIT, even after you close the database connection, and therefore the file descriptors.

Open just one, or a limited number of database connections, and pass the open database connection to each request handling thread, protected by a mutex of course.

Closing File Descriptors when Spawning Subprocesses

To communicate with subprocesses very often you open pipes that connect the two processes, typically by passing the three pipes connected to STDINSTDOUT and STDERR of the child process.

Before the process is forked or a new process spawned, you create a pipe and then duplicate the pipe. You then have a pipe will have 2 read ends and 2 write ends. Each process will then close three of the associated handles. So for a pipe that writes from the parent to the child, the parent process closes the two read ends, and one of the write ends. The child process then closes the two write ends, and closes one of the read ends. At the end of the spawn the parent process is left with a write file descriptor, and the child a read file descriptor. When the child terminates it doesn’t need to close the read end. But the parent must still close the write end.

The best explanation of how to use pipes for inter-process communication is here.
You can find more details about how to do this on Windows here and how to do this on Linux and Os X here.


<think>我们正在分析程序在长时间运行后变慢的原因。根据引用资料和常见问题,可能的原因包括内存泄漏、资源竞争、数据库连接问题、网络问题、垃圾回收(GC)问题、配置不当等。以下将逐步分析可能的原因并提供排查建议。 ### 1. **内存泄漏** 内存泄漏是常见原因之一。程序长时间运行后,未释放的对象逐渐积累,导致内存不足,触发频繁的垃圾回收(GC),进而影响性能。 - **现象**:内存使用率持续上升,最终可能导致`OutOfMemoryError`或频繁Full GC。 - **排查方法**: - 使用监控工具(如VisualVM, JConsole, Prometheus + Grafana)观察内存使用情况,特别是老年代(Old Generation)内存是否持续增长。 - 分析堆转储(Heap Dump)找出泄漏对象。可使用`jmap`生成堆转储,然后用MAT(Memory Analyzer Tool)或JVisualVM分析。 - **解决方案**:修复代码中导致内存泄漏的部分,如未关闭的资源(数据库连接、文件流等)、静态集合类不当引用等。 引用[3]中提到程序运行一段时间后越来越慢,且创建临时对象,但未正确释放,可能与此有关[^3]。 ### 2. **垃圾回收(GC)问题** 即使没有内存泄漏,不合理的GC配置也可能导致性能下降。 - **现象**:CPU使用率周期性飙升(GC线程占用),应用暂停(Stop-The-World)时间变长。 - **排查方法**: - 启用GC日志:JVM参数中添加`-XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:<filename>`。 - 使用工具(如GCViewer)分析日志,观察GC频率、暂停时间、内存回收效果。 - **解决方案**: - 调整堆大小(`-Xms`, `-Xmx`)和新生代/老年代比例(`-XX:NewRatio`)。 - 选择合适的GC算法(如G1GC)。 - 避免创建过多临时对象(尤其在循环中)。 ### 3. **数据库问题** 数据库连接池或查询性能下降也是常见原因。 - **现象**:数据库响应变慢,连接池耗尽,出现超时错误。 - **排查方法**: - 监控数据库连接池使用情况(如活跃连接数、等待连接数)。 - 检查慢查询日志(MySQL的`slow_query_log`),分析执行计划。 - 观察数据库服务器资源(CPU、内存、磁盘I/O)。 - **解决方案**: - 优化慢查询(添加索引、重写SQL)。 - 调整连接池配置(最大连接数、超时时间)。 - 确保及时关闭数据库连接(使用`try-with-resources`或`finally`块)。 引用[2]提到MySQL参数配置不当(如`innodb_io_capacity`)可能导致性能下降[^2]。引用[4]指出SQL语句可能因索引使用不当而变慢[^4]。 ### 4. **线程阻塞/死锁** 多线程环境下,资源竞争可能导致线程阻塞,甚至死锁。 - **现象**:线程数增加,但CPU利用率不高,响应时间变长。 - **排查方法**: - 使用`jstack`获取线程转储,分析线程状态(如`BLOCKED`, `WAITING`)。 - 检查死锁(`jstack`会报告死锁信息)。 - **解决方案**:优化同步机制(如减小同步范围、使用并发集合、避免嵌套锁)。 ### 5. **外部依赖问题** 网络或外部服务(如API、缓存、消息队列)性能下降会拖累程序。 - **现象**:调用外部服务时超时或响应慢。 - **排查方法**: - 监控网络延迟(如`ping`, `traceroute`)。 - 检查外部服务状态(日志、监控)。 - **解决方案**: - 设置合理的超时时间。 - 添加熔断机制(如Hystrix)。 引用[1]中负载机到服务器的网络延迟导致TPS下降[^1],类似问题也可能发生在程序依赖的外部服务上。 ### 6. **资源耗尽** - **文件描述符**:打开文件、网络连接等未关闭,导致文件描述符耗尽。 - **线程数**:线程池配置不当或线程泄漏,导致线程数过多,上下文切换开销大。 - **排查方法**:监控系统资源(`ulimit -n`查看文件描述符限制,`/proc/sys/fs/file-nr`查看使用情况;`top -H`查看线程数)。 - **解决方案**:修复资源泄漏,调整系统或应用配置。 ### 7. **缓存问题** 缓存未命中或缓存穿透可能导致数据库压力骤增。 - **现象**:缓存命中率下降,数据库负载增加。 - **解决方案**:优化缓存策略(如设置过期时间、使用布隆过滤器防止穿透)。 ### 总结排查步骤 1. **监控资源**:CPU、内存、磁盘I/O、网络。 2. **分析日志**:应用日志、GC日志、数据库慢查询日志。 3. **使用工具**: - JVM:VisualVM, jstack, jmap, MAT。 - 系统:top, vmstat, iostat, netstat。 - 数据库:EXPLAIN, 慢查询日志。 4. **逐步缩小范围**:确定是应用内部问题(如内存泄漏、GC)还是外部问题(如数据库、网络)。 请根据上述方向逐一排查。若仍无法解决,请提供更多信息(如错误日志、资源监控图)。 §§相关问题§§ 1. 如何监控Java应用的内存泄漏? 2. 如何优化JVM垃圾回收性能? 3. 数据库连接池配置不当会导致哪些问题? 4. 如何分析并解决多线程环境下的死锁问题? 5. 如何识别和解决网络延迟导致的性能问题?</think>### 程序长时间运行后性能下降的原因及排查方案 程序运行一段时间后变慢是典型的技术问题,常见原因和解决方案如下: --- #### 一、核心原因分析 1. **内存泄漏(最常见)** - **现象**:内存占用持续上升,触发频繁GC,最终导致OOM或响应延迟。 - **根源**:未释放的对象积累(如静态集合、未关闭的连接、缓存未清理)[^3]。 - **排查工具**: - Java:`jvisualvm` 分析堆内存,`jmap` 生成堆转储文件(Heap Dump)。 - Python:`tracemalloc` 或 `objgraph` 跟踪对象分配。 2. **数据库性能劣化** - **索引失效**:数据量增长导致索引未覆盖查询,或统计信息过期[^4]。 - **连接池耗尽**:连接未释放(如未正确关闭`Connection`对象)[^2]。 - **查询效率下降**:数据量增大后,未优化的SQL执行计划变慢(如全表扫描)。 3. **资源竞争与阻塞** - **线程死锁/锁竞争**:多线程同步时发生阻塞(如`synchronized`范围过大)。 - **I/O瓶颈**:磁盘或网络I/O阻塞(如日志文件过大、网络队列堆积)[^1]。 4. **垃圾回收(GC)问题** - **配置不当**:堆内存分配不合理(如新生代过小导致频繁Minor GC)。 - **GC策略不匹配**:高吞吐场景使用CMS而非G1,导致长时间STW暂停[^3]。 5. **外部依赖问题** - **网络延迟**:下游服务或中间件(如Redis、DB)响应变慢[^1]。 - **第三方API限制**:调用频率超限触发降级。 --- #### 二、逐步排查方案 **第一步:监控资源使用** - **系统层**: ```bash # Linux 实时监控 top -H -p <PID> # 查看进程线程CPU/内存 iostat -dx 2 # 磁盘I/O netstat -s | grep retrans # 网络重传率 ``` - **应用层**: - JVM:`jstat -gcutil <PID> 1000` 观察GC频率(FGC/YGC剧增=泄漏)。 - 数据库:启用慢查询日志(`slow_query_log=ON`),分析`EXPLAIN`执行计划[^4]。 **第二步:代码级诊断** 1. **内存泄漏定位**: - 生成Heap Dump:`jmap -dump:live,format=b,file=heap.hprof <PID>` - 使用MAT工具分析对象引用链,聚焦`Retained Heap`大的对象。 2. **线程阻塞分析**: - `jstack <PID> > thread.txt`,搜索`BLOCKED`或`WAITING`状态的线程。 3. **数据库优化**: - 检查连接池:`SELECT * FROM information_schema.PROCESSLIST` 观察休眠连接。 - 重建索引:`ANALYZE TABLE` 更新统计信息,避免错误执行计划[^2]。 **第三步:压测复现问题** 使用`jmeter`或`wrk`模拟长时间运行: ```bash # 示例:持续压测30分钟 wrk -t4 -c100 -d1800s http://your-api ``` 监控压测期间的GC日志、线程状态、DB响应时间变化。 --- #### 三、解决方案 1. **内存泄漏修复** - 释放资源:在`finally`块中关闭连接(数据库、文件流等)。 - 限制缓存:使用`WeakHashMap`或`Caffeine.expireAfterWrite()`自动清理。 2. **数据库优化** - 索引优化:对`WHERE`和`ORDER BY`字段加复合索引。 - 连接池调参:设置`maxIdleTime`(如HikariCP的`idleTimeout=60s`)[^2]。 3. **GC调优** - 调整堆大小:`-Xms4g -Xmx4g` 避免动态扩容。 - 更换GC器:`-XX:+UseG1GC` 替代CMS(适合大堆内存)。 4. **异步化处理** - 将阻塞操作(如日志写入)移交独立线程池,避免主线程阻塞。 --- #### 四、预防措施 - **监控告警**:部署Prometheus监控JVM内存、线程池、DB连接数。 - **定期维护**: - 每天重启非核心服务(如K8s滚动更新)。 - 每周执行数据库`OPTIMIZE TABLE`(InnoDB需谨慎)[^4]。 - **代码规范**: - 使用`try-with-resources`(Java)或`with`(Python)自动管理资源。 - 避免静态集合(如`static Map`)长期持有对象。 > 通过上述方法,80%的性能劣化问题可快速定位。若问题仍存在,需结合具体日志和APM工具(如Arthas、SkyWalking)深入分析调用链。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值