pthread_cond_destroy死锁卡住问题处理记录

原创已于 2022-09-27 17:28:45 修改 · 2.4k 阅读

3 ·

CC 4.0 BY-SA版权

文章标签：

#pthread_cond

于 2022-09-27 17:06:03 首次发布

linux 同时被 2 个专栏收录

44 篇文章

订阅专栏

调试技巧

35 篇文章

订阅专栏

本文探讨了在销毁条件变量过程中遇到的问题，特别是当其他线程正在等待时的不确定性行为。通过源码分析揭示了问题根源在于条件变量在未初始化状态下就被使用的错误做法，并给出了正确的解决方案。

问题

供应商代码, 在退出某线程时, 销毁条件变量的过程中, 线程被阻塞.

在这里插入图片描述

参考手册

参看man手册, 销毁其它线程正在等待的cond将导致不确定行为:

pthread_cond_destroy()
It  shall be safe to destroy an initialized condition variable upon which no threads are currently blocked. Attempting to destroy a condition variable upon which other threads are currently blocked results in undefined behavior.

因此在销毁之前, 先发送pthread_cond_broadcast(&pEvent->cond);
通知所有等待线程:

int32_t osEventDestroy(osEvent *pEvent)
{
    pthread_mutex_lock(&pEvent->mutex);
    pthread_cond_broadcast(&pEvent->cond);
    pthread_mutex_unlock(&pEvent->mutex);

    pthread_cond_destroy(&pEvent->cond);
    pthread_mutex_destroy(&pEvent->mutex);

再次测试, 问题还是有概率出现.

查看源码

查看源码描述, __pthread_cond_destroy 默认有其它线程在等待, 因此将会等待__wrefs变量的值:


/* See __pthread_cond_wait for a high-level description of the algorithm.

   A correct program must make sure that no waiters are blocked on the condvar
   when it is destroyed, and that there are no concurrent signals or
   broadcasts.  To wake waiters reliably, the program must signal or
   broadcast while holding the mutex or after having held the mutex.  It must
   also ensure that no signal or broadcast are still pending to unblock
   waiters; IOW, because waiters can wake up spuriously, the program must
   effectively ensure that destruction happens after the execution of those
   signal or broadcast calls.
   Thus, we can assume that all waiters that are still accessing the condvar
   have been woken.  We wait until they have confirmed to have woken up by
   decrementing __wrefs.  */
int
__pthread_cond_destroy (pthread_cond_t *cond)
{
  LIBC_PROBE (cond_destroy, 1, cond);

  /* Set the wake request flag.  We could also spin, but destruction that is
     concurrent with still-active waiters is probably neither common nor
     performance critical.  Acquire MO to synchronize with waiters confirming
     that they finished.  */
  unsigned int wrefs = atomic_fetch_or_acquire (&cond->__data.__wrefs, 4);
  int private = __condvar_get_private (wrefs);
  while (wrefs >> 3 != 0)
    {
      futex_wait_simple (&cond->__data.__wrefs, wrefs, private);
      /* See above.  */
      wrefs = atomic_load_acquire (&cond->__data.__wrefs);
    }
  /* The memory the condvar occupies can now be reused.  */
  return 0;
}

打印销毁之前__wrefs的为-8, 不可理喻. 尝试将其强制清零之后, 问题消失.

    //
int32_t osEventDestroy(osEvent *pEvent)
{
    pthread_mutex_lock(&pEvent->mutex);
    pthread_cond_broadcast(&pEvent->cond);
    pthread_mutex_unlock(&pEvent->mutex);
	if(0 != pEvent->cond.__data.__wrefs)
	{
		OSLAYER_ERR("%p %s cond error with refs %d\n",pEvent,__func__,pEvent->cond.__data.__wrefs);
		pEvent->cond.__data.__wrefs = 0;
	}
    pthread_cond_destroy(&pEvent->cond);
    pthread_mutex_destroy(&pEvent->mutex);

追查原因

查阅代码, 没有其它更多的线程在使用该变量, 那么为啥该值会异常呢?
最后发现, 是因为源码在使用条件变量时, 先启动了等待线程pthread_cond_wait, 再进行了cond的初始化.
也就是说,pthread_cond_wait带入了条件变量的时候, 该条件变量并没有初始化, 执行完成了pthread_cond_wait之后, 才调用了pthread_cond_init初始化变量.

调整代码逻辑之后, 问题消失.