HTCondor 有用的资源

本文介绍HTCondor中TCP与UDP的选择配置、权限问题的排查与解决方法,并提供在特定软件环境下运行任务的具体步骤。
部署运行你感兴趣的模型镜像

Getting started: Creating a multiple node Condor pool

[1] HTCondor默认使用TCP向condor_collector服务器发送更新请求.  condor_collector 服务器 由 CONDOR_VIEW_HOST设定


UPDATE_COLLECTOR_WITH_TCP


    默认值为True. HTCondor 使用TCP更新 condor_collector.如设置为 False, the HTCondor daemons 将使用 UDP 来更新 condor_collector.


UPDATE_VIEW_COLLECTOR_WITH_TCP


     When set to True, the HTCondor collector will use TCP to forward updates to condor_collector daemons specified by CONDOR_VIEW_HOST, instead of the default UDP. Defaults to False.


TCP_UPDATE_COLLECTORS


当 UPDATE_COLLECTOR_WITH_TCP 或 UPDATE_VIEW_COLLECTOR_WITH_TCP 被设置为 False时,.使用 TCP 更新 一组 condor_collector 服务器
 

[2]  within the collector log of the form

DaemonCore: PERMISSION DENIED to host <xxx.xxx.xxx.xxx> for command 0 (UPDATE_STARTD_AD)

indicates a permissions problem. The condor_startd daemons do not have write permission to the condor_collector daemon. This could be because you used domain names in your HOSTALLOW_WRITE and/orHOSTDENY_WRITE configuration macros, but the domain name server (DNS) is not properly configured at your site. Without the proper configuration, HTCondor cannot resolve the IP addresses of your machines into fully-qualified domain names (an inverse look up). If this is the problem, then the solution takes one of two forms:

  1. Fix the DNS so that inverse look ups (trying to get the domain name from an IP address) works for your machines. You can either fix the DNS itself, or use the DEFAULT_DOMAIN_NAME setting in your HTCondor configuration file.
  2. Use numeric IP addresses in the HOSTALLOW_WRITE and/or HOSTDENY_WRITE configuration macros instead of domain names. As an example of this, assume your site has a machine such as foo.your.domain.com, and it has two subnets, with IP addresses 129.131.133.10, and 129.131.132.10. If the configuration macro is set as

     HOSTALLOW_WRITE = *.your.domain.com
    

    and this does not work, use

     HOSTALLOW_WRITE = 192.131.133.*, 192.131.132.*
    

Alternatively, this permissions problem may be caused by being too restrictive in the setting of your HOSTALLOW_WRITE and/or HOSTDENY_WRITE configuration macros. If it is, then the solution is to change the macros, for example from

 HOSTALLOW_WRITE = condor.your.domain.com

to

 HOSTALLOW_WRITE = *.your.domain.com

or possibly

 HOSTALLOW_WRITE = condor.your.domain.com, foo.your.domain.com, \
 bar.your.domain.com

Another likely error message within the collector log of the form

DaemonCore: PERMISSION DENIED to host <xxx.xxx.xxx.xxx> for command 5 (QUERY_STARTD_ADS)

indicates a similar problem as above, but read permission is the problem (as opposed to write permission). Use the solutions given above.

 只在安装了特定软件的计算节点上运行计算任务,应如何配置?


  具体方法[4]分两步:
   步骤1:
     在安装了特定软件的计算节点的配置文件添加 参数 
     HAS_MY_SOFTWARE = True
     然后 添加 
      STARTD_ATTRS = HAS_MY_SOFTWARE, $(STARTD_ATTRS) 


   步骤2:
     在计算任务submit文件中添加
      Requirements = (HAS_MY_SOFTWARE =?= True)

参考文献

[1] Using TCP to Send Updates to the condor_collector

[2 ]collector permission problem

[3] https://github.com/htcondor/htcondor

[4 http://research.cs.wisc.edu/htcondor/manual/v7.8/7_2Setting_up.html  

您可能感兴趣的与本文相关的镜像

Stable-Diffusion-3.5

Stable-Diffusion-3.5

图片生成
Stable-Diffusion

Stable Diffusion 3.5 (SD 3.5) 是由 Stability AI 推出的新一代文本到图像生成模型,相比 3.0 版本,它提升了图像质量、运行速度和硬件效率

HTCondor 是一个开源项目,其启动文件主要包括 condor_master、condor_startd 和 condor_schedd。condor_master 是主守护进程,负责启动和管理其他 HTCondor 守护进程;condor_startd 负责管理计算资源;condor_schedd 负责管理作业队列和作业调度 [^1]。 在使用方面,以向 HTCondor 集群提交 LS - DYNA 任务为例,有如下的提交描述文件示例: ```plaintext #################### # # Example 1 - LS-DYNA # Simple HTCondor submit description file #################### Executable = E:\\LSDYNAR7\\program\\ls-dyna_smp_d_R700_winx64_ifort101.exe universe = vanilla input = 'I=I:\\HTCondor\\foo\\foo-airbag.k O=I:\\HTCondor\\foo\\result.txt' should_transfer_files = YES transfer_input_files = LSTC_FILE when_to_transfer_output = ON_EXIT output = output.$(cluster).$(process).txt log = log.$(cluster).$(process).txt queue ``` 此文件定义了可执行文件、输入输出文件等信息 [^2]。 若要安装 HTCondor 8.6 及以上版本,可按以下步骤操作: ```bash #安装epel资料库 yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm #安装HTCondor资料库 yum install -y https://research.cs.wisc.edu/htcondor/repo/8.8/el7/release/htcondor-release-8.8-1.el7.noarch.rpm #安装HTCondeor(minicondor还是condor看是在集群上还是单机器上) yum install -y minicondor # or condor #运行 systemctl start condor # 通过系统启动condor systemctl enable condor #检查是否运行成功 condor_q #列出当前池中任务 condor_status #列出当前池中节点 systemctl #检查condor的系统状态 ``` 这里可以根据是在集群还是单机器上选择安装 minicondor 或 condor [^3]。 此外,还可通过一些网站获取 HTCondor 相关教程,如 High Throughput Computing (HTC) 官网、quickstart 官网、tutorial 官网,较好的教程可参考 http://research.cs.wisc.edu/htcondor/tutorials/intl-grid-school - 3/ ,非官方教程可参考 http://www.hexfarm.rutgers.edu 的批处理部分,也可在百度文库查找 condor 调研报告、作业调度系统 Condor 等资料,还能找到 Condor 系统在大吞吐量计算中的应用、Condor 系统简介及性能分析等论文 [^4]。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值