SemanGit: A Linked Dataset from git

本文介绍SemanGit,一个基于GitHub活动的链接数据集,包含200多亿个RDF三元组,旨在加速开源软件开发与协作。通过对GitHub元数据进行语义增强,创建了一个丰富的数据资源,可用于分析开源项目、开发者行为和社会关系。

本文由学者Dennis Oliver Kubitza、Matthias Bockmann和Damien Graux联合发表

简介:
人们对开源软件的喜爱越来越高,所以加速了控制系统版本的更新,帮助开发者在项目中协作。特定工具和开源在线平台越来越重要。本文介绍共享SemanGit,在基于语义Web和git Web控制系统的双方交互点提供资源。

本文目标:
将语义添加到git协议,利用语义和图形数据库的优势,集成来自链接开放数据云的其他数据集,主干和初始设计本体支持逻辑推理。GitHub规模巨大,所以从GitHub中提取数据创建第一个RDF,提供一种REST API作为可收集数据的查询点,单次查询数量有限,耗时但产出少。GHTorrent项目提供了大量数据,这些数据提供了比限速的GitHub API更好的输入。由于其数据已存储在关系模型中,因此不适合分析链接的数据,而首选图形数据集,再利用转换器将关系表转换为RDF。

关于git的部分知识

  1. git用于跟踪文件更改(如增删查改)的协议,GitHub是git最大的在线提供 。除了提供git存储库,Github还实现一些不属于git协议的功能,如跟踪用户、监察项目变化、创建项目版本等。
  2. git提供多种属性,如数据完整性或对分布式和非线性工作流。由于git存储库代表的文件系统是分布的,所以开发人员可将更改嵌入git,一旦更新,协作者就可访问最新版本。
  3. git协议依赖数据存储库,许多在线git存储库提供者都添加了自己的特性,而这些特性并不是git协议的一部分。为了拥有可扩展本体,需要清楚什么是git协议以及什么特定于提供者。本体中涉及git协议功能仅代表数据严格属于协议。与协议相关的部分很小,仅包含四个类:用户,项目,提交和请求,用户该类仅存储一个电子邮件地址。

通过在类名中添加前缀,可将与特定于提供程序的协议扩展相对应的类与原始协议分开。本体大部分没有引用git协议的一部分,而是包括由提供者在其添加或扩展的特性。其中一些是社会关系。Github允许用户对某些对象留下评论,但不是在git协议中指定,仅允许初始提交消息。在这种情况下,如果添加的整个特性不是现有git协议特性的扩展,则本体中的相应类不会继承表示git特性的类。

创建SemanGit数据集过程

  1. 数据生成过程:GHTorrent项目一直从GitHub挖掘元数据息。他们提供每月的数据库转储,使用转储解决GitHub API的查询限制。数据转储以逗号分隔值(CSV)文件的形式,文件存储了不同对象或某些对象关系。对于不同类型的关系,使用不同的文件实现琐碎的并行化。考虑到本体和GHTorrent项目的输入,将CSV文件转换成 Turtle非常简单。考虑数据集大小,所以还要尽可能压缩输出。
    转换过程:创建bash脚本,通过检查新数据转储、管理下载、解压缩和确保所使用资源的容错性自动化处理。对于每个步骤,添加了错误检查和回退机制,确保结果的完整性,检查主要是日志文件, 记录任务完成点,便于在合适点重新开始该过程。
    减小RDF文件大小技巧:选择以Turtle序列化数据,使用前缀和缩写三元组的部分。为本体中的每个URI创建一个前缀,选择不超过两个字符的前缀名称,并为最常用的URI选择最短的前缀。通过将资源标识符中的所有整数从基数10表示转换为基数64,极大减少输出大小与 Turtle兼容。在数据生成过程完成后,使用互连数据集词汇(VoID)描述结果数据集,这些三元组包括数据集名称和描述、格式、可用许可等数据集许可。

  2. 数据集的统计
    Semangit的大小为353 GB,包含210多亿个三元组。输入GHTorrent的文件使用340 GB,通过添加语义,开销只产生了不到4%。通过在 turtle格式允许的范围内使用前缀, 大小比输入文件大25%。通过添加类似Base64整数表示,开销减少到4%。整个转换过程不到七个小时。

组织的社会关系
为了解在开放源码协作中对社会方面的调查,分析两个特性:组织成员数量和用户关注者的数量。在 Semangit数据集中, 类用户对应于自然用户或组织。在第二种情况下,组织有不同的成员,这些成员也是用户。使用组织连接事件跟踪成员关系。通过聚集所有事件,就能得到所有成员的集合。

开发者一一在数据集中, GitHub上的社交互动记录可用于不仅仅是社交分析。假设开发人员一直使用开源工具。使用SemanGit,可以通过观察工具库的一组开发人员并评估开发人员正在监视的存储库来找到类似的项目。
经济学家一一经济学对经济驱动力,结构和制度有用,但对开放软件项目的分析及其背后的动机仍受到当前研究的关注。对于此类研究兴趣,链接数据的使用提供了新机会,因为它为分析本地模型而定制并提供直接访问个人的经验。

在本文中,共享了SemanGit数据集,该数据集是GitHub活动的链接数据版本,由200多亿个RDF三元组组成。除了公开可用的数据集外,在GitHub存储库中提供提取器,该提取器将数据从GHTorrent转换为符合本体的RDF,并设计了本体来表示git存储库和GitHub活动。Semangit目前仍然缺少公共SparQL端点,使整个数据集可用。由于数据量大,任务具有挑战性。将计算资源用于包含垂直扩展和与其他数据集的链接。释放了必要的计算能力和存储空间后,会在服务器上实现一个端点。

解决D:\create\conda1\envs\python37\python.exe D:\create\programm\yolov5-master\train.py github: skipping check (not a git repository), for updates see https://github.com/ultralytics/yolov5 train: weights=yolov5s.pt, cfg=models/yolov5s.yaml, data=data\sign.yaml, hyp=data\hyps\hyp.scratch-low.yaml, epochs=300, batch_size=16, imgsz=640, rect=False, resume=False, nosave=False, noval=False, noautoanchor=False, noplots=False, evolve=None, evolve_population=data\hyps, resume_evolve=None, bucket=, cache=None, image_weights=False, device=0, multi_scale=False, single_cls=False, optimizer=SGD, sync_bn=False, workers=8, project=runs\train, name=exp, exist_ok=False, quad=False, cos_lr=False, label_smoothing=0.0, patience=100, freeze=[0], save_period=-1, seed=0, local_rank=-1, entity=None, upload_dataset=False, bbox_interval=-1, artifact_alias=latest, ndjson_console=False, ndjson_file=False YOLOv5 2025-8-14 Python-3.13.5 torch-2.6.0+cu126 CUDA:0 (NVIDIA GeForce RTX 4060 Laptop GPU, 8188MiB) hyperparameters: lr0=0.01, lrf=0.01, momentum=0.937, weight_decay=0.0005, warmup_epochs=3.0, warmup_momentum=0.8, warmup_bias_lr=0.1, box=0.05, cls=0.5, cls_pw=1.0, obj=1.0, obj_pw=1.0, iou_t=0.2, anchor_t=4.0, fl_gamma=0.0, hsv_h=0.015, hsv_s=0.7, hsv_v=0.4, degrees=0.0, translate=0.1, scale=0.5, shear=0.0, perspective=0.0, flipud=0.0, fliplr=0.5, mosaic=1.0, mixup=0.0, copy_paste=0.0 Comet: run 'pip install comet_ml' to automatically track and visualize YOLOv5 runs in Comet TensorBoard: Start with 'tensorboard --logdir runs\train', view at http://localhost:6006/ from n params module arguments 0 -1 1 3520 models.common.Conv [3, 32, 6, 2, 2] 1 -1 1 18560 models.common.Conv [32, 64, 3, 2] 2 -1 1 18816 models.common.C3 [64, 64, 1] 3 -1 1 73984 models.common.Conv [64, 128, 3, 2] 4 -1 2 115712 models.common.C3 [128, 128, 2] 5 -1 1 295424 models.common.Conv [128, 256, 3, 2] 6 -1 3 625152 models.common.C3 [256, 256, 3] 7 -1 1 1180672 models.common.Conv [256, 512, 3, 2] 8 -1 1 1182720 models.common.C3 [512, 512, 1] 9 -1 1 656896 models.common.SPPF [512, 512, 5] 10 -1 1 131584 models.common.Conv [512, 256, 1, 1] 11 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 12 [-1, 6] 1 0 models.common.Concat [1] 13 -1 1 361984 models.common.C3 [512, 256, 1, False] 14 -1 1 33024 models.common.Conv [256, 128, 1, 1] 15 -1 1 0 torch.nn.modules.upsampling.Upsample [None, 2, 'nearest'] 16 [-1, 4] 1 0 models.common.Concat [1] 17 -1 1 90880 models.common.C3 [256, 128, 1, False] 18 -1 1 147712 models.common.Conv [128, 128, 3, 2] 19 [-1, 14] 1 0 models.common.Concat [1] 20 -1 1 296448 models.common.C3 [256, 256, 1, False] 21 -1 1 590336 models.common.Conv [256, 256, 3, 2] 22 [-1, 10] 1 0 models.common.Concat [1] 23 -1 1 1182720 models.common.C3 [512, 512, 1, False] 24 [17, 20, 23] 1 64728 models.yolo.Detect [19, [[10, 13, 16, 30, 33, 23], [30, 61, 62, 45, 59, 119], [116, 90, 156, 198, 373, 326]], [128, 256, 512]] YOLOv5s summary: 214 layers, 7070872 parameters, 7070872 gradients, 16.1 GFLOPs Transferred 342/349 items from yolov5s.pt D:\create\programm\yolov5-master\models\common.py:906: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead. with amp.autocast(autocast): D:\create\programm\yolov5-master\models\common.py:906: FutureWarning: `torch.cuda.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cuda', args...)` instead. with amp.autocast(autocast): AMP: checks passed optimizer: SGD(lr=0.01) with parameter groups 57 weight(decay=0.0), 60 weight(decay=0.0005), 60 bias train: Scanning D:\create\programm\datasets\labels\train.cache... 493 images, 7 backgrounds, 0 corrupt: 100%|██████████| 500/500 [00:00<?, ?it/s] OMP: Error #15: Initializing libomp.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/ OMP: Error #15: Initializing libomp.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/ OMP: Error #15: Initializing libomp.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/ OMP: Error #15: Initializing libomp.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/ OMP: Error #15: Initializing libomp.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/ OMP: Error #15: Initializing libomp.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/ OMP: Error #15: Initializing libomp.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/ OMP: Error #15: Initializing libomp.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/ val: Scanning D:\create\programm\datasets\labels\val... 7 images, 5 backgrounds, 0 corrupt: 100%|██████████| 12/12 [00:04<00:00, 2.44it/s] val: New cache created: D:\create\programm\datasets\labels\val.cache AutoAnchor: 6.00 anchors/target, 1.000 Best Possible Recall (BPR). Current anchors are a good fit to dataset Plotting labels to runs\train\exp15\labels.jpg... OMP: Error #15: Initializing libomp.dll, but found libiomp5md.dll already initialized. OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/ 进程已结束,退出代码为 3
09-04
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值