在学习代码GitHub - ArvinZhuang/DSI-QG: The official repository for "Bridging the Gap Between Indexing and Retrieval for Differentiable Search Index with Query Generation", Shengyao Zhuang, Houxing Ren, Linjun Shou, Jian Pei, Ming Gong, Guido Zuccon and Daxin Jiang.的过程中,在经过https://huggingface.co/ 预训练模型和数据集的虐待,因为不方便直接加载模型所以一开始打算通过下载到disk然后再调用的方式
数据库的下载使用了这个,最后还算成功了具体就是参考huggingface之datasets将数据集下载到本地_快去写论文的博客-优快云博客
#数据集下载到硬盘disk
from datasets import load_dataset
dataset = load_dataset("neulab/conala", cache_dir="./dataset")
dataset.save_to_disk('dataset/neulab/conala')
#加载数据集.
import datasets
dataset = datasets.load_from_disk("dataset/neulab/conala")
然后手动将它放到data下
DSI-QG
- -__pycache__
- cache
- dowloads
- Tevatron__msmarco-passage-corpus
- default
- CE
- data
- msmarco_data
- 100k
- X.tsv
- Other file .py .sh et.al**
但是在在线调预训练模型时,不知道下载了放哪去
if 'mt5' in run_args.model_name:
tokenizer = MT5Tokenizer.from_pretrained(run_args.model_name, cache_dir='cache')
fast_tokenizer = MT5TokenizerFast.from_pretrained(run_args.model_name, cache_dir='cache')
if run_args.model_path:
model = MT5ForConditionalGeneration.from_pretrained(run_args.model_path, cache_dir='cache')
else:
model = MT5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache')
else:
tokenizer = T5Tokenizer.from_pretrained(run_args.model_name, cache_dir='cache')
fast_tokenizer = T5TokenizerFast.from_pretrained(run_args.model_name, cache_dir='cache')
if run_args.model_path:
model = T5ForConditionalGeneration.from_pretrained(run_args.model_path, cache_dir='cache')
else:
model = T5ForConditionalGeneration.from_pretrained(run_args.model_name, cache_dir='cache')
于是我想通过其他的方式,直接在线调用,用了传统的clash
但是ssl证书验证一直过不去
假如用requests 传统方式就是
requests.get(url, verify=False)
但是我没用requests,于是我就在找一些全局的方法
参考了下载Pytorch的自带数据集时报错=urllib.error.URLError: urlopen error [SSL: CERTIFICATE_VERIFY_FAILED]-优快云博客
# 全局取消证书验证
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
还有一些小插曲就是服务器不一定支持https
# 设置服务器的地址和端口
proxy_host = "127.0.0.1"
proxy_port = 7890
# 设置环境变量,指定代理
os.environ['http_proxy'] = f'http://{proxy_host}:{proxy_port}'
os.environ['https_proxy'] = f'http://{proxy_host}:{proxy_port}'
最后这个http一定考虑清楚