how to download cifar10 and split it into training file and testing file in python

本文链接：https://blog.youkuaiyun.com/weixin_39587440/article/details/106185846

how to use cifar10 in python

the first step:download the cifar10 using the shell scripts
- how to split the cifar10 into training data, testing data
- how to change the data more convient

the first step:download the cifar10 using the shell scripts

#!/usr/bin/env bash
if ! [ -d "cifar-10-batches-py" ]; then
        wget https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
        tar xvzf cifar-10-python.tar.gz
        rm -f cifar-10-python.tar.gz
fi

in the first line, it means that this is a bash shell script
the second line represents that if there is no cifar-10-batches-py in the folder,then it will automatically download the batch file at toronto.
the third line, it stands for unfold the cifar10 at the same folder.
the forth line, it means that delete the compressed file right now
the last line, the scripts is finished.

Suppose you have created a folder named data, then if you cd /data then you will see the contents below.
在这里插入图片描述
if you cd into the cifar-10-py you will see the file below data_batch_1 …5 is the training data,and the test_batch is the testing data you will see this contents in the folder named cifar10-batchs-py

how to split the cifar10 into training data, testing data

in the training process of a model, using the training data, in the testing process, it will use the testing data. So it is necessary to split the data into training data and testing data. But to our joy, the cifar10 have already split the data into training data and testing data, so what you need to do is to just take it out.

#because there are five files as the training data in the folder as you can see above,so the nbbatch=5
def load_cifar10_2(nbbatch=5):
    all_data = []#this is the traning data 
    all_labels = []#this is the trianing label
    test_data=[]#this is the testing data
    test_labels=[]#this is the testing label
    ########
    #this section is for getting the training data
    for i in range(nbbatch):
        data = open("./data/cifar-10-batches-py/data_batch_%s" % (i + 1), 'rb')
        #open files in a sequence, and the flag is 'rb' because this file is opened in a read-only and Binary mode.(all images should do like this)  
        dict = pickle.load(data, encoding='bytes')
        #the pickle.load return a dict in a bytes mode
        data = dict[b'data']
        labels = np.asarray(dict[b'labels']).reshape((-1,1))
        #it changes it to an array
        all_data.append(data)
        all_labels.append(labels)
    ########
    data=open("./data/cifar-10-batches-py/test_batch",'rb')
    dict=pickle.load(data,encoding='bytes')
    data=dict[b'data']
    labels = np.asarray(dict[b'labels']).reshape((-1,1))
    test_data.append(data)
    test_labels.append(labels)


    all_data = np.concatenate(all_data, axis=0)
    all_labels = np.concatenate(all_labels, axis=0)
	#cat the data and labels
    test_data=np.concatenate(test_data,axis=0)
    test_labels=np.concatenate(test_labels,axis=0)
    return (all_data, all_labels,test_data,test_labels)

how to change the data more convient


def cifar10_proper_array(data):
    all_red = data[:,:1024].reshape(-1, 32, 32)
    all_green = data[:,1024:2048].reshape(-1, 32, 32)
    all_blue = data[:,2048:].reshape(-1, 32, 32)
    return np.stack([all_red, all_green, all_blue], axis=1) / 255.0

the snippet above is for data normalization.

data, labels,test_data,test_label =load_cifar10_2()
labels = labels.reshape(-1)
test_label=test_label.reshape(-1)

data = cifar10_proper_array(data)
test_data=cifar10_proper_array(test_data)