【演化计算】【论文研读】Large-Scale Evolution of Image Classiﬁers

本文链接：https://blog.youkuaiyun.com/qq_40690815/article/details/103617388

Large-Scale Evolution of Image Classiﬁers

1. Introduction
2.RelatedWork
3. Methods
4. Experiments and Results
5. Analysis
6. Supplementary Material

1. Introduction

page dist: https://arxiv.org/abs/1703.01041

the main purpose is to “employ evolutionary algorithms to discover such networks automatically.”

the dataset for the paper is CIFAR-10 and CIFAR-100

Table 1. Comparison with single-model hand-designed architectures. The “C10+” and “C100+” columns indicate the test accuracy on the data-augmented CIFAR-10 and CIFAR-100 datasets, respectively. The “Reachable?” column denotes whether the given handdesigned model lies within our search space. An entry of “–” indicates that no value was reported. The † indicates a result reported by Huang et al. (2016b) instead of the original author. Much of this table was based on that presented in Huang et al. (2016a).

在这里插入图片描述

2.RelatedWork

Table 2. Comparison with automatically discovered architectures. The “C10+” and “C100+” contain the test accuracy on the dataaugmented CIFAR-10 and CIFAR-100 datasets, respectively. An entry of “–” indicates that the information was not reported or is not known to us. For Zoph & Le (2016), we quote the result with the most similar search space to ours, as well as their best result. Please refer to Table 1 for hand-designed results, including the state of the art. “Discrete params.” means that the parameters can be picked from a handful of values only (e.g. strides∈{1,2,4}).

在这里插入图片描述

3. Methods

3.1 EvolutionaryAlgorithm

During each evolutionary step, a computer—a worker—chooses two individuals at random from this population and compares their ﬁtnesses. The worst of the pair is immediately removed from the population—it is killed. The best of the pair is selected to be a parent, that is, to undergo reproduction.

the related code in paper’s appendix :

# Supplementary Material
#
# S1.MethodsDetails
#
# This section contains additional implementation details, roughly following the order in Section 3.
# Short code snippets illustrate the ideas. The code is not intended to run on its own and it has
# been highly edited for clarity.
#
# In our implementation, each worker runs an outer loop that is responsible for selecting a pair
# of random individuals from the population. The individual with the highest ﬁtness usually becomes
# a parent and the one with the lowest ﬁtness is usually killed (Section 3.1). Occasionally, either
# of these two actions is not carried out in order to keep the population size close to a set-point:

def evolve_population(self):
    # Iterate indefinitely.
    while True:
        # Select two random individuals from the population.
        # 每次挑选两个个体
        valid_individuals = []
        for individual in self.load_individuals():  # Only loads the IDs and states.
            if individual.state in [TRAINING, ALIVE]:
                valid_individuals.append(individual)
        individual_pair = random.sample(valid_individuals, 2)

        for individual in individual_pair:
            # Sync changes from other workers from file-system. Loads everything else.
            # 同步系统中的其他更新
            individual.update_if_necessary()

            # Ensure the individual is fully trained.
            # 确保个体已经被训练过了
            if individual.state == TRAINING:
                self._train(individual)

        # Select by fitness (accuracy).
        # 挑选个体的fitness，(population过大->kill不好的)，反之(population过小->reproduce好的)
        individual_pair.sort(key=lambda i: i.fitness, reverse=True)
        better_individual = individual_pair[0]
        worse_individual = individual_pair[1]

        # If the population is not too small, kill the worst of the pair.
        # ?????? 宁写反了把
        if self._population_size() >= self._population_size_setpoint:
            self._kill_individual(worse_individual)

        # If the population is not too large, reproduce the best of the pair.
        if self._population_size() < self._population_size_setpoint:
            self._reproduce_and_train_individual(better_individual)

Using this strategy to search large spaces of complex image models requires considerable computation. To achieve scale, we developed a massively-parallel, lock-free infrastructure. Many workers operate asynchronously on different computers. They do not communicate directly with each other. Instead, they use a shared ﬁle-system, where the population is stored. The ﬁle-system contains directories that represent the individuals. Operations on these individuals, such as the killing of one, are represented as atomic renames on the directory2. Occasionally, a worker may concurrently modify the individual another worker is operating on. In this case,the affected workers imply gives up and tries again. The population size is 1000 individuals, unless otherwise stated. The number of workers is always $\frac{1}{4}$ of the population size. To allow for long run-times with alimited amount of space,dead individuals’ directories are frequently garbage-collected.

3.2 Encoding and Mutations

ALTER-LEARNING-RATE (sampling details below).
IDENTITY (effectively means “keep training”).
RESET-WEIGHTS (sampled as in He et al. (2015), for example).
INSERT-CONVOLUTION (inserts a convolution at a random location in the “convolutional backbone”, as in Figure 1. The inserted convolution has 3 × 3 filters, strides of 1 or 2 at random, number of channels same as input. May apply batch-normalization and ReLU activation or none at random).
REMOVE-CONVOLUTION.
ALTER-STRIDE(only powers of 2 are allowed).
ALTER-NUMBER-OF-CHANNELS (of random conv.).
FILTER-SIZE (horizontal or vertical at random, on random convolution, odd values only).
INSERT-ONE-TO-ONE (inserts a one-to-one/identity connection, analogous to insert-convolution mutation).
ADD-SKIP (identity between random layers).
REMOVE-SKIP (removes random skip)

These speciﬁc mutations were chosen for their similarity to the actions that a human designer may take when improving an architecture. This may clear the way for hybrid evolutionary–hand-design methods in the future. The probabilities for the mutations were not tuned in any way.

3.3 Initial Conditions

Every evolution experiment begins with a population of simple individuals, all with a learning rate of 0.1.

3.4 TrainingandValidation

3.5 Computationcost

To estimate computation costs, we identiﬁed the basic TensorFlow (TF) operations used by our model training and validation, like convolutions, generic matrix multiplications, etc. For each of these TF operations, we estimated the theoretical number of ﬂoating-point operations (FLOPs) required. This resulted in a map from TF operation to FLOPs, which is valid for all our experiments.

For each individual within an evolution experiment, we compute the total FLOPs incurred by the TF operations in its architecture over one batch of examples, both during its training (F_t FLOPs) and during its validation (F_v FLOPs). Then we assign to the individual the cost F_tN_t + F_vN_v, where N_t and N_v are the number of training and validation batches, respectively. The cost of the experiment is then the sum of the costs of all its individuals.

3.6 Weight Inheritance

4. Experiments and Results

5. Analysis

在这里插入图片描述
Figure 1. Progress of an evolution experiment. Each dot represents an individual in the population. Blue dots (darker, top-right) are alive. The rest have been killed. The four diagrams show examples of discovered architectures. These correspond to the best individual (rightmost) and three of its ancestors. The best individual was selected by its validation accuracy. Evolution sometimes stacks convolutions without any nonlinearity in between (“C”, white background), which are mathematically equivalent to a single linear operation. Unlike typical hand-designed architectures, some convolutions are followed by more than one nonlinear function (“C+BN+R+BN+R+…”, orange background)

6. Supplementary Material

6.1 DNA


# The encoding for an individual is represented by a serializable DNA class instance containing all
# information except for the trained weights (Section 3.2). For all results in this paper, this
# encoding is a directed, acyclic graph where edges represent convolutions and vertices represent
# nonlinearities. This is a sketch of the DNA class:


class DNA(object):
    def __init__(self, dna_proto):
        """Initializes the ‘DNA‘ instance from a protocol buffer(协议缓冲区).
        The ‘dna_proto‘ is a protocol buffer used to restore the DNA state from disk.
        Together with the corresponding ‘to_proto‘ method, they allow for a
        serialization-deserialization(序列化-反序列化) mechanism.
        """
        # Allows evolving the learning rate, i.e. exploring the space of
        # learning rate schedules.
        # 编辑learning_rate
        self.learning_rate = dna_proto.learning_rate
        # 编辑结点、链接边
        self._vertices = {}  # String vertex ID to ‘Vertex‘ instance.
        for vertex_id in dna_proto.vertices:
            vertices[vertex_id] = Vertex(
                vertex_proto=dna_sproto.vertices[vertex_id])

        self._edges = {}  # String edge ID to ‘Edge‘ instance.
        for edge_id in dna_proto.edges:
            mutable_edges[edge_id] = Edge(edge_proto=dna_proto.edges[edge_id])

        ...

    def to_proto(self):
        """Returns this instance in protocol buffer form."""
        dna_proto = dna_pb2.DnaProto(learning_rate=self.learning_rate)

        for vertex_id, vertex in self._vertices.iteritems():
            dna_proto.vertices[vertex_id].CopyFrom(vertex.to_proto())

        for edge_id, edge in self._edges.iteritems():
            dna_proto.edges[edge_id].CopyFrom(edge.to_proto())
        ...
        return dna_proto

    def add_edge(self, dna, from_vertex_id, to_vertex_id, edge_type, edge_id):
        """Adds an edge to the DNA graph, ensuring internal consistency."""
        # ‘EdgeProto‘ defines defaults for other attributes.
        edge = Edge(EdgeProto(
            from_vertex=from_vertex_id, to_vertex=to_vertex_id, type=edge_type))
        self._edges[edge_id] = edge
        self._vertices[from_vertex_id].edges_out.add(edge_id)
        self._vertices[to_vertex].edges_in.add(edge_id)
        return edge
    # Other methods like ‘add_edge‘ to manipulate the graph structure.
    ...

6.2 Vertex


# The DNA holds Vertex and Edge instances. The Vertex class looks like this:
class Vertex(object):
    def __init__(self, vertex_proto):
        # Relationship to the rest of the graph.
        self.edges_in = set(vertex_proto.edges_in)  # Incoming edge IDs.
        self.edges_out = set(vertex_proto.edges_out)  # Outgoing edge IDs.

        # The type of activations.
        if vertex_proto.HasField('linear'):
            self.type = LINEAR  # Linear activations.
        elif vertex_proto.HasField('bn_relu'):
            self.type = BN_RELU  # ReLU activations with batch-normalization.
        else:
            raise NotImplementedError()

        # Some parts of the graph can be prevented from being acted upon by mutations.
        # The following boolean flags control this.
        self.inputs_mutable = vertex_proto.inputs_mutable
        self.outputs_mutable = vertex_proto.outputs_mutable
        self.properties_mutable = vertex_proto.properties_mutable
        # Each vertex represents a 2ˆs x 2ˆs x d block of nodes. s and d are positive
        # integers computed dynamically from the in-edges. s stands for "scale" so
        # that 2ˆx x 2ˆs is the spatial size of the activations. d stands for "depth",
        # the number of channels.

    def to_proto(self):
        ...

6.3 Edge


class Edge(object):

    def __init__(self, edge_proto):
        # Relationship to the rest of the graph.
        self.from_vertex = edge_proto.from_vertex  # Source vertex ID.
        self.to_vertex = edge_proto.to_vertex  # Destination vertex ID.
        if edge_proto.HasField('conv'):
            # In this case, the edge represents a convolution.
            self.type = CONV

            # Controls the depth (i.e. number of channels) in the output, relative to the
            # input. For example if there is only one input edge with a depth of 16 channels
            # and ‘self._depth_factor‘ is 2, then this convolution will result in an output
            # depth of 32 channels. Multiple-inputs with conflicting depth must undergo
            # depth resolution first.
            self.depth_factor = edge_proto.conv.depth_factor

            # Control the shape of the convolution filters (i.e. transfer function).
            # This parameterization ensures that the filter width and height are odd
            # numbers: filter_width = 2 * filter_half_width + 1.
            self.filter_half_width = edge_proto.conv.filter_half_width
            self.filter_half_height = edge_proto.conv.filter_half_height

            # Controls the strides(步幅) of the convolution. It will be 2ˆstride_scale.
            # Note that conflicting input scales must undergo scale resolution. This
            # controls the spatial scale of the output activations relative to the
            # spatial scale of the input activations.
            self.stride_scale = edge_proto.conv.stride_scale
        elif edge_spec.HasField('identity'):
            self.type = IDENTITY
        else:
            raise NotImplementedError()

        # In case depth or scale resolution is necessary due to conflicts in inputs,
        # These integer parameters determine which of the inputs takes precedence in
        # deciding the resolved depth or scale.
        self.depth_precedence = edge_proto.depth_precedence
        self.scale_precedence = edge_proto.scale_precedence

    def to_proto(self):
        ...

注意 Vertex 和 Edge 的具体配置，在之前的示意图中，某DNA的实际组成可能应该是这样的：(lin~linear)

6.4 AddEdgeMutation

# Many mutations modify the structure. Mutations to insert and excise vertex-edge pairs
# build up a main convolutional column, while mutations to add and remove edges can handle
# the skip connections. For example, the AddEdgeMutation can add a skip connection between
# random vertices.

import copy
import random


class AddEdgeMutation(Mutation):
    """Adds a single edge to the graph."""
    def mutate(self, dna):
        # Try the candidates in random order until one has the right connectivity.
        for from_vertex_id, to_vertex_id in self._vertex_pair_candidates(dna):
            mutated_dna = copy.deepcopy(dna)
            if (self._mutate_structure(mutated_dna, from_vertex_id,
                                       to_vertex_id)):
                return mutated_dna
        raise exceptions.MutationException()  # Try another mutation.

    def _vertex_pair_candidates(self, dna):
        """Yields connectable vertex pairs."""
        from_vertex_ids = _find_allowed_vertices(dna, self._to_regex, ...)
        if not from_vertex_ids:
            raise exceptions.MutationException()  # Try another mutation.
        random.shuffle(from_vertex_ids)

        to_vertex_ids = _find_allowed_vertices(dna, self._from_regex, ...)
        if not to_vertex_ids:
            raise exceptions.MutationException()  # Try another mutation.
        random.shuffle(to_vertex_ids)

        for to_vertex_id in to_vertex_ids:
            # Avoid back-connections.
            disallowed_from_vertex_ids, _ = topology.propagated_set(
                to_vertex_id)
            for from_vertex_id in from_vertex_ids:
                if from_vertex_id in disallowed_from_vertex_ids:
                    continue
                # This pair does not generate a cycle, so we yield it.
                yield from_vertex_id, to_vertex_id

    def _mutate_structure(self, dna, from_vertex_id, to_vertex_id):
        """Adds the edge to the DNA instance."""
        edge_id = _random_id()
        edge_type = random.choice(self._edge_types)
        if dna.has_edge(from_vertex_id, to_vertex_id):
            return False
        else:
            new_edge = dna.add_edge(from_vertex_id, to_vertex_id, edge_type,
                                    edge_id)
            ...
            return True

6.5 AlterLearningRateMutation


# Mutations act on DNA instances. The set of mutations restricts the space explored
# somewhat (Section 3.2). The following are some example mutations. The
# AlterLearningRateMutation simply randomly modiﬁes the attribute in the DNA:

import copy
import random


class AlterLearningRateMutation(Mutation):
    """Mutation that modifies the learning rate."""

    def mutate(self, dna):
        mutated_dna = copy.deepcopy(dna)

        # Mutate the learning rate by a random factor between 0.5 and 2.0,
        # uniformly distributed in log scale.
        factor = 2 ** random.uniform(-1.0, 1.0)
        mutated_dna.learning_rate = dna.learning_rate * factor
        return mutated_dna