Motivation: Policies in Heuristic algorithms can be parameterized using deep neural network, and be trained via reinforcement to obtain new and stronger algorithms for many different combinatorial optimization problems.
Related work:
The application of Neural Networks (NNs) for optimizing decisions in combinatorial optimization problems dates back to Hopfield & Tank (1985), who applied a Hopfield-network for solving small TSP instances.
Pointer Network (PN) for (Euclidean) TSP
Graph Neural Network in a supervised manner
Encoder:
Attetion layer:

Training:
REINFORCE
WITH GREEDY ROLLOUT BASELINE
baseline: to estimate the difficulty of an instance or problem.
A good baseline reduces gradient variance and increase the speed of learning.
If one instance is difficult to solve and