Fundamentally, the root of the urban traffic distribution problem is in multicriteria decision making. The Reinforcement Learning framework, in which an agent learns from a model with optimal policy based on its environment, could provide an advantageous method for algorithmic development and network improvement. Each action that the agent would take will lead to a reward or punishment with the new observation of the state. Through its learning progress, the agent will learn a distributed routing policy that could maximize the capacity of an urban transport network. This process could be treated as a Markov Decision Process (MDP), which ultimately aims for the best solution by optimising specific policy step by step. A Markov Decision Process contains 5 key elements as shown below:


S: The collection of the states which could reflect the situation with specific properties.

A: A set of actions which could change the status.

P(SA): The probability of the state S change under action A.

γ: The discount factor.

R(SA): The expected reward that state S change under action A.


Considering the urban transport network is a rather rapid and complex system (the decision tree is hard to evaluate, and multiple decisions need to be made at the same time), a model-free Reinforcement Learning method would be applied in this research. In this case, we use Q-learning with function approximation to learn the best vehicle distribution method for sustainable and resilient urban transportation.

A Q-learning algorithm is a model-free Reinforcement Learning solution able to find an action selection policy for MDP. 

Q-learning algorithms work with general Markov Decision Process (MDP) as described below. Under an MDP with fixed transition probabilities and rewards, the Bellman equation (Equation 1) gives the optimal policy:


Q(s, a) = R(s’) + γ max Q(s’, a’)——-(1)


Where Q(s,a) is the immediate reward R(s’) for making an action plus best utility Q for the resulting state Q(s’, a’). For each state and action pair Q(s, a), we initialize its value to zero, then observe the current state s by repeating the following step:

  • Pick an action and execute it
  • Receive immediate reward R
  • Observe new state s’
  • Update Q(s, a) with the equation above with the discounted reward γ.
  • Set s = s’

The Bellman equation plays an important role in Reinforcement Learning, it is widely applied during policy assessment. If the Q function can be correctly estimated, a better policy π(s)  will become the optimal policy and lead the actions according to Equation 2.


π(s) = arg max Q(s, a)——(2)


With the Q-learning approach, a few key values can be extracted, including the value for the state features, action, reward, and the next state features when it works together in simulation.

The main objective function is to minimize the total travel time of every driver in an urban transportation network. Therefore, it could lead to reduced carbon dioxide emissions and decreased fossil fuel consumption. The reward of this approach is inherited from the objective function, which is to minimise the total travel time of the group of vehicles in the urban transportation network. The action, however, is rather simple; it only contains path selection. With a predetermined level of average wait times, the agent will choose which car to assign a designated path for group route optimisation. The state features are the properties that show the situation of the traffic. It has to reflect the busy level of the urban transportation network. In order to achieve this, a neural network might be applied to “read” the situation of the transportation network.

In conclusion, this article explores the current issues in modern urban transport networks and how Reinforcement Learning could be applied to urban traffic distribution. By combining reinforcement learning and a neural network, we could have the ability to determine the situation of the transport network and manage the traffic accordingly. Reinforcement Learning is a promising solution to this problem as it allows multiple opportunities to present itself and deliver the optimal route. Using several actions (in this case routing instructions) changes the state of the transport network, from which we can determine whether to reward or punish the algorithm with correct properties. This is good for dynamic control and is predominantly designed for the adaption of new conditions.