Multi-Agent Multi-Objective Optimization with Deep Q-Learning (Bachelor's Thesis Project)

Guining Pertin
Jun 18, 2023
5 min read

Updated: Jul 2, 2023

//certain sections of the original BTP have been removed

Introduction

This work was my final year project under Dr. Hanumant Singh Shekhawat at EEE, IITG performed along with Devashish Taneja (ECE, IITG). The research aimed to target defense scenarios and disaster management task where multiple "goal areas" with different "weights" are to be reached and completed by a group of agents within a time limit. The work was primarily on developing a model for this problem, generating a a simulation model to test the algorithms and developing the control and planning algorithm for the same.

The final algorithm combined classical path planning and coordinated learning for the agents using Deep Q-Networks along with a novel reduced input space representation to improve the learning process. The project was well received by the evaluators and received an AA grade.

Preliminaries

I will first go over the preliminary topics and algorithms upon which the final solution is based, namely Motion planning, Reinforcement learning, Deep Q-Learning and Multi-agent reinforcement learning.

1. Motion Planning

2. Reinforcement Learning

3. Deep Q-Learning

4. Multi-Agent Reinforcement Learning (MARL)

Problem Formulation

Natural disasters like floods and landslides are a common occurrence in the North Eastern states of India, but few measures are taken against or during large scale catastrophes like the Assam floods in 2020 and 2022. Our research work started with finding different approaches to improve the conditions or reduce the problems during the event. We finally stuck to the problem of finding optimal logistics and/or planning solutions for the disaster struck locations called goals, for rescuers, either robots or humans called agents.

This brings us to a multi-goal multi-agent scenario, where each goal can have different importance, here called weights(greater=important/difficult) within a certain area for optimization, called the map/environment.

This goal-weights-agents-map problem is then mapped to a MARL problem:

Agents: multiple RL agents working on improving their rewards.

Goals: multiple locations that provide reward values to these agents.

Weights: maximum amount of reward each goal can provide.

Map: an NxN matrix with obstacles, goals and agent locations.

Simulation Model

We first developed the environment/map using OpenAI gym library in python which is a popular toolkit to test RL algorithms. We used the wrapper functions to develop our own environment from scratch allowing us to have full control over the environment variables. Although implementation details have been removed, overall this outputs the environment as follows:

The size of this NxN matrix can be changed, along with the number of agents, number of goals and weights assigned to each goal. Movable regions are shown in dark grey, agents in purple, goals in yellow-orange-red depending on the weights assigned to these.

The final goal for the agents is to complete all these objectives as soon as possible under the following constraints:

Time taken to complete an objective is equal to the weight assigned to it. Eg: a goal with weight 50 will take 50 time steps to complete.
Rate of completion of an objective is equal to the number of agents overlapping with it. Eg: with two agents, the goal in above example will take 25 time steps instead.

For each run of the simulation we need to generate random maps, where the goal locations and weights are also varied randomly. For the map generation we implemented an improved version of random walk algorithm. This improved algorithm generates better quality maps in terms of clustered empty cells so that each non-obstacle point in space can be reached(fully connected graph). This algorithm is described below:

Initialize: parameters num_tunnels, max_length of tunnel
Initialize: map with only walls, a random start location and direction
Loop: until num_tunnels are created
- Select: random direction perpendicular to previous direction
- Select: random length < max_length
- Move: move in the selected direction for given length
- Convert: visited cells to valid empty cells

We then sample goal locations from these empty cells and assign them random weights adding up to a particular limit value.

Proposed Algorithm (Removed details)

It was initially planned to make the algorithm independent of map size and the number of agents and goals, but due to time constraints, we could only solve the first problem.

In any RL problem, we start with defining the reward function, the states and the actions possible in each state. Given this data, we can develop the algorithm for optimal policy.

The extrinsic reward function can be defined as follows:

+goal_weight reward at the completion of that goal.
+avg_weight reward at the completion of all goals.
+0 for all other time steps.

Other algorithms in RL dealing with similar agent-goal scenarios often take in the entire map as the input state and outputs actions in [up, down, left, right] space. But using this approach meant relearning motion planning algorithms by the agents from scratch while also learning to coordinate between them; logistics of which goal to solve first goes out of the question then. Another problem with this approach is the presence of fixed obstacles which do not provide any information to the learning algorithm and also makes the map size fixed since any form of filtering/convolution operation on the NxN map will remove the data on certain obstacles.

Our solution was to use an existing (also highly researched) motion planning algorithm on the map to generate a feature vector about the current status of the environment wrt agent and goal locations.

Our state is then defined as the output of this feature vector along with the information on current goal weights and previous states.

The action for each agent is an integer corresponding to:

0 - stop
1 - go towards objective 1
.
.
K - go towards objective K

This removes the problem of learning the motion planning for shortest path since the direction for each goal has already been captured during the generation of states.

Our Agent then, is now a branched dueling double q learning network (BDDQN) which learns the optimal q function and generates the optimal policy for our problem.

The agent is also provided with an intrinsic reward function to interpolate over the time steps with 0 rewards. Other details have been omitted but now the complete system can be described as follows:

Results

The algorithm was trained on a GTX 1050Ti system with 7th gen Intel i7. This made it impossible to train over 2500 iterations for different scenarios. But few important behaviors emerged out of the training process and have been described below.

For a single agent multiple-objective configuration, you can see that it finds the optimal planning i.e. to solve the closest and most important goal first: (time steps left to right, top to bottom)

In the multiple agent multiple-objective scenario, the algorithm behaved differently wrt the time limit for each instance of the simulation.

On setting low time limit, most agents demonstrated the common behavior of completing at least a single goal as soon as possible. The green, blue and orange plots represent n_agents>n_goals, n_agents=n_goals and n_agents<n_goals. Iterations along x axis and total reward along y axis.

On setting a high time limit, the agents moved independently to the nearest goals in most cases. When n_agents>= n_goals, the leftover agents refused to move often but all other agents move towards single nearest goals and hence end the game with comparatively higher total rewards (green-blue). But during the orange case, all agents moved towards the most important goals, leaving behind less important goals and hence a lower reward overall.