top of page
Search

2025; Master's Thesis: Part 4 of 4: Differentiable Simulations

  • Writer: Guining Pertin
    Guining Pertin
  • Jan 3
  • 3 min read

Updated: Mar 24

//certain sections of the original thesis have been removed

Introduction

These set of four blog posts will cover the work I performed for my Master's Thesis at KAIST titled "Accelerating Policy Learning for Robust Control of Robotic Manipulators and Aerial Vehicles via Physics-Informed Guidance" under Professor Dong Eui Chang. The full thesis can be found at KAIST library: THESIS after it is publicly released. It has in-depth explanation of the preliminary concepts, simulation and control design (including both equations and intuition), and the experimental results. Please note that these posts do not cover the entire thesis in detail and is instead built upon my thesis defence slides that summarizes the full approach. Parts of this thesis are still partly utilized for other research work, and so some sections have been removed.


This is part four of the four part series and this blog post will cover differentiable simulation based manipulator auto-tuning design.


Differentiable simulations are quite new in the scene and I found there are some review papers that explain it in great detail, way better than I could explain. Here, I will explain only some parts and in a future blog post, I plan to use simpler examples to teach the basics (specifically for parameter estimation and policy learning).

Going beyond gradient estimation

In RL (here specifically actor-critic algorithms):

  • Actor performs action given the state of the system,

  • Environment shares reward with the critic; RL wants to maximize cumulative reward,

  • Critic estimates the value of actions, which is used to approximate the actor’s gradient.

Differentiable simulations for policy learning can be viewed as an extension:

  • Actor performs actions given the state of the system,

  • Environment shares reward; we want to maximize cumulative reward/minimize total loss,

  • We just differentiate the reward/loss with respect to actor to obtain the analytic gradient.

Note that differentiable simulation for policy optimization extends the concepts from model-based techniques by backpropagating through the high-fidelity simulation directly rather than probabilistic estimation approaches.

Simulation and control

We utilize Nvidia Warp+Newton to simulate the manipulator and the ship motion:

  • Manipulator simulation through Featherstone algorithm with Semi-implicit Euler integration,

  • Sea-state simulation is rewritten as Warp kernel codes to allow backpropagation through disturbance,

  • FBPID controller and NDOB (from part 2) is reutilized, but with zero integral gain.


I utilized a custom modified version of Nvidia Newton for my work with a custom Featherstone algorithm and warp kernels for faster training. This version is still being used for other research work and hence is not public currently.


We utilize differentiable sim to optimize both the gain tuning agent (π_ν) and the NDOB gain matrix (L). To ensure positive NDOB gains, we reparametrize the matrix with learnable parameter Γ as

The loss function to minimize is constructed as

Truncated BPTT

Differentiable simulations utilize backpropagation through time to compute the analytic gradients from all time steps, all the way back to initial step. However, due to the long horizon of length N, the gradients can explode or vanish over time. Since the PID gain tuning agent acts every H steps for N/H blocks, we truncate the loss into short horizons by stopping the computation graph every H steps. The final parameter gradients are computed as

and the parameters are updated using gradient descent  ν←ν-η∇_ν L,   Γ←Γ-η∇_Γ L.

Training and Simulation

(A greater part of the simulation design has been omitted here for now). With differentiable simulation based approach, the loss decreases rapidly compared to standard RL training within only 100 steps with stable training and high sample efficiency. This took about 30 minutes to train on a standard personal computer. We found the average PID gains for each joint to converge to physically meaningful values over time. We utilize IsaacSim for the training and evaluation rendering.


Comments


"The best way to predict the future is to invent it" ~ Alan Kay

bottom of page