I need help structuring a ml agent and its config file

Hello!

I am currently working on simulating a city that I let an ml agent control. However, no matter how I structure the rewards and the config, its rewards are incredibly random, and the agent often ends up failing at its task. Any help or ideas would be greatly appreciated!

I am most concerned with my config file. Should the network setting (like normalization), reward signals, and the time horizon be set up differently for my purposes?
Below I’ve explained how the project and ai are set up:


How the project is set up:

  • My goal for the ai is for it to lower the amount of emissions that a city has.
    The city consists of 8 different structures (2 houses, 2 busses, 3 power plants, and a tree), all with different costs and emissions. At the beginning of every ai episode, the population of city increases and the ai receives a specified amount of money. For the episode to end, the ai would either have to run out of money, or the ai would have to build enough houses and busses to satisfy the city’s needs while having a positive net energy production.

  • The population increase and the money received are always the same.

  • Note: I originally wanted to let the ai create structures after every year (The population would increase and the AI would need to create the necessary structures to progress to the next year, where it would then get more money, and the population would again increase). But later (due to the failed results) assumed that this was to complex and therefore made the new simpler model.


Ai:

  • The ai receives 5 observations: Money, emissions, required houses, required busses, and the net energy production of the city. All these stats update after each ai action.

  • The ai has one branch of 8 discrete actions, which represent the 8 structures that the ai can build.
    If the agent runs out of money a reward of -1000 is added and the episode ends. If the ai builds a house or buss when the required amount is greater than 0, a reward of 10 is added. If the ai builds a power plant when the net energy is less than 0, a similar reward (but proportional to the amount of energy that the plant produces) is added.

  • If the required amount of structures are built and the net energy production is greater than 0, the ai will receive a reward which equals 100 000 000 minus the emissions that the city has and the episode will end.


config file:

behaviors:
  OptimizeCity:
    trainer_type: ppo
    hyperparameters:
      batch_size: 10
      buffer_size: 100
      learning_rate: 3.0e-4
      beta: 5.0e-4
      epsilon: 0.2
      lambd: 0.99
      num_epoch: 3
      learning_rate_schedule: linear
      beta_schedule: constant
      epsilon_schedule: linear
    network_settings:
      normalize: true
      hidden_units: 128
      num_layers: 2
    reward_signals:
      extrinsic:
        gamma: 0.99
        strength: 1.0
      curiosity:
        strength: 0.01
    max_steps: 1e20
    time_horizon: 1000
    summary_freq: 10000

Ai script:

using UnityEngine;
using Unity.MLAgents;
using Unity.MLAgents.Actuators;
using Unity.MLAgents.Sensors;

public class AIScript : Agent
{
    [SerializeField] private GameManager gm;

    [SerializeField] private bool debug;

    private float optimizationMultiplier = 1;

    private float maxCO2 = 100000000; 

    public override void OnEpisodeBegin()
    {
        gm.CreateCity();
    }

    public override void CollectObservations(VectorSensor sensor)
    {       
        sensor.AddObservation(gm.Emissions);

        sensor.AddObservation(gm.Money);

        sensor.AddObservation(gm.ReqBusses);

        sensor.AddObservation(gm.ReqBusses);

        sensor.AddObservation(gm.NetEnergy);
        
    }

    public override void OnActionReceived(ActionBuffers actions)
    {
        int action = actions.DiscreteActions[0];

        gm.CreateStructure(action); 

        if (gm.Money < 0)
        {
            AddReward(-10000f);

            if (debug)
            {
                Debug.Log("Fail");
            }

            EndEpisode();

            return;
        }

        if (gm.ReqHouses <= 0 && gm.ReqBusses <= 0 && gm.NetEnergy >= 0)
        {
            float optimizationReward = (float)gm.Emissions * optimizationMultiplier;

            AddReward(maxCO2 - optimizationReward);

            if (debug)
            {
                Debug.Log("WIN");
            }

            gm.OnWin();

            EndEpisode();
        }

        if(action == gm.HouseInt || action == gm.NHouseInt)
        {
            if(gm.ReqHouses > 0)
            {
                AddReward(10f);
            }
        }
        else if (action == gm.BussInt || action == gm.EBussInt)
        {
            if (gm.ReqBusses > 0)
            {
                AddReward(5f);
            }
        }
        else if (action == gm.WindInt || action == gm.SolarInt || action == gm.CoalInt)
        {
            if (gm.NetEnergy < 0)
            {
                float reward = 10f * gm.Structures[action].EnergyCreated / gm.Structures[gm.CoalInt].EnergyCreated;

                AddReward(reward);
            }
        }
    }
}

Screenshot of the cumulative reward after 20 hours

Hey there, thank you for this extensive and detailed post…

From what i can see here the most probable issue is that your reward function is not good.

In general if i remember correctly it is best if the reward is normalized between -1 and 1. On top of that i get the feeling that the reward that comes from your emission calculation does outweight anything that you give/take from the building process. This should be more equalized imo.
Imagine it like this: If your emission level has reached a certain negative or positive level the agent will not feel any more direct feedback from the actions it took. So perhaps it might be good to create a more nonlinear behaviour for the reward regarding the emission level:

if level > some_bad_level → give -0.5 reward
else if level < some_really_good_level → give +0.5 reward
else → give ((some_bad_level - some_really_good_level) / current_level) reward

Then add for example 0.1 per good build/buy action and subtract the same for bad actions.

This should lead to a more stable reward and hopefully to a better learning result.

Reduce the loss negative result to something like -10

Hope this helps - let me know if it worked. If it did not we can discuss what else could be done.

EDIT:

Something i just spotted in your config: you can also try raising the layer count to 4 or 5 and reduce the number of nodes to 32. This should be way enough for this case.