Hello!
I am currently working on simulating a city that I let an ml agent control. However, no matter how I structure the rewards and the config, its rewards are incredibly random, and the agent often ends up failing at its task. Any help or ideas would be greatly appreciated!
I am most concerned with my config file. Should the network setting (like normalization), reward signals, and the time horizon be set up differently for my purposes?
Below I’ve explained how the project and ai are set up:
How the project is set up:
-
My goal for the ai is for it to lower the amount of emissions that a city has.
The city consists of 8 different structures (2 houses, 2 busses, 3 power plants, and a tree), all with different costs and emissions. At the beginning of every ai episode, the population of city increases and the ai receives a specified amount of money. For the episode to end, the ai would either have to run out of money, or the ai would have to build enough houses and busses to satisfy the city’s needs while having a positive net energy production. -
The population increase and the money received are always the same.
-
Note: I originally wanted to let the ai create structures after every year (The population would increase and the AI would need to create the necessary structures to progress to the next year, where it would then get more money, and the population would again increase). But later (due to the failed results) assumed that this was to complex and therefore made the new simpler model.
Ai:
-
The ai receives 5 observations: Money, emissions, required houses, required busses, and the net energy production of the city. All these stats update after each ai action.
-
The ai has one branch of 8 discrete actions, which represent the 8 structures that the ai can build.
If the agent runs out of money a reward of -1000 is added and the episode ends. If the ai builds a house or buss when the required amount is greater than 0, a reward of 10 is added. If the ai builds a power plant when the net energy is less than 0, a similar reward (but proportional to the amount of energy that the plant produces) is added. -
If the required amount of structures are built and the net energy production is greater than 0, the ai will receive a reward which equals 100 000 000 minus the emissions that the city has and the episode will end.
config file:
behaviors:
OptimizeCity:
trainer_type: ppo
hyperparameters:
batch_size: 10
buffer_size: 100
learning_rate: 3.0e-4
beta: 5.0e-4
epsilon: 0.2
lambd: 0.99
num_epoch: 3
learning_rate_schedule: linear
beta_schedule: constant
epsilon_schedule: linear
network_settings:
normalize: true
hidden_units: 128
num_layers: 2
reward_signals:
extrinsic:
gamma: 0.99
strength: 1.0
curiosity:
strength: 0.01
max_steps: 1e20
time_horizon: 1000
summary_freq: 10000
Ai script:
using UnityEngine;
using Unity.MLAgents;
using Unity.MLAgents.Actuators;
using Unity.MLAgents.Sensors;
public class AIScript : Agent
{
[SerializeField] private GameManager gm;
[SerializeField] private bool debug;
private float optimizationMultiplier = 1;
private float maxCO2 = 100000000;
public override void OnEpisodeBegin()
{
gm.CreateCity();
}
public override void CollectObservations(VectorSensor sensor)
{
sensor.AddObservation(gm.Emissions);
sensor.AddObservation(gm.Money);
sensor.AddObservation(gm.ReqBusses);
sensor.AddObservation(gm.ReqBusses);
sensor.AddObservation(gm.NetEnergy);
}
public override void OnActionReceived(ActionBuffers actions)
{
int action = actions.DiscreteActions[0];
gm.CreateStructure(action);
if (gm.Money < 0)
{
AddReward(-10000f);
if (debug)
{
Debug.Log("Fail");
}
EndEpisode();
return;
}
if (gm.ReqHouses <= 0 && gm.ReqBusses <= 0 && gm.NetEnergy >= 0)
{
float optimizationReward = (float)gm.Emissions * optimizationMultiplier;
AddReward(maxCO2 - optimizationReward);
if (debug)
{
Debug.Log("WIN");
}
gm.OnWin();
EndEpisode();
}
if(action == gm.HouseInt || action == gm.NHouseInt)
{
if(gm.ReqHouses > 0)
{
AddReward(10f);
}
}
else if (action == gm.BussInt || action == gm.EBussInt)
{
if (gm.ReqBusses > 0)
{
AddReward(5f);
}
}
else if (action == gm.WindInt || action == gm.SolarInt || action == gm.CoalInt)
{
if (gm.NetEnergy < 0)
{
float reward = 10f * gm.Structures[action].EnergyCreated / gm.Structures[gm.CoalInt].EnergyCreated;
AddReward(reward);
}
}
}
}
Screenshot of the cumulative reward after 20 hours