空间站广场

论文

Notebooks

比赛

课程

Apps

我的主页

我的Notebooks

我的论文库

我的足迹

我的工作空间

任务

节点

文件

数据集

镜像

项目

数据库

公开

Homework 12 - Reinforcement Learning

Deep Learning

notebook

python

Deep Learningnotebookpython

goujiaxin

发布于 2024-04-11

推荐镜像 :Basic Image:bohrium-notebook:2023-04-07

推荐机型 :c4_m15_1 * NVIDIA T4

Homework 12 - Reinforcement Learning

Preliminary work

Warning ! Do not revise random seed !!!

Your submission on JudgeBoi will not reproduce your result !!!

What Lunar Lander？

Observation / State

Action

Reward

Random Agent

Policy Gradient

Training Agent

Training Result

Testing

This is the file you need to submit !!!

Server

Your score

Reference

Homework 12 - Reinforcement Learning

If you have any problem, e-mail us at ntu-ml-2022spring-ta@googlegroups.com

代码

文本

Preliminary work

First, we need to install all necessary packages. One of them, gym, builded by OpenAI, is a toolkit for developing Reinforcement Learning algorithm. Other packages are for visualization in colab.

代码

文本

[1]

!apt update

!apt install python-opengl xvfb -y

#!pip install gym[box2d]==0.18.3 pyvirtualdisplay tqdm numpy==1.20 torch==1.8.1

!pip install -q swig

!pip install box2d==2.3.2 gym[box2d]==0.25.2 box2d-py pyvirtualdisplay tqdm numpy==1.22.4

!pip install box2d==2.3.2 box2d-kengz

!pip freeze > requirements.txt

Hit:1 http://archive.ubuntu.com/ubuntu focal InRelease                         
Get:2 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]        
Get:4 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]      
Get:3 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  InRelease [1581 B]
Get:5 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  Packages [1498 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]      
Get:7 https://deb.nodesource.com/node_18.x focal InRelease [4583 B]            
Get:8 http://archive.ubuntu.com/ubuntu focal-updates/restricted amd64 Packages [3639 kB]
Get:9 https://deb.nodesource.com/node_18.x focal/main amd64 Packages [776 B]  
Get:10 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages [1197 kB]
Get:11 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages [4024 kB]3m
Get:12 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [1493 kB]m
Get:13 http://archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 Packages [32.5 kB]
Get:14 http://archive.ubuntu.com/ubuntu focal-backports/main amd64 Packages [55.2 kB]33m
Get:15 http://archive.ubuntu.com/ubuntu focal-backports/universe amd64 Packages [28.6 kB]
Get:16 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [3490 kB]3m
Get:17 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages [29.8 kB]3m
Get:18 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [3549 kB]33m
Fetched 19.4 MB in 26s (740 kB/s)                                              
Reading package lists... Done
Building dependency tree       
Reading state information... Done
163 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  freeglut3 libglu1-mesa libpython2-stdlib libpython2.7-minimal
  libpython2.7-stdlib libunwind8 libxfont2 python2 python2-minimal python2.7
  python2.7-minimal x11-xkb-utils xauth xfonts-base xfonts-encodings
  xfonts-utils xserver-common
Suggested packages:
  python-tk python-numpy libgle3 python2-doc python2.7-doc binfmt-support
The following NEW packages will be installed:
  freeglut3 libglu1-mesa libpython2-stdlib libpython2.7-minimal
  libpython2.7-stdlib libunwind8 libxfont2 python-opengl python2
  python2-minimal python2.7 python2.7-minimal x11-xkb-utils xauth xfonts-base
  xfonts-encodings xfonts-utils xserver-common xvfb
0 upgraded, 19 newly installed, 0 to remove and 163 not upgraded.
Need to get 12.2 MB of archives.
After this operation, 34.7 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 libpython2.7-minimal amd64 2.7.18-1~20.04.4 [335 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 python2.7-minimal amd64 2.7.18-1~20.04.4 [1280 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal/universe amd64 python2-minimal amd64 2.7.17-2ubuntu4 [27.5 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 libpython2.7-stdlib amd64 2.7.18-1~20.04.4 [1887 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 python2.7 amd64 2.7.18-1~20.04.4 [248 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal/universe amd64 libpython2-stdlib amd64 2.7.17-2ubuntu4 [7072 B]
Get:7 http://archive.ubuntu.com/ubuntu focal/universe amd64 python2 amd64 2.7.17-2ubuntu4 [26.5 kB]
Get:8 http://archive.ubuntu.com/ubuntu focal/main amd64 xauth amd64 1:1.1-0ubuntu1 [25.0 kB]
Get:9 http://archive.ubuntu.com/ubuntu focal/universe amd64 freeglut3 amd64 2.8.1-3 [73.6 kB]
Get:10 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libunwind8 amd64 1.2.1-9ubuntu0.1 [47.7 kB]
Get:11 http://archive.ubuntu.com/ubuntu focal/main amd64 libxfont2 amd64 1:2.0.3-1 [91.7 kB]
Get:12 http://archive.ubuntu.com/ubuntu focal/main amd64 libglu1-mesa amd64 9.0.1-1build1 [168 kB]
Get:13 http://archive.ubuntu.com/ubuntu focal/universe amd64 python-opengl all 3.1.0+dfsg-2build1 [486 kB]
Get:14 http://archive.ubuntu.com/ubuntu focal/main amd64 x11-xkb-utils amd64 7.7+5 [158 kB]
Get:15 http://archive.ubuntu.com/ubuntu focal/main amd64 xfonts-encodings all 1:1.0.5-0ubuntu1 [573 kB]
Get:16 http://archive.ubuntu.com/ubuntu focal/main amd64 xfonts-utils amd64 1:7.7+6 [91.5 kB]
Get:17 http://archive.ubuntu.com/ubuntu focal/main amd64 xfonts-base all 1:1.0.5 [5896 kB]
Get:18 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 xserver-common all 2:1.20.13-1ubuntu1~20.04.17 [27.8 kB]
Get:19 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 xvfb amd64 2:1.20.13-1ubuntu1~20.04.17 [781 kB]
Fetched 12.2 MB in 7s (1698 kB/s)                                              

78Selecting previously unselected package libpython2.7-minimal:amd64.
(Reading database ... 63384 files and directories currently installed.)
Preparing to unpack .../0-libpython2.7-minimal_2.7.18-1~20.04.4_amd64.deb ...
7Progress: [  0%] [..........................................................] 87Progress: [  1%] [..........................................................] 8Unpacking libpython2.7-minimal:amd64 (2.7.18-1~20.04.4) ...
7Progress: [  3%] [#.........................................................] 8Selecting previously unselected package python2.7-minimal.
Preparing to unpack .../1-python2.7-minimal_2.7.18-1~20.04.4_amd64.deb ...
7Progress: [  4%] [##........................................................] 8Unpacking python2.7-minimal (2.7.18-1~20.04.4) ...
7Progress: [  5%] [###.......................................................] 8Selecting previously unselected package python2-minimal.
Preparing to unpack .../2-python2-minimal_2.7.17-2ubuntu4_amd64.deb ...
7Progress: [  6%] [###.......................................................] 8Unpacking python2-minimal (2.7.17-2ubuntu4) ...
7Progress: [  8%] [####......................................................] 8Selecting previously unselected package libpython2.7-stdlib:amd64.
Preparing to unpack .../3-libpython2.7-stdlib_2.7.18-1~20.04.4_amd64.deb ...
7Progress: [  9%] [#####.....................................................] 8Unpacking libpython2.7-stdlib:amd64 (2.7.18-1~20.04.4) ...
7Progress: [ 10%] [######....................................................] 8Selecting previously unselected package python2.7.
Preparing to unpack .../4-python2.7_2.7.18-1~20.04.4_amd64.deb ...
7Progress: [ 12%] [######....................................................] 8Unpacking python2.7 (2.7.18-1~20.04.4) ...
7Progress: [ 13%] [#######...................................................] 8Selecting previously unselected package libpython2-stdlib:amd64.
Preparing to unpack .../5-libpython2-stdlib_2.7.17-2ubuntu4_amd64.deb ...
7Progress: [ 14%] [########..................................................] 8Unpacking libpython2-stdlib:amd64 (2.7.17-2ubuntu4) ...
7Progress: [ 16%] [#########.................................................] 8Setting up libpython2.7-minimal:amd64 (2.7.18-1~20.04.4) ...
7Progress: [ 17%] [#########.................................................] 87Progress: [ 18%] [##########................................................] 8Setting up python2.7-minimal (2.7.18-1~20.04.4) ...
7Progress: [ 19%] [###########...............................................] 87Progress: [ 21%] [############..............................................] 8Setting up python2-minimal (2.7.17-2ubuntu4) ...
7Progress: [ 22%] [############..............................................] 87Progress: [ 23%] [#############.............................................] 8Selecting previously unselected package python2.
(Reading database ... 64131 files and directories currently installed.)
Preparing to unpack .../00-python2_2.7.17-2ubuntu4_amd64.deb ...
7Progress: [ 25%] [##############............................................] 8Unpacking python2 (2.7.17-2ubuntu4) ...
7Progress: [ 26%] [###############...........................................] 8Selecting previously unselected package xauth.
Preparing to unpack .../01-xauth_1%3a1.1-0ubuntu1_amd64.deb ...
7Progress: [ 27%] [###############...........................................] 8Unpacking xauth (1:1.1-0ubuntu1) ...
7Progress: [ 29%] [################..........................................] 8Selecting previously unselected package freeglut3:amd64.
Preparing to unpack .../02-freeglut3_2.8.1-3_amd64.deb ...
7Progress: [ 30%] [#################.........................................] 8Unpacking freeglut3:amd64 (2.8.1-3) ...
7Progress: [ 31%] [##################........................................] 8Selecting previously unselected package libunwind8:amd64.
Preparing to unpack .../03-libunwind8_1.2.1-9ubuntu0.1_amd64.deb ...
7Progress: [ 32%] [##################........................................] 8Unpacking libunwind8:amd64 (1.2.1-9ubuntu0.1) ...
7Progress: [ 34%] [###################.......................................] 8Selecting previously unselected package libxfont2:amd64.
Preparing to unpack .../04-libxfont2_1%3a2.0.3-1_amd64.deb ...
7Progress: [ 35%] [####################......................................] 8Unpacking libxfont2:amd64 (1:2.0.3-1) ...
7Progress: [ 36%] [#####################.....................................] 8Selecting previously unselected package libglu1-mesa:amd64.
Preparing to unpack .../05-libglu1-mesa_9.0.1-1build1_amd64.deb ...
7Progress: [ 38%] [#####################.....................................] 8Unpacking libglu1-mesa:amd64 (9.0.1-1build1) ...
7Progress: [ 39%] [######################....................................] 8Selecting previously unselected package python-opengl.
Preparing to unpack .../06-python-opengl_3.1.0+dfsg-2build1_all.deb ...
7Progress: [ 40%] [#######################...................................] 8Unpacking python-opengl (3.1.0+dfsg-2build1) ...
7Progress: [ 42%] [########################..................................] 8Selecting previously unselected package x11-xkb-utils.
Preparing to unpack .../07-x11-xkb-utils_7.7+5_amd64.deb ...
7Progress: [ 43%] [########################..................................] 8Unpacking x11-xkb-utils (7.7+5) ...
7Progress: [ 44%] [#########################.................................] 8Selecting previously unselected package xfonts-encodings.
Preparing to unpack .../08-xfonts-encodings_1%3a1.0.5-0ubuntu1_all.deb ...
7Progress: [ 45%] [##########################................................] 8Unpacking xfonts-encodings (1:1.0.5-0ubuntu1) ...
7Progress: [ 47%] [###########################...............................] 8Selecting previously unselected package xfonts-utils.
Preparing to unpack .../09-xfonts-utils_1%3a7.7+6_amd64.deb ...
7Progress: [ 48%] [###########################...............................] 8Unpacking xfonts-utils (1:7.7+6) ...
7Progress: [ 49%] [############################..............................] 8Selecting previously unselected package xfonts-base.
Preparing to unpack .../10-xfonts-base_1%3a1.0.5_all.deb ...
7Progress: [ 51%] [#############################.............................] 8Unpacking xfonts-base (1:1.0.5) ...
7Progress: [ 52%] [##############################............................] 8Selecting previously unselected package xserver-common.
Preparing to unpack .../11-xserver-common_2%3a1.20.13-1ubuntu1~20.04.17_all.deb ...
7Progress: [ 53%] [##############################............................] 8Unpacking xserver-common (2:1.20.13-1ubuntu1~20.04.17) ...
7Progress: [ 55%] [###############################...........................] 8Selecting previously unselected package xvfb.
Preparing to unpack .../12-xvfb_2%3a1.20.13-1ubuntu1~20.04.17_amd64.deb ...
7Progress: [ 56%] [################################..........................] 8Unpacking xvfb (2:1.20.13-1ubuntu1~20.04.17) ...
7Progress: [ 57%] [#################################.........................] 8Setting up freeglut3:amd64 (2.8.1-3) ...
7Progress: [ 58%] [#################################.........................] 87Progress: [ 60%] [##################################........................] 8Setting up x11-xkb-utils (7.7+5) ...
7Progress: [ 61%] [###################################.......................] 87Progress: [ 62%] [####################################......................] 8Setting up libunwind8:amd64 (1.2.1-9ubuntu0.1) ...
7Progress: [ 64%] [####################################......................] 87Progress: [ 65%] [#####################################.....................] 8Setting up libpython2.7-stdlib:amd64 (2.7.18-1~20.04.4) ...
7Progress: [ 66%] [######################################....................] 87Progress: [ 68%] [#######################################...................] 8Setting up xfonts-encodings (1:1.0.5-0ubuntu1) ...
7Progress: [ 69%] [#######################################...................] 87Progress: [ 70%] [########################################..................] 8Setting up xauth (1:1.1-0ubuntu1) ...
7Progress: [ 71%] [#########################################.................] 87Progress: [ 73%] [##########################################................] 8Setting up libglu1-mesa:amd64 (9.0.1-1build1) ...
7Progress: [ 74%] [##########################################................] 87Progress: [ 75%] [###########################################...............] 8Setting up xserver-common (2:1.20.13-1ubuntu1~20.04.17) ...
7Progress: [ 77%] [############################################..............] 87Progress: [ 78%] [#############################################.............] 8Setting up libxfont2:amd64 (1:2.0.3-1) ...
7Progress: [ 79%] [#############################################.............] 87Progress: [ 81%] [##############################################............] 8Setting up python2.7 (2.7.18-1~20.04.4) ...
7Progress: [ 82%] [###############################################...........] 87Progress: [ 83%] [################################################..........] 8Setting up libpython2-stdlib:amd64 (2.7.17-2ubuntu4) ...
7Progress: [ 84%] [################################################..........] 87Progress: [ 86%] [#################################################.........] 8Setting up xvfb (2:1.20.13-1ubuntu1~20.04.17) ...
7Progress: [ 87%] [##################################################........] 87Progress: [ 88%] [###################################################.......] 8Setting up xfonts-utils (1:7.7+6) ...
7Progress: [ 90%] [###################################################.......] 87Progress: [ 91%] [####################################################......] 8Setting up python2 (2.7.17-2ubuntu4) ...
7Progress: [ 92%] [#####################################################.....] 87Progress: [ 94%] [######################################################....] 8Setting up xfonts-base (1:1.0.5) ...
7Progress: [ 95%] [######################################################....] 87Progress: [ 96%] [#######################################################...] 8Setting up python-opengl (3.1.0+dfsg-2build1) ...
7Progress: [ 97%] [########################################################..] 87Progress: [ 99%] [#########################################################.] 8Processing triggers for man-db (2.9.1-1) ...
Processing triggers for fontconfig (2.13.1-2ubuntu3) ...
Processing triggers for mime-support (3.64ubuntu1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.9) ...
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-ml.so.470.82.01 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.82.01 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libcuda.so.470.82.01 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-cfg.so.470.82.01 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-allocator.so.470.82.01 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-compiler.so.470.82.01 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-opencl.so.470.82.01 is empty, not checked.

78WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting box2d==2.3.2
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cc/7b/ddb96fea1fa5b24f8929714ef483f64c33e9649e7aae066e5f5023ea426a/Box2D-2.3.2.tar.gz (427 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 427.9/427.9 kB 7.5 MB/s eta 0:00:00a 0:00:01
  Preparing metadata (setup.py) ... done
Requirement already satisfied: gym[box2d]==0.25.2 in /opt/conda/lib/python3.8/site-packages (0.25.2)
Collecting box2d-py
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/98/c2/ab05b5329dc4416b5ee5530f0625a79c394a3e3c10abe0812b9345256451/box2d-py-2.3.8.tar.gz (374 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 374.5/374.5 kB 21.4 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting pyvirtualdisplay
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/90/eb/c3b8deb661cb3846db63288c99bbb39f217b7807fc8acb2fd058db41e2e6/PyVirtualDisplay-3.0-py3-none-any.whl (15 kB)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.8/site-packages (4.64.1)
Requirement already satisfied: numpy==1.22.4 in /opt/conda/lib/python3.8/site-packages (1.22.4)
Requirement already satisfied: cloudpickle>=1.2.0 in /opt/conda/lib/python3.8/site-packages (from gym[box2d]==0.25.2) (2.2.1)
Requirement already satisfied: importlib-metadata>=4.8.0 in /opt/conda/lib/python3.8/site-packages (from gym[box2d]==0.25.2) (6.0.0)
Requirement already satisfied: gym-notices>=0.0.4 in /opt/conda/lib/python3.8/site-packages (from gym[box2d]==0.25.2) (0.0.8)
Requirement already satisfied: swig==4.* in /opt/conda/lib/python3.8/site-packages (from gym[box2d]==0.25.2) (4.2.1)
Collecting pygame==2.1.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ba/a3/6888bb6d57678a6acf754dfed589cb0dbe85086bce607dd580ab4b50cad9/pygame-2.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 23.2 MB/s eta 0:00:0000:0100:01
Collecting box2d-py
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/dd/5a/ad8d3ef9c13d5afcc1e44a77f11792ee717f6727b3320bddbc607e935e2a/box2d-py-2.3.5.tar.gz (374 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 374.4/374.4 kB 12.7 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.8/site-packages (from importlib-metadata>=4.8.0->gym[box2d]==0.25.2) (3.14.0)
Building wheels for collected packages: box2d, box2d-py
  Building wheel for box2d (setup.py) ... done
  Created wheel for box2d-py: filename=box2d_py-2.3.5-cp38-cp38-linux_x86_64.whl size=3124676 sha256=3abbe5a971859f55aea1e08f607c192adb23333cea1014a10a0f04a1ace59ae2
  Stored in directory: /root/.cache/pip/wheels/08/ec/28/605876e7e1b11ffc19f6b33dd08293669e66c42676f80e98ef
Successfully built box2d box2d-py
Installing collected packages: pyvirtualdisplay, box2d-py, box2d, pygame
Successfully installed box2d-2.3.2 box2d-py-2.3.5 pygame-2.1.0 pyvirtualdisplay-3.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
one
  Created wheel for box2d-kengz: filename=Box2D_kengz-2.3.3-cp38-cp38-linux_x86_64.whl size=3142929 sha256=bae0e85dd98671e3b8cbe38d777a8df99908360795bbb8118e21fe02816af652
  Stored in directory: /root/.cache/pip/wheels/b1/5a/15/37288ab87c40e970871421b595614b3feb5021a6de0661401c
Successfully built box2d-kengz
Installing collected packages: box2d-kengz
Successfully installed box2d-kengz-2.3.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

代码

文本

Next, set up virtual display，and import all necessaary packages.

代码

文本

[2]

%%capture

from pyvirtualdisplay import Display

virtual_display = Display(visible=0, size=(1400, 900))

virtual_display.start()

%matplotlib inline

import matplotlib.pyplot as plt

from IPython import display

import numpy as np

import torch

import torch.nn as nn

import torch.optim as optim

import torch.nn.functional as F

from torch.distributions import Categorical

from tqdm.notebook import tqdm

代码

文本

Warning ! Do not revise random seed !!!

Your submission on JudgeBoi will not reproduce your result !!!

Make your HW result to be reproducible.

代码

文本

[3]

seed = 543 # Do not change this

def fix(env, seed):

env.seed(seed)

env.action_space.seed(seed)

torch.manual_seed(seed)

torch.cuda.manual_seed(seed)

torch.cuda.manual_seed_all(seed)

np.random.seed(seed)

random.seed(seed)

#torch.set_deterministic(True)

torch.backends.cudnn.benchmark = False

torch.backends.cudnn.deterministic = True

代码

文本

Last, call gym and build an Lunar Lander environment.

代码

文本

[4]

%%capture

import gym

import random

env = gym.make('LunarLander-v2')

fix(env, seed) # fix the environment Do not revise this !!!

代码

文本

What Lunar Lander？

“LunarLander-v2”is to simulate the situation when the craft lands on the surface of the moon.

This task is to enable the craft to land "safely" at the pad between the two yellow flags.

Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector.

"LunarLander-v2" actually includes "Agent" and "Environment".

In this homework, we will utilize the function step() to control the action of "Agent".

Then step() will return the observation/state and reward given by the "Environment".

代码

文本

Observation / State

First, we can take a look at what an Observation / State looks like.

代码

文本

[5]

print(env.observation_space)

Box([-1.5       -1.5       -5.        -5.        -3.1415927 -5.
 -0.        -0.       ], [1.5       1.5       5.        5.        3.1415927 5.        1.
 1.       ], (8,), float32)

代码

文本

Box(8,)means that observation is an 8-dim vector

Action

Actions can be taken by looks like

代码

文本

[6]

print(env.action_space)

Discrete(4)

代码

文本

Discrete(4) implies that there are four kinds of actions can be taken by agent.

0 implies the agent will not take any actions
2 implies the agent will accelerate downward
1, 3 implies the agent will accelerate left and right

Next, we will try to make the agent interact with the environment. Before taking any actions, we recommend to call reset() function to reset the environment. Also, this function will return the initial state of the environment.

代码

文本

[7]

initial_state = env.reset()

print(initial_state)

[-1.2619973e-03  1.3984586e+00 -1.2784091e-01 -5.5384123e-01
  1.4691149e-03  2.8957864e-02  0.0000000e+00  0.0000000e+00]

代码

文本

Then, we try to get a random action from the agent's action space.

代码

文本

[8]

random_action = env.action_space.sample()

print(random_action)

代码

文本

More, we can utilize step() to make agent act according to the randomly-selected random_action. The step() function will return four values:

observation / state
reward
done (True/ False)
Other information

代码

文本

[9]

observation, reward, done, info = env.step(random_action)

代码

文本

[10]

print(done)

False

代码

文本

Reward

Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points.

代码

文本

[11]

print(reward)

-1.0511407416545058

代码

文本

Random Agent

In the end, before we start training, we can see whether a random agent can successfully land the moon or not.

代码

文本

[12]

env.reset()

img = plt.imshow(env.render(mode='rgb_array'))

done = False

while not done:

action = env.action_space.sample()

observation, reward, done, _ = env.step(action)

img.set_data(env.render(mode='rgb_array'))

display.display(plt.gcf())

display.clear_output(wait=True)

/opt/conda/lib/python3.8/site-packages/gym/core.py:43: DeprecationWarning: WARN: The argument mode in render method is deprecated; use render_mode during environment initialization instead.
See here for more information: https://www.gymlibrary.ml/content/api/
  deprecation(

代码

文本

Policy Gradient

Now, we can build a simple policy network. The network will return one of action in the action space.

代码

文本

[13]

class PolicyGradientNetwork(nn.Module):

def __init__(self):

super().__init__()

self.fc1 = nn.Linear(8, 16)

self.fc2 = nn.Linear(16, 16)

self.fc3 = nn.Linear(16, 4)

def forward(self, state):

hid = torch.tanh(self.fc1(state))

hid = torch.tanh(self.fc2(hid))

return F.softmax(self.fc3(hid), dim=-1)

代码

文本

Then, we need to build a simple agent. The agent will acts according to the output of the policy network above. There are a few things can be done by agent:

learn()：update the policy network from log probabilities and rewards.
sample()：After receiving observation from the environment, utilize policy network to tell which action to take. The return values of this function includes action and log probabilities.

代码

文本

[14]

from torch.optim.lr_scheduler import StepLR

class PolicyGradientAgent():

def __init__(self, network):

self.network = network

self.optimizer = optim.SGD(self.network.parameters(), lr=0.001)

def forward(self, state):

return self.network(state)

def learn(self, log_probs, rewards):

loss = (-log_probs * rewards).sum() # You don't need to revise this to pass simple baseline (but you can)

self.optimizer.zero_grad()

loss.backward()

self.optimizer.step()

def sample(self, state):

action_prob = self.network(torch.FloatTensor(state))

action_dist = Categorical(action_prob)

action = action_dist.sample()

log_prob = action_dist.log_prob(action)

return action.item(), log_prob

代码

文本

Lastly, build a network and agent to start training.

代码

文本

[15]

network = PolicyGradientNetwork()

agent = PolicyGradientAgent(network)

代码

文本

Training Agent

Now let's start to train our agent. Through taking all the interactions between agent and environment as training data, the policy network can learn from all these attempts,

代码

文本

[18]

agent.network.train() # Switch network into training mode

EPISODE_PER_BATCH = 5 # update the agent every 5 episode

NUM_BATCH = 500 # totally update the agent for 400 time

avg_total_rewards, avg_final_rewards = [], []

prg_bar = tqdm(range(NUM_BATCH))

for batch in prg_bar:

log_probs, rewards = [], []

total_rewards, final_rewards = [], []

# collect trajectory

for episode in range(EPISODE_PER_BATCH):

state = env.reset()

total_reward, total_step = 0, 0

seq_rewards = []

while True:

action, log_prob = agent.sample(state) # at, log(at|st)

next_state, reward, done, _ = env.step(action)

log_probs.append(log_prob) # [log(a1|s1), log(a2|s2), ...., log(at|st)]

# seq_rewards.append(reward)

state = next_state

total_reward += reward

total_step += 1

rewards.append(reward) # change here

# ! IMPORTANT !

# Current reward implementation: immediate reward, given action_list : a1, a2, a3 ......

# rewards : r1, r2 ,r3 ......

# medium：change "rewards" to accumulative decaying reward, given action_list : a1, a2, a3, ......

# rewards : r1+0.99*r2+0.99^2*r3+......, r2+0.99*r3+0.99^2*r4+...... , r3+0.99*r4+0.99^2*r5+ ......

# boss : implement Actor-Critic

if done:

final_rewards.append(reward)

total_rewards.append(total_reward)

break

print(f"rewards looks like ", np.shape(rewards))

#print(f"log_probs looks like ", np.shape(log_probs))

# record training process

avg_total_reward = sum(total_rewards) / len(total_rewards)

avg_final_reward = sum(final_rewards) / len(final_rewards)

avg_total_rewards.append(avg_total_reward)

avg_final_rewards.append(avg_final_reward)

prg_bar.set_description(f"Total: {avg_total_reward: 4.1f}, Final: {avg_final_reward: 4.1f}")

# update agent

# rewards = np.concatenate(rewards, axis=0)

rewards = (rewards - np.mean(rewards)) / (np.std(rewards) + 1e-9) # normalize the reward

agent.learn(torch.stack(log_probs), torch.from_numpy(rewards))

print("logs prob looks like ", torch.stack(log_probs).size())

print("torch.from_numpy(rewards) looks like ", torch.from_numpy(rewards).size())

rewards looks like  (467,)
logs prob looks like  torch.Size([467])
torch.from_numpy(rewards) looks like  torch.Size([467])
rewards looks like  (460,)
logs prob looks like  torch.Size([460])
torch.from_numpy(rewards) looks like  torch.Size([460])
rewards looks like  (493,)
logs prob looks like  torch.Size([493])
torch.from_numpy(rewards) looks like  torch.Size([493])
rewards looks like  (426,)
logs prob looks like  torch.Size([426])
torch.from_numpy(rewards) looks like  torch.Size([426])
rewards looks like  (415,)
logs prob looks like  torch.Size([415])
torch.from_numpy(rewards) looks like  torch.Size([415])
rewards looks like  (504,)
logs prob looks like  torch.Size([504])
torch.from_numpy(rewards) looks like  torch.Size([504])
rewards looks like  (466,)
logs prob looks like  torch.Size([466])
torch.from_numpy(rewards) looks like  torch.Size([466])
rewards looks like  (475,)
logs prob looks like  torch.Size([475])
torch.from_numpy(rewards) looks like  torch.Size([475])
rewards looks like  (513,)
logs prob looks like  torch.Size([513])
torch.from_numpy(rewards) looks like  torch.Size([513])
rewards looks like  (618,)
logs prob looks like  torch.Size([618])
torch.from_numpy(rewards) looks like  torch.Size([618])
rewards looks like  (533,)
logs prob looks like  torch.Size([533])
torch.from_numpy(rewards) looks like  torch.Size([533])
rewards looks like  (475,)
logs prob looks like  torch.Size([475])
torch.from_numpy(rewards) looks like  torch.Size([475])
rewards looks like  (465,)
logs prob looks like  torch.Size([465])
torch.from_numpy(rewards) looks like  torch.Size([465])
rewards looks like  (1396,)
logs prob looks like  torch.Size([1396])
torch.from_numpy(rewards) looks like  torch.Size([1396])
rewards looks like  (541,)
logs prob looks like  torch.Size([541])
torch.from_numpy(rewards) looks like  torch.Size([541])
rewards looks like  (400,)
logs prob looks like  torch.Size([400])
torch.from_numpy(rewards) looks like  torch.Size([400])
rewards looks like  (541,)
logs prob looks like  torch.Size([541])
torch.from_numpy(rewards) looks like  torch.Size([541])
rewards looks like  (478,)
logs prob looks like  torch.Size([478])
torch.from_numpy(rewards) looks like  torch.Size([478])
rewards looks like  (491,)
logs prob looks like  torch.Size([491])
torch.from_numpy(rewards) looks like  torch.Size([491])
rewards looks like  (599,)
logs prob looks like  torch.Size([599])
torch.from_numpy(rewards) looks like  torch.Size([599])
rewards looks like  (468,)
logs prob looks like  torch.Size([468])
torch.from_numpy(rewards) looks like  torch.Size([468])
rewards looks like  (787,)
logs prob looks like  torch.Size([787])
torch.from_numpy(rewards) looks like  torch.Size([787])
rewards looks like  (656,)
logs prob looks like  torch.Size([656])
torch.from_numpy(rewards) looks like  torch.Size([656])
rewards looks like  (574,)
logs prob looks like  torch.Size([574])
torch.from_numpy(rewards) looks like  torch.Size([574])
rewards looks like  (468,)
logs prob looks like  torch.Size([468])
torch.from_numpy(rewards) looks like  torch.Size([468])
rewards looks like  (542,)
logs prob looks like  torch.Size([542])
torch.from_numpy(rewards) looks like  torch.Size([542])
rewards looks like  (558,)
logs prob looks like  torch.Size([558])
torch.from_numpy(rewards) looks like  torch.Size([558])
rewards looks like  (565,)
logs prob looks like  torch.Size([565])
torch.from_numpy(rewards) looks like  torch.Size([565])
rewards looks like  (463,)
logs prob looks like  torch.Size([463])
torch.from_numpy(rewards) looks like  torch.Size([463])
rewards looks like  (551,)
logs prob looks like  torch.Size([551])
torch.from_numpy(rewards) looks like  torch.Size([551])
rewards looks like  (580,)
logs prob looks like  torch.Size([580])
torch.from_numpy(rewards) looks like  torch.Size([580])
rewards looks like  (694,)
logs prob looks like  torch.Size([694])
torch.from_numpy(rewards) looks like  torch.Size([694])
rewards looks like  (537,)
logs prob looks like  torch.Size([537])
torch.from_numpy(rewards) looks like  torch.Size([537])
rewards looks like  (639,)
logs prob looks like  torch.Size([639])
torch.from_numpy(rewards) looks like  torch.Size([639])
rewards looks like  (519,)
logs prob looks like  torch.Size([519])
torch.from_numpy(rewards) looks like  torch.Size([519])
rewards looks like  (657,)
logs prob looks like  torch.Size([657])
torch.from_numpy(rewards) looks like  torch.Size([657])
rewards looks like  (647,)
logs prob looks like  torch.Size([647])
torch.from_numpy(rewards) looks like  torch.Size([647])
rewards looks like  (554,)
logs prob looks like  torch.Size([554])
torch.from_numpy(rewards) looks like  torch.Size([554])
rewards looks like  (558,)
logs prob looks like  torch.Size([558])
torch.from_numpy(rewards) looks like  torch.Size([558])
rewards looks like  (1382,)
logs prob looks like  torch.Size([1382])
torch.from_numpy(rewards) looks like  torch.Size([1382])
rewards looks like  (500,)
logs prob looks like  torch.Size([500])
torch.from_numpy(rewards) looks like  torch.Size([500])
rewards looks like  (575,)
logs prob looks like  torch.Size([575])
torch.from_numpy(rewards) looks like  torch.Size([575])
rewards looks like  (576,)
logs prob looks like  torch.Size([576])
torch.from_numpy(rewards) looks like  torch.Size([576])
rewards looks like  (510,)
logs prob looks like  torch.Size([510])
torch.from_numpy(rewards) looks like  torch.Size([510])
rewards looks like  (703,)
logs prob looks like  torch.Size([703])
torch.from_numpy(rewards) looks like  torch.Size([703])
rewards looks like  (509,)
logs prob looks like  torch.Size([509])
torch.from_numpy(rewards) looks like  torch.Size([509])
rewards looks like  (580,)
logs prob looks like  torch.Size([580])
torch.from_numpy(rewards) looks like  torch.Size([580])
rewards looks like  (1475,)
logs prob looks like  torch.Size([1475])
torch.from_numpy(rewards) looks like  torch.Size([1475])
rewards looks like  (729,)
logs prob looks like  torch.Size([729])
torch.from_numpy(rewards) looks like  torch.Size([729])
rewards looks like  (589,)
logs prob looks like  torch.Size([589])
torch.from_numpy(rewards) looks like  torch.Size([589])
rewards looks like  (494,)
logs prob looks like  torch.Size([494])
torch.from_numpy(rewards) looks like  torch.Size([494])
rewards looks like  (511,)
logs prob looks like  torch.Size([511])
torch.from_numpy(rewards) looks like  torch.Size([511])
rewards looks like  (816,)
logs prob looks like  torch.Size([816])
torch.from_numpy(rewards) looks like  torch.Size([816])
rewards looks like  (562,)
logs prob looks like  torch.Size([562])
torch.from_numpy(rewards) looks like  torch.Size([562])
rewards looks like  (827,)
logs prob looks like  torch.Size([827])
torch.from_numpy(rewards) looks like  torch.Size([827])
rewards looks like  (747,)
logs prob looks like  torch.Size([747])
torch.from_numpy(rewards) looks like  torch.Size([747])
rewards looks like  (804,)
logs prob looks like  torch.Size([804])
torch.from_numpy(rewards) looks like  torch.Size([804])
rewards looks like  (555,)
logs prob looks like  torch.Size([555])
torch.from_numpy(rewards) looks like  torch.Size([555])
rewards looks like  (786,)
logs prob looks like  torch.Size([786])
torch.from_numpy(rewards) looks like  torch.Size([786])
rewards looks like  (536,)
logs prob looks like  torch.Size([536])
torch.from_numpy(rewards) looks like  torch.Size([536])
rewards looks like  (680,)
logs prob looks like  torch.Size([680])
torch.from_numpy(rewards) looks like  torch.Size([680])
rewards looks like  (721,)
logs prob looks like  torch.Size([721])
torch.from_numpy(rewards) looks like  torch.Size([721])
rewards looks like  (664,)
logs prob looks like  torch.Size([664])
torch.from_numpy(rewards) looks like  torch.Size([664])
rewards looks like  (916,)
logs prob looks like  torch.Size([916])
torch.from_numpy(rewards) looks like  torch.Size([916])
rewards looks like  (1148,)
logs prob looks like  torch.Size([1148])
torch.from_numpy(rewards) looks like  torch.Size([1148])
rewards looks like  (644,)
logs prob looks like  torch.Size([644])
torch.from_numpy(rewards) looks like  torch.Size([644])
rewards looks like  (671,)
logs prob looks like  torch.Size([671])
torch.from_numpy(rewards) looks like  torch.Size([671])
rewards looks like  (929,)
logs prob looks like  torch.Size([929])
torch.from_numpy(rewards) looks like  torch.Size([929])
rewards looks like  (929,)
logs prob looks like  torch.Size([929])
torch.from_numpy(rewards) looks like  torch.Size([929])
rewards looks like  (865,)
logs prob looks like  torch.Size([865])
torch.from_numpy(rewards) looks like  torch.Size([865])
rewards looks like  (621,)
logs prob looks like  torch.Size([621])
torch.from_numpy(rewards) looks like  torch.Size([621])
rewards looks like  (772,)
logs prob looks like  torch.Size([772])
torch.from_numpy(rewards) looks like  torch.Size([772])
rewards looks like  (720,)
logs prob looks like  torch.Size([720])
torch.from_numpy(rewards) looks like  torch.Size([720])
rewards looks like  (972,)
logs prob looks like  torch.Size([972])
torch.from_numpy(rewards) looks like  torch.Size([972])
rewards looks like  (979,)
logs prob looks like  torch.Size([979])
torch.from_numpy(rewards) looks like  torch.Size([979])
rewards looks like  (1539,)
logs prob looks like  torch.Size([1539])
torch.from_numpy(rewards) looks like  torch.Size([1539])
rewards looks like  (604,)
logs prob looks like  torch.Size([604])
torch.from_numpy(rewards) looks like  torch.Size([604])
rewards looks like  (724,)
logs prob looks like  torch.Size([724])
torch.from_numpy(rewards) looks like  torch.Size([724])
rewards looks like  (821,)
logs prob looks like  torch.Size([821])
torch.from_numpy(rewards) looks like  torch.Size([821])
rewards looks like  (778,)
logs prob looks like  torch.Size([778])
torch.from_numpy(rewards) looks like  torch.Size([778])
rewards looks like  (625,)
logs prob looks like  torch.Size([625])
torch.from_numpy(rewards) looks like  torch.Size([625])
rewards looks like  (853,)
logs prob looks like  torch.Size([853])
torch.from_numpy(rewards) looks like  torch.Size([853])
rewards looks like  (797,)
logs prob looks like  torch.Size([797])
torch.from_numpy(rewards) looks like  torch.Size([797])
rewards looks like  (922,)
logs prob looks like  torch.Size([922])
torch.from_numpy(rewards) looks like  torch.Size([922])
rewards looks like  (839,)
logs prob looks like  torch.Size([839])
torch.from_numpy(rewards) looks like  torch.Size([839])
rewards looks like  (765,)
logs prob looks like  torch.Size([765])
torch.from_numpy(rewards) looks like  torch.Size([765])
rewards looks like  (682,)
logs prob looks like  torch.Size([682])
torch.from_numpy(rewards) looks like  torch.Size([682])
rewards looks like  (809,)
logs prob looks like  torch.Size([809])
torch.from_numpy(rewards) looks like  torch.Size([809])
rewards looks like  (768,)
logs prob looks like  torch.Size([768])
torch.from_numpy(rewards) looks like  torch.Size([768])
rewards looks like  (635,)
logs prob looks like  torch.Size([635])
torch.from_numpy(rewards) looks like  torch.Size([635])
rewards looks like  (722,)
logs prob looks like  torch.Size([722])
torch.from_numpy(rewards) looks like  torch.Size([722])
rewards looks like  (894,)
logs prob looks like  torch.Size([894])
torch.from_numpy(rewards) looks like  torch.Size([894])
rewards looks like  (912,)
logs prob looks like  torch.Size([912])
torch.from_numpy(rewards) looks like  torch.Size([912])
rewards looks like  (769,)
logs prob looks like  torch.Size([769])
torch.from_numpy(rewards) looks like  torch.Size([769])
rewards looks like  (719,)
logs prob looks like  torch.Size([719])
torch.from_numpy(rewards) looks like  torch.Size([719])
rewards looks like  (1036,)
logs prob looks like  torch.Size([1036])
torch.from_numpy(rewards) looks like  torch.Size([1036])
rewards looks like  (671,)
logs prob looks like  torch.Size([671])
torch.from_numpy(rewards) looks like  torch.Size([671])
rewards looks like  (795,)
logs prob looks like  torch.Size([795])
torch.from_numpy(rewards) looks like  torch.Size([795])
rewards looks like  (822,)
logs prob looks like  torch.Size([822])
torch.from_numpy(rewards) looks like  torch.Size([822])
rewards looks like  (940,)
logs prob looks like  torch.Size([940])
torch.from_numpy(rewards) looks like  torch.Size([940])
rewards looks like  (805,)
logs prob looks like  torch.Size([805])
torch.from_numpy(rewards) looks like  torch.Size([805])
rewards looks like  (888,)
logs prob looks like  torch.Size([888])
torch.from_numpy(rewards) looks like  torch.Size([888])
rewards looks like  (795,)
logs prob looks like  torch.Size([795])
torch.from_numpy(rewards) looks like  torch.Size([795])
rewards looks like  (732,)
logs prob looks like  torch.Size([732])
torch.from_numpy(rewards) looks like  torch.Size([732])
rewards looks like  (857,)
logs prob looks like  torch.Size([857])
torch.from_numpy(rewards) looks like  torch.Size([857])
rewards looks like  (1208,)
logs prob looks like  torch.Size([1208])
torch.from_numpy(rewards) looks like  torch.Size([1208])
rewards looks like  (755,)
logs prob looks like  torch.Size([755])
torch.from_numpy(rewards) looks like  torch.Size([755])
rewards looks like  (975,)
logs prob looks like  torch.Size([975])
torch.from_numpy(rewards) looks like  torch.Size([975])
rewards looks like  (969,)
logs prob looks like  torch.Size([969])
torch.from_numpy(rewards) looks like  torch.Size([969])
rewards looks like  (1217,)
logs prob looks like  torch.Size([1217])
torch.from_numpy(rewards) looks like  torch.Size([1217])
rewards looks like  (1466,)
logs prob looks like  torch.Size([1466])
torch.from_numpy(rewards) looks like  torch.Size([1466])
rewards looks like  (892,)
logs prob looks like  torch.Size([892])
torch.from_numpy(rewards) looks like  torch.Size([892])
rewards looks like  (933,)
logs prob looks like  torch.Size([933])
torch.from_numpy(rewards) looks like  torch.Size([933])
rewards looks like  (1991,)
logs prob looks like  torch.Size([1991])
torch.from_numpy(rewards) looks like  torch.Size([1991])
rewards looks like  (602,)
logs prob looks like  torch.Size([602])
torch.from_numpy(rewards) looks like  torch.Size([602])
rewards looks like  (694,)
logs prob looks like  torch.Size([694])
torch.from_numpy(rewards) looks like  torch.Size([694])
rewards looks like  (962,)
logs prob looks like  torch.Size([962])
torch.from_numpy(rewards) looks like  torch.Size([962])
rewards looks like  (889,)
logs prob looks like  torch.Size([889])
torch.from_numpy(rewards) looks like  torch.Size([889])
rewards looks like  (874,)
logs prob looks like  torch.Size([874])
torch.from_numpy(rewards) looks like  torch.Size([874])
rewards looks like  (1108,)
logs prob looks like  torch.Size([1108])
torch.from_numpy(rewards) looks like  torch.Size([1108])
rewards looks like  (994,)
logs prob looks like  torch.Size([994])
torch.from_numpy(rewards) looks like  torch.Size([994])
rewards looks like  (1742,)
logs prob looks like  torch.Size([1742])
torch.from_numpy(rewards) looks like  torch.Size([1742])
rewards looks like  (1287,)
logs prob looks like  torch.Size([1287])
torch.from_numpy(rewards) looks like  torch.Size([1287])
rewards looks like  (1190,)
logs prob looks like  torch.Size([1190])
torch.from_numpy(rewards) looks like  torch.Size([1190])
rewards looks like  (1016,)
logs prob looks like  torch.Size([1016])
torch.from_numpy(rewards) looks like  torch.Size([1016])
rewards looks like  (810,)
logs prob looks like  torch.Size([810])
torch.from_numpy(rewards) looks like  torch.Size([810])
rewards looks like  (1244,)
logs prob looks like  torch.Size([1244])
torch.from_numpy(rewards) looks like  torch.Size([1244])
rewards looks like  (1755,)
logs prob looks like  torch.Size([1755])
torch.from_numpy(rewards) looks like  torch.Size([1755])
rewards looks like  (1467,)
rewards looks like  (1530,)
logs prob looks like  torch.Size([1530])
torch.from_numpy(rewards) looks like  torch.Size([1530])
rewards looks like  (2494,)
logs prob looks like  torch.Size([2494])
torch.from_numpy(rewards) looks like  torch.Size([2494])
rewards looks like  (1130,)
logs prob looks like  torch.Size([1130])
torch.from_numpy(rewards) looks like  torch.Size([1130])
rewards looks like  (1282,)
logs prob looks like  torch.Size([1282])
torch.from_numpy(rewards) looks like  torch.Size([1282])
rewards looks like  (2414,)
logs prob looks like  torch.Size([2414])
torch.from_numpy(rewards) looks like  torch.Size([2414])
rewards looks like  (1461,)
logs prob looks like  torch.Size([1461])
torch.from_numpy(rewards) looks like  torch.Size([1461])
rewards looks like  (818,)
logs prob looks like  torch.Size([818])
torch.from_numpy(rewards) looks like  torch.Size([818])
rewards looks like  (1231,)
logs prob looks like  torch.Size([1231])
torch.from_numpy(rewards) looks like  torch.Size([1231])
rewards looks like  (2387,)
logs prob looks like  torch.Size([2387])
torch.from_numpy(rewards) looks like  torch.Size([2387])
rewards looks like  (421,)
logs prob looks like  torch.Size([421])
torch.from_numpy(rewards) looks like  torch.Size([421])
rewards looks like  (374,)
logs prob looks like  torch.Size([374])
torch.from_numpy(rewards) looks like  torch.Size([374])
rewards looks like  (419,)
logs prob looks like  torch.Size([419])
torch.from_numpy(rewards) looks like  torch.Size([419])
rewards looks like  (345,)
logs prob looks like  torch.Size([345])
torch.from_numpy(rewards) looks like  torch.Size([345])
rewards looks like  (422,)
logs prob looks like  torch.Size([422])
torch.from_numpy(rewards) looks like  torch.Size([422])
rewards looks like  (426,)
logs prob looks like  torch.Size([426])
torch.from_numpy(rewards) looks like  torch.Size([426])
rewards looks like  (416,)
logs prob looks like  torch.Size([416])
torch.from_numpy(rewards) looks like  torch.Size([416])
rewards looks like  (374,)
logs prob looks like  torch.Size([374])
torch.from_numpy(rewards) looks like  torch.Size([374])
rewards looks like  (442,)
logs prob looks like  torch.Size([442])
torch.from_numpy(rewards) looks like  torch.Size([442])
rewards looks like  (387,)
logs prob looks like  torch.Size([387])
torch.from_numpy(rewards) looks like  torch.Size([387])
rewards looks like  (364,)
logs prob looks like  torch.Size([364])
torch.from_numpy(rewards) looks like  torch.Size([364])
rewards looks like  (433,)
logs prob looks like  torch.Size([433])
torch.from_numpy(rewards) looks like  torch.Size([433])
rewards looks like  (447,)
logs prob looks like  torch.Size([447])
torch.from_numpy(rewards) looks like  torch.Size([447])
rewards looks like  (450,)
logs prob looks like  torch.Size([450])
torch.from_numpy(rewards) looks like  torch.Size([450])
rewards looks like  (468,)
logs prob looks like  torch.Size([468])
torch.from_numpy(rewards) looks like  torch.Size([468])
rewards looks like  (459,)
logs prob looks like  torch.Size([459])
torch.from_numpy(rewards) looks like  torch.Size([459])
rewards looks like  (463,)
logs prob looks like  torch.Size([463])
torch.from_numpy(rewards) looks like  torch.Size([463])
rewards looks like  (1427,)
logs prob looks like  torch.Size([1427])
torch.from_numpy(rewards) looks like  torch.Size([1427])
rewards looks like  (1327,)
logs prob looks like  torch.Size([1327])
torch.from_numpy(rewards) looks like  torch.Size([1327])
rewards looks like  (1328,)
logs prob looks like  torch.Size([1328])
torch.from_numpy(rewards) looks like  torch.Size([1328])
rewards looks like  (1374,)
logs prob looks like  torch.Size([1374])
torch.from_numpy(rewards) looks like  torch.Size([1374])
rewards looks like  (2257,)
logs prob looks like  torch.Size([2257])
torch.from_numpy(rewards) looks like  torch.Size([2257])
rewards looks like  (1379,)
logs prob looks like  torch.Size([1379])
torch.from_numpy(rewards) looks like  torch.Size([1379])
rewards looks like  (2934,)
logs prob looks like  torch.Size([2934])
torch.from_numpy(rewards) looks like  torch.Size([2934])
rewards looks like  (1415,)
logs prob looks like  torch.Size([1415])
torch.from_numpy(rewards) looks like  torch.Size([1415])
rewards looks like  (698,)
logs prob looks like  torch.Size([698])
torch.from_numpy(rewards) looks like  torch.Size([698])
rewards looks like  (1740,)
logs prob looks like  torch.Size([1740])
torch.from_numpy(rewards) looks like  torch.Size([1740])
rewards looks like  (2216,)
logs prob looks like  torch.Size([2216])
torch.from_numpy(rewards) looks like  torch.Size([2216])
rewards looks like  (1920,)
logs prob looks like  torch.Size([1920])
torch.from_numpy(rewards) looks like  torch.Size([1920])
rewards looks like  (1229,)
logs prob looks like  torch.Size([1229])
torch.from_numpy(rewards) looks like  torch.Size([1229])
rewards looks like  (2278,)
logs prob looks like  torch.Size([2278])
torch.from_numpy(rewards) looks like  torch.Size([2278])
rewards looks like  (2598,)
logs prob looks like  torch.Size([2598])
torch.from_numpy(rewards) looks like  torch.Size([2598])
rewards looks like  (1279,)
logs prob looks like  torch.Size([1279])
torch.from_numpy(rewards) looks like  torch.Size([1279])
rewards looks like  (2926,)
logs prob looks like  torch.Size([2926])
torch.from_numpy(rewards) looks like  torch.Size([2926])
rewards looks like  (1525,)
logs prob looks like  torch.Size([1525])
torch.from_numpy(rewards) looks like  torch.Size([1525])
rewards looks like  (965,)
logs prob looks like  torch.Size([965])
torch.from_numpy(rewards) looks like  torch.Size([965])
rewards looks like  (1734,)
logs prob looks like  torch.Size([1734])
torch.from_numpy(rewards) looks like  torch.Size([1734])
rewards looks like  (1625,)
logs prob looks like  torch.Size([1625])
torch.from_numpy(rewards) looks like  torch.Size([1625])
rewards looks like  (1081,)
logs prob looks like  torch.Size([1081])
torch.from_numpy(rewards) looks like  torch.Size([1081])
rewards looks like  (1628,)
logs prob looks like  torch.Size([1628])
torch.from_numpy(rewards) looks like  torch.Size([1628])
rewards looks like  (2825,)
logs prob looks like  torch.Size([2825])
torch.from_numpy(rewards) looks like  torch.Size([2825])
rewards looks like  (3485,)
logs prob looks like  torch.Size([3485])
torch.from_numpy(rewards) looks like  torch.Size([3485])
rewards looks like  (1514,)
logs prob looks like  torch.Size([1514])
torch.from_numpy(rewards) looks like  torch.Size([1514])
rewards looks like  (642,)
logs prob looks like  torch.Size([846])
torch.from_numpy(rewards) looks like  torch.Size([846])
rewards looks like  (755,)
logs prob looks like  torch.Size([755])
torch.from_numpy(rewards) looks like  torch.Size([755])
rewards looks like  (1059,)
logs prob looks like  torch.Size([1059])
torch.from_numpy(rewards) looks like  torch.Size([1059])
rewards looks like  (2581,)
logs prob looks like  torch.Size([2581])
torch.from_numpy(rewards) looks like  torch.Size([2581])
rewards looks like  (2767,)
logs prob looks like  torch.Size([2767])
torch.from_numpy(rewards) looks like  torch.Size([2767])
rewards looks like  (899,)
logs prob looks like  torch.Size([899])
torch.from_numpy(rewards) looks like  torch.Size([899])
rewards looks like  (2808,)
logs prob looks like  torch.Size([2808])
torch.from_numpy(rewards) looks like  torch.Size([2808])
rewards looks like  (1459,)
logs prob looks like  torch.Size([1459])
torch.from_numpy(rewards) looks like  torch.Size([1459])
rewards looks like  (2458,)
logs prob looks like  torch.Size([2458])
torch.from_numpy(rewards) looks like  torch.Size([2458])
rewards looks like  (1027,)
logs prob looks like  torch.Size([1027])
torch.from_numpy(rewards) looks like  torch.Size([1027])
rewards looks like  (1907,)
logs prob looks like  torch.Size([1907])
torch.from_numpy(rewards) looks like  torch.Size([1907])
rewards looks like  (1878,)
logs prob looks like  torch.Size([1878])
torch.from_numpy(rewards) looks like  torch.Size([1878])
rewards looks like  (2129,)
logs prob looks like  torch.Size([2129])
torch.from_numpy(rewards) looks like  torch.Size([2129])
rewards looks like  (2873,)
logs prob looks like  torch.Size([2873])
torch.from_numpy(rewards) looks like  torch.Size([2873])
rewards looks like  (1311,)
logs prob looks like  torch.Size([1311])
torch.from_numpy(rewards) looks like  torch.Size([1311])
rewards looks like  (1888,)
logs prob looks like  torch.Size([1888])
torch.from_numpy(rewards) looks like  torch.Size([1888])
rewards looks like  (870,)
logs prob looks like  torch.Size([870])
torch.from_numpy(rewards) looks like  torch.Size([870])
rewards looks like  (1193,)
logs prob looks like  torch.Size([1193])
torch.from_numpy(rewards) looks like  torch.Size([1193])
rewards looks like  (1367,)
logs prob looks like  torch.Size([1367])
torch.from_numpy(rewards) looks like  torch.Size([1367])
rewards looks like  (1786,)
logs prob looks like  torch.Size([1786])
torch.from_numpy(rewards) looks like  torch.Size([1786])
rewards looks like  (992,)
logs prob looks like  torch.Size([992])
torch.from_numpy(rewards) looks like  torch.Size([992])
rewards looks like  (1037,)
logs prob looks like  torch.Size([1037])
torch.from_numpy(rewards) looks like  torch.Size([1037])
rewards looks like  (2417,)
logs prob looks like  torch.Size([2417])
torch.from_numpy(rewards) looks like  torch.Size([2417])
rewards looks like  (2027,)
logs prob looks like  torch.Size([2027])
torch.from_numpy(rewards) looks like  torch.Size([2027])
rewards looks like  (1203,)
logs prob looks like  torch.Size([1203])
torch.from_numpy(rewards) looks like  torch.Size([1203])
rewards looks like  (2168,)
logs prob looks like  torch.Size([2168])
torch.from_numpy(rewards) looks like  torch.Size([2168])
rewards looks like  (1097,)
logs prob looks like  torch.Size([1097])
torch.from_numpy(rewards) looks like  torch.Size([1097])
rewards looks like  (2070,)
logs prob looks like  torch.Size([2070])
torch.from_numpy(rewards) looks like  torch.Size([2070])
rewards looks like  (1878,)
logs prob looks like  torch.Size([1878])
torch.from_numpy(rewards) looks like  torch.Size([1878])
rewards looks like  (1325,)
logs prob looks like  torch.Size([1325])
torch.from_numpy(rewards) looks like  torch.Size([1325])
rewards looks like  (2611,)
logs prob looks like  torch.Size([2611])
torch.from_numpy(rewards) looks like  torch.Size([2611])
rewards looks like  (1549,)
logs prob looks like  torch.Size([1549])
torch.from_numpy(rewards) looks like  torch.Size([1549])
rewards looks like  (2479,)
logs prob looks like  torch.Size([2479])
torch.from_numpy(rewards) looks like  torch.Size([2479])
rewards looks like  (1987,)
logs prob looks like  torch.Size([1987])
torch.from_numpy(rewards) looks like  torch.Size([1987])
rewards looks like  (1370,)
logs prob looks like  torch.Size([1370])
torch.from_numpy(rewards) looks like  torch.Size([1370])
rewards looks like  (1003,)
logs prob looks like  torch.Size([1003])
torch.from_numpy(rewards) looks like  torch.Size([1003])
rewards looks like  (2640,)
logs prob looks like  torch.Size([2640])
torch.from_numpy(rewards) looks like  torch.Size([2640])
rewards looks like  (1486,)
logs prob looks like  torch.Size([1486])
torch.from_numpy(rewards) looks like  torch.Size([1486])
rewards looks like  (2105,)
logs prob looks like  torch.Size([2105])
torch.from_numpy(rewards) looks like  torch.Size([2105])
rewards looks like  (2222,)
logs prob looks like  torch.Size([2222])
torch.from_numpy(rewards) looks like  torch.Size([2222])
rewards looks like  (1209,)
logs prob looks like  torch.Size([1209])
torch.from_numpy(rewards) looks like  torch.Size([1209])
rewards looks like  (1666,)
logs prob looks like  torch.Size([1666])
torch.from_numpy(rewards) looks like  torch.Size([1666])
rewards looks like  (1435,)
logs prob looks like  torch.Size([1435])
torch.from_numpy(rewards) looks like  torch.Size([1435])
rewards looks like  (1231,)
logs prob looks like  torch.Size([1231])
torch.from_numpy(rewards) looks like  torch.Size([1231])
rewards looks like  (1207,)
logs prob looks like  torch.Size([1207])
torch.from_numpy(rewards) looks like  torch.Size([1207])
rewards looks like  (1155,)
logs prob looks like  torch.Size([1155])
torch.from_numpy(rewards) looks like  torch.Size([1155])
rewards looks like  (1526,)
logs prob looks like  torch.Size([1526])
torch.from_numpy(rewards) looks like  torch.Size([1526])
rewards looks like  (2181,)
logs prob looks like  torch.Size([2181])
torch.from_numpy(rewards) looks like  torch.Size([2181])
rewards looks like  (1868,)
logs prob looks like  torch.Size([1868])
torch.from_numpy(rewards) looks like  torch.Size([1868])
rewards looks like  (2452,)
logs prob looks like  torch.Size([2452])
torch.from_numpy(rewards) looks like  torch.Size([2452])
rewards looks like  (1363,)
logs prob looks like  torch.Size([1363])
torch.from_numpy(rewards) looks like  torch.Size([1363])
rewards looks like  (1543,)
logs prob looks like  torch.Size([1543])
torch.from_numpy(rewards) looks like  torch.Size([1543])
rewards looks like  (2103,)
logs prob looks like  torch.Size([2103])
torch.from_numpy(rewards) looks like  torch.Size([2103])
rewards looks like  (1750,)
logs prob looks like  torch.Size([1750])
torch.from_numpy(rewards) looks like  torch.Size([1750])
rewards looks like  (1453,)
logs prob looks like  torch.Size([1453])
torch.from_numpy(rewards) looks like  torch.Size([1453])
rewards looks like  (1996,)
logs prob looks like  torch.Size([1996])
torch.from_numpy(rewards) looks like  torch.Size([1996])
rewards looks like  (1634,)
logs prob looks like  torch.Size([1634])
torch.from_numpy(rewards) looks like  torch.Size([1634])
rewards looks like  (1364,)
logs prob looks like  torch.Size([1364])
torch.from_numpy(rewards) looks like  torch.Size([1364])
rewards looks like  (2401,)
logs prob looks like  torch.Size([2401])
torch.from_numpy(rewards) looks like  torch.Size([2401])
rewards looks like  (1041,)
logs prob looks like  torch.Size([1041])
torch.from_numpy(rewards) looks like  torch.Size([1041])
rewards looks like  (1014,)
logs prob looks like  torch.Size([1014])
torch.from_numpy(rewards) looks like  torch.Size([1014])
rewards looks like  (1723,)
logs prob looks like  torch.Size([1723])
torch.from_numpy(rewards) looks like  torch.Size([1723])
rewards looks like  (1141,)
logs prob looks like  torch.Size([1141])
torch.from_numpy(rewards) looks like  torch.Size([1141])
rewards looks like  (1153,)
logs prob looks like  torch.Size([1153])
torch.from_numpy(rewards) looks like  torch.Size([1153])
rewards looks like  (1345,)
logs prob looks like  torch.Size([1345])
torch.from_numpy(rewards) looks like  torch.Size([1345])
rewards looks like  (1537,)
logs prob looks like  torch.Size([1537])
torch.from_numpy(rewards) looks like  torch.Size([1537])
rewards looks like  (1362,)
logs prob looks like  torch.Size([1362])
torch.from_numpy(rewards) looks like  torch.Size([1362])
rewards looks like  (1400,)
logs prob looks like  torch.Size([1400])
torch.from_numpy(rewards) looks like  torch.Size([1400])
rewards looks like  (1363,)
logs prob looks like  torch.Size([1363])
torch.from_numpy(rewards) looks like  torch.Size([1363])
rewards looks like  (1381,)
logs prob looks like  torch.Size([1381])
torch.from_numpy(rewards) looks like  torch.Size([1381])
rewards looks like  (2077,)
logs prob looks like  torch.Size([2077])
torch.from_numpy(rewards) looks like  torch.Size([2077])
rewards looks like  (2517,)
logs prob looks like  torch.Size([2517])
torch.from_numpy(rewards) looks like  torch.Size([2517])
rewards looks like  (1419,)
logs prob looks like  torch.Size([1419])
torch.from_numpy(rewards) looks like  torch.Size([1419])
rewards looks like  (960,)
logs prob looks like  torch.Size([960])
torch.from_numpy(rewards) looks like  torch.Size([960])
rewards looks like  (1079,)
logs prob looks like  torch.Size([1079])
torch.from_numpy(rewards) looks like  torch.Size([1079])
rewards looks like  (1285,)
logs prob looks like  torch.Size([1285])
torch.from_numpy(rewards) looks like  torch.Size([1285])
rewards looks like  (2475,)
logs prob looks like  torch.Size([2475])
torch.from_numpy(rewards) looks like  torch.Size([2475])
rewards looks like  (1376,)
logs prob looks like  torch.Size([1376])
torch.from_numpy(rewards) looks like  torch.Size([1376])
rewards looks like  (2248,)
logs prob looks like  torch.Size([2248])
torch.from_numpy(rewards) looks like  torch.Size([2248])
rewards looks like  (2912,)
logs prob looks like  torch.Size([2912])
torch.from_numpy(rewards) looks like  torch.Size([2912])
rewards looks like  (1334,)
logs prob looks like  torch.Size([1334])
torch.from_numpy(rewards) looks like  torch.Size([1334])
rewards looks like  (1481,)
logs prob looks like  torch.Size([1481])
torch.from_numpy(rewards) looks like  torch.Size([1481])
rewards looks like  (2016,)
logs prob looks like  torch.Size([2016])
torch.from_numpy(rewards) looks like  torch.Size([2016])
rewards looks like  (1899,)
logs prob looks like  torch.Size([1899])
torch.from_numpy(rewards) looks like  torch.Size([1899])
rewards looks like  (1171,)
logs prob looks like  torch.Size([1171])
torch.from_numpy(rewards) looks like  torch.Size([1171])
rewards looks like  (1250,)
logs prob looks like  torch.Size([1250])
torch.from_numpy(rewards) looks like  torch.Size([1250])
rewards looks like  (1945,)
logs prob looks like  torch.Size([1945])
torch.from_numpy(rewards) looks like  torch.Size([1945])
rewards looks like  (2421,)
logs prob looks like  torch.Size([2421])
torch.from_numpy(rewards) looks like  torch.Size([2421])
rewards looks like  (1859,)
logs prob looks like  torch.Size([1859])
torch.from_numpy(rewards) looks like  torch.Size([1859])
rewards looks like  (1101,)
logs prob looks like  torch.Size([1101])
torch.from_numpy(rewards) looks like  torch.Size([1101])
rewards looks like  (1297,)
logs prob looks like  torch.Size([1297])
torch.from_numpy(rewards) looks like  torch.Size([1297])
rewards looks like  (2085,)
logs prob looks like  torch.Size([2085])
torch.from_numpy(rewards) looks like  torch.Size([2085])
rewards looks like  (1478,)
logs prob looks like  torch.Size([1478])
torch.from_numpy(rewards) looks like  torch.Size([1478])
rewards looks like  (1131,)
logs prob looks like  torch.Size([1131])
torch.from_numpy(rewards) looks like  torch.Size([1131])
rewards looks like  (1370,)
logs prob looks like  torch.Size([1370])
torch.from_numpy(rewards) looks like  torch.Size([1370])
rewards looks like  (1503,)
logs prob looks like  torch.Size([1503])
torch.from_numpy(rewards) looks like  torch.Size([1503])
rewards looks like  (1058,)
logs prob looks like  torch.Size([1058])
torch.from_numpy(rewards) looks like  torch.Size([1058])
rewards looks like  (1350,)
logs prob looks like  torch.Size([1350])
torch.from_numpy(rewards) looks like  torch.Size([1350])
rewards looks like  (1250,)
logs prob looks like  torch.Size([1250])
torch.from_numpy(rewards) looks like  torch.Size([1250])
rewards looks like  (1364,)
logs prob looks like  torch.Size([1364])
torch.from_numpy(rewards) looks like  torch.Size([1364])
rewards looks like  (1084,)
logs prob looks like  torch.Size([1084])
torch.from_numpy(rewards) looks like  torch.Size([1084])
rewards looks like  (1250,)
logs prob looks like  torch.Size([1250])
torch.from_numpy(rewards) looks like  torch.Size([1250])
rewards looks like  (1286,)
logs prob looks like  torch.Size([1286])
torch.from_numpy(rewards) looks like  torch.Size([1286])
rewards looks like  (1477,)
logs prob looks like  torch.Size([1477])
torch.from_numpy(rewards) looks like  torch.Size([1477])
rewards looks like  (1172,)
logs prob looks like  torch.Size([1172])
torch.from_numpy(rewards) looks like  torch.Size([1172])
rewards looks like  (1366,)
logs prob looks like  torch.Size([1366])
torch.from_numpy(rewards) looks like  torch.Size([1366])
rewards looks like  (1826,)
logs prob looks like  torch.Size([1826])
torch.from_numpy(rewards) looks like  torch.Size([1826])
rewards looks like  (1165,)
logs prob looks like  torch.Size([1165])
torch.from_numpy(rewards) looks like  torch.Size([1165])
rewards looks like  (2540,)
logs prob looks like  torch.Size([2540])
torch.from_numpy(rewards) looks like  torch.Size([2540])
rewards looks like  (1507,)
logs prob looks like  torch.Size([1507])
torch.from_numpy(rewards) looks like  torch.Size([1507])
rewards looks like  (2418,)
logs prob looks like  torch.Size([2418])
torch.from_numpy(rewards) looks like  torch.Size([2418])
rewards looks like  (1300,)
logs prob looks like  torch.Size([1300])
torch.from_numpy(rewards) looks like  torch.Size([1300])
rewards looks like  (2572,)
logs prob looks like  torch.Size([2572])
torch.from_numpy(rewards) looks like  torch.Size([2572])
rewards looks like  (1225,)
logs prob looks like  torch.Size([1225])
torch.from_numpy(rewards) looks like  torch.Size([1225])
rewards looks like  (1586,)
logs prob looks like  torch.Size([1586])
torch.from_numpy(rewards) looks like  torch.Size([1586])
rewards looks like  (1460,)
logs prob looks like  torch.Size([1460])
torch.from_numpy(rewards) looks like  torch.Size([1460])
rewards looks like  (1458,)
logs prob looks like  torch.Size([1458])
torch.from_numpy(rewards) looks like  torch.Size([1458])
rewards looks like  (1381,)
logs prob looks like  torch.Size([1381])
torch.from_numpy(rewards) looks like  torch.Size([1381])
rewards looks like  (1356,)
logs prob looks like  torch.Size([1356])
torch.from_numpy(rewards) looks like  torch.Size([1356])
rewards looks like  (1520,)
logs prob looks like  torch.Size([1520])
torch.from_numpy(rewards) looks like  torch.Size([1520])
rewards looks like  (1570,)
logs prob looks like  torch.Size([1570])
torch.from_numpy(rewards) looks like  torch.Size([1570])
rewards looks like  (1303,)
logs prob looks like  torch.Size([1303])
torch.from_numpy(rewards) looks like  torch.Size([1303])
rewards looks like  (2160,)
logs prob looks like  torch.Size([2160])
torch.from_numpy(rewards) looks like  torch.Size([2160])
rewards looks like  (1344,)
logs prob looks like  torch.Size([1344])
torch.from_numpy(rewards) looks like  torch.Size([1344])
rewards looks like  (1496,)
logs prob looks like  torch.Size([1496])
torch.from_numpy(rewards) looks like  torch.Size([1496])
rewards looks like  (1905,)
logs prob looks like  torch.Size([1905])
torch.from_numpy(rewards) looks like  torch.Size([1905])
rewards looks like  (1255,)
logs prob looks like  torch.Size([1255])
torch.from_numpy(rewards) looks like  torch.Size([1255])
rewards looks like  (1440,)
logs prob looks like  torch.Size([1440])
torch.from_numpy(rewards) looks like  torch.Size([1440])
rewards looks like  (1472,)
logs prob looks like  torch.Size([1472])
torch.from_numpy(rewards) looks like  torch.Size([1472])
rewards looks like  (1261,)
logs prob looks like  torch.Size([1261])
torch.from_numpy(rewards) looks like  torch.Size([1261])
rewards looks like  (2225,)
logs prob looks like  torch.Size([2225])
torch.from_numpy(rewards) looks like  torch.Size([2225])
rewards looks like  (1071,)
logs prob looks like  torch.Size([1071])
torch.from_numpy(rewards) looks like  torch.Size([1071])
rewards looks like  (1033,)
logs prob looks like  torch.Size([1033])
torch.from_numpy(rewards) looks like  torch.Size([1033])
rewards looks like  (856,)
logs prob looks like  torch.Size([856])
torch.from_numpy(rewards) looks like  torch.Size([856])
rewards looks like  (1261,)
logs prob looks like  torch.Size([1261])
torch.from_numpy(rewards) looks like  torch.Size([1261])
rewards looks like  (1782,)
logs prob looks like  torch.Size([1782])
torch.from_numpy(rewards) looks like  torch.Size([1782])
rewards looks like  (1867,)
logs prob looks like  torch.Size([1867])
torch.from_numpy(rewards) looks like  torch.Size([1867])
rewards looks like  (2025,)
logs prob looks like  torch.Size([2025])
torch.from_numpy(rewards) looks like  torch.Size([2025])
rewards looks like  (1250,)
logs prob looks like  torch.Size([1250])
torch.from_numpy(rewards) looks like  torch.Size([1250])
rewards looks like  (1323,)
logs prob looks like  torch.Size([1323])
torch.from_numpy(rewards) looks like  torch.Size([1323])
rewards looks like  (1349,)
logs prob looks like  torch.Size([1349])
torch.from_numpy(rewards) looks like  torch.Size([1349])
rewards looks like  (1617,)
logs prob looks like  torch.Size([1617])
torch.from_numpy(rewards) looks like  torch.Size([1617])
rewards looks like  (1668,)
logs prob looks like  torch.Size([1668])
torch.from_numpy(rewards) looks like  torch.Size([1668])
rewards looks like  (1109,)
logs prob looks like  torch.Size([1109])
torch.from_numpy(rewards) looks like  torch.Size([1109])
rewards looks like  (1102,)
logs prob looks like  torch.Size([1102])
torch.from_numpy(rewards) looks like  torch.Size([1102])
rewards looks like  (2017,)
logs prob looks like  torch.Size([2017])
torch.from_numpy(rewards) looks like  torch.Size([2017])
rewards looks like  (2368,)
logs prob looks like  torch.Size([2368])
torch.from_numpy(rewards) looks like  torch.Size([2368])
rewards looks like  (1128,)
logs prob looks like  torch.Size([1128])
torch.from_numpy(rewards) looks like  torch.Size([1128])
rewards looks like  (1469,)
logs prob looks like  torch.Size([1469])
torch.from_numpy(rewards) looks like  torch.Size([1469])
rewards looks like  (1091,)
logs prob looks like  torch.Size([1091])
torch.from_numpy(rewards) looks like  torch.Size([1091])
rewards looks like  (1516,)
logs prob looks like  torch.Size([1516])
torch.from_numpy(rewards) looks like  torch.Size([1516])
rewards looks like  (1145,)
logs prob looks like  torch.Size([1145])
torch.from_numpy(rewards) looks like  torch.Size([1145])
rewards looks like  (1594,)
logs prob looks like  torch.Size([1594])
torch.from_numpy(rewards) looks like  torch.Size([1594])
rewards looks like  (1536,)
logs prob looks like  torch.Size([1536])
torch.from_numpy(rewards) looks like  torch.Size([1536])
rewards looks like  (1295,)
logs prob looks like  torch.Size([1295])
torch.from_numpy(rewards) looks like  torch.Size([1295])
rewards looks like  (1473,)
logs prob looks like  torch.Size([1473])
torch.from_numpy(rewards) looks like  torch.Size([1473])
rewards looks like  (1458,)
logs prob looks like  torch.Size([1458])
torch.from_numpy(rewards) looks like  torch.Size([1458])
rewards looks like  (1316,)
logs prob looks like  torch.Size([1316])
torch.from_numpy(rewards) looks like  torch.Size([1316])
rewards looks like  (1257,)
logs prob looks like  torch.Size([1257])
torch.from_numpy(rewards) looks like  torch.Size([1257])
rewards looks like  (2354,)
logs prob looks like  torch.Size([2354])
torch.from_numpy(rewards) looks like  torch.Size([2354])
rewards looks like  (1340,)
logs prob looks like  torch.Size([1340])
torch.from_numpy(rewards) looks like  torch.Size([1340])
rewards looks like  (1900,)
logs prob looks like  torch.Size([1900])
torch.from_numpy(rewards) looks like  torch.Size([1900])
rewards looks like  (1513,)
logs prob looks like  torch.Size([1513])
torch.from_numpy(rewards) looks like  torch.Size([1513])
rewards looks like  (1873,)
logs prob looks like  torch.Size([1873])
torch.from_numpy(rewards) looks like  torch.Size([1873])
rewards looks like  (1279,)
logs prob looks like  torch.Size([1279])
torch.from_numpy(rewards) looks like  torch.Size([1279])
rewards looks like  (2151,)
logs prob looks like  torch.Size([2151])
torch.from_numpy(rewards) looks like  torch.Size([2151])
rewards looks like  (1933,)
logs prob looks like  torch.Size([1933])
torch.from_numpy(rewards) looks like  torch.Size([1933])
rewards looks like  (2081,)
logs prob looks like  torch.Size([2081])
torch.from_numpy(rewards) looks like  torch.Size([2081])
rewards looks like  (1054,)
logs prob looks like  torch.Size([1054])
torch.from_numpy(rewards) looks like  torch.Size([1054])
rewards looks like  (1158,)
logs prob looks like  torch.Size([1158])
torch.from_numpy(rewards) looks like  torch.Size([1158])
rewards looks like  (1369,)
logs prob looks like  torch.Size([1369])
torch.from_numpy(rewards) looks like  torch.Size([1369])
rewards looks like  (1148,)
logs prob looks like  torch.Size([1148])
torch.from_numpy(rewards) looks like  torch.Size([1148])
rewards looks like  (1898,)
logs prob looks like  torch.Size([1898])
torch.from_numpy(rewards) looks like  torch.Size([1898])
rewards looks like  (1424,)
logs prob looks like  torch.Size([1424])
torch.from_numpy(rewards) looks like  torch.Size([1424])
rewards looks like  (2106,)
logs prob looks like  torch.Size([2106])
torch.from_numpy(rewards) looks like  torch.Size([2106])
rewards looks like  (1310,)
logs prob looks like  torch.Size([1310])
torch.from_numpy(rewards) looks like  torch.Size([1310])
rewards looks like  (1423,)
logs prob looks like  torch.Size([1423])
torch.from_numpy(rewards) looks like  torch.Size([1423])
rewards looks like  (1866,)
logs prob looks like  torch.Size([1866])
torch.from_numpy(rewards) looks like  torch.Size([1866])
rewards looks like  (2571,)
logs prob looks like  torch.Size([2571])
torch.from_numpy(rewards) looks like  torch.Size([2571])
rewards looks like  (1958,)
logs prob looks like  torch.Size([1958])
torch.from_numpy(rewards) looks like  torch.Size([1958])
rewards looks like  (1608,)
logs prob looks like  torch.Size([1608])
torch.from_numpy(rewards) looks like  torch.Size([1608])
rewards looks like  (1197,)
logs prob looks like  torch.Size([1197])
torch.from_numpy(rewards) looks like  torch.Size([1197])
rewards looks like  (1429,)
logs prob looks like  torch.Size([1429])
torch.from_numpy(rewards) looks like  torch.Size([1429])
rewards looks like  (1466,)
logs prob looks like  torch.Size([1466])
torch.from_numpy(rewards) looks like  torch.Size([1466])
rewards looks like  (1405,)
logs prob looks like  torch.Size([1405])
torch.from_numpy(rewards) looks like  torch.Size([1405])
rewards looks like  (1304,)
logs prob looks like  torch.Size([1304])
torch.from_numpy(rewards) looks like  torch.Size([1304])
rewards looks like  (2045,)
logs prob looks like  torch.Size([2045])
torch.from_numpy(rewards) looks like  torch.Size([2045])
rewards looks like  (1565,)
logs prob looks like  torch.Size([1565])
torch.from_numpy(rewards) looks like  torch.Size([1565])
rewards looks like  (2539,)
logs prob looks like  torch.Size([2539])
torch.from_numpy(rewards) looks like  torch.Size([2539])
rewards looks like  (1497,)
logs prob looks like  torch.Size([1497])
torch.from_numpy(rewards) looks like  torch.Size([1497])
rewards looks like  (2141,)
logs prob looks like  torch.Size([2141])
torch.from_numpy(rewards) looks like  torch.Size([2141])
rewards looks like  (1141,)
logs prob looks like  torch.Size([1141])
torch.from_numpy(rewards) looks like  torch.Size([1141])
rewards looks like  (2892,)
logs prob looks like  torch.Size([2892])
torch.from_numpy(rewards) looks like  torch.Size([2892])
rewards looks like  (841,)
logs prob looks like  torch.Size([841])
torch.from_numpy(rewards) looks like  torch.Size([841])
rewards looks like  (1129,)
logs prob looks like  torch.Size([1129])
torch.from_numpy(rewards) looks like  torch.Size([1129])
rewards looks like  (1347,)
logs prob looks like  torch.Size([1347])
torch.from_numpy(rewards) looks like  torch.Size([1347])
rewards looks like  (1596,)
logs prob looks like  torch.Size([1596])
torch.from_numpy(rewards) looks like  torch.Size([1596])
rewards looks like  (2045,)
logs prob looks like  torch.Size([2045])
torch.from_numpy(rewards) looks like  torch.Size([2045])
rewards looks like  (1247,)
logs prob looks like  torch.Size([1247])
torch.from_numpy(rewards) looks like  torch.Size([1247])
rewards looks like  (1289,)
logs prob looks like  torch.Size([1289])
torch.from_numpy(rewards) looks like  torch.Size([1289])
rewards looks like  (2360,)
logs prob looks like  torch.Size([2360])
torch.from_numpy(rewards) looks like  torch.Size([2360])
rewards looks like  (2745,)
logs prob looks like  torch.Size([2745])
torch.from_numpy(rewards) looks like  torch.Size([2745])
rewards looks like  (1191,)
logs prob looks like  torch.Size([1191])
torch.from_numpy(rewards) looks like  torch.Size([1191])
rewards looks like  (1266,)
logs prob looks like  torch.Size([1266])
torch.from_numpy(rewards) looks like  torch.Size([1266])
rewards looks like  (1424,)
logs prob looks like  torch.Size([1424])
torch.from_numpy(rewards) looks like  torch.Size([1424])
rewards looks like  (929,)
logs prob looks like  torch.Size([929])
torch.from_numpy(rewards) looks like  torch.Size([929])
rewards looks like  (2134,)
logs prob looks like  torch.Size([2134])
torch.from_numpy(rewards) looks like  torch.Size([2134])
rewards looks like  (1933,)
logs prob looks like  torch.Size([1933])
torch.from_numpy(rewards) looks like  torch.Size([1933])
rewards looks like  (1357,)
logs prob looks like  torch.Size([1357])
torch.from_numpy(rewards) looks like  torch.Size([1357])
rewards looks like  (1807,)
logs prob looks like  torch.Size([1807])
torch.from_numpy(rewards) looks like  torch.Size([1807])
rewards looks like  (2153,)
logs prob looks like  torch.Size([2153])
torch.from_numpy(rewards) looks like  torch.Size([2153])
rewards looks like  (1101,)
logs prob looks like  torch.Size([1101])
torch.from_numpy(rewards) looks like  torch.Size([1101])
rewards looks like  (1263,)
logs prob looks like  torch.Size([1263])
torch.from_numpy(rewards) looks like  torch.Size([1263])
rewards looks like  (2021,)
logs prob looks like  torch.Size([2021])
torch.from_numpy(rewards) looks like  torch.Size([2021])
rewards looks like  (1306,)
logs prob looks like  torch.Size([1306])
torch.from_numpy(rewards) looks like  torch.Size([1306])
rewards looks like  (1696,)
logs prob looks like  torch.Size([1696])
torch.from_numpy(rewards) looks like  torch.Size([1696])
rewards looks like  (1593,)
logs prob looks like  torch.Size([1593])
torch.from_numpy(rewards) looks like  torch.Size([1593])
rewards looks like  (1181,)
logs prob looks like  torch.Size([1181])
torch.from_numpy(rewards) looks like  torch.Size([1181])
rewards looks like  (2203,)
logs prob looks like  torch.Size([2203])
torch.from_numpy(rewards) looks like  torch.Size([2203])
rewards looks like  (2740,)
logs prob looks like  torch.Size([2740])
torch.from_numpy(rewards) looks like  torch.Size([2740])
rewards looks like  (1403,)
logs prob looks like  torch.Size([1403])
torch.from_numpy(rewards) looks like  torch.Size([1403])
rewards looks like  (1326,)
logs prob looks like  torch.Size([1326])
torch.from_numpy(rewards) looks like  torch.Size([1326])
rewards looks like  (2057,)
logs prob looks like  torch.Size([2057])
torch.from_numpy(rewards) looks like  torch.Size([2057])
rewards looks like  (3534,)
logs prob looks like  torch.Size([3534])
torch.from_numpy(rewards) looks like  torch.Size([3534])
rewards looks like  (1318,)
logs prob looks like  torch.Size([1318])
torch.from_numpy(rewards) looks like  torch.Size([1318])
rewards looks like  (1419,)
logs prob looks like  torch.Size([1419])
torch.from_numpy(rewards) looks like  torch.Size([1419])
rewards looks like  (1403,)
logs prob looks like  torch.Size([1403])
torch.from_numpy(rewards) looks like  torch.Size([1403])
rewards looks like  (2790,)
logs prob looks like  torch.Size([2790])
torch.from_numpy(rewards) looks like  torch.Size([2790])
rewards looks like  (1318,)
logs prob looks like  torch.Size([1318])
torch.from_numpy(rewards) looks like  torch.Size([1318])
rewards looks like  (1406,)
logs prob looks like  torch.Size([1406])
torch.from_numpy(rewards) looks like  torch.Size([1406])
rewards looks like  (1603,)
logs prob looks like  torch.Size([1603])
torch.from_numpy(rewards) looks like  torch.Size([1603])
rewards looks like  (1794,)
logs prob looks like  torch.Size([1794])
torch.from_numpy(rewards) looks like  torch.Size([1794])
rewards looks like  (1461,)
logs prob looks like  torch.Size([1461])
torch.from_numpy(rewards) looks like  torch.Size([1461])
rewards looks like  (1343,)
logs prob looks like  torch.Size([1343])
torch.from_numpy(rewards) looks like  torch.Size([1343])
rewards looks like  (1442,)
logs prob looks like  torch.Size([1442])
torch.from_numpy(rewards) looks like  torch.Size([1442])
rewards looks like  (1414,)
logs prob looks like  torch.Size([1414])
torch.from_numpy(rewards) looks like  torch.Size([1414])
rewards looks like  (2715,)
logs prob looks like  torch.Size([2715])
torch.from_numpy(rewards) looks like  torch.Size([2715])
rewards looks like  (2386,)
logs prob looks like  torch.Size([2386])
torch.from_numpy(rewards) looks like  torch.Size([2386])
rewards looks like  (1905,)
logs prob looks like  torch.Size([1905])
torch.from_numpy(rewards) looks like  torch.Size([1905])
rewards looks like  (1031,)
logs prob looks like  torch.Size([1031])
torch.from_numpy(rewards) looks like  torch.Size([1031])
rewards looks like  (1125,)
logs prob looks like  torch.Size([1125])
torch.from_numpy(rewards) looks like  torch.Size([1125])
rewards looks like  (1556,)
logs prob looks like  torch.Size([1556])
torch.from_numpy(rewards) looks like  torch.Size([1556])
rewards looks like  (1906,)
logs prob looks like  torch.Size([1906])
torch.from_numpy(rewards) looks like  torch.Size([1906])
rewards looks like  (1777,)
logs prob looks like  torch.Size([1777])
torch.from_numpy(rewards) looks like  torch.Size([1777])
rewards looks like  (1269,)
logs prob looks like  torch.Size([1269])
torch.from_numpy(rewards) looks like  torch.Size([1269])
rewards looks like  (1407,)
logs prob looks like  torch.Size([1407])
torch.from_numpy(rewards) looks like  torch.Size([1407])
rewards looks like  (1333,)
logs prob looks like  torch.Size([1333])
torch.from_numpy(rewards) looks like  torch.Size([1333])
rewards looks like  (1224,)
logs prob looks like  torch.Size([1224])
torch.from_numpy(rewards) looks like  torch.Size([1224])
rewards looks like  (1997,)
logs prob looks like  torch.Size([1997])
torch.from_numpy(rewards) looks like  torch.Size([1997])
rewards looks like  (1610,)
logs prob looks like  torch.Size([1610])
torch.from_numpy(rewards) looks like  torch.Size([1610])
rewards looks like  (1393,)
logs prob looks like  torch.Size([1393])
torch.from_numpy(rewards) looks like  torch.Size([1393])
rewards looks like  (1808,)
logs prob looks like  torch.Size([1808])
torch.from_numpy(rewards) looks like  torch.Size([1808])
rewards looks like  (1448,)
logs prob looks like  torch.Size([1448])
torch.from_numpy(rewards) looks like  torch.Size([1448])
rewards looks like  (1558,)
logs prob looks like  torch.Size([1558])
torch.from_numpy(rewards) looks like  torch.Size([1558])
rewards looks like  (1766,)
logs prob looks like  torch.Size([1766])
torch.from_numpy(rewards) looks like  torch.Size([1766])
rewards looks like  (1942,)
logs prob looks like  torch.Size([1942])
torch.from_numpy(rewards) looks like  torch.Size([1942])
rewards looks like  (1487,)
logs prob looks like  torch.Size([1487])
torch.from_numpy(rewards) looks like  torch.Size([1487])
rewards looks like  (2154,)
logs prob looks like  torch.Size([2154])
torch.from_numpy(rewards) looks like  torch.Size([2154])
rewards looks like  (1400,)
logs prob looks like  torch.Size([1400])
torch.from_numpy(rewards) looks like  torch.Size([1400])
rewards looks like  (1379,)
logs prob looks like  torch.Size([1379])
torch.from_numpy(rewards) looks like  torch.Size([1379])
rewards looks like  (2227,)
logs prob looks like  torch.Size([2227])
torch.from_numpy(rewards) looks like  torch.Size([2227])
rewards looks like  (1308,)
logs prob looks like  torch.Size([1308])
torch.from_numpy(rewards) looks like  torch.Size([1308])
rewards looks like  (1469,)
logs prob looks like  torch.Size([1469])
torch.from_numpy(rewards) looks like  torch.Size([1469])
rewards looks like  (1734,)
logs prob looks like  torch.Size([1734])
torch.from_numpy(rewards) looks like  torch.Size([1734])
rewards looks like  (1994,)
logs prob looks like  torch.Size([1994])
torch.from_numpy(rewards) looks like  torch.Size([1994])
rewards looks like  (2025,)
logs prob looks like  torch.Size([2025])
torch.from_numpy(rewards) looks like  torch.Size([2025])
rewards looks like  (2223,)
logs prob looks like  torch.Size([2223])
torch.from_numpy(rewards) looks like  torch.Size([2223])
rewards looks like  (2418,)
logs prob looks like  torch.Size([2418])
torch.from_numpy(rewards) looks like  torch.Size([2418])
rewards looks like  (1520,)
logs prob looks like  torch.Size([1520])
torch.from_numpy(rewards) looks like  torch.Size([1520])
rewards looks like  (1613,)
logs prob looks like  torch.Size([1613])
torch.from_numpy(rewards) looks like  torch.Size([1613])
rewards looks like  (1984,)
logs prob looks like  torch.Size([1984])
torch.from_numpy(rewards) looks like  torch.Size([1984])
rewards looks like  (1563,)
logs prob looks like  torch.Size([1563])
torch.from_numpy(rewards) looks like  torch.Size([1563])
rewards looks like  (1559,)
logs prob looks like  torch.Size([1559])
torch.from_numpy(rewards) looks like  torch.Size([1559])
rewards looks like  (2198,)
logs prob looks like  torch.Size([2198])
torch.from_numpy(rewards) looks like  torch.Size([2198])
rewards looks like  (1582,)
logs prob looks like  torch.Size([1582])
torch.from_numpy(rewards) looks like  torch.Size([1582])
rewards looks like  (1423,)
logs prob looks like  torch.Size([1423])
torch.from_numpy(rewards) looks like  torch.Size([1423])
rewards looks like  (2810,)
logs prob looks like  torch.Size([2810])
torch.from_numpy(rewards) looks like  torch.Size([2810])
rewards looks like  (1279,)
logs prob looks like  torch.Size([1279])
torch.from_numpy(rewards) looks like  torch.Size([1279])
rewards looks like  (1101,)
logs prob looks like  torch.Size([1101])
torch.from_numpy(rewards) looks like  torch.Size([1101])
rewards looks like  (2219,)
logs prob looks like  torch.Size([2219])
torch.from_numpy(rewards) looks like  torch.Size([2219])
rewards looks like  (1930,)
logs prob looks like  torch.Size([1930])
torch.from_numpy(rewards) looks like  torch.Size([1930])

代码

文本

Training Result

During the training process, we recorded avg_total_reward, which represents the average total reward of episodes before updating the policy network.

Theoretically, if the agent becomes better, the avg_total_reward will increase. The visualization of the training process is shown below:

代码

文本

[19]

plt.plot(avg_total_rewards)

plt.title("Total Rewards")

plt.show()

代码

文本

In addition, avg_final_reward represents average final rewards of episodes. To be specific, final rewards is the last reward received in one episode, indicating whether the craft lands successfully or not.

代码

文本

[20]

plt.plot(avg_final_rewards)

plt.title("Final Rewards")

plt.show()

代码

文本

Testing

The testing result will be the average reward of 5 testing

代码

文本

[21]

fix(env, seed)

agent.network.eval() # set the network into evaluation mode

NUM_OF_TEST = 5 # Do not revise this !!!

test_total_reward = []

action_list = []

for i in range(NUM_OF_TEST):

actions = []

state = env.reset()

img = plt.imshow(env.render(mode='rgb_array'))

total_reward = 0

done = False

while not done:

action, _ = agent.sample(state)

actions.append(action)

state, reward, done, _ = env.step(action)

total_reward += reward

img.set_data(env.render(mode='rgb_array'))

display.display(plt.gcf())

display.clear_output(wait=True)

print(total_reward)

test_total_reward.append(total_reward)

action_list.append(actions) # save the result of testing

-209.13696525868605

代码

文本

[22]

print(np.mean(test_total_reward))

-106.5599827895497

代码

文本

Action list

代码

文本

[23]

print("Action list looks like ", action_list)

print("Action list's shape looks like ", np.shape(action_list))

Action list looks like  [[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 2, 3, 2, 2, 3, 2, 2, 0, 3, 2, 3, 2, 2, 0, 2, 2, 0, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 0, 2, 0, 2, 2, 3, 3, 2, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 2, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 2, 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3, 3, 3, 2, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3, 2, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 2, 2, 2, 3, 3, 2, 2, 3, 2, 3, 3, 3, 3], [0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 2, 2, 3, 2, 3, 0, 2, 2, 2, 0, 2, 1, 2, 3, 2, 2, 0, 2, 2, 1, 0, 2, 2, 3, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 0, 1, 2, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 3, 2, 2, 0, 3, 2, 3, 2, 2, 2, 3, 2, 2, 3, 3, 2, 3, 2, 2, 3, 2, 2, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 3, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 3, 2, 2, 0, 2, 2, 0, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 3, 3, 2, 3, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 3, 2, 2, 3, 3, 3, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 3, 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 3, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 0, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 2, 2, 3, 3, 2, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 3, 2, 3, 2, 2, 2, 2, 3, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 3, 3, 3, 3, 2, 2, 3, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 2, 2, 3, 0, 2, 0, 0, 2, 3, 2, 0, 2, 3, 3, 2, 0, 2, 0, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 3, 2, 2, 0, 2, 1, 0, 2, 1, 2, 3, 2, 0, 2, 1, 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 3, 2, 2, 2, 3, 0, 2, 3, 2, 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 3, 2, 2, 2, 3, 2, 3, 2, 0, 2, 3, 2, 3, 0, 2, 3, 2, 3, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 0, 2, 3, 0, 2, 3, 2, 3, 3, 2, 3, 3, 2, 3, 2, 3, 2, 3, 3, 2, 3, 3, 3, 2, 3, 2, 3, 2, 2, 3, 2, 3, 3, 2, 2, 2, 3, 2, 2, 3, 2, 3, 2, 2, 2, 3, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 2, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]]
Action list's shape looks like  (5,)
/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py:2007: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  result = asarray(a).shape

代码

文本

Analysis of actions taken by agent

代码

文本

[24]

distribution = {}

for actions in action_list:

for action in actions:

if action not in distribution.keys():

distribution[action] = 1

else:

distribution[action] += 1

print(distribution)

{2: 991, 3: 374, 0: 108, 1: 496}

代码

文本

Saving the result of Model Testing

代码

文本

[25]

PATH = "Action_List.npy" # Can be modified into the name or path you want

np.save(PATH ,np.array(action_list))

/tmp/ipykernel_123/1616289779.py:2: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  np.save(PATH ,np.array(action_list))

代码

文本

This is the file you need to submit !!!

Download the testing result to your device

代码

文本

[26]

from google.colab import files

files.download(PATH)

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[26], line 1
----> 1 from google.colab import files
      2 files.download(PATH)

ModuleNotFoundError: No module named 'google.colab'

代码

文本

Server

The code below simulate the environment on the judge server. Can be used for testing.

代码

文本

[27]

action_list = np.load(PATH,allow_pickle=True) # The action list you upload

seed = 543 # Do not revise this

fix(env, seed)

agent.network.eval() # set network to evaluation mode

test_total_reward = []

if len(action_list) != 5:

print("Wrong format of file !!!")

exit(0)

for actions in action_list:

state = env.reset()

img = plt.imshow(env.render(mode='rgb_array'))

total_reward = 0

done = False

for action in actions:

state, reward, done, _ = env.step(action)

total_reward += reward

if done:

break

print(f"Your reward is : %.2f"%total_reward)

test_total_reward.append(total_reward)

Your reward is : -209.14
Your reward is : -45.50
Your reward is : 62.21
Your reward is : -200.09
Your reward is : -240.06

代码

文本

Your score

代码

文本

[28]

print(f"Your final reward is : %.2f"%np.mean(test_total_reward))

Your final reward is : -126.51

代码

文本

Reference

Below are some useful tips for you to get high score.

代码

文本

Deep Learning

notebook

python

Deep Learningnotebookpython

点个赞吧