Bohrium
robot
新建

空间站广场

论文
Notebooks
比赛
课程
Apps
我的主页
我的Notebooks
我的论文库
我的足迹

我的工作空间

任务
节点
文件
数据集
镜像
项目
数据库
公开
Homework 12 - Reinforcement Learning
Deep Learning
notebook
python
Deep Learningnotebookpython
goujiaxin
发布于 2024-04-11
推荐镜像 :Basic Image:bohrium-notebook:2023-04-07
推荐机型 :c4_m15_1 * NVIDIA T4
1
Homework 12 - Reinforcement Learning
Preliminary work
Warning ! Do not revise random seed !!!
Your submission on JudgeBoi will not reproduce your result !!!
What Lunar Lander?
Observation / State
Action
Reward
Random Agent
Policy Gradient
Training Agent
Training Result
Testing
This is the file you need to submit !!!
Server
Your score
Reference

Homework 12 - Reinforcement Learning

If you have any problem, e-mail us at ntu-ml-2022spring-ta@googlegroups.com

代码
文本

Preliminary work

First, we need to install all necessary packages. One of them, gym, builded by OpenAI, is a toolkit for developing Reinforcement Learning algorithm. Other packages are for visualization in colab.

代码
文本
[1]
!apt update
!apt install python-opengl xvfb -y
#!pip install gym[box2d]==0.18.3 pyvirtualdisplay tqdm numpy==1.20 torch==1.8.1
!pip install -q swig
!pip install box2d==2.3.2 gym[box2d]==0.25.2 box2d-py pyvirtualdisplay tqdm numpy==1.22.4
!pip install box2d==2.3.2 box2d-kengz
!pip freeze > requirements.txt
Hit:1 http://archive.ubuntu.com/ubuntu focal InRelease                         
Get:2 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB]        
Get:4 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB]      
Get:3 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  InRelease [1581 B]
Get:5 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64  Packages [1498 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB]      
Get:7 https://deb.nodesource.com/node_18.x focal InRelease [4583 B]            
Get:8 http://archive.ubuntu.com/ubuntu focal-updates/restricted amd64 Packages [3639 kB]
Get:9 https://deb.nodesource.com/node_18.x focal/main amd64 Packages [776 B]  
Get:10 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages [1197 kB]
Get:11 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages [4024 kB]3m
Get:12 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [1493 kB]m
Get:13 http://archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 Packages [32.5 kB]
Get:14 http://archive.ubuntu.com/ubuntu focal-backports/main amd64 Packages [55.2 kB]33m
Get:15 http://archive.ubuntu.com/ubuntu focal-backports/universe amd64 Packages [28.6 kB]
Get:16 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [3490 kB]3m
Get:17 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages [29.8 kB]3m
Get:18 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [3549 kB]33m
Fetched 19.4 MB in 26s (740 kB/s)                                              
Reading package lists... Done
Building dependency tree       
Reading state information... Done
163 packages can be upgraded. Run 'apt list --upgradable' to see them.
Reading package lists... Done
Building dependency tree       
Reading state information... Done
The following additional packages will be installed:
  freeglut3 libglu1-mesa libpython2-stdlib libpython2.7-minimal
  libpython2.7-stdlib libunwind8 libxfont2 python2 python2-minimal python2.7
  python2.7-minimal x11-xkb-utils xauth xfonts-base xfonts-encodings
  xfonts-utils xserver-common
Suggested packages:
  python-tk python-numpy libgle3 python2-doc python2.7-doc binfmt-support
The following NEW packages will be installed:
  freeglut3 libglu1-mesa libpython2-stdlib libpython2.7-minimal
  libpython2.7-stdlib libunwind8 libxfont2 python-opengl python2
  python2-minimal python2.7 python2.7-minimal x11-xkb-utils xauth xfonts-base
  xfonts-encodings xfonts-utils xserver-common xvfb
0 upgraded, 19 newly installed, 0 to remove and 163 not upgraded.
Need to get 12.2 MB of archives.
After this operation, 34.7 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 libpython2.7-minimal amd64 2.7.18-1~20.04.4 [335 kB]
Get:2 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 python2.7-minimal amd64 2.7.18-1~20.04.4 [1280 kB]
Get:3 http://archive.ubuntu.com/ubuntu focal/universe amd64 python2-minimal amd64 2.7.17-2ubuntu4 [27.5 kB]
Get:4 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 libpython2.7-stdlib amd64 2.7.18-1~20.04.4 [1887 kB]
Get:5 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 python2.7 amd64 2.7.18-1~20.04.4 [248 kB]
Get:6 http://archive.ubuntu.com/ubuntu focal/universe amd64 libpython2-stdlib amd64 2.7.17-2ubuntu4 [7072 B]
Get:7 http://archive.ubuntu.com/ubuntu focal/universe amd64 python2 amd64 2.7.17-2ubuntu4 [26.5 kB]
Get:8 http://archive.ubuntu.com/ubuntu focal/main amd64 xauth amd64 1:1.1-0ubuntu1 [25.0 kB]
Get:9 http://archive.ubuntu.com/ubuntu focal/universe amd64 freeglut3 amd64 2.8.1-3 [73.6 kB]
Get:10 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libunwind8 amd64 1.2.1-9ubuntu0.1 [47.7 kB]
Get:11 http://archive.ubuntu.com/ubuntu focal/main amd64 libxfont2 amd64 1:2.0.3-1 [91.7 kB]
Get:12 http://archive.ubuntu.com/ubuntu focal/main amd64 libglu1-mesa amd64 9.0.1-1build1 [168 kB]
Get:13 http://archive.ubuntu.com/ubuntu focal/universe amd64 python-opengl all 3.1.0+dfsg-2build1 [486 kB]
Get:14 http://archive.ubuntu.com/ubuntu focal/main amd64 x11-xkb-utils amd64 7.7+5 [158 kB]
Get:15 http://archive.ubuntu.com/ubuntu focal/main amd64 xfonts-encodings all 1:1.0.5-0ubuntu1 [573 kB]
Get:16 http://archive.ubuntu.com/ubuntu focal/main amd64 xfonts-utils amd64 1:7.7+6 [91.5 kB]
Get:17 http://archive.ubuntu.com/ubuntu focal/main amd64 xfonts-base all 1:1.0.5 [5896 kB]
Get:18 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 xserver-common all 2:1.20.13-1ubuntu1~20.04.17 [27.8 kB]
Get:19 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 xvfb amd64 2:1.20.13-1ubuntu1~20.04.17 [781 kB]
Fetched 12.2 MB in 7s (1698 kB/s)                                              

78Selecting previously unselected package libpython2.7-minimal:amd64.
(Reading database ... 63384 files and directories currently installed.)
Preparing to unpack .../0-libpython2.7-minimal_2.7.18-1~20.04.4_amd64.deb ...
7Progress: [  0%] [..........................................................] 87Progress: [  1%] [..........................................................] 8Unpacking libpython2.7-minimal:amd64 (2.7.18-1~20.04.4) ...
7Progress: [  3%] [#.........................................................] 8Selecting previously unselected package python2.7-minimal.
Preparing to unpack .../1-python2.7-minimal_2.7.18-1~20.04.4_amd64.deb ...
7Progress: [  4%] [##........................................................] 8Unpacking python2.7-minimal (2.7.18-1~20.04.4) ...
7Progress: [  5%] [###.......................................................] 8Selecting previously unselected package python2-minimal.
Preparing to unpack .../2-python2-minimal_2.7.17-2ubuntu4_amd64.deb ...
7Progress: [  6%] [###.......................................................] 8Unpacking python2-minimal (2.7.17-2ubuntu4) ...
7Progress: [  8%] [####......................................................] 8Selecting previously unselected package libpython2.7-stdlib:amd64.
Preparing to unpack .../3-libpython2.7-stdlib_2.7.18-1~20.04.4_amd64.deb ...
7Progress: [  9%] [#####.....................................................] 8Unpacking libpython2.7-stdlib:amd64 (2.7.18-1~20.04.4) ...
7Progress: [ 10%] [######....................................................] 8Selecting previously unselected package python2.7.
Preparing to unpack .../4-python2.7_2.7.18-1~20.04.4_amd64.deb ...
7Progress: [ 12%] [######....................................................] 8Unpacking python2.7 (2.7.18-1~20.04.4) ...
7Progress: [ 13%] [#######...................................................] 8Selecting previously unselected package libpython2-stdlib:amd64.
Preparing to unpack .../5-libpython2-stdlib_2.7.17-2ubuntu4_amd64.deb ...
7Progress: [ 14%] [########..................................................] 8Unpacking libpython2-stdlib:amd64 (2.7.17-2ubuntu4) ...
7Progress: [ 16%] [#########.................................................] 8Setting up libpython2.7-minimal:amd64 (2.7.18-1~20.04.4) ...
7Progress: [ 17%] [#########.................................................] 87Progress: [ 18%] [##########................................................] 8Setting up python2.7-minimal (2.7.18-1~20.04.4) ...
7Progress: [ 19%] [###########...............................................] 87Progress: [ 21%] [############..............................................] 8Setting up python2-minimal (2.7.17-2ubuntu4) ...
7Progress: [ 22%] [############..............................................] 87Progress: [ 23%] [#############.............................................] 8Selecting previously unselected package python2.
(Reading database ... 64131 files and directories currently installed.)
Preparing to unpack .../00-python2_2.7.17-2ubuntu4_amd64.deb ...
7Progress: [ 25%] [##############............................................] 8Unpacking python2 (2.7.17-2ubuntu4) ...
7Progress: [ 26%] [###############...........................................] 8Selecting previously unselected package xauth.
Preparing to unpack .../01-xauth_1%3a1.1-0ubuntu1_amd64.deb ...
7Progress: [ 27%] [###############...........................................] 8Unpacking xauth (1:1.1-0ubuntu1) ...
7Progress: [ 29%] [################..........................................] 8Selecting previously unselected package freeglut3:amd64.
Preparing to unpack .../02-freeglut3_2.8.1-3_amd64.deb ...
7Progress: [ 30%] [#################.........................................] 8Unpacking freeglut3:amd64 (2.8.1-3) ...
7Progress: [ 31%] [##################........................................] 8Selecting previously unselected package libunwind8:amd64.
Preparing to unpack .../03-libunwind8_1.2.1-9ubuntu0.1_amd64.deb ...
7Progress: [ 32%] [##################........................................] 8Unpacking libunwind8:amd64 (1.2.1-9ubuntu0.1) ...
7Progress: [ 34%] [###################.......................................] 8Selecting previously unselected package libxfont2:amd64.
Preparing to unpack .../04-libxfont2_1%3a2.0.3-1_amd64.deb ...
7Progress: [ 35%] [####################......................................] 8Unpacking libxfont2:amd64 (1:2.0.3-1) ...
7Progress: [ 36%] [#####################.....................................] 8Selecting previously unselected package libglu1-mesa:amd64.
Preparing to unpack .../05-libglu1-mesa_9.0.1-1build1_amd64.deb ...
7Progress: [ 38%] [#####################.....................................] 8Unpacking libglu1-mesa:amd64 (9.0.1-1build1) ...
7Progress: [ 39%] [######################....................................] 8Selecting previously unselected package python-opengl.
Preparing to unpack .../06-python-opengl_3.1.0+dfsg-2build1_all.deb ...
7Progress: [ 40%] [#######################...................................] 8Unpacking python-opengl (3.1.0+dfsg-2build1) ...
7Progress: [ 42%] [########################..................................] 8Selecting previously unselected package x11-xkb-utils.
Preparing to unpack .../07-x11-xkb-utils_7.7+5_amd64.deb ...
7Progress: [ 43%] [########################..................................] 8Unpacking x11-xkb-utils (7.7+5) ...
7Progress: [ 44%] [#########################.................................] 8Selecting previously unselected package xfonts-encodings.
Preparing to unpack .../08-xfonts-encodings_1%3a1.0.5-0ubuntu1_all.deb ...
7Progress: [ 45%] [##########################................................] 8Unpacking xfonts-encodings (1:1.0.5-0ubuntu1) ...
7Progress: [ 47%] [###########################...............................] 8Selecting previously unselected package xfonts-utils.
Preparing to unpack .../09-xfonts-utils_1%3a7.7+6_amd64.deb ...
7Progress: [ 48%] [###########################...............................] 8Unpacking xfonts-utils (1:7.7+6) ...
7Progress: [ 49%] [############################..............................] 8Selecting previously unselected package xfonts-base.
Preparing to unpack .../10-xfonts-base_1%3a1.0.5_all.deb ...
7Progress: [ 51%] [#############################.............................] 8Unpacking xfonts-base (1:1.0.5) ...
7Progress: [ 52%] [##############################............................] 8Selecting previously unselected package xserver-common.
Preparing to unpack .../11-xserver-common_2%3a1.20.13-1ubuntu1~20.04.17_all.deb ...
7Progress: [ 53%] [##############################............................] 8Unpacking xserver-common (2:1.20.13-1ubuntu1~20.04.17) ...
7Progress: [ 55%] [###############################...........................] 8Selecting previously unselected package xvfb.
Preparing to unpack .../12-xvfb_2%3a1.20.13-1ubuntu1~20.04.17_amd64.deb ...
7Progress: [ 56%] [################################..........................] 8Unpacking xvfb (2:1.20.13-1ubuntu1~20.04.17) ...
7Progress: [ 57%] [#################################.........................] 8Setting up freeglut3:amd64 (2.8.1-3) ...
7Progress: [ 58%] [#################################.........................] 87Progress: [ 60%] [##################################........................] 8Setting up x11-xkb-utils (7.7+5) ...
7Progress: [ 61%] [###################################.......................] 87Progress: [ 62%] [####################################......................] 8Setting up libunwind8:amd64 (1.2.1-9ubuntu0.1) ...
7Progress: [ 64%] [####################################......................] 87Progress: [ 65%] [#####################################.....................] 8Setting up libpython2.7-stdlib:amd64 (2.7.18-1~20.04.4) ...
7Progress: [ 66%] [######################################....................] 87Progress: [ 68%] [#######################################...................] 8Setting up xfonts-encodings (1:1.0.5-0ubuntu1) ...
7Progress: [ 69%] [#######################################...................] 87Progress: [ 70%] [########################################..................] 8Setting up xauth (1:1.1-0ubuntu1) ...
7Progress: [ 71%] [#########################################.................] 87Progress: [ 73%] [##########################################................] 8Setting up libglu1-mesa:amd64 (9.0.1-1build1) ...
7Progress: [ 74%] [##########################################................] 87Progress: [ 75%] [###########################################...............] 8Setting up xserver-common (2:1.20.13-1ubuntu1~20.04.17) ...
7Progress: [ 77%] [############################################..............] 87Progress: [ 78%] [#############################################.............] 8Setting up libxfont2:amd64 (1:2.0.3-1) ...
7Progress: [ 79%] [#############################################.............] 87Progress: [ 81%] [##############################################............] 8Setting up python2.7 (2.7.18-1~20.04.4) ...
7Progress: [ 82%] [###############################################...........] 87Progress: [ 83%] [################################################..........] 8Setting up libpython2-stdlib:amd64 (2.7.17-2ubuntu4) ...
7Progress: [ 84%] [################################################..........] 87Progress: [ 86%] [#################################################.........] 8Setting up xvfb (2:1.20.13-1ubuntu1~20.04.17) ...
7Progress: [ 87%] [##################################################........] 87Progress: [ 88%] [###################################################.......] 8Setting up xfonts-utils (1:7.7+6) ...
7Progress: [ 90%] [###################################################.......] 87Progress: [ 91%] [####################################################......] 8Setting up python2 (2.7.17-2ubuntu4) ...
7Progress: [ 92%] [#####################################################.....] 87Progress: [ 94%] [######################################################....] 8Setting up xfonts-base (1:1.0.5) ...
7Progress: [ 95%] [######################################################....] 87Progress: [ 96%] [#######################################################...] 8Setting up python-opengl (3.1.0+dfsg-2build1) ...
7Progress: [ 97%] [########################################################..] 87Progress: [ 99%] [#########################################################.] 8Processing triggers for man-db (2.9.1-1) ...
Processing triggers for fontconfig (2.13.1-2ubuntu3) ...
Processing triggers for mime-support (3.64ubuntu1) ...
Processing triggers for libc-bin (2.31-0ubuntu9.9) ...
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-ml.so.470.82.01 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.82.01 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libcuda.so.470.82.01 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-cfg.so.470.82.01 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-allocator.so.470.82.01 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-compiler.so.470.82.01 is empty, not checked.
/sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-opencl.so.470.82.01 is empty, not checked.

78WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple
Collecting box2d==2.3.2
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cc/7b/ddb96fea1fa5b24f8929714ef483f64c33e9649e7aae066e5f5023ea426a/Box2D-2.3.2.tar.gz (427 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 427.9/427.9 kB 7.5 MB/s eta 0:00:00a 0:00:01
  Preparing metadata (setup.py) ... done
Requirement already satisfied: gym[box2d]==0.25.2 in /opt/conda/lib/python3.8/site-packages (0.25.2)
Collecting box2d-py
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/98/c2/ab05b5329dc4416b5ee5530f0625a79c394a3e3c10abe0812b9345256451/box2d-py-2.3.8.tar.gz (374 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 374.5/374.5 kB 21.4 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Collecting pyvirtualdisplay
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/90/eb/c3b8deb661cb3846db63288c99bbb39f217b7807fc8acb2fd058db41e2e6/PyVirtualDisplay-3.0-py3-none-any.whl (15 kB)
Requirement already satisfied: tqdm in /opt/conda/lib/python3.8/site-packages (4.64.1)
Requirement already satisfied: numpy==1.22.4 in /opt/conda/lib/python3.8/site-packages (1.22.4)
Requirement already satisfied: cloudpickle>=1.2.0 in /opt/conda/lib/python3.8/site-packages (from gym[box2d]==0.25.2) (2.2.1)
Requirement already satisfied: importlib-metadata>=4.8.0 in /opt/conda/lib/python3.8/site-packages (from gym[box2d]==0.25.2) (6.0.0)
Requirement already satisfied: gym-notices>=0.0.4 in /opt/conda/lib/python3.8/site-packages (from gym[box2d]==0.25.2) (0.0.8)
Requirement already satisfied: swig==4.* in /opt/conda/lib/python3.8/site-packages (from gym[box2d]==0.25.2) (4.2.1)
Collecting pygame==2.1.0
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ba/a3/6888bb6d57678a6acf754dfed589cb0dbe85086bce607dd580ab4b50cad9/pygame-2.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 23.2 MB/s eta 0:00:0000:0100:01
Collecting box2d-py
  Downloading https://pypi.tuna.tsinghua.edu.cn/packages/dd/5a/ad8d3ef9c13d5afcc1e44a77f11792ee717f6727b3320bddbc607e935e2a/box2d-py-2.3.5.tar.gz (374 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 374.4/374.4 kB 12.7 MB/s eta 0:00:00
  Preparing metadata (setup.py) ... done
Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.8/site-packages (from importlib-metadata>=4.8.0->gym[box2d]==0.25.2) (3.14.0)
Building wheels for collected packages: box2d, box2d-py
  Building wheel for box2d (setup.py) ... done
  Created wheel for box2d-py: filename=box2d_py-2.3.5-cp38-cp38-linux_x86_64.whl size=3124676 sha256=3abbe5a971859f55aea1e08f607c192adb23333cea1014a10a0f04a1ace59ae2
  Stored in directory: /root/.cache/pip/wheels/08/ec/28/605876e7e1b11ffc19f6b33dd08293669e66c42676f80e98ef
Successfully built box2d box2d-py
Installing collected packages: pyvirtualdisplay, box2d-py, box2d, pygame
Successfully installed box2d-2.3.2 box2d-py-2.3.5 pygame-2.1.0 pyvirtualdisplay-3.0
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
one
  Created wheel for box2d-kengz: filename=Box2D_kengz-2.3.3-cp38-cp38-linux_x86_64.whl size=3142929 sha256=bae0e85dd98671e3b8cbe38d777a8df99908360795bbb8118e21fe02816af652
  Stored in directory: /root/.cache/pip/wheels/b1/5a/15/37288ab87c40e970871421b595614b3feb5021a6de0661401c
Successfully built box2d-kengz
Installing collected packages: box2d-kengz
Successfully installed box2d-kengz-2.3.3
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
代码
文本

Next, set up virtual display,and import all necessaary packages.

代码
文本
[2]
%%capture
from pyvirtualdisplay import Display
virtual_display = Display(visible=0, size=(1400, 900))
virtual_display.start()

%matplotlib inline
import matplotlib.pyplot as plt

from IPython import display

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Categorical
from tqdm.notebook import tqdm
代码
文本

Warning ! Do not revise random seed !!!

Your submission on JudgeBoi will not reproduce your result !!!

Make your HW result to be reproducible.

代码
文本
[3]
seed = 543 # Do not change this
def fix(env, seed):
env.seed(seed)
env.action_space.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
np.random.seed(seed)
random.seed(seed)
#torch.set_deterministic(True)
torch.backends.cudnn.benchmark = False
torch.backends.cudnn.deterministic = True
代码
文本

Last, call gym and build an Lunar Lander environment.

代码
文本
[4]
%%capture
import gym
import random
env = gym.make('LunarLander-v2')
fix(env, seed) # fix the environment Do not revise this !!!
代码
文本

What Lunar Lander?

“LunarLander-v2”is to simulate the situation when the craft lands on the surface of the moon.

This task is to enable the craft to land "safely" at the pad between the two yellow flags.

Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector.

"LunarLander-v2" actually includes "Agent" and "Environment".

In this homework, we will utilize the function step() to control the action of "Agent".

Then step() will return the observation/state and reward given by the "Environment".

代码
文本

Observation / State

First, we can take a look at what an Observation / State looks like.

代码
文本
[5]
print(env.observation_space)
Box([-1.5       -1.5       -5.        -5.        -3.1415927 -5.
 -0.        -0.       ], [1.5       1.5       5.        5.        3.1415927 5.        1.
 1.       ], (8,), float32)
代码
文本

Box(8,)means that observation is an 8-dim vector

Action

Actions can be taken by looks like

代码
文本
[6]
print(env.action_space)
Discrete(4)
代码
文本

Discrete(4) implies that there are four kinds of actions can be taken by agent.

  • 0 implies the agent will not take any actions
  • 2 implies the agent will accelerate downward
  • 1, 3 implies the agent will accelerate left and right

Next, we will try to make the agent interact with the environment. Before taking any actions, we recommend to call reset() function to reset the environment. Also, this function will return the initial state of the environment.

代码
文本
[7]
initial_state = env.reset()
print(initial_state)
[-1.2619973e-03  1.3984586e+00 -1.2784091e-01 -5.5384123e-01
  1.4691149e-03  2.8957864e-02  0.0000000e+00  0.0000000e+00]
代码
文本

Then, we try to get a random action from the agent's action space.

代码
文本
[8]
random_action = env.action_space.sample()
print(random_action)
3
代码
文本

More, we can utilize step() to make agent act according to the randomly-selected random_action. The step() function will return four values:

  • observation / state
  • reward
  • done (True/ False)
  • Other information
代码
文本
[9]
observation, reward, done, info = env.step(random_action)
代码
文本
[10]
print(done)
False
代码
文本

Reward

Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points.

代码
文本
[11]
print(reward)
-1.0511407416545058
代码
文本

Random Agent

In the end, before we start training, we can see whether a random agent can successfully land the moon or not.

代码
文本
[12]
env.reset()

img = plt.imshow(env.render(mode='rgb_array'))

done = False
while not done:
action = env.action_space.sample()
observation, reward, done, _ = env.step(action)

img.set_data(env.render(mode='rgb_array'))
display.display(plt.gcf())
display.clear_output(wait=True)
/opt/conda/lib/python3.8/site-packages/gym/core.py:43: DeprecationWarning: WARN: The argument mode in render method is deprecated; use render_mode during environment initialization instead.
See here for more information: https://www.gymlibrary.ml/content/api/
  deprecation(
代码
文本

Policy Gradient

Now, we can build a simple policy network. The network will return one of action in the action space.

代码
文本
[13]
class PolicyGradientNetwork(nn.Module):

def __init__(self):
super().__init__()
self.fc1 = nn.Linear(8, 16)
self.fc2 = nn.Linear(16, 16)
self.fc3 = nn.Linear(16, 4)

def forward(self, state):
hid = torch.tanh(self.fc1(state))
hid = torch.tanh(self.fc2(hid))
return F.softmax(self.fc3(hid), dim=-1)
代码
文本

Then, we need to build a simple agent. The agent will acts according to the output of the policy network above. There are a few things can be done by agent:

  • learn():update the policy network from log probabilities and rewards.
  • sample():After receiving observation from the environment, utilize policy network to tell which action to take. The return values of this function includes action and log probabilities.
代码
文本
[14]
from torch.optim.lr_scheduler import StepLR
class PolicyGradientAgent():
def __init__(self, network):
self.network = network
self.optimizer = optim.SGD(self.network.parameters(), lr=0.001)
def forward(self, state):
return self.network(state)
def learn(self, log_probs, rewards):
loss = (-log_probs * rewards).sum() # You don't need to revise this to pass simple baseline (but you can)

self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
def sample(self, state):
action_prob = self.network(torch.FloatTensor(state))
action_dist = Categorical(action_prob)
action = action_dist.sample()
log_prob = action_dist.log_prob(action)
return action.item(), log_prob
代码
文本

Lastly, build a network and agent to start training.

代码
文本
[15]
network = PolicyGradientNetwork()
agent = PolicyGradientAgent(network)
代码
文本

Training Agent

Now let's start to train our agent. Through taking all the interactions between agent and environment as training data, the policy network can learn from all these attempts,

代码
文本
[18]
agent.network.train() # Switch network into training mode
EPISODE_PER_BATCH = 5 # update the agent every 5 episode
NUM_BATCH = 500 # totally update the agent for 400 time

avg_total_rewards, avg_final_rewards = [], []

prg_bar = tqdm(range(NUM_BATCH))
for batch in prg_bar:

log_probs, rewards = [], []
total_rewards, final_rewards = [], []

# collect trajectory
for episode in range(EPISODE_PER_BATCH):
state = env.reset()
total_reward, total_step = 0, 0
seq_rewards = []
while True:

action, log_prob = agent.sample(state) # at, log(at|st)
next_state, reward, done, _ = env.step(action)

log_probs.append(log_prob) # [log(a1|s1), log(a2|s2), ...., log(at|st)]
# seq_rewards.append(reward)
state = next_state
total_reward += reward
total_step += 1
rewards.append(reward) # change here
# ! IMPORTANT !
# Current reward implementation: immediate reward, given action_list : a1, a2, a3 ......
# rewards : r1, r2 ,r3 ......
# medium:change "rewards" to accumulative decaying reward, given action_list : a1, a2, a3, ......
# rewards : r1+0.99*r2+0.99^2*r3+......, r2+0.99*r3+0.99^2*r4+...... , r3+0.99*r4+0.99^2*r5+ ......
# boss : implement Actor-Critic
if done:
final_rewards.append(reward)
total_rewards.append(total_reward)
break

print(f"rewards looks like ", np.shape(rewards))
#print(f"log_probs looks like ", np.shape(log_probs))
# record training process
avg_total_reward = sum(total_rewards) / len(total_rewards)
avg_final_reward = sum(final_rewards) / len(final_rewards)
avg_total_rewards.append(avg_total_reward)
avg_final_rewards.append(avg_final_reward)
prg_bar.set_description(f"Total: {avg_total_reward: 4.1f}, Final: {avg_final_reward: 4.1f}")

# update agent
# rewards = np.concatenate(rewards, axis=0)
rewards = (rewards - np.mean(rewards)) / (np.std(rewards) + 1e-9) # normalize the reward
agent.learn(torch.stack(log_probs), torch.from_numpy(rewards))
print("logs prob looks like ", torch.stack(log_probs).size())
print("torch.from_numpy(rewards) looks like ", torch.from_numpy(rewards).size())
rewards looks like  (467,)
logs prob looks like  torch.Size([467])
torch.from_numpy(rewards) looks like  torch.Size([467])
rewards looks like  (460,)
logs prob looks like  torch.Size([460])
torch.from_numpy(rewards) looks like  torch.Size([460])
rewards looks like  (493,)
logs prob looks like  torch.Size([493])
torch.from_numpy(rewards) looks like  torch.Size([493])
rewards looks like  (426,)
logs prob looks like  torch.Size([426])
torch.from_numpy(rewards) looks like  torch.Size([426])
rewards looks like  (415,)
logs prob looks like  torch.Size([415])
torch.from_numpy(rewards) looks like  torch.Size([415])
rewards looks like  (504,)
logs prob looks like  torch.Size([504])
torch.from_numpy(rewards) looks like  torch.Size([504])
rewards looks like  (466,)
logs prob looks like  torch.Size([466])
torch.from_numpy(rewards) looks like  torch.Size([466])
rewards looks like  (475,)
logs prob looks like  torch.Size([475])
torch.from_numpy(rewards) looks like  torch.Size([475])
rewards looks like  (513,)
logs prob looks like  torch.Size([513])
torch.from_numpy(rewards) looks like  torch.Size([513])
rewards looks like  (618,)
logs prob looks like  torch.Size([618])
torch.from_numpy(rewards) looks like  torch.Size([618])
rewards looks like  (533,)
logs prob looks like  torch.Size([533])
torch.from_numpy(rewards) looks like  torch.Size([533])
rewards looks like  (475,)
logs prob looks like  torch.Size([475])
torch.from_numpy(rewards) looks like  torch.Size([475])
rewards looks like  (465,)
logs prob looks like  torch.Size([465])
torch.from_numpy(rewards) looks like  torch.Size([465])
rewards looks like  (1396,)
logs prob looks like  torch.Size([1396])
torch.from_numpy(rewards) looks like  torch.Size([1396])
rewards looks like  (541,)
logs prob looks like  torch.Size([541])
torch.from_numpy(rewards) looks like  torch.Size([541])
rewards looks like  (400,)
logs prob looks like  torch.Size([400])
torch.from_numpy(rewards) looks like  torch.Size([400])
rewards looks like  (541,)
logs prob looks like  torch.Size([541])
torch.from_numpy(rewards) looks like  torch.Size([541])
rewards looks like  (478,)
logs prob looks like  torch.Size([478])
torch.from_numpy(rewards) looks like  torch.Size([478])
rewards looks like  (491,)
logs prob looks like  torch.Size([491])
torch.from_numpy(rewards) looks like  torch.Size([491])
rewards looks like  (599,)
logs prob looks like  torch.Size([599])
torch.from_numpy(rewards) looks like  torch.Size([599])
rewards looks like  (468,)
logs prob looks like  torch.Size([468])
torch.from_numpy(rewards) looks like  torch.Size([468])
rewards looks like  (787,)
logs prob looks like  torch.Size([787])
torch.from_numpy(rewards) looks like  torch.Size([787])
rewards looks like  (656,)
logs prob looks like  torch.Size([656])
torch.from_numpy(rewards) looks like  torch.Size([656])
rewards looks like  (574,)
logs prob looks like  torch.Size([574])
torch.from_numpy(rewards) looks like  torch.Size([574])
rewards looks like  (468,)
logs prob looks like  torch.Size([468])
torch.from_numpy(rewards) looks like  torch.Size([468])
rewards looks like  (542,)
logs prob looks like  torch.Size([542])
torch.from_numpy(rewards) looks like  torch.Size([542])
rewards looks like  (558,)
logs prob looks like  torch.Size([558])
torch.from_numpy(rewards) looks like  torch.Size([558])
rewards looks like  (565,)
logs prob looks like  torch.Size([565])
torch.from_numpy(rewards) looks like  torch.Size([565])
rewards looks like  (463,)
logs prob looks like  torch.Size([463])
torch.from_numpy(rewards) looks like  torch.Size([463])
rewards looks like  (551,)
logs prob looks like  torch.Size([551])
torch.from_numpy(rewards) looks like  torch.Size([551])
rewards looks like  (580,)
logs prob looks like  torch.Size([580])
torch.from_numpy(rewards) looks like  torch.Size([580])
rewards looks like  (694,)
logs prob looks like  torch.Size([694])
torch.from_numpy(rewards) looks like  torch.Size([694])
rewards looks like  (537,)
logs prob looks like  torch.Size([537])
torch.from_numpy(rewards) looks like  torch.Size([537])
rewards looks like  (639,)
logs prob looks like  torch.Size([639])
torch.from_numpy(rewards) looks like  torch.Size([639])
rewards looks like  (519,)
logs prob looks like  torch.Size([519])
torch.from_numpy(rewards) looks like  torch.Size([519])
rewards looks like  (657,)
logs prob looks like  torch.Size([657])
torch.from_numpy(rewards) looks like  torch.Size([657])
rewards looks like  (647,)
logs prob looks like  torch.Size([647])
torch.from_numpy(rewards) looks like  torch.Size([647])
rewards looks like  (554,)
logs prob looks like  torch.Size([554])
torch.from_numpy(rewards) looks like  torch.Size([554])
rewards looks like  (558,)
logs prob looks like  torch.Size([558])
torch.from_numpy(rewards) looks like  torch.Size([558])
rewards looks like  (1382,)
logs prob looks like  torch.Size([1382])
torch.from_numpy(rewards) looks like  torch.Size([1382])
rewards looks like  (500,)
logs prob looks like  torch.Size([500])
torch.from_numpy(rewards) looks like  torch.Size([500])
rewards looks like  (575,)
logs prob looks like  torch.Size([575])
torch.from_numpy(rewards) looks like  torch.Size([575])
rewards looks like  (576,)
logs prob looks like  torch.Size([576])
torch.from_numpy(rewards) looks like  torch.Size([576])
rewards looks like  (510,)
logs prob looks like  torch.Size([510])
torch.from_numpy(rewards) looks like  torch.Size([510])
rewards looks like  (703,)
logs prob looks like  torch.Size([703])
torch.from_numpy(rewards) looks like  torch.Size([703])
rewards looks like  (509,)
logs prob looks like  torch.Size([509])
torch.from_numpy(rewards) looks like  torch.Size([509])
rewards looks like  (580,)
logs prob looks like  torch.Size([580])
torch.from_numpy(rewards) looks like  torch.Size([580])
rewards looks like  (1475,)
logs prob looks like  torch.Size([1475])
torch.from_numpy(rewards) looks like  torch.Size([1475])
rewards looks like  (729,)
logs prob looks like  torch.Size([729])
torch.from_numpy(rewards) looks like  torch.Size([729])
rewards looks like  (589,)
logs prob looks like  torch.Size([589])
torch.from_numpy(rewards) looks like  torch.Size([589])
rewards looks like  (494,)
logs prob looks like  torch.Size([494])
torch.from_numpy(rewards) looks like  torch.Size([494])
rewards looks like  (511,)
logs prob looks like  torch.Size([511])
torch.from_numpy(rewards) looks like  torch.Size([511])
rewards looks like  (816,)
logs prob looks like  torch.Size([816])
torch.from_numpy(rewards) looks like  torch.Size([816])
rewards looks like  (562,)
logs prob looks like  torch.Size([562])
torch.from_numpy(rewards) looks like  torch.Size([562])
rewards looks like  (827,)
logs prob looks like  torch.Size([827])
torch.from_numpy(rewards) looks like  torch.Size([827])
rewards looks like  (747,)
logs prob looks like  torch.Size([747])
torch.from_numpy(rewards) looks like  torch.Size([747])
rewards looks like  (804,)
logs prob looks like  torch.Size([804])
torch.from_numpy(rewards) looks like  torch.Size([804])
rewards looks like  (555,)
logs prob looks like  torch.Size([555])
torch.from_numpy(rewards) looks like  torch.Size([555])
rewards looks like  (786,)
logs prob looks like  torch.Size([786])
torch.from_numpy(rewards) looks like  torch.Size([786])
rewards looks like  (536,)
logs prob looks like  torch.Size([536])
torch.from_numpy(rewards) looks like  torch.Size([536])
rewards looks like  (680,)
logs prob looks like  torch.Size([680])
torch.from_numpy(rewards) looks like  torch.Size([680])
rewards looks like  (721,)
logs prob looks like  torch.Size([721])
torch.from_numpy(rewards) looks like  torch.Size([721])
rewards looks like  (664,)
logs prob looks like  torch.Size([664])
torch.from_numpy(rewards) looks like  torch.Size([664])
rewards looks like  (916,)
logs prob looks like  torch.Size([916])
torch.from_numpy(rewards) looks like  torch.Size([916])
rewards looks like  (1148,)
logs prob looks like  torch.Size([1148])
torch.from_numpy(rewards) looks like  torch.Size([1148])
rewards looks like  (644,)
logs prob looks like  torch.Size([644])
torch.from_numpy(rewards) looks like  torch.Size([644])
rewards looks like  (671,)
logs prob looks like  torch.Size([671])
torch.from_numpy(rewards) looks like  torch.Size([671])
rewards looks like  (929,)
logs prob looks like  torch.Size([929])
torch.from_numpy(rewards) looks like  torch.Size([929])
rewards looks like  (929,)
logs prob looks like  torch.Size([929])
torch.from_numpy(rewards) looks like  torch.Size([929])
rewards looks like  (865,)
logs prob looks like  torch.Size([865])
torch.from_numpy(rewards) looks like  torch.Size([865])
rewards looks like  (621,)
logs prob looks like  torch.Size([621])
torch.from_numpy(rewards) looks like  torch.Size([621])
rewards looks like  (772,)
logs prob looks like  torch.Size([772])
torch.from_numpy(rewards) looks like  torch.Size([772])
rewards looks like  (720,)
logs prob looks like  torch.Size([720])
torch.from_numpy(rewards) looks like  torch.Size([720])
rewards looks like  (972,)
logs prob looks like  torch.Size([972])
torch.from_numpy(rewards) looks like  torch.Size([972])
rewards looks like  (979,)
logs prob looks like  torch.Size([979])
torch.from_numpy(rewards) looks like  torch.Size([979])
rewards looks like  (1539,)
logs prob looks like  torch.Size([1539])
torch.from_numpy(rewards) looks like  torch.Size([1539])
rewards looks like  (604,)
logs prob looks like  torch.Size([604])
torch.from_numpy(rewards) looks like  torch.Size([604])
rewards looks like  (724,)
logs prob looks like  torch.Size([724])
torch.from_numpy(rewards) looks like  torch.Size([724])
rewards looks like  (821,)
logs prob looks like  torch.Size([821])
torch.from_numpy(rewards) looks like  torch.Size([821])
rewards looks like  (778,)
logs prob looks like  torch.Size([778])
torch.from_numpy(rewards) looks like  torch.Size([778])
rewards looks like  (625,)
logs prob looks like  torch.Size([625])
torch.from_numpy(rewards) looks like  torch.Size([625])
rewards looks like  (853,)
logs prob looks like  torch.Size([853])
torch.from_numpy(rewards) looks like  torch.Size([853])
rewards looks like  (797,)
logs prob looks like  torch.Size([797])
torch.from_numpy(rewards) looks like  torch.Size([797])
rewards looks like  (922,)
logs prob looks like  torch.Size([922])
torch.from_numpy(rewards) looks like  torch.Size([922])
rewards looks like  (839,)
logs prob looks like  torch.Size([839])
torch.from_numpy(rewards) looks like  torch.Size([839])
rewards looks like  (765,)
logs prob looks like  torch.Size([765])
torch.from_numpy(rewards) looks like  torch.Size([765])
rewards looks like  (682,)
logs prob looks like  torch.Size([682])
torch.from_numpy(rewards) looks like  torch.Size([682])
rewards looks like  (809,)
logs prob looks like  torch.Size([809])
torch.from_numpy(rewards) looks like  torch.Size([809])
rewards looks like  (768,)
logs prob looks like  torch.Size([768])
torch.from_numpy(rewards) looks like  torch.Size([768])
rewards looks like  (635,)
logs prob looks like  torch.Size([635])
torch.from_numpy(rewards) looks like  torch.Size([635])
rewards looks like  (722,)
logs prob looks like  torch.Size([722])
torch.from_numpy(rewards) looks like  torch.Size([722])
rewards looks like  (894,)
logs prob looks like  torch.Size([894])
torch.from_numpy(rewards) looks like  torch.Size([894])
rewards looks like  (912,)
logs prob looks like  torch.Size([912])
torch.from_numpy(rewards) looks like  torch.Size([912])
rewards looks like  (769,)
logs prob looks like  torch.Size([769])
torch.from_numpy(rewards) looks like  torch.Size([769])
rewards looks like  (719,)
logs prob looks like  torch.Size([719])
torch.from_numpy(rewards) looks like  torch.Size([719])
rewards looks like  (1036,)
logs prob looks like  torch.Size([1036])
torch.from_numpy(rewards) looks like  torch.Size([1036])
rewards looks like  (671,)
logs prob looks like  torch.Size([671])
torch.from_numpy(rewards) looks like  torch.Size([671])
rewards looks like  (795,)
logs prob looks like  torch.Size([795])
torch.from_numpy(rewards) looks like  torch.Size([795])
rewards looks like  (822,)
logs prob looks like  torch.Size([822])
torch.from_numpy(rewards) looks like  torch.Size([822])
rewards looks like  (940,)
logs prob looks like  torch.Size([940])
torch.from_numpy(rewards) looks like  torch.Size([940])
rewards looks like  (805,)
logs prob looks like  torch.Size([805])
torch.from_numpy(rewards) looks like  torch.Size([805])
rewards looks like  (888,)
logs prob looks like  torch.Size([888])
torch.from_numpy(rewards) looks like  torch.Size([888])
rewards looks like  (795,)
logs prob looks like  torch.Size([795])
torch.from_numpy(rewards) looks like  torch.Size([795])
rewards looks like  (732,)
logs prob looks like  torch.Size([732])
torch.from_numpy(rewards) looks like  torch.Size([732])
rewards looks like  (857,)
logs prob looks like  torch.Size([857])
torch.from_numpy(rewards) looks like  torch.Size([857])
rewards looks like  (1208,)
logs prob looks like  torch.Size([1208])
torch.from_numpy(rewards) looks like  torch.Size([1208])
rewards looks like  (755,)
logs prob looks like  torch.Size([755])
torch.from_numpy(rewards) looks like  torch.Size([755])
rewards looks like  (975,)
logs prob looks like  torch.Size([975])
torch.from_numpy(rewards) looks like  torch.Size([975])
rewards looks like  (969,)
logs prob looks like  torch.Size([969])
torch.from_numpy(rewards) looks like  torch.Size([969])
rewards looks like  (1217,)
logs prob looks like  torch.Size([1217])
torch.from_numpy(rewards) looks like  torch.Size([1217])
rewards looks like  (1466,)
logs prob looks like  torch.Size([1466])
torch.from_numpy(rewards) looks like  torch.Size([1466])
rewards looks like  (892,)
logs prob looks like  torch.Size([892])
torch.from_numpy(rewards) looks like  torch.Size([892])
rewards looks like  (933,)
logs prob looks like  torch.Size([933])
torch.from_numpy(rewards) looks like  torch.Size([933])
rewards looks like  (1991,)
logs prob looks like  torch.Size([1991])
torch.from_numpy(rewards) looks like  torch.Size([1991])
rewards looks like  (602,)
logs prob looks like  torch.Size([602])
torch.from_numpy(rewards) looks like  torch.Size([602])
rewards looks like  (694,)
logs prob looks like  torch.Size([694])
torch.from_numpy(rewards) looks like  torch.Size([694])
rewards looks like  (962,)
logs prob looks like  torch.Size([962])
torch.from_numpy(rewards) looks like  torch.Size([962])
rewards looks like  (889,)
logs prob looks like  torch.Size([889])
torch.from_numpy(rewards) looks like  torch.Size([889])
rewards looks like  (874,)
logs prob looks like  torch.Size([874])
torch.from_numpy(rewards) looks like  torch.Size([874])
rewards looks like  (1108,)
logs prob looks like  torch.Size([1108])
torch.from_numpy(rewards) looks like  torch.Size([1108])
rewards looks like  (994,)
logs prob looks like  torch.Size([994])
torch.from_numpy(rewards) looks like  torch.Size([994])
rewards looks like  (1742,)
logs prob looks like  torch.Size([1742])
torch.from_numpy(rewards) looks like  torch.Size([1742])
rewards looks like  (1287,)
logs prob looks like  torch.Size([1287])
torch.from_numpy(rewards) looks like  torch.Size([1287])
rewards looks like  (1190,)
logs prob looks like  torch.Size([1190])
torch.from_numpy(rewards) looks like  torch.Size([1190])
rewards looks like  (1016,)
logs prob looks like  torch.Size([1016])
torch.from_numpy(rewards) looks like  torch.Size([1016])
rewards looks like  (810,)
logs prob looks like  torch.Size([810])
torch.from_numpy(rewards) looks like  torch.Size([810])
rewards looks like  (1244,)
logs prob looks like  torch.Size([1244])
torch.from_numpy(rewards) looks like  torch.Size([1244])
rewards looks like  (1755,)
logs prob looks like  torch.Size([1755])
torch.from_numpy(rewards) looks like  torch.Size([1755])
rewards looks like  (1467,)
rewards looks like  (1530,)
logs prob looks like  torch.Size([1530])
torch.from_numpy(rewards) looks like  torch.Size([1530])
rewards looks like  (2494,)
logs prob looks like  torch.Size([2494])
torch.from_numpy(rewards) looks like  torch.Size([2494])
rewards looks like  (1130,)
logs prob looks like  torch.Size([1130])
torch.from_numpy(rewards) looks like  torch.Size([1130])
rewards looks like  (1282,)
logs prob looks like  torch.Size([1282])
torch.from_numpy(rewards) looks like  torch.Size([1282])
rewards looks like  (2414,)
logs prob looks like  torch.Size([2414])
torch.from_numpy(rewards) looks like  torch.Size([2414])
rewards looks like  (1461,)
logs prob looks like  torch.Size([1461])
torch.from_numpy(rewards) looks like  torch.Size([1461])
rewards looks like  (818,)
logs prob looks like  torch.Size([818])
torch.from_numpy(rewards) looks like  torch.Size([818])
rewards looks like  (1231,)
logs prob looks like  torch.Size([1231])
torch.from_numpy(rewards) looks like  torch.Size([1231])
rewards looks like  (2387,)
logs prob looks like  torch.Size([2387])
torch.from_numpy(rewards) looks like  torch.Size([2387])
rewards looks like  (421,)
logs prob looks like  torch.Size([421])
torch.from_numpy(rewards) looks like  torch.Size([421])
rewards looks like  (374,)
logs prob looks like  torch.Size([374])
torch.from_numpy(rewards) looks like  torch.Size([374])
rewards looks like  (419,)
logs prob looks like  torch.Size([419])
torch.from_numpy(rewards) looks like  torch.Size([419])
rewards looks like  (345,)
logs prob looks like  torch.Size([345])
torch.from_numpy(rewards) looks like  torch.Size([345])
rewards looks like  (422,)
logs prob looks like  torch.Size([422])
torch.from_numpy(rewards) looks like  torch.Size([422])
rewards looks like  (426,)
logs prob looks like  torch.Size([426])
torch.from_numpy(rewards) looks like  torch.Size([426])
rewards looks like  (416,)
logs prob looks like  torch.Size([416])
torch.from_numpy(rewards) looks like  torch.Size([416])
rewards looks like  (374,)
logs prob looks like  torch.Size([374])
torch.from_numpy(rewards) looks like  torch.Size([374])
rewards looks like  (442,)
logs prob looks like  torch.Size([442])
torch.from_numpy(rewards) looks like  torch.Size([442])
rewards looks like  (387,)
logs prob looks like  torch.Size([387])
torch.from_numpy(rewards) looks like  torch.Size([387])
rewards looks like  (364,)
logs prob looks like  torch.Size([364])
torch.from_numpy(rewards) looks like  torch.Size([364])
rewards looks like  (433,)
logs prob looks like  torch.Size([433])
torch.from_numpy(rewards) looks like  torch.Size([433])
rewards looks like  (447,)
logs prob looks like  torch.Size([447])
torch.from_numpy(rewards) looks like  torch.Size([447])
rewards looks like  (450,)
logs prob looks like  torch.Size([450])
torch.from_numpy(rewards) looks like  torch.Size([450])
rewards looks like  (468,)
logs prob looks like  torch.Size([468])
torch.from_numpy(rewards) looks like  torch.Size([468])
rewards looks like  (459,)
logs prob looks like  torch.Size([459])
torch.from_numpy(rewards) looks like  torch.Size([459])
rewards looks like  (463,)
logs prob looks like  torch.Size([463])
torch.from_numpy(rewards) looks like  torch.Size([463])
rewards looks like  (1427,)
logs prob looks like  torch.Size([1427])
torch.from_numpy(rewards) looks like  torch.Size([1427])
rewards looks like  (1327,)
logs prob looks like  torch.Size([1327])
torch.from_numpy(rewards) looks like  torch.Size([1327])
rewards looks like  (1328,)
logs prob looks like  torch.Size([1328])
torch.from_numpy(rewards) looks like  torch.Size([1328])
rewards looks like  (1374,)
logs prob looks like  torch.Size([1374])
torch.from_numpy(rewards) looks like  torch.Size([1374])
rewards looks like  (2257,)
logs prob looks like  torch.Size([2257])
torch.from_numpy(rewards) looks like  torch.Size([2257])
rewards looks like  (1379,)
logs prob looks like  torch.Size([1379])
torch.from_numpy(rewards) looks like  torch.Size([1379])
rewards looks like  (2934,)
logs prob looks like  torch.Size([2934])
torch.from_numpy(rewards) looks like  torch.Size([2934])
rewards looks like  (1415,)
logs prob looks like  torch.Size([1415])
torch.from_numpy(rewards) looks like  torch.Size([1415])
rewards looks like  (698,)
logs prob looks like  torch.Size([698])
torch.from_numpy(rewards) looks like  torch.Size([698])
rewards looks like  (1740,)
logs prob looks like  torch.Size([1740])
torch.from_numpy(rewards) looks like  torch.Size([1740])
rewards looks like  (2216,)
logs prob looks like  torch.Size([2216])
torch.from_numpy(rewards) looks like  torch.Size([2216])
rewards looks like  (1920,)
logs prob looks like  torch.Size([1920])
torch.from_numpy(rewards) looks like  torch.Size([1920])
rewards looks like  (1229,)
logs prob looks like  torch.Size([1229])
torch.from_numpy(rewards) looks like  torch.Size([1229])
rewards looks like  (2278,)
logs prob looks like  torch.Size([2278])
torch.from_numpy(rewards) looks like  torch.Size([2278])
rewards looks like  (2598,)
logs prob looks like  torch.Size([2598])
torch.from_numpy(rewards) looks like  torch.Size([2598])
rewards looks like  (1279,)
logs prob looks like  torch.Size([1279])
torch.from_numpy(rewards) looks like  torch.Size([1279])
rewards looks like  (2926,)
logs prob looks like  torch.Size([2926])
torch.from_numpy(rewards) looks like  torch.Size([2926])
rewards looks like  (1525,)
logs prob looks like  torch.Size([1525])
torch.from_numpy(rewards) looks like  torch.Size([1525])
rewards looks like  (965,)
logs prob looks like  torch.Size([965])
torch.from_numpy(rewards) looks like  torch.Size([965])
rewards looks like  (1734,)
logs prob looks like  torch.Size([1734])
torch.from_numpy(rewards) looks like  torch.Size([1734])
rewards looks like  (1625,)
logs prob looks like  torch.Size([1625])
torch.from_numpy(rewards) looks like  torch.Size([1625])
rewards looks like  (1081,)
logs prob looks like  torch.Size([1081])
torch.from_numpy(rewards) looks like  torch.Size([1081])
rewards looks like  (1628,)
logs prob looks like  torch.Size([1628])
torch.from_numpy(rewards) looks like  torch.Size([1628])
rewards looks like  (2825,)
logs prob looks like  torch.Size([2825])
torch.from_numpy(rewards) looks like  torch.Size([2825])
rewards looks like  (3485,)
logs prob looks like  torch.Size([3485])
torch.from_numpy(rewards) looks like  torch.Size([3485])
rewards looks like  (1514,)
logs prob looks like  torch.Size([1514])
torch.from_numpy(rewards) looks like  torch.Size([1514])
rewards looks like  (642,)
logs prob looks like  torch.Size([846])
torch.from_numpy(rewards) looks like  torch.Size([846])
rewards looks like  (755,)
logs prob looks like  torch.Size([755])
torch.from_numpy(rewards) looks like  torch.Size([755])
rewards looks like  (1059,)
logs prob looks like  torch.Size([1059])
torch.from_numpy(rewards) looks like  torch.Size([1059])
rewards looks like  (2581,)
logs prob looks like  torch.Size([2581])
torch.from_numpy(rewards) looks like  torch.Size([2581])
rewards looks like  (2767,)
logs prob looks like  torch.Size([2767])
torch.from_numpy(rewards) looks like  torch.Size([2767])
rewards looks like  (899,)
logs prob looks like  torch.Size([899])
torch.from_numpy(rewards) looks like  torch.Size([899])
rewards looks like  (2808,)
logs prob looks like  torch.Size([2808])
torch.from_numpy(rewards) looks like  torch.Size([2808])
rewards looks like  (1459,)
logs prob looks like  torch.Size([1459])
torch.from_numpy(rewards) looks like  torch.Size([1459])
rewards looks like  (2458,)
logs prob looks like  torch.Size([2458])
torch.from_numpy(rewards) looks like  torch.Size([2458])
rewards looks like  (1027,)
logs prob looks like  torch.Size([1027])
torch.from_numpy(rewards) looks like  torch.Size([1027])
rewards looks like  (1907,)
logs prob looks like  torch.Size([1907])
torch.from_numpy(rewards) looks like  torch.Size([1907])
rewards looks like  (1878,)
logs prob looks like  torch.Size([1878])
torch.from_numpy(rewards) looks like  torch.Size([1878])
rewards looks like  (2129,)
logs prob looks like  torch.Size([2129])
torch.from_numpy(rewards) looks like  torch.Size([2129])
rewards looks like  (2873,)
logs prob looks like  torch.Size([2873])
torch.from_numpy(rewards) looks like  torch.Size([2873])
rewards looks like  (1311,)
logs prob looks like  torch.Size([1311])
torch.from_numpy(rewards) looks like  torch.Size([1311])
rewards looks like  (1888,)
logs prob looks like  torch.Size([1888])
torch.from_numpy(rewards) looks like  torch.Size([1888])
rewards looks like  (870,)
logs prob looks like  torch.Size([870])
torch.from_numpy(rewards) looks like  torch.Size([870])
rewards looks like  (1193,)
logs prob looks like  torch.Size([1193])
torch.from_numpy(rewards) looks like  torch.Size([1193])
rewards looks like  (1367,)
logs prob looks like  torch.Size([1367])
torch.from_numpy(rewards) looks like  torch.Size([1367])
rewards looks like  (1786,)
logs prob looks like  torch.Size([1786])
torch.from_numpy(rewards) looks like  torch.Size([1786])
rewards looks like  (992,)
logs prob looks like  torch.Size([992])
torch.from_numpy(rewards) looks like  torch.Size([992])
rewards looks like  (1037,)
logs prob looks like  torch.Size([1037])
torch.from_numpy(rewards) looks like  torch.Size([1037])
rewards looks like  (2417,)
logs prob looks like  torch.Size([2417])
torch.from_numpy(rewards) looks like  torch.Size([2417])
rewards looks like  (2027,)
logs prob looks like  torch.Size([2027])
torch.from_numpy(rewards) looks like  torch.Size([2027])
rewards looks like  (1203,)
logs prob looks like  torch.Size([1203])
torch.from_numpy(rewards) looks like  torch.Size([1203])
rewards looks like  (2168,)
logs prob looks like  torch.Size([2168])
torch.from_numpy(rewards) looks like  torch.Size([2168])
rewards looks like  (1097,)
logs prob looks like  torch.Size([1097])
torch.from_numpy(rewards) looks like  torch.Size([1097])
rewards looks like  (2070,)
logs prob looks like  torch.Size([2070])
torch.from_numpy(rewards) looks like  torch.Size([2070])
rewards looks like  (1878,)
logs prob looks like  torch.Size([1878])
torch.from_numpy(rewards) looks like  torch.Size([1878])
rewards looks like  (1325,)
logs prob looks like  torch.Size([1325])
torch.from_numpy(rewards) looks like  torch.Size([1325])
rewards looks like  (2611,)
logs prob looks like  torch.Size([2611])
torch.from_numpy(rewards) looks like  torch.Size([2611])
rewards looks like  (1549,)
logs prob looks like  torch.Size([1549])
torch.from_numpy(rewards) looks like  torch.Size([1549])
rewards looks like  (2479,)
logs prob looks like  torch.Size([2479])
torch.from_numpy(rewards) looks like  torch.Size([2479])
rewards looks like  (1987,)
logs prob looks like  torch.Size([1987])
torch.from_numpy(rewards) looks like  torch.Size([1987])
rewards looks like  (1370,)
logs prob looks like  torch.Size([1370])
torch.from_numpy(rewards) looks like  torch.Size([1370])
rewards looks like  (1003,)
logs prob looks like  torch.Size([1003])
torch.from_numpy(rewards) looks like  torch.Size([1003])
rewards looks like  (2640,)
logs prob looks like  torch.Size([2640])
torch.from_numpy(rewards) looks like  torch.Size([2640])
rewards looks like  (1486,)
logs prob looks like  torch.Size([1486])
torch.from_numpy(rewards) looks like  torch.Size([1486])
rewards looks like  (2105,)
logs prob looks like  torch.Size([2105])
torch.from_numpy(rewards) looks like  torch.Size([2105])
rewards looks like  (2222,)
logs prob looks like  torch.Size([2222])
torch.from_numpy(rewards) looks like  torch.Size([2222])
rewards looks like  (1209,)
logs prob looks like  torch.Size([1209])
torch.from_numpy(rewards) looks like  torch.Size([1209])
rewards looks like  (1666,)
logs prob looks like  torch.Size([1666])
torch.from_numpy(rewards) looks like  torch.Size([1666])
rewards looks like  (1435,)
logs prob looks like  torch.Size([1435])
torch.from_numpy(rewards) looks like  torch.Size([1435])
rewards looks like  (1231,)
logs prob looks like  torch.Size([1231])
torch.from_numpy(rewards) looks like  torch.Size([1231])
rewards looks like  (1207,)
logs prob looks like  torch.Size([1207])
torch.from_numpy(rewards) looks like  torch.Size([1207])
rewards looks like  (1155,)
logs prob looks like  torch.Size([1155])
torch.from_numpy(rewards) looks like  torch.Size([1155])
rewards looks like  (1526,)
logs prob looks like  torch.Size([1526])
torch.from_numpy(rewards) looks like  torch.Size([1526])
rewards looks like  (2181,)
logs prob looks like  torch.Size([2181])
torch.from_numpy(rewards) looks like  torch.Size([2181])
rewards looks like  (1868,)
logs prob looks like  torch.Size([1868])
torch.from_numpy(rewards) looks like  torch.Size([1868])
rewards looks like  (2452,)
logs prob looks like  torch.Size([2452])
torch.from_numpy(rewards) looks like  torch.Size([2452])
rewards looks like  (1363,)
logs prob looks like  torch.Size([1363])
torch.from_numpy(rewards) looks like  torch.Size([1363])
rewards looks like  (1543,)
logs prob looks like  torch.Size([1543])
torch.from_numpy(rewards) looks like  torch.Size([1543])
rewards looks like  (2103,)
logs prob looks like  torch.Size([2103])
torch.from_numpy(rewards) looks like  torch.Size([2103])
rewards looks like  (1750,)
logs prob looks like  torch.Size([1750])
torch.from_numpy(rewards) looks like  torch.Size([1750])
rewards looks like  (1453,)
logs prob looks like  torch.Size([1453])
torch.from_numpy(rewards) looks like  torch.Size([1453])
rewards looks like  (1996,)
logs prob looks like  torch.Size([1996])
torch.from_numpy(rewards) looks like  torch.Size([1996])
rewards looks like  (1634,)
logs prob looks like  torch.Size([1634])
torch.from_numpy(rewards) looks like  torch.Size([1634])
rewards looks like  (1364,)
logs prob looks like  torch.Size([1364])
torch.from_numpy(rewards) looks like  torch.Size([1364])
rewards looks like  (2401,)
logs prob looks like  torch.Size([2401])
torch.from_numpy(rewards) looks like  torch.Size([2401])
rewards looks like  (1041,)
logs prob looks like  torch.Size([1041])
torch.from_numpy(rewards) looks like  torch.Size([1041])
rewards looks like  (1014,)
logs prob looks like  torch.Size([1014])
torch.from_numpy(rewards) looks like  torch.Size([1014])
rewards looks like  (1723,)
logs prob looks like  torch.Size([1723])
torch.from_numpy(rewards) looks like  torch.Size([1723])
rewards looks like  (1141,)
logs prob looks like  torch.Size([1141])
torch.from_numpy(rewards) looks like  torch.Size([1141])
rewards looks like  (1153,)
logs prob looks like  torch.Size([1153])
torch.from_numpy(rewards) looks like  torch.Size([1153])
rewards looks like  (1345,)
logs prob looks like  torch.Size([1345])
torch.from_numpy(rewards) looks like  torch.Size([1345])
rewards looks like  (1537,)
logs prob looks like  torch.Size([1537])
torch.from_numpy(rewards) looks like  torch.Size([1537])
rewards looks like  (1362,)
logs prob looks like  torch.Size([1362])
torch.from_numpy(rewards) looks like  torch.Size([1362])
rewards looks like  (1400,)
logs prob looks like  torch.Size([1400])
torch.from_numpy(rewards) looks like  torch.Size([1400])
rewards looks like  (1363,)
logs prob looks like  torch.Size([1363])
torch.from_numpy(rewards) looks like  torch.Size([1363])
rewards looks like  (1381,)
logs prob looks like  torch.Size([1381])
torch.from_numpy(rewards) looks like  torch.Size([1381])
rewards looks like  (2077,)
logs prob looks like  torch.Size([2077])
torch.from_numpy(rewards) looks like  torch.Size([2077])
rewards looks like  (2517,)
logs prob looks like  torch.Size([2517])
torch.from_numpy(rewards) looks like  torch.Size([2517])
rewards looks like  (1419,)
logs prob looks like  torch.Size([1419])
torch.from_numpy(rewards) looks like  torch.Size([1419])
rewards looks like  (960,)
logs prob looks like  torch.Size([960])
torch.from_numpy(rewards) looks like  torch.Size([960])
rewards looks like  (1079,)
logs prob looks like  torch.Size([1079])
torch.from_numpy(rewards) looks like  torch.Size([1079])
rewards looks like  (1285,)
logs prob looks like  torch.Size([1285])
torch.from_numpy(rewards) looks like  torch.Size([1285])
rewards looks like  (2475,)
logs prob looks like  torch.Size([2475])
torch.from_numpy(rewards) looks like  torch.Size([2475])
rewards looks like  (1376,)
logs prob looks like  torch.Size([1376])
torch.from_numpy(rewards) looks like  torch.Size([1376])
rewards looks like  (2248,)
logs prob looks like  torch.Size([2248])
torch.from_numpy(rewards) looks like  torch.Size([2248])
rewards looks like  (2912,)
logs prob looks like  torch.Size([2912])
torch.from_numpy(rewards) looks like  torch.Size([2912])
rewards looks like  (1334,)
logs prob looks like  torch.Size([1334])
torch.from_numpy(rewards) looks like  torch.Size([1334])
rewards looks like  (1481,)
logs prob looks like  torch.Size([1481])
torch.from_numpy(rewards) looks like  torch.Size([1481])
rewards looks like  (2016,)
logs prob looks like  torch.Size([2016])
torch.from_numpy(rewards) looks like  torch.Size([2016])
rewards looks like  (1899,)
logs prob looks like  torch.Size([1899])
torch.from_numpy(rewards) looks like  torch.Size([1899])
rewards looks like  (1171,)
logs prob looks like  torch.Size([1171])
torch.from_numpy(rewards) looks like  torch.Size([1171])
rewards looks like  (1250,)
logs prob looks like  torch.Size([1250])
torch.from_numpy(rewards) looks like  torch.Size([1250])
rewards looks like  (1945,)
logs prob looks like  torch.Size([1945])
torch.from_numpy(rewards) looks like  torch.Size([1945])
rewards looks like  (2421,)
logs prob looks like  torch.Size([2421])
torch.from_numpy(rewards) looks like  torch.Size([2421])
rewards looks like  (1859,)
logs prob looks like  torch.Size([1859])
torch.from_numpy(rewards) looks like  torch.Size([1859])
rewards looks like  (1101,)
logs prob looks like  torch.Size([1101])
torch.from_numpy(rewards) looks like  torch.Size([1101])
rewards looks like  (1297,)
logs prob looks like  torch.Size([1297])
torch.from_numpy(rewards) looks like  torch.Size([1297])
rewards looks like  (2085,)
logs prob looks like  torch.Size([2085])
torch.from_numpy(rewards) looks like  torch.Size([2085])
rewards looks like  (1478,)
logs prob looks like  torch.Size([1478])
torch.from_numpy(rewards) looks like  torch.Size([1478])
rewards looks like  (1131,)
logs prob looks like  torch.Size([1131])
torch.from_numpy(rewards) looks like  torch.Size([1131])
rewards looks like  (1370,)
logs prob looks like  torch.Size([1370])
torch.from_numpy(rewards) looks like  torch.Size([1370])
rewards looks like  (1503,)
logs prob looks like  torch.Size([1503])
torch.from_numpy(rewards) looks like  torch.Size([1503])
rewards looks like  (1058,)
logs prob looks like  torch.Size([1058])
torch.from_numpy(rewards) looks like  torch.Size([1058])
rewards looks like  (1350,)
logs prob looks like  torch.Size([1350])
torch.from_numpy(rewards) looks like  torch.Size([1350])
rewards looks like  (1250,)
logs prob looks like  torch.Size([1250])
torch.from_numpy(rewards) looks like  torch.Size([1250])
rewards looks like  (1364,)
logs prob looks like  torch.Size([1364])
torch.from_numpy(rewards) looks like  torch.Size([1364])
rewards looks like  (1084,)
logs prob looks like  torch.Size([1084])
torch.from_numpy(rewards) looks like  torch.Size([1084])
rewards looks like  (1250,)
logs prob looks like  torch.Size([1250])
torch.from_numpy(rewards) looks like  torch.Size([1250])
rewards looks like  (1286,)
logs prob looks like  torch.Size([1286])
torch.from_numpy(rewards) looks like  torch.Size([1286])
rewards looks like  (1477,)
logs prob looks like  torch.Size([1477])
torch.from_numpy(rewards) looks like  torch.Size([1477])
rewards looks like  (1172,)
logs prob looks like  torch.Size([1172])
torch.from_numpy(rewards) looks like  torch.Size([1172])
rewards looks like  (1366,)
logs prob looks like  torch.Size([1366])
torch.from_numpy(rewards) looks like  torch.Size([1366])
rewards looks like  (1826,)
logs prob looks like  torch.Size([1826])
torch.from_numpy(rewards) looks like  torch.Size([1826])
rewards looks like  (1165,)
logs prob looks like  torch.Size([1165])
torch.from_numpy(rewards) looks like  torch.Size([1165])
rewards looks like  (2540,)
logs prob looks like  torch.Size([2540])
torch.from_numpy(rewards) looks like  torch.Size([2540])
rewards looks like  (1507,)
logs prob looks like  torch.Size([1507])
torch.from_numpy(rewards) looks like  torch.Size([1507])
rewards looks like  (2418,)
logs prob looks like  torch.Size([2418])
torch.from_numpy(rewards) looks like  torch.Size([2418])
rewards looks like  (1300,)
logs prob looks like  torch.Size([1300])
torch.from_numpy(rewards) looks like  torch.Size([1300])
rewards looks like  (2572,)
logs prob looks like  torch.Size([2572])
torch.from_numpy(rewards) looks like  torch.Size([2572])
rewards looks like  (1225,)
logs prob looks like  torch.Size([1225])
torch.from_numpy(rewards) looks like  torch.Size([1225])
rewards looks like  (1586,)
logs prob looks like  torch.Size([1586])
torch.from_numpy(rewards) looks like  torch.Size([1586])
rewards looks like  (1460,)
logs prob looks like  torch.Size([1460])
torch.from_numpy(rewards) looks like  torch.Size([1460])
rewards looks like  (1458,)
logs prob looks like  torch.Size([1458])
torch.from_numpy(rewards) looks like  torch.Size([1458])
rewards looks like  (1381,)
logs prob looks like  torch.Size([1381])
torch.from_numpy(rewards) looks like  torch.Size([1381])
rewards looks like  (1356,)
logs prob looks like  torch.Size([1356])
torch.from_numpy(rewards) looks like  torch.Size([1356])
rewards looks like  (1520,)
logs prob looks like  torch.Size([1520])
torch.from_numpy(rewards) looks like  torch.Size([1520])
rewards looks like  (1570,)
logs prob looks like  torch.Size([1570])
torch.from_numpy(rewards) looks like  torch.Size([1570])
rewards looks like  (1303,)
logs prob looks like  torch.Size([1303])
torch.from_numpy(rewards) looks like  torch.Size([1303])
rewards looks like  (2160,)
logs prob looks like  torch.Size([2160])
torch.from_numpy(rewards) looks like  torch.Size([2160])
rewards looks like  (1344,)
logs prob looks like  torch.Size([1344])
torch.from_numpy(rewards) looks like  torch.Size([1344])
rewards looks like  (1496,)
logs prob looks like  torch.Size([1496])
torch.from_numpy(rewards) looks like  torch.Size([1496])
rewards looks like  (1905,)
logs prob looks like  torch.Size([1905])
torch.from_numpy(rewards) looks like  torch.Size([1905])
rewards looks like  (1255,)
logs prob looks like  torch.Size([1255])
torch.from_numpy(rewards) looks like  torch.Size([1255])
rewards looks like  (1440,)
logs prob looks like  torch.Size([1440])
torch.from_numpy(rewards) looks like  torch.Size([1440])
rewards looks like  (1472,)
logs prob looks like  torch.Size([1472])
torch.from_numpy(rewards) looks like  torch.Size([1472])
rewards looks like  (1261,)
logs prob looks like  torch.Size([1261])
torch.from_numpy(rewards) looks like  torch.Size([1261])
rewards looks like  (2225,)
logs prob looks like  torch.Size([2225])
torch.from_numpy(rewards) looks like  torch.Size([2225])
rewards looks like  (1071,)
logs prob looks like  torch.Size([1071])
torch.from_numpy(rewards) looks like  torch.Size([1071])
rewards looks like  (1033,)
logs prob looks like  torch.Size([1033])
torch.from_numpy(rewards) looks like  torch.Size([1033])
rewards looks like  (856,)
logs prob looks like  torch.Size([856])
torch.from_numpy(rewards) looks like  torch.Size([856])
rewards looks like  (1261,)
logs prob looks like  torch.Size([1261])
torch.from_numpy(rewards) looks like  torch.Size([1261])
rewards looks like  (1782,)
logs prob looks like  torch.Size([1782])
torch.from_numpy(rewards) looks like  torch.Size([1782])
rewards looks like  (1867,)
logs prob looks like  torch.Size([1867])
torch.from_numpy(rewards) looks like  torch.Size([1867])
rewards looks like  (2025,)
logs prob looks like  torch.Size([2025])
torch.from_numpy(rewards) looks like  torch.Size([2025])
rewards looks like  (1250,)
logs prob looks like  torch.Size([1250])
torch.from_numpy(rewards) looks like  torch.Size([1250])
rewards looks like  (1323,)
logs prob looks like  torch.Size([1323])
torch.from_numpy(rewards) looks like  torch.Size([1323])
rewards looks like  (1349,)
logs prob looks like  torch.Size([1349])
torch.from_numpy(rewards) looks like  torch.Size([1349])
rewards looks like  (1617,)
logs prob looks like  torch.Size([1617])
torch.from_numpy(rewards) looks like  torch.Size([1617])
rewards looks like  (1668,)
logs prob looks like  torch.Size([1668])
torch.from_numpy(rewards) looks like  torch.Size([1668])
rewards looks like  (1109,)
logs prob looks like  torch.Size([1109])
torch.from_numpy(rewards) looks like  torch.Size([1109])
rewards looks like  (1102,)
logs prob looks like  torch.Size([1102])
torch.from_numpy(rewards) looks like  torch.Size([1102])
rewards looks like  (2017,)
logs prob looks like  torch.Size([2017])
torch.from_numpy(rewards) looks like  torch.Size([2017])
rewards looks like  (2368,)
logs prob looks like  torch.Size([2368])
torch.from_numpy(rewards) looks like  torch.Size([2368])
rewards looks like  (1128,)
logs prob looks like  torch.Size([1128])
torch.from_numpy(rewards) looks like  torch.Size([1128])
rewards looks like  (1469,)
logs prob looks like  torch.Size([1469])
torch.from_numpy(rewards) looks like  torch.Size([1469])
rewards looks like  (1091,)
logs prob looks like  torch.Size([1091])
torch.from_numpy(rewards) looks like  torch.Size([1091])
rewards looks like  (1516,)
logs prob looks like  torch.Size([1516])
torch.from_numpy(rewards) looks like  torch.Size([1516])
rewards looks like  (1145,)
logs prob looks like  torch.Size([1145])
torch.from_numpy(rewards) looks like  torch.Size([1145])
rewards looks like  (1594,)
logs prob looks like  torch.Size([1594])
torch.from_numpy(rewards) looks like  torch.Size([1594])
rewards looks like  (1536,)
logs prob looks like  torch.Size([1536])
torch.from_numpy(rewards) looks like  torch.Size([1536])
rewards looks like  (1295,)
logs prob looks like  torch.Size([1295])
torch.from_numpy(rewards) looks like  torch.Size([1295])
rewards looks like  (1473,)
logs prob looks like  torch.Size([1473])
torch.from_numpy(rewards) looks like  torch.Size([1473])
rewards looks like  (1458,)
logs prob looks like  torch.Size([1458])
torch.from_numpy(rewards) looks like  torch.Size([1458])
rewards looks like  (1316,)
logs prob looks like  torch.Size([1316])
torch.from_numpy(rewards) looks like  torch.Size([1316])
rewards looks like  (1257,)
logs prob looks like  torch.Size([1257])
torch.from_numpy(rewards) looks like  torch.Size([1257])
rewards looks like  (2354,)
logs prob looks like  torch.Size([2354])
torch.from_numpy(rewards) looks like  torch.Size([2354])
rewards looks like  (1340,)
logs prob looks like  torch.Size([1340])
torch.from_numpy(rewards) looks like  torch.Size([1340])
rewards looks like  (1900,)
logs prob looks like  torch.Size([1900])
torch.from_numpy(rewards) looks like  torch.Size([1900])
rewards looks like  (1513,)
logs prob looks like  torch.Size([1513])
torch.from_numpy(rewards) looks like  torch.Size([1513])
rewards looks like  (1873,)
logs prob looks like  torch.Size([1873])
torch.from_numpy(rewards) looks like  torch.Size([1873])
rewards looks like  (1279,)
logs prob looks like  torch.Size([1279])
torch.from_numpy(rewards) looks like  torch.Size([1279])
rewards looks like  (2151,)
logs prob looks like  torch.Size([2151])
torch.from_numpy(rewards) looks like  torch.Size([2151])
rewards looks like  (1933,)
logs prob looks like  torch.Size([1933])
torch.from_numpy(rewards) looks like  torch.Size([1933])
rewards looks like  (2081,)
logs prob looks like  torch.Size([2081])
torch.from_numpy(rewards) looks like  torch.Size([2081])
rewards looks like  (1054,)
logs prob looks like  torch.Size([1054])
torch.from_numpy(rewards) looks like  torch.Size([1054])
rewards looks like  (1158,)
logs prob looks like  torch.Size([1158])
torch.from_numpy(rewards) looks like  torch.Size([1158])
rewards looks like  (1369,)
logs prob looks like  torch.Size([1369])
torch.from_numpy(rewards) looks like  torch.Size([1369])
rewards looks like  (1148,)
logs prob looks like  torch.Size([1148])
torch.from_numpy(rewards) looks like  torch.Size([1148])
rewards looks like  (1898,)
logs prob looks like  torch.Size([1898])
torch.from_numpy(rewards) looks like  torch.Size([1898])
rewards looks like  (1424,)
logs prob looks like  torch.Size([1424])
torch.from_numpy(rewards) looks like  torch.Size([1424])
rewards looks like  (2106,)
logs prob looks like  torch.Size([2106])
torch.from_numpy(rewards) looks like  torch.Size([2106])
rewards looks like  (1310,)
logs prob looks like  torch.Size([1310])
torch.from_numpy(rewards) looks like  torch.Size([1310])
rewards looks like  (1423,)
logs prob looks like  torch.Size([1423])
torch.from_numpy(rewards) looks like  torch.Size([1423])
rewards looks like  (1866,)
logs prob looks like  torch.Size([1866])
torch.from_numpy(rewards) looks like  torch.Size([1866])
rewards looks like  (2571,)
logs prob looks like  torch.Size([2571])
torch.from_numpy(rewards) looks like  torch.Size([2571])
rewards looks like  (1958,)
logs prob looks like  torch.Size([1958])
torch.from_numpy(rewards) looks like  torch.Size([1958])
rewards looks like  (1608,)
logs prob looks like  torch.Size([1608])
torch.from_numpy(rewards) looks like  torch.Size([1608])
rewards looks like  (1197,)
logs prob looks like  torch.Size([1197])
torch.from_numpy(rewards) looks like  torch.Size([1197])
rewards looks like  (1429,)
logs prob looks like  torch.Size([1429])
torch.from_numpy(rewards) looks like  torch.Size([1429])
rewards looks like  (1466,)
logs prob looks like  torch.Size([1466])
torch.from_numpy(rewards) looks like  torch.Size([1466])
rewards looks like  (1405,)
logs prob looks like  torch.Size([1405])
torch.from_numpy(rewards) looks like  torch.Size([1405])
rewards looks like  (1304,)
logs prob looks like  torch.Size([1304])
torch.from_numpy(rewards) looks like  torch.Size([1304])
rewards looks like  (2045,)
logs prob looks like  torch.Size([2045])
torch.from_numpy(rewards) looks like  torch.Size([2045])
rewards looks like  (1565,)
logs prob looks like  torch.Size([1565])
torch.from_numpy(rewards) looks like  torch.Size([1565])
rewards looks like  (2539,)
logs prob looks like  torch.Size([2539])
torch.from_numpy(rewards) looks like  torch.Size([2539])
rewards looks like  (1497,)
logs prob looks like  torch.Size([1497])
torch.from_numpy(rewards) looks like  torch.Size([1497])
rewards looks like  (2141,)
logs prob looks like  torch.Size([2141])
torch.from_numpy(rewards) looks like  torch.Size([2141])
rewards looks like  (1141,)
logs prob looks like  torch.Size([1141])
torch.from_numpy(rewards) looks like  torch.Size([1141])
rewards looks like  (2892,)
logs prob looks like  torch.Size([2892])
torch.from_numpy(rewards) looks like  torch.Size([2892])
rewards looks like  (841,)
logs prob looks like  torch.Size([841])
torch.from_numpy(rewards) looks like  torch.Size([841])
rewards looks like  (1129,)
logs prob looks like  torch.Size([1129])
torch.from_numpy(rewards) looks like  torch.Size([1129])
rewards looks like  (1347,)
logs prob looks like  torch.Size([1347])
torch.from_numpy(rewards) looks like  torch.Size([1347])
rewards looks like  (1596,)
logs prob looks like  torch.Size([1596])
torch.from_numpy(rewards) looks like  torch.Size([1596])
rewards looks like  (2045,)
logs prob looks like  torch.Size([2045])
torch.from_numpy(rewards) looks like  torch.Size([2045])
rewards looks like  (1247,)
logs prob looks like  torch.Size([1247])
torch.from_numpy(rewards) looks like  torch.Size([1247])
rewards looks like  (1289,)
logs prob looks like  torch.Size([1289])
torch.from_numpy(rewards) looks like  torch.Size([1289])
rewards looks like  (2360,)
logs prob looks like  torch.Size([2360])
torch.from_numpy(rewards) looks like  torch.Size([2360])
rewards looks like  (2745,)
logs prob looks like  torch.Size([2745])
torch.from_numpy(rewards) looks like  torch.Size([2745])
rewards looks like  (1191,)
logs prob looks like  torch.Size([1191])
torch.from_numpy(rewards) looks like  torch.Size([1191])
rewards looks like  (1266,)
logs prob looks like  torch.Size([1266])
torch.from_numpy(rewards) looks like  torch.Size([1266])
rewards looks like  (1424,)
logs prob looks like  torch.Size([1424])
torch.from_numpy(rewards) looks like  torch.Size([1424])
rewards looks like  (929,)
logs prob looks like  torch.Size([929])
torch.from_numpy(rewards) looks like  torch.Size([929])
rewards looks like  (2134,)
logs prob looks like  torch.Size([2134])
torch.from_numpy(rewards) looks like  torch.Size([2134])
rewards looks like  (1933,)
logs prob looks like  torch.Size([1933])
torch.from_numpy(rewards) looks like  torch.Size([1933])
rewards looks like  (1357,)
logs prob looks like  torch.Size([1357])
torch.from_numpy(rewards) looks like  torch.Size([1357])
rewards looks like  (1807,)
logs prob looks like  torch.Size([1807])
torch.from_numpy(rewards) looks like  torch.Size([1807])
rewards looks like  (2153,)
logs prob looks like  torch.Size([2153])
torch.from_numpy(rewards) looks like  torch.Size([2153])
rewards looks like  (1101,)
logs prob looks like  torch.Size([1101])
torch.from_numpy(rewards) looks like  torch.Size([1101])
rewards looks like  (1263,)
logs prob looks like  torch.Size([1263])
torch.from_numpy(rewards) looks like  torch.Size([1263])
rewards looks like  (2021,)
logs prob looks like  torch.Size([2021])
torch.from_numpy(rewards) looks like  torch.Size([2021])
rewards looks like  (1306,)
logs prob looks like  torch.Size([1306])
torch.from_numpy(rewards) looks like  torch.Size([1306])
rewards looks like  (1696,)
logs prob looks like  torch.Size([1696])
torch.from_numpy(rewards) looks like  torch.Size([1696])
rewards looks like  (1593,)
logs prob looks like  torch.Size([1593])
torch.from_numpy(rewards) looks like  torch.Size([1593])
rewards looks like  (1181,)
logs prob looks like  torch.Size([1181])
torch.from_numpy(rewards) looks like  torch.Size([1181])
rewards looks like  (2203,)
logs prob looks like  torch.Size([2203])
torch.from_numpy(rewards) looks like  torch.Size([2203])
rewards looks like  (2740,)
logs prob looks like  torch.Size([2740])
torch.from_numpy(rewards) looks like  torch.Size([2740])
rewards looks like  (1403,)
logs prob looks like  torch.Size([1403])
torch.from_numpy(rewards) looks like  torch.Size([1403])
rewards looks like  (1326,)
logs prob looks like  torch.Size([1326])
torch.from_numpy(rewards) looks like  torch.Size([1326])
rewards looks like  (2057,)
logs prob looks like  torch.Size([2057])
torch.from_numpy(rewards) looks like  torch.Size([2057])
rewards looks like  (3534,)
logs prob looks like  torch.Size([3534])
torch.from_numpy(rewards) looks like  torch.Size([3534])
rewards looks like  (1318,)
logs prob looks like  torch.Size([1318])
torch.from_numpy(rewards) looks like  torch.Size([1318])
rewards looks like  (1419,)
logs prob looks like  torch.Size([1419])
torch.from_numpy(rewards) looks like  torch.Size([1419])
rewards looks like  (1403,)
logs prob looks like  torch.Size([1403])
torch.from_numpy(rewards) looks like  torch.Size([1403])
rewards looks like  (2790,)
logs prob looks like  torch.Size([2790])
torch.from_numpy(rewards) looks like  torch.Size([2790])
rewards looks like  (1318,)
logs prob looks like  torch.Size([1318])
torch.from_numpy(rewards) looks like  torch.Size([1318])
rewards looks like  (1406,)
logs prob looks like  torch.Size([1406])
torch.from_numpy(rewards) looks like  torch.Size([1406])
rewards looks like  (1603,)
logs prob looks like  torch.Size([1603])
torch.from_numpy(rewards) looks like  torch.Size([1603])
rewards looks like  (1794,)
logs prob looks like  torch.Size([1794])
torch.from_numpy(rewards) looks like  torch.Size([1794])
rewards looks like  (1461,)
logs prob looks like  torch.Size([1461])
torch.from_numpy(rewards) looks like  torch.Size([1461])
rewards looks like  (1343,)
logs prob looks like  torch.Size([1343])
torch.from_numpy(rewards) looks like  torch.Size([1343])
rewards looks like  (1442,)
logs prob looks like  torch.Size([1442])
torch.from_numpy(rewards) looks like  torch.Size([1442])
rewards looks like  (1414,)
logs prob looks like  torch.Size([1414])
torch.from_numpy(rewards) looks like  torch.Size([1414])
rewards looks like  (2715,)
logs prob looks like  torch.Size([2715])
torch.from_numpy(rewards) looks like  torch.Size([2715])
rewards looks like  (2386,)
logs prob looks like  torch.Size([2386])
torch.from_numpy(rewards) looks like  torch.Size([2386])
rewards looks like  (1905,)
logs prob looks like  torch.Size([1905])
torch.from_numpy(rewards) looks like  torch.Size([1905])
rewards looks like  (1031,)
logs prob looks like  torch.Size([1031])
torch.from_numpy(rewards) looks like  torch.Size([1031])
rewards looks like  (1125,)
logs prob looks like  torch.Size([1125])
torch.from_numpy(rewards) looks like  torch.Size([1125])
rewards looks like  (1556,)
logs prob looks like  torch.Size([1556])
torch.from_numpy(rewards) looks like  torch.Size([1556])
rewards looks like  (1906,)
logs prob looks like  torch.Size([1906])
torch.from_numpy(rewards) looks like  torch.Size([1906])
rewards looks like  (1777,)
logs prob looks like  torch.Size([1777])
torch.from_numpy(rewards) looks like  torch.Size([1777])
rewards looks like  (1269,)
logs prob looks like  torch.Size([1269])
torch.from_numpy(rewards) looks like  torch.Size([1269])
rewards looks like  (1407,)
logs prob looks like  torch.Size([1407])
torch.from_numpy(rewards) looks like  torch.Size([1407])
rewards looks like  (1333,)
logs prob looks like  torch.Size([1333])
torch.from_numpy(rewards) looks like  torch.Size([1333])
rewards looks like  (1224,)
logs prob looks like  torch.Size([1224])
torch.from_numpy(rewards) looks like  torch.Size([1224])
rewards looks like  (1997,)
logs prob looks like  torch.Size([1997])
torch.from_numpy(rewards) looks like  torch.Size([1997])
rewards looks like  (1610,)
logs prob looks like  torch.Size([1610])
torch.from_numpy(rewards) looks like  torch.Size([1610])
rewards looks like  (1393,)
logs prob looks like  torch.Size([1393])
torch.from_numpy(rewards) looks like  torch.Size([1393])
rewards looks like  (1808,)
logs prob looks like  torch.Size([1808])
torch.from_numpy(rewards) looks like  torch.Size([1808])
rewards looks like  (1448,)
logs prob looks like  torch.Size([1448])
torch.from_numpy(rewards) looks like  torch.Size([1448])
rewards looks like  (1558,)
logs prob looks like  torch.Size([1558])
torch.from_numpy(rewards) looks like  torch.Size([1558])
rewards looks like  (1766,)
logs prob looks like  torch.Size([1766])
torch.from_numpy(rewards) looks like  torch.Size([1766])
rewards looks like  (1942,)
logs prob looks like  torch.Size([1942])
torch.from_numpy(rewards) looks like  torch.Size([1942])
rewards looks like  (1487,)
logs prob looks like  torch.Size([1487])
torch.from_numpy(rewards) looks like  torch.Size([1487])
rewards looks like  (2154,)
logs prob looks like  torch.Size([2154])
torch.from_numpy(rewards) looks like  torch.Size([2154])
rewards looks like  (1400,)
logs prob looks like  torch.Size([1400])
torch.from_numpy(rewards) looks like  torch.Size([1400])
rewards looks like  (1379,)
logs prob looks like  torch.Size([1379])
torch.from_numpy(rewards) looks like  torch.Size([1379])
rewards looks like  (2227,)
logs prob looks like  torch.Size([2227])
torch.from_numpy(rewards) looks like  torch.Size([2227])
rewards looks like  (1308,)
logs prob looks like  torch.Size([1308])
torch.from_numpy(rewards) looks like  torch.Size([1308])
rewards looks like  (1469,)
logs prob looks like  torch.Size([1469])
torch.from_numpy(rewards) looks like  torch.Size([1469])
rewards looks like  (1734,)
logs prob looks like  torch.Size([1734])
torch.from_numpy(rewards) looks like  torch.Size([1734])
rewards looks like  (1994,)
logs prob looks like  torch.Size([1994])
torch.from_numpy(rewards) looks like  torch.Size([1994])
rewards looks like  (2025,)
logs prob looks like  torch.Size([2025])
torch.from_numpy(rewards) looks like  torch.Size([2025])
rewards looks like  (2223,)
logs prob looks like  torch.Size([2223])
torch.from_numpy(rewards) looks like  torch.Size([2223])
rewards looks like  (2418,)
logs prob looks like  torch.Size([2418])
torch.from_numpy(rewards) looks like  torch.Size([2418])
rewards looks like  (1520,)
logs prob looks like  torch.Size([1520])
torch.from_numpy(rewards) looks like  torch.Size([1520])
rewards looks like  (1613,)
logs prob looks like  torch.Size([1613])
torch.from_numpy(rewards) looks like  torch.Size([1613])
rewards looks like  (1984,)
logs prob looks like  torch.Size([1984])
torch.from_numpy(rewards) looks like  torch.Size([1984])
rewards looks like  (1563,)
logs prob looks like  torch.Size([1563])
torch.from_numpy(rewards) looks like  torch.Size([1563])
rewards looks like  (1559,)
logs prob looks like  torch.Size([1559])
torch.from_numpy(rewards) looks like  torch.Size([1559])
rewards looks like  (2198,)
logs prob looks like  torch.Size([2198])
torch.from_numpy(rewards) looks like  torch.Size([2198])
rewards looks like  (1582,)
logs prob looks like  torch.Size([1582])
torch.from_numpy(rewards) looks like  torch.Size([1582])
rewards looks like  (1423,)
logs prob looks like  torch.Size([1423])
torch.from_numpy(rewards) looks like  torch.Size([1423])
rewards looks like  (2810,)
logs prob looks like  torch.Size([2810])
torch.from_numpy(rewards) looks like  torch.Size([2810])
rewards looks like  (1279,)
logs prob looks like  torch.Size([1279])
torch.from_numpy(rewards) looks like  torch.Size([1279])
rewards looks like  (1101,)
logs prob looks like  torch.Size([1101])
torch.from_numpy(rewards) looks like  torch.Size([1101])
rewards looks like  (2219,)
logs prob looks like  torch.Size([2219])
torch.from_numpy(rewards) looks like  torch.Size([2219])
rewards looks like  (1930,)
logs prob looks like  torch.Size([1930])
torch.from_numpy(rewards) looks like  torch.Size([1930])
代码
文本

Training Result

During the training process, we recorded avg_total_reward, which represents the average total reward of episodes before updating the policy network.

Theoretically, if the agent becomes better, the avg_total_reward will increase. The visualization of the training process is shown below:

代码
文本
[19]
plt.plot(avg_total_rewards)
plt.title("Total Rewards")
plt.show()
代码
文本

In addition, avg_final_reward represents average final rewards of episodes. To be specific, final rewards is the last reward received in one episode, indicating whether the craft lands successfully or not.

代码
文本
[20]
plt.plot(avg_final_rewards)
plt.title("Final Rewards")
plt.show()
代码
文本

Testing

The testing result will be the average reward of 5 testing

代码
文本
[21]
fix(env, seed)
agent.network.eval() # set the network into evaluation mode
NUM_OF_TEST = 5 # Do not revise this !!!
test_total_reward = []
action_list = []
for i in range(NUM_OF_TEST):
actions = []
state = env.reset()

img = plt.imshow(env.render(mode='rgb_array'))

total_reward = 0

done = False
while not done:
action, _ = agent.sample(state)
actions.append(action)
state, reward, done, _ = env.step(action)

total_reward += reward

img.set_data(env.render(mode='rgb_array'))
display.display(plt.gcf())
display.clear_output(wait=True)
print(total_reward)
test_total_reward.append(total_reward)

action_list.append(actions) # save the result of testing

-209.13696525868605
代码
文本
[22]
print(np.mean(test_total_reward))
-106.5599827895497
代码
文本

Action list

代码
文本
[23]
print("Action list looks like ", action_list)
print("Action list's shape looks like ", np.shape(action_list))
Action list looks like  [[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 2, 3, 2, 2, 3, 2, 2, 0, 3, 2, 3, 2, 2, 0, 2, 2, 0, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 0, 2, 0, 2, 2, 3, 3, 2, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 2, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 2, 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3, 3, 3, 2, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3, 2, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 2, 2, 2, 3, 3, 2, 2, 3, 2, 3, 3, 3, 3], [0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 2, 2, 3, 2, 3, 0, 2, 2, 2, 0, 2, 1, 2, 3, 2, 2, 0, 2, 2, 1, 0, 2, 2, 3, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 0, 1, 2, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 3, 2, 2, 0, 3, 2, 3, 2, 2, 2, 3, 2, 2, 3, 3, 2, 3, 2, 2, 3, 2, 2, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 3, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 3, 2, 2, 0, 2, 2, 0, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 3, 3, 2, 3, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 3, 2, 2, 3, 3, 3, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 3, 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 3, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 0, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 2, 2, 3, 3, 2, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 3, 2, 3, 2, 2, 2, 2, 3, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 3, 3, 3, 3, 2, 2, 3, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 2, 2, 3, 0, 2, 0, 0, 2, 3, 2, 0, 2, 3, 3, 2, 0, 2, 0, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 3, 2, 2, 0, 2, 1, 0, 2, 1, 2, 3, 2, 0, 2, 1, 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 3, 2, 2, 2, 3, 0, 2, 3, 2, 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 3, 2, 2, 2, 3, 2, 3, 2, 0, 2, 3, 2, 3, 0, 2, 3, 2, 3, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 0, 2, 3, 0, 2, 3, 2, 3, 3, 2, 3, 3, 2, 3, 2, 3, 2, 3, 3, 2, 3, 3, 3, 2, 3, 2, 3, 2, 2, 3, 2, 3, 3, 2, 2, 2, 3, 2, 2, 3, 2, 3, 2, 2, 2, 3, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 2, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]]
Action list's shape looks like  (5,)
/opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py:2007: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  result = asarray(a).shape
代码
文本

Analysis of actions taken by agent

代码
文本
[24]
distribution = {}
for actions in action_list:
for action in actions:
if action not in distribution.keys():
distribution[action] = 1
else:
distribution[action] += 1
print(distribution)
{2: 991, 3: 374, 0: 108, 1: 496}
代码
文本

Saving the result of Model Testing

代码
文本
[25]
PATH = "Action_List.npy" # Can be modified into the name or path you want
np.save(PATH ,np.array(action_list))
/tmp/ipykernel_123/1616289779.py:2: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray.
  np.save(PATH ,np.array(action_list))
代码
文本

This is the file you need to submit !!!

Download the testing result to your device

代码
文本
[26]
from google.colab import files
files.download(PATH)
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[26], line 1
----> 1 from google.colab import files
      2 files.download(PATH)

ModuleNotFoundError: No module named 'google.colab'
代码
文本

Server

The code below simulate the environment on the judge server. Can be used for testing.

代码
文本
[27]
action_list = np.load(PATH,allow_pickle=True) # The action list you upload
seed = 543 # Do not revise this
fix(env, seed)

agent.network.eval() # set network to evaluation mode

test_total_reward = []
if len(action_list) != 5:
print("Wrong format of file !!!")
exit(0)
for actions in action_list:
state = env.reset()
img = plt.imshow(env.render(mode='rgb_array'))

total_reward = 0

done = False

for action in actions:
state, reward, done, _ = env.step(action)
total_reward += reward
if done:
break

print(f"Your reward is : %.2f"%total_reward)
test_total_reward.append(total_reward)
Your reward is : -209.14
Your reward is : -45.50
Your reward is : 62.21
Your reward is : -200.09
Your reward is : -240.06
代码
文本

Your score

代码
文本
[28]
print(f"Your final reward is : %.2f"%np.mean(test_total_reward))
Your final reward is : -126.51
代码
文本
Deep Learning
notebook
python
Deep Learningnotebookpython
点个赞吧
推荐阅读
公开
Homework 12: Reinforcement Learning
Deep Learning
Deep Learning
ck
发布于 2024-03-18
1 转存文件
公开
Homework 12 - Reinforcement Learning
Deep Learning
Deep Learning
朱世林
发布于 2024-04-17