Homework 12 - Reinforcement Learning
If you have any problem, e-mail us at ntu-ml-2022spring-ta@googlegroups.com
Preliminary work
First, we need to install all necessary packages. One of them, gym, builded by OpenAI, is a toolkit for developing Reinforcement Learning algorithm. Other packages are for visualization in colab.
Hit:1 http://archive.ubuntu.com/ubuntu focal InRelease Get:2 http://archive.ubuntu.com/ubuntu focal-updates InRelease [114 kB] Get:4 http://security.ubuntu.com/ubuntu focal-security InRelease [114 kB] Get:3 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 InRelease [1581 B] Get:5 https://developer.download.nvidia.cn/compute/cuda/repos/ubuntu2004/x86_64 Packages [1498 kB] Get:6 http://archive.ubuntu.com/ubuntu focal-backports InRelease [108 kB] Get:7 https://deb.nodesource.com/node_18.x focal InRelease [4583 B] Get:8 http://archive.ubuntu.com/ubuntu focal-updates/restricted amd64 Packages [3639 kB] Get:9 https://deb.nodesource.com/node_18.x focal/main amd64 Packages [776 B] Get:10 http://security.ubuntu.com/ubuntu focal-security/universe amd64 Packages [1197 kB] Get:11 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 Packages [4024 kB]3m Get:12 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 Packages [1493 kB]m Get:13 http://archive.ubuntu.com/ubuntu focal-updates/multiverse amd64 Packages [32.5 kB] Get:14 http://archive.ubuntu.com/ubuntu focal-backports/main amd64 Packages [55.2 kB]33m Get:15 http://archive.ubuntu.com/ubuntu focal-backports/universe amd64 Packages [28.6 kB] Get:16 http://security.ubuntu.com/ubuntu focal-security/restricted amd64 Packages [3490 kB]3m Get:17 http://security.ubuntu.com/ubuntu focal-security/multiverse amd64 Packages [29.8 kB]3m Get:18 http://security.ubuntu.com/ubuntu focal-security/main amd64 Packages [3549 kB]33m Fetched 19.4 MB in 26s (740 kB/s) Reading package lists... Done Building dependency tree Reading state information... Done 163 packages can be upgraded. Run 'apt list --upgradable' to see them. Reading package lists... Done Building dependency tree Reading state information... Done The following additional packages will be installed: freeglut3 libglu1-mesa libpython2-stdlib libpython2.7-minimal libpython2.7-stdlib libunwind8 libxfont2 python2 python2-minimal python2.7 python2.7-minimal x11-xkb-utils xauth xfonts-base xfonts-encodings xfonts-utils xserver-common Suggested packages: python-tk python-numpy libgle3 python2-doc python2.7-doc binfmt-support The following NEW packages will be installed: freeglut3 libglu1-mesa libpython2-stdlib libpython2.7-minimal libpython2.7-stdlib libunwind8 libxfont2 python-opengl python2 python2-minimal python2.7 python2.7-minimal x11-xkb-utils xauth xfonts-base xfonts-encodings xfonts-utils xserver-common xvfb 0 upgraded, 19 newly installed, 0 to remove and 163 not upgraded. Need to get 12.2 MB of archives. After this operation, 34.7 MB of additional disk space will be used. Get:1 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 libpython2.7-minimal amd64 2.7.18-1~20.04.4 [335 kB] Get:2 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 python2.7-minimal amd64 2.7.18-1~20.04.4 [1280 kB] Get:3 http://archive.ubuntu.com/ubuntu focal/universe amd64 python2-minimal amd64 2.7.17-2ubuntu4 [27.5 kB] Get:4 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 libpython2.7-stdlib amd64 2.7.18-1~20.04.4 [1887 kB] Get:5 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 python2.7 amd64 2.7.18-1~20.04.4 [248 kB] Get:6 http://archive.ubuntu.com/ubuntu focal/universe amd64 libpython2-stdlib amd64 2.7.17-2ubuntu4 [7072 B] Get:7 http://archive.ubuntu.com/ubuntu focal/universe amd64 python2 amd64 2.7.17-2ubuntu4 [26.5 kB] Get:8 http://archive.ubuntu.com/ubuntu focal/main amd64 xauth amd64 1:1.1-0ubuntu1 [25.0 kB] Get:9 http://archive.ubuntu.com/ubuntu focal/universe amd64 freeglut3 amd64 2.8.1-3 [73.6 kB] Get:10 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 libunwind8 amd64 1.2.1-9ubuntu0.1 [47.7 kB] Get:11 http://archive.ubuntu.com/ubuntu focal/main amd64 libxfont2 amd64 1:2.0.3-1 [91.7 kB] Get:12 http://archive.ubuntu.com/ubuntu focal/main amd64 libglu1-mesa amd64 9.0.1-1build1 [168 kB] Get:13 http://archive.ubuntu.com/ubuntu focal/universe amd64 python-opengl all 3.1.0+dfsg-2build1 [486 kB] Get:14 http://archive.ubuntu.com/ubuntu focal/main amd64 x11-xkb-utils amd64 7.7+5 [158 kB] Get:15 http://archive.ubuntu.com/ubuntu focal/main amd64 xfonts-encodings all 1:1.0.5-0ubuntu1 [573 kB] Get:16 http://archive.ubuntu.com/ubuntu focal/main amd64 xfonts-utils amd64 1:7.7+6 [91.5 kB] Get:17 http://archive.ubuntu.com/ubuntu focal/main amd64 xfonts-base all 1:1.0.5 [5896 kB] Get:18 http://archive.ubuntu.com/ubuntu focal-updates/main amd64 xserver-common all 2:1.20.13-1ubuntu1~20.04.17 [27.8 kB] Get:19 http://archive.ubuntu.com/ubuntu focal-updates/universe amd64 xvfb amd64 2:1.20.13-1ubuntu1~20.04.17 [781 kB] Fetched 12.2 MB in 7s (1698 kB/s) 78Selecting previously unselected package libpython2.7-minimal:amd64. (Reading database ... 63384 files and directories currently installed.) Preparing to unpack .../0-libpython2.7-minimal_2.7.18-1~20.04.4_amd64.deb ... 7Progress: [ 0%] [..........................................................] 87Progress: [ 1%] [..........................................................] 8Unpacking libpython2.7-minimal:amd64 (2.7.18-1~20.04.4) ... 7Progress: [ 3%] [#.........................................................] 8Selecting previously unselected package python2.7-minimal. Preparing to unpack .../1-python2.7-minimal_2.7.18-1~20.04.4_amd64.deb ... 7Progress: [ 4%] [##........................................................] 8Unpacking python2.7-minimal (2.7.18-1~20.04.4) ... 7Progress: [ 5%] [###.......................................................] 8Selecting previously unselected package python2-minimal. Preparing to unpack .../2-python2-minimal_2.7.17-2ubuntu4_amd64.deb ... 7Progress: [ 6%] [###.......................................................] 8Unpacking python2-minimal (2.7.17-2ubuntu4) ... 7Progress: [ 8%] [####......................................................] 8Selecting previously unselected package libpython2.7-stdlib:amd64. Preparing to unpack .../3-libpython2.7-stdlib_2.7.18-1~20.04.4_amd64.deb ... 7Progress: [ 9%] [#####.....................................................] 8Unpacking libpython2.7-stdlib:amd64 (2.7.18-1~20.04.4) ... 7Progress: [ 10%] [######....................................................] 8Selecting previously unselected package python2.7. Preparing to unpack .../4-python2.7_2.7.18-1~20.04.4_amd64.deb ... 7Progress: [ 12%] [######....................................................] 8Unpacking python2.7 (2.7.18-1~20.04.4) ... 7Progress: [ 13%] [#######...................................................] 8Selecting previously unselected package libpython2-stdlib:amd64. Preparing to unpack .../5-libpython2-stdlib_2.7.17-2ubuntu4_amd64.deb ... 7Progress: [ 14%] [########..................................................] 8Unpacking libpython2-stdlib:amd64 (2.7.17-2ubuntu4) ... 7Progress: [ 16%] [#########.................................................] 8Setting up libpython2.7-minimal:amd64 (2.7.18-1~20.04.4) ... 7Progress: [ 17%] [#########.................................................] 87Progress: [ 18%] [##########................................................] 8Setting up python2.7-minimal (2.7.18-1~20.04.4) ... 7Progress: [ 19%] [###########...............................................] 87Progress: [ 21%] [############..............................................] 8Setting up python2-minimal (2.7.17-2ubuntu4) ... 7Progress: [ 22%] [############..............................................] 87Progress: [ 23%] [#############.............................................] 8Selecting previously unselected package python2. (Reading database ... 64131 files and directories currently installed.) Preparing to unpack .../00-python2_2.7.17-2ubuntu4_amd64.deb ... 7Progress: [ 25%] [##############............................................] 8Unpacking python2 (2.7.17-2ubuntu4) ... 7Progress: [ 26%] [###############...........................................] 8Selecting previously unselected package xauth. Preparing to unpack .../01-xauth_1%3a1.1-0ubuntu1_amd64.deb ... 7Progress: [ 27%] [###############...........................................] 8Unpacking xauth (1:1.1-0ubuntu1) ... 7Progress: [ 29%] [################..........................................] 8Selecting previously unselected package freeglut3:amd64. Preparing to unpack .../02-freeglut3_2.8.1-3_amd64.deb ... 7Progress: [ 30%] [#################.........................................] 8Unpacking freeglut3:amd64 (2.8.1-3) ... 7Progress: [ 31%] [##################........................................] 8Selecting previously unselected package libunwind8:amd64. Preparing to unpack .../03-libunwind8_1.2.1-9ubuntu0.1_amd64.deb ... 7Progress: [ 32%] [##################........................................] 8Unpacking libunwind8:amd64 (1.2.1-9ubuntu0.1) ... 7Progress: [ 34%] [###################.......................................] 8Selecting previously unselected package libxfont2:amd64. Preparing to unpack .../04-libxfont2_1%3a2.0.3-1_amd64.deb ... 7Progress: [ 35%] [####################......................................] 8Unpacking libxfont2:amd64 (1:2.0.3-1) ... 7Progress: [ 36%] [#####################.....................................] 8Selecting previously unselected package libglu1-mesa:amd64. Preparing to unpack .../05-libglu1-mesa_9.0.1-1build1_amd64.deb ... 7Progress: [ 38%] [#####################.....................................] 8Unpacking libglu1-mesa:amd64 (9.0.1-1build1) ... 7Progress: [ 39%] [######################....................................] 8Selecting previously unselected package python-opengl. Preparing to unpack .../06-python-opengl_3.1.0+dfsg-2build1_all.deb ... 7Progress: [ 40%] [#######################...................................] 8Unpacking python-opengl (3.1.0+dfsg-2build1) ... 7Progress: [ 42%] [########################..................................] 8Selecting previously unselected package x11-xkb-utils. Preparing to unpack .../07-x11-xkb-utils_7.7+5_amd64.deb ... 7Progress: [ 43%] [########################..................................] 8Unpacking x11-xkb-utils (7.7+5) ... 7Progress: [ 44%] [#########################.................................] 8Selecting previously unselected package xfonts-encodings. Preparing to unpack .../08-xfonts-encodings_1%3a1.0.5-0ubuntu1_all.deb ... 7Progress: [ 45%] [##########################................................] 8Unpacking xfonts-encodings (1:1.0.5-0ubuntu1) ... 7Progress: [ 47%] [###########################...............................] 8Selecting previously unselected package xfonts-utils. Preparing to unpack .../09-xfonts-utils_1%3a7.7+6_amd64.deb ... 7Progress: [ 48%] [###########################...............................] 8Unpacking xfonts-utils (1:7.7+6) ... 7Progress: [ 49%] [############################..............................] 8Selecting previously unselected package xfonts-base. Preparing to unpack .../10-xfonts-base_1%3a1.0.5_all.deb ... 7Progress: [ 51%] [#############################.............................] 8Unpacking xfonts-base (1:1.0.5) ... 7Progress: [ 52%] [##############################............................] 8Selecting previously unselected package xserver-common. Preparing to unpack .../11-xserver-common_2%3a1.20.13-1ubuntu1~20.04.17_all.deb ... 7Progress: [ 53%] [##############################............................] 8Unpacking xserver-common (2:1.20.13-1ubuntu1~20.04.17) ... 7Progress: [ 55%] [###############################...........................] 8Selecting previously unselected package xvfb. Preparing to unpack .../12-xvfb_2%3a1.20.13-1ubuntu1~20.04.17_amd64.deb ... 7Progress: [ 56%] [################################..........................] 8Unpacking xvfb (2:1.20.13-1ubuntu1~20.04.17) ... 7Progress: [ 57%] [#################################.........................] 8Setting up freeglut3:amd64 (2.8.1-3) ... 7Progress: [ 58%] [#################################.........................] 87Progress: [ 60%] [##################################........................] 8Setting up x11-xkb-utils (7.7+5) ... 7Progress: [ 61%] [###################################.......................] 87Progress: [ 62%] [####################################......................] 8Setting up libunwind8:amd64 (1.2.1-9ubuntu0.1) ... 7Progress: [ 64%] [####################################......................] 87Progress: [ 65%] [#####################################.....................] 8Setting up libpython2.7-stdlib:amd64 (2.7.18-1~20.04.4) ... 7Progress: [ 66%] [######################################....................] 87Progress: [ 68%] [#######################################...................] 8Setting up xfonts-encodings (1:1.0.5-0ubuntu1) ... 7Progress: [ 69%] [#######################################...................] 87Progress: [ 70%] [########################################..................] 8Setting up xauth (1:1.1-0ubuntu1) ... 7Progress: [ 71%] [#########################################.................] 87Progress: [ 73%] [##########################################................] 8Setting up libglu1-mesa:amd64 (9.0.1-1build1) ... 7Progress: [ 74%] [##########################################................] 87Progress: [ 75%] [###########################################...............] 8Setting up xserver-common (2:1.20.13-1ubuntu1~20.04.17) ... 7Progress: [ 77%] [############################################..............] 87Progress: [ 78%] [#############################################.............] 8Setting up libxfont2:amd64 (1:2.0.3-1) ... 7Progress: [ 79%] [#############################################.............] 87Progress: [ 81%] [##############################################............] 8Setting up python2.7 (2.7.18-1~20.04.4) ... 7Progress: [ 82%] [###############################################...........] 87Progress: [ 83%] [################################################..........] 8Setting up libpython2-stdlib:amd64 (2.7.17-2ubuntu4) ... 7Progress: [ 84%] [################################################..........] 87Progress: [ 86%] [#################################################.........] 8Setting up xvfb (2:1.20.13-1ubuntu1~20.04.17) ... 7Progress: [ 87%] [##################################################........] 87Progress: [ 88%] [###################################################.......] 8Setting up xfonts-utils (1:7.7+6) ... 7Progress: [ 90%] [###################################################.......] 87Progress: [ 91%] [####################################################......] 8Setting up python2 (2.7.17-2ubuntu4) ... 7Progress: [ 92%] [#####################################################.....] 87Progress: [ 94%] [######################################################....] 8Setting up xfonts-base (1:1.0.5) ... 7Progress: [ 95%] [######################################################....] 87Progress: [ 96%] [#######################################################...] 8Setting up python-opengl (3.1.0+dfsg-2build1) ... 7Progress: [ 97%] [########################################################..] 87Progress: [ 99%] [#########################################################.] 8Processing triggers for man-db (2.9.1-1) ... Processing triggers for fontconfig (2.13.1-2ubuntu3) ... Processing triggers for mime-support (3.64ubuntu1) ... Processing triggers for libc-bin (2.31-0ubuntu9.9) ... /sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-ml.so.470.82.01 is empty, not checked. /sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-ptxjitcompiler.so.470.82.01 is empty, not checked. /sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libcuda.so.470.82.01 is empty, not checked. /sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-cfg.so.470.82.01 is empty, not checked. /sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-allocator.so.470.82.01 is empty, not checked. /sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-compiler.so.470.82.01 is empty, not checked. /sbin/ldconfig.real: File /lib/x86_64-linux-gnu/libnvidia-opencl.so.470.82.01 is empty, not checked. 78WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple Collecting box2d==2.3.2 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/cc/7b/ddb96fea1fa5b24f8929714ef483f64c33e9649e7aae066e5f5023ea426a/Box2D-2.3.2.tar.gz (427 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 427.9/427.9 kB 7.5 MB/s eta 0:00:00a 0:00:01 Preparing metadata (setup.py) ... done Requirement already satisfied: gym[box2d]==0.25.2 in /opt/conda/lib/python3.8/site-packages (0.25.2) Collecting box2d-py Downloading https://pypi.tuna.tsinghua.edu.cn/packages/98/c2/ab05b5329dc4416b5ee5530f0625a79c394a3e3c10abe0812b9345256451/box2d-py-2.3.8.tar.gz (374 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 374.5/374.5 kB 21.4 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Collecting pyvirtualdisplay Downloading https://pypi.tuna.tsinghua.edu.cn/packages/90/eb/c3b8deb661cb3846db63288c99bbb39f217b7807fc8acb2fd058db41e2e6/PyVirtualDisplay-3.0-py3-none-any.whl (15 kB) Requirement already satisfied: tqdm in /opt/conda/lib/python3.8/site-packages (4.64.1) Requirement already satisfied: numpy==1.22.4 in /opt/conda/lib/python3.8/site-packages (1.22.4) Requirement already satisfied: cloudpickle>=1.2.0 in /opt/conda/lib/python3.8/site-packages (from gym[box2d]==0.25.2) (2.2.1) Requirement already satisfied: importlib-metadata>=4.8.0 in /opt/conda/lib/python3.8/site-packages (from gym[box2d]==0.25.2) (6.0.0) Requirement already satisfied: gym-notices>=0.0.4 in /opt/conda/lib/python3.8/site-packages (from gym[box2d]==0.25.2) (0.0.8) Requirement already satisfied: swig==4.* in /opt/conda/lib/python3.8/site-packages (from gym[box2d]==0.25.2) (4.2.1) Collecting pygame==2.1.0 Downloading https://pypi.tuna.tsinghua.edu.cn/packages/ba/a3/6888bb6d57678a6acf754dfed589cb0dbe85086bce607dd580ab4b50cad9/pygame-2.1.0-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (18.3 MB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 18.3/18.3 MB 23.2 MB/s eta 0:00:0000:0100:01 Collecting box2d-py Downloading https://pypi.tuna.tsinghua.edu.cn/packages/dd/5a/ad8d3ef9c13d5afcc1e44a77f11792ee717f6727b3320bddbc607e935e2a/box2d-py-2.3.5.tar.gz (374 kB) ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 374.4/374.4 kB 12.7 MB/s eta 0:00:00 Preparing metadata (setup.py) ... done Requirement already satisfied: zipp>=0.5 in /opt/conda/lib/python3.8/site-packages (from importlib-metadata>=4.8.0->gym[box2d]==0.25.2) (3.14.0) Building wheels for collected packages: box2d, box2d-py Building wheel for box2d (setup.py) ... done Created wheel for box2d-py: filename=box2d_py-2.3.5-cp38-cp38-linux_x86_64.whl size=3124676 sha256=3abbe5a971859f55aea1e08f607c192adb23333cea1014a10a0f04a1ace59ae2 Stored in directory: /root/.cache/pip/wheels/08/ec/28/605876e7e1b11ffc19f6b33dd08293669e66c42676f80e98ef Successfully built box2d box2d-py Installing collected packages: pyvirtualdisplay, box2d-py, box2d, pygame Successfully installed box2d-2.3.2 box2d-py-2.3.5 pygame-2.1.0 pyvirtualdisplay-3.0 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv one Created wheel for box2d-kengz: filename=Box2D_kengz-2.3.3-cp38-cp38-linux_x86_64.whl size=3142929 sha256=bae0e85dd98671e3b8cbe38d777a8df99908360795bbb8118e21fe02816af652 Stored in directory: /root/.cache/pip/wheels/b1/5a/15/37288ab87c40e970871421b595614b3feb5021a6de0661401c Successfully built box2d-kengz Installing collected packages: box2d-kengz Successfully installed box2d-kengz-2.3.3 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
Next, set up virtual display,and import all necessaary packages.
Warning ! Do not revise random seed !!!
Your submission on JudgeBoi will not reproduce your result !!!
Make your HW result to be reproducible.
Last, call gym and build an Lunar Lander environment.
What Lunar Lander?
“LunarLander-v2”is to simulate the situation when the craft lands on the surface of the moon.
This task is to enable the craft to land "safely" at the pad between the two yellow flags.
Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector.
"LunarLander-v2" actually includes "Agent" and "Environment".
In this homework, we will utilize the function step()
to control the action of "Agent".
Then step()
will return the observation/state and reward given by the "Environment".
Observation / State
First, we can take a look at what an Observation / State looks like.
Box([-1.5 -1.5 -5. -5. -3.1415927 -5. -0. -0. ], [1.5 1.5 5. 5. 3.1415927 5. 1. 1. ], (8,), float32)
Box(8,)
means that observation is an 8-dim vector
Action
Actions can be taken by looks like
Discrete(4)
Discrete(4)
implies that there are four kinds of actions can be taken by agent.
- 0 implies the agent will not take any actions
- 2 implies the agent will accelerate downward
- 1, 3 implies the agent will accelerate left and right
Next, we will try to make the agent interact with the environment.
Before taking any actions, we recommend to call reset()
function to reset the environment. Also, this function will return the initial state of the environment.
[-1.2619973e-03 1.3984586e+00 -1.2784091e-01 -5.5384123e-01 1.4691149e-03 2.8957864e-02 0.0000000e+00 0.0000000e+00]
Then, we try to get a random action from the agent's action space.
3
More, we can utilize step()
to make agent act according to the randomly-selected random_action
.
The step()
function will return four values:
- observation / state
- reward
- done (True/ False)
- Other information
False
Reward
Landing pad is always at coordinates (0,0). Coordinates are the first two numbers in state vector. Reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points.
-1.0511407416545058
Random Agent
In the end, before we start training, we can see whether a random agent can successfully land the moon or not.
/opt/conda/lib/python3.8/site-packages/gym/core.py:43: DeprecationWarning: WARN: The argument mode in render method is deprecated; use render_mode during environment initialization instead. See here for more information: https://www.gymlibrary.ml/content/api/ deprecation(
Policy Gradient
Now, we can build a simple policy network. The network will return one of action in the action space.
Then, we need to build a simple agent. The agent will acts according to the output of the policy network above. There are a few things can be done by agent:
learn()
:update the policy network from log probabilities and rewards.sample()
:After receiving observation from the environment, utilize policy network to tell which action to take. The return values of this function includes action and log probabilities.
Lastly, build a network and agent to start training.
Training Agent
Now let's start to train our agent. Through taking all the interactions between agent and environment as training data, the policy network can learn from all these attempts,
rewards looks like (467,) logs prob looks like torch.Size([467]) torch.from_numpy(rewards) looks like torch.Size([467]) rewards looks like (460,) logs prob looks like torch.Size([460]) torch.from_numpy(rewards) looks like torch.Size([460]) rewards looks like (493,) logs prob looks like torch.Size([493]) torch.from_numpy(rewards) looks like torch.Size([493]) rewards looks like (426,) logs prob looks like torch.Size([426]) torch.from_numpy(rewards) looks like torch.Size([426]) rewards looks like (415,) logs prob looks like torch.Size([415]) torch.from_numpy(rewards) looks like torch.Size([415]) rewards looks like (504,) logs prob looks like torch.Size([504]) torch.from_numpy(rewards) looks like torch.Size([504]) rewards looks like (466,) logs prob looks like torch.Size([466]) torch.from_numpy(rewards) looks like torch.Size([466]) rewards looks like (475,) logs prob looks like torch.Size([475]) torch.from_numpy(rewards) looks like torch.Size([475]) rewards looks like (513,) logs prob looks like torch.Size([513]) torch.from_numpy(rewards) looks like torch.Size([513]) rewards looks like (618,) logs prob looks like torch.Size([618]) torch.from_numpy(rewards) looks like torch.Size([618]) rewards looks like (533,) logs prob looks like torch.Size([533]) torch.from_numpy(rewards) looks like torch.Size([533]) rewards looks like (475,) logs prob looks like torch.Size([475]) torch.from_numpy(rewards) looks like torch.Size([475]) rewards looks like (465,) logs prob looks like torch.Size([465]) torch.from_numpy(rewards) looks like torch.Size([465]) rewards looks like (1396,) logs prob looks like torch.Size([1396]) torch.from_numpy(rewards) looks like torch.Size([1396]) rewards looks like (541,) logs prob looks like torch.Size([541]) torch.from_numpy(rewards) looks like torch.Size([541]) rewards looks like (400,) logs prob looks like torch.Size([400]) torch.from_numpy(rewards) looks like torch.Size([400]) rewards looks like (541,) logs prob looks like torch.Size([541]) torch.from_numpy(rewards) looks like torch.Size([541]) rewards looks like (478,) logs prob looks like torch.Size([478]) torch.from_numpy(rewards) looks like torch.Size([478]) rewards looks like (491,) logs prob looks like torch.Size([491]) torch.from_numpy(rewards) looks like torch.Size([491]) rewards looks like (599,) logs prob looks like torch.Size([599]) torch.from_numpy(rewards) looks like torch.Size([599]) rewards looks like (468,) logs prob looks like torch.Size([468]) torch.from_numpy(rewards) looks like torch.Size([468]) rewards looks like (787,) logs prob looks like torch.Size([787]) torch.from_numpy(rewards) looks like torch.Size([787]) rewards looks like (656,) logs prob looks like torch.Size([656]) torch.from_numpy(rewards) looks like torch.Size([656]) rewards looks like (574,) logs prob looks like torch.Size([574]) torch.from_numpy(rewards) looks like torch.Size([574]) rewards looks like (468,) logs prob looks like torch.Size([468]) torch.from_numpy(rewards) looks like torch.Size([468]) rewards looks like (542,) logs prob looks like torch.Size([542]) torch.from_numpy(rewards) looks like torch.Size([542]) rewards looks like (558,) logs prob looks like torch.Size([558]) torch.from_numpy(rewards) looks like torch.Size([558]) rewards looks like (565,) logs prob looks like torch.Size([565]) torch.from_numpy(rewards) looks like torch.Size([565]) rewards looks like (463,) logs prob looks like torch.Size([463]) torch.from_numpy(rewards) looks like torch.Size([463]) rewards looks like (551,) logs prob looks like torch.Size([551]) torch.from_numpy(rewards) looks like torch.Size([551]) rewards looks like (580,) logs prob looks like torch.Size([580]) torch.from_numpy(rewards) looks like torch.Size([580]) rewards looks like (694,) logs prob looks like torch.Size([694]) torch.from_numpy(rewards) looks like torch.Size([694]) rewards looks like (537,) logs prob looks like torch.Size([537]) torch.from_numpy(rewards) looks like torch.Size([537]) rewards looks like (639,) logs prob looks like torch.Size([639]) torch.from_numpy(rewards) looks like torch.Size([639]) rewards looks like (519,) logs prob looks like torch.Size([519]) torch.from_numpy(rewards) looks like torch.Size([519]) rewards looks like (657,) logs prob looks like torch.Size([657]) torch.from_numpy(rewards) looks like torch.Size([657]) rewards looks like (647,) logs prob looks like torch.Size([647]) torch.from_numpy(rewards) looks like torch.Size([647]) rewards looks like (554,) logs prob looks like torch.Size([554]) torch.from_numpy(rewards) looks like torch.Size([554]) rewards looks like (558,) logs prob looks like torch.Size([558]) torch.from_numpy(rewards) looks like torch.Size([558]) rewards looks like (1382,) logs prob looks like torch.Size([1382]) torch.from_numpy(rewards) looks like torch.Size([1382]) rewards looks like (500,) logs prob looks like torch.Size([500]) torch.from_numpy(rewards) looks like torch.Size([500]) rewards looks like (575,) logs prob looks like torch.Size([575]) torch.from_numpy(rewards) looks like torch.Size([575]) rewards looks like (576,) logs prob looks like torch.Size([576]) torch.from_numpy(rewards) looks like torch.Size([576]) rewards looks like (510,) logs prob looks like torch.Size([510]) torch.from_numpy(rewards) looks like torch.Size([510]) rewards looks like (703,) logs prob looks like torch.Size([703]) torch.from_numpy(rewards) looks like torch.Size([703]) rewards looks like (509,) logs prob looks like torch.Size([509]) torch.from_numpy(rewards) looks like torch.Size([509]) rewards looks like (580,) logs prob looks like torch.Size([580]) torch.from_numpy(rewards) looks like torch.Size([580]) rewards looks like (1475,) logs prob looks like torch.Size([1475]) torch.from_numpy(rewards) looks like torch.Size([1475]) rewards looks like (729,) logs prob looks like torch.Size([729]) torch.from_numpy(rewards) looks like torch.Size([729]) rewards looks like (589,) logs prob looks like torch.Size([589]) torch.from_numpy(rewards) looks like torch.Size([589]) rewards looks like (494,) logs prob looks like torch.Size([494]) torch.from_numpy(rewards) looks like torch.Size([494]) rewards looks like (511,) logs prob looks like torch.Size([511]) torch.from_numpy(rewards) looks like torch.Size([511]) rewards looks like (816,) logs prob looks like torch.Size([816]) torch.from_numpy(rewards) looks like torch.Size([816]) rewards looks like (562,) logs prob looks like torch.Size([562]) torch.from_numpy(rewards) looks like torch.Size([562]) rewards looks like (827,) logs prob looks like torch.Size([827]) torch.from_numpy(rewards) looks like torch.Size([827]) rewards looks like (747,) logs prob looks like torch.Size([747]) torch.from_numpy(rewards) looks like torch.Size([747]) rewards looks like (804,) logs prob looks like torch.Size([804]) torch.from_numpy(rewards) looks like torch.Size([804]) rewards looks like (555,) logs prob looks like torch.Size([555]) torch.from_numpy(rewards) looks like torch.Size([555]) rewards looks like (786,) logs prob looks like torch.Size([786]) torch.from_numpy(rewards) looks like torch.Size([786]) rewards looks like (536,) logs prob looks like torch.Size([536]) torch.from_numpy(rewards) looks like torch.Size([536]) rewards looks like (680,) logs prob looks like torch.Size([680]) torch.from_numpy(rewards) looks like torch.Size([680]) rewards looks like (721,) logs prob looks like torch.Size([721]) torch.from_numpy(rewards) looks like torch.Size([721]) rewards looks like (664,) logs prob looks like torch.Size([664]) torch.from_numpy(rewards) looks like torch.Size([664]) rewards looks like (916,) logs prob looks like torch.Size([916]) torch.from_numpy(rewards) looks like torch.Size([916]) rewards looks like (1148,) logs prob looks like torch.Size([1148]) torch.from_numpy(rewards) looks like torch.Size([1148]) rewards looks like (644,) logs prob looks like torch.Size([644]) torch.from_numpy(rewards) looks like torch.Size([644]) rewards looks like (671,) logs prob looks like torch.Size([671]) torch.from_numpy(rewards) looks like torch.Size([671]) rewards looks like (929,) logs prob looks like torch.Size([929]) torch.from_numpy(rewards) looks like torch.Size([929]) rewards looks like (929,) logs prob looks like torch.Size([929]) torch.from_numpy(rewards) looks like torch.Size([929]) rewards looks like (865,) logs prob looks like torch.Size([865]) torch.from_numpy(rewards) looks like torch.Size([865]) rewards looks like (621,) logs prob looks like torch.Size([621]) torch.from_numpy(rewards) looks like torch.Size([621]) rewards looks like (772,) logs prob looks like torch.Size([772]) torch.from_numpy(rewards) looks like torch.Size([772]) rewards looks like (720,) logs prob looks like torch.Size([720]) torch.from_numpy(rewards) looks like torch.Size([720]) rewards looks like (972,) logs prob looks like torch.Size([972]) torch.from_numpy(rewards) looks like torch.Size([972]) rewards looks like (979,) logs prob looks like torch.Size([979]) torch.from_numpy(rewards) looks like torch.Size([979]) rewards looks like (1539,) logs prob looks like torch.Size([1539]) torch.from_numpy(rewards) looks like torch.Size([1539]) rewards looks like (604,) logs prob looks like torch.Size([604]) torch.from_numpy(rewards) looks like torch.Size([604]) rewards looks like (724,) logs prob looks like torch.Size([724]) torch.from_numpy(rewards) looks like torch.Size([724]) rewards looks like (821,) logs prob looks like torch.Size([821]) torch.from_numpy(rewards) looks like torch.Size([821]) rewards looks like (778,) logs prob looks like torch.Size([778]) torch.from_numpy(rewards) looks like torch.Size([778]) rewards looks like (625,) logs prob looks like torch.Size([625]) torch.from_numpy(rewards) looks like torch.Size([625]) rewards looks like (853,) logs prob looks like torch.Size([853]) torch.from_numpy(rewards) looks like torch.Size([853]) rewards looks like (797,) logs prob looks like torch.Size([797]) torch.from_numpy(rewards) looks like torch.Size([797]) rewards looks like (922,) logs prob looks like torch.Size([922]) torch.from_numpy(rewards) looks like torch.Size([922]) rewards looks like (839,) logs prob looks like torch.Size([839]) torch.from_numpy(rewards) looks like torch.Size([839]) rewards looks like (765,) logs prob looks like torch.Size([765]) torch.from_numpy(rewards) looks like torch.Size([765]) rewards looks like (682,) logs prob looks like torch.Size([682]) torch.from_numpy(rewards) looks like torch.Size([682]) rewards looks like (809,) logs prob looks like torch.Size([809]) torch.from_numpy(rewards) looks like torch.Size([809]) rewards looks like (768,) logs prob looks like torch.Size([768]) torch.from_numpy(rewards) looks like torch.Size([768]) rewards looks like (635,) logs prob looks like torch.Size([635]) torch.from_numpy(rewards) looks like torch.Size([635]) rewards looks like (722,) logs prob looks like torch.Size([722]) torch.from_numpy(rewards) looks like torch.Size([722]) rewards looks like (894,) logs prob looks like torch.Size([894]) torch.from_numpy(rewards) looks like torch.Size([894]) rewards looks like (912,) logs prob looks like torch.Size([912]) torch.from_numpy(rewards) looks like torch.Size([912]) rewards looks like (769,) logs prob looks like torch.Size([769]) torch.from_numpy(rewards) looks like torch.Size([769]) rewards looks like (719,) logs prob looks like torch.Size([719]) torch.from_numpy(rewards) looks like torch.Size([719]) rewards looks like (1036,) logs prob looks like torch.Size([1036]) torch.from_numpy(rewards) looks like torch.Size([1036]) rewards looks like (671,) logs prob looks like torch.Size([671]) torch.from_numpy(rewards) looks like torch.Size([671]) rewards looks like (795,) logs prob looks like torch.Size([795]) torch.from_numpy(rewards) looks like torch.Size([795]) rewards looks like (822,) logs prob looks like torch.Size([822]) torch.from_numpy(rewards) looks like torch.Size([822]) rewards looks like (940,) logs prob looks like torch.Size([940]) torch.from_numpy(rewards) looks like torch.Size([940]) rewards looks like (805,) logs prob looks like torch.Size([805]) torch.from_numpy(rewards) looks like torch.Size([805]) rewards looks like (888,) logs prob looks like torch.Size([888]) torch.from_numpy(rewards) looks like torch.Size([888]) rewards looks like (795,) logs prob looks like torch.Size([795]) torch.from_numpy(rewards) looks like torch.Size([795]) rewards looks like (732,) logs prob looks like torch.Size([732]) torch.from_numpy(rewards) looks like torch.Size([732]) rewards looks like (857,) logs prob looks like torch.Size([857]) torch.from_numpy(rewards) looks like torch.Size([857]) rewards looks like (1208,) logs prob looks like torch.Size([1208]) torch.from_numpy(rewards) looks like torch.Size([1208]) rewards looks like (755,) logs prob looks like torch.Size([755]) torch.from_numpy(rewards) looks like torch.Size([755]) rewards looks like (975,) logs prob looks like torch.Size([975]) torch.from_numpy(rewards) looks like torch.Size([975]) rewards looks like (969,) logs prob looks like torch.Size([969]) torch.from_numpy(rewards) looks like torch.Size([969]) rewards looks like (1217,) logs prob looks like torch.Size([1217]) torch.from_numpy(rewards) looks like torch.Size([1217]) rewards looks like (1466,) logs prob looks like torch.Size([1466]) torch.from_numpy(rewards) looks like torch.Size([1466]) rewards looks like (892,) logs prob looks like torch.Size([892]) torch.from_numpy(rewards) looks like torch.Size([892]) rewards looks like (933,) logs prob looks like torch.Size([933]) torch.from_numpy(rewards) looks like torch.Size([933]) rewards looks like (1991,) logs prob looks like torch.Size([1991]) torch.from_numpy(rewards) looks like torch.Size([1991]) rewards looks like (602,) logs prob looks like torch.Size([602]) torch.from_numpy(rewards) looks like torch.Size([602]) rewards looks like (694,) logs prob looks like torch.Size([694]) torch.from_numpy(rewards) looks like torch.Size([694]) rewards looks like (962,) logs prob looks like torch.Size([962]) torch.from_numpy(rewards) looks like torch.Size([962]) rewards looks like (889,) logs prob looks like torch.Size([889]) torch.from_numpy(rewards) looks like torch.Size([889]) rewards looks like (874,) logs prob looks like torch.Size([874]) torch.from_numpy(rewards) looks like torch.Size([874]) rewards looks like (1108,) logs prob looks like torch.Size([1108]) torch.from_numpy(rewards) looks like torch.Size([1108]) rewards looks like (994,) logs prob looks like torch.Size([994]) torch.from_numpy(rewards) looks like torch.Size([994]) rewards looks like (1742,) logs prob looks like torch.Size([1742]) torch.from_numpy(rewards) looks like torch.Size([1742]) rewards looks like (1287,) logs prob looks like torch.Size([1287]) torch.from_numpy(rewards) looks like torch.Size([1287]) rewards looks like (1190,) logs prob looks like torch.Size([1190]) torch.from_numpy(rewards) looks like torch.Size([1190]) rewards looks like (1016,) logs prob looks like torch.Size([1016]) torch.from_numpy(rewards) looks like torch.Size([1016]) rewards looks like (810,) logs prob looks like torch.Size([810]) torch.from_numpy(rewards) looks like torch.Size([810]) rewards looks like (1244,) logs prob looks like torch.Size([1244]) torch.from_numpy(rewards) looks like torch.Size([1244]) rewards looks like (1755,) logs prob looks like torch.Size([1755]) torch.from_numpy(rewards) looks like torch.Size([1755]) rewards looks like (1467,) rewards looks like (1530,) logs prob looks like torch.Size([1530]) torch.from_numpy(rewards) looks like torch.Size([1530]) rewards looks like (2494,) logs prob looks like torch.Size([2494]) torch.from_numpy(rewards) looks like torch.Size([2494]) rewards looks like (1130,) logs prob looks like torch.Size([1130]) torch.from_numpy(rewards) looks like torch.Size([1130]) rewards looks like (1282,) logs prob looks like torch.Size([1282]) torch.from_numpy(rewards) looks like torch.Size([1282]) rewards looks like (2414,) logs prob looks like torch.Size([2414]) torch.from_numpy(rewards) looks like torch.Size([2414]) rewards looks like (1461,) logs prob looks like torch.Size([1461]) torch.from_numpy(rewards) looks like torch.Size([1461]) rewards looks like (818,) logs prob looks like torch.Size([818]) torch.from_numpy(rewards) looks like torch.Size([818]) rewards looks like (1231,) logs prob looks like torch.Size([1231]) torch.from_numpy(rewards) looks like torch.Size([1231]) rewards looks like (2387,) logs prob looks like torch.Size([2387]) torch.from_numpy(rewards) looks like torch.Size([2387]) rewards looks like (421,) logs prob looks like torch.Size([421]) torch.from_numpy(rewards) looks like torch.Size([421]) rewards looks like (374,) logs prob looks like torch.Size([374]) torch.from_numpy(rewards) looks like torch.Size([374]) rewards looks like (419,) logs prob looks like torch.Size([419]) torch.from_numpy(rewards) looks like torch.Size([419]) rewards looks like (345,) logs prob looks like torch.Size([345]) torch.from_numpy(rewards) looks like torch.Size([345]) rewards looks like (422,) logs prob looks like torch.Size([422]) torch.from_numpy(rewards) looks like torch.Size([422]) rewards looks like (426,) logs prob looks like torch.Size([426]) torch.from_numpy(rewards) looks like torch.Size([426]) rewards looks like (416,) logs prob looks like torch.Size([416]) torch.from_numpy(rewards) looks like torch.Size([416]) rewards looks like (374,) logs prob looks like torch.Size([374]) torch.from_numpy(rewards) looks like torch.Size([374]) rewards looks like (442,) logs prob looks like torch.Size([442]) torch.from_numpy(rewards) looks like torch.Size([442]) rewards looks like (387,) logs prob looks like torch.Size([387]) torch.from_numpy(rewards) looks like torch.Size([387]) rewards looks like (364,) logs prob looks like torch.Size([364]) torch.from_numpy(rewards) looks like torch.Size([364]) rewards looks like (433,) logs prob looks like torch.Size([433]) torch.from_numpy(rewards) looks like torch.Size([433]) rewards looks like (447,) logs prob looks like torch.Size([447]) torch.from_numpy(rewards) looks like torch.Size([447]) rewards looks like (450,) logs prob looks like torch.Size([450]) torch.from_numpy(rewards) looks like torch.Size([450]) rewards looks like (468,) logs prob looks like torch.Size([468]) torch.from_numpy(rewards) looks like torch.Size([468]) rewards looks like (459,) logs prob looks like torch.Size([459]) torch.from_numpy(rewards) looks like torch.Size([459]) rewards looks like (463,) logs prob looks like torch.Size([463]) torch.from_numpy(rewards) looks like torch.Size([463]) rewards looks like (1427,) logs prob looks like torch.Size([1427]) torch.from_numpy(rewards) looks like torch.Size([1427]) rewards looks like (1327,) logs prob looks like torch.Size([1327]) torch.from_numpy(rewards) looks like torch.Size([1327]) rewards looks like (1328,) logs prob looks like torch.Size([1328]) torch.from_numpy(rewards) looks like torch.Size([1328]) rewards looks like (1374,) logs prob looks like torch.Size([1374]) torch.from_numpy(rewards) looks like torch.Size([1374]) rewards looks like (2257,) logs prob looks like torch.Size([2257]) torch.from_numpy(rewards) looks like torch.Size([2257]) rewards looks like (1379,) logs prob looks like torch.Size([1379]) torch.from_numpy(rewards) looks like torch.Size([1379]) rewards looks like (2934,) logs prob looks like torch.Size([2934]) torch.from_numpy(rewards) looks like torch.Size([2934]) rewards looks like (1415,) logs prob looks like torch.Size([1415]) torch.from_numpy(rewards) looks like torch.Size([1415]) rewards looks like (698,) logs prob looks like torch.Size([698]) torch.from_numpy(rewards) looks like torch.Size([698]) rewards looks like (1740,) logs prob looks like torch.Size([1740]) torch.from_numpy(rewards) looks like torch.Size([1740]) rewards looks like (2216,) logs prob looks like torch.Size([2216]) torch.from_numpy(rewards) looks like torch.Size([2216]) rewards looks like (1920,) logs prob looks like torch.Size([1920]) torch.from_numpy(rewards) looks like torch.Size([1920]) rewards looks like (1229,) logs prob looks like torch.Size([1229]) torch.from_numpy(rewards) looks like torch.Size([1229]) rewards looks like (2278,) logs prob looks like torch.Size([2278]) torch.from_numpy(rewards) looks like torch.Size([2278]) rewards looks like (2598,) logs prob looks like torch.Size([2598]) torch.from_numpy(rewards) looks like torch.Size([2598]) rewards looks like (1279,) logs prob looks like torch.Size([1279]) torch.from_numpy(rewards) looks like torch.Size([1279]) rewards looks like (2926,) logs prob looks like torch.Size([2926]) torch.from_numpy(rewards) looks like torch.Size([2926]) rewards looks like (1525,) logs prob looks like torch.Size([1525]) torch.from_numpy(rewards) looks like torch.Size([1525]) rewards looks like (965,) logs prob looks like torch.Size([965]) torch.from_numpy(rewards) looks like torch.Size([965]) rewards looks like (1734,) logs prob looks like torch.Size([1734]) torch.from_numpy(rewards) looks like torch.Size([1734]) rewards looks like (1625,) logs prob looks like torch.Size([1625]) torch.from_numpy(rewards) looks like torch.Size([1625]) rewards looks like (1081,) logs prob looks like torch.Size([1081]) torch.from_numpy(rewards) looks like torch.Size([1081]) rewards looks like (1628,) logs prob looks like torch.Size([1628]) torch.from_numpy(rewards) looks like torch.Size([1628]) rewards looks like (2825,) logs prob looks like torch.Size([2825]) torch.from_numpy(rewards) looks like torch.Size([2825]) rewards looks like (3485,) logs prob looks like torch.Size([3485]) torch.from_numpy(rewards) looks like torch.Size([3485]) rewards looks like (1514,) logs prob looks like torch.Size([1514]) torch.from_numpy(rewards) looks like torch.Size([1514]) rewards looks like (642,) logs prob looks like torch.Size([846]) torch.from_numpy(rewards) looks like torch.Size([846]) rewards looks like (755,) logs prob looks like torch.Size([755]) torch.from_numpy(rewards) looks like torch.Size([755]) rewards looks like (1059,) logs prob looks like torch.Size([1059]) torch.from_numpy(rewards) looks like torch.Size([1059]) rewards looks like (2581,) logs prob looks like torch.Size([2581]) torch.from_numpy(rewards) looks like torch.Size([2581]) rewards looks like (2767,) logs prob looks like torch.Size([2767]) torch.from_numpy(rewards) looks like torch.Size([2767]) rewards looks like (899,) logs prob looks like torch.Size([899]) torch.from_numpy(rewards) looks like torch.Size([899]) rewards looks like (2808,) logs prob looks like torch.Size([2808]) torch.from_numpy(rewards) looks like torch.Size([2808]) rewards looks like (1459,) logs prob looks like torch.Size([1459]) torch.from_numpy(rewards) looks like torch.Size([1459]) rewards looks like (2458,) logs prob looks like torch.Size([2458]) torch.from_numpy(rewards) looks like torch.Size([2458]) rewards looks like (1027,) logs prob looks like torch.Size([1027]) torch.from_numpy(rewards) looks like torch.Size([1027]) rewards looks like (1907,) logs prob looks like torch.Size([1907]) torch.from_numpy(rewards) looks like torch.Size([1907]) rewards looks like (1878,) logs prob looks like torch.Size([1878]) torch.from_numpy(rewards) looks like torch.Size([1878]) rewards looks like (2129,) logs prob looks like torch.Size([2129]) torch.from_numpy(rewards) looks like torch.Size([2129]) rewards looks like (2873,) logs prob looks like torch.Size([2873]) torch.from_numpy(rewards) looks like torch.Size([2873]) rewards looks like (1311,) logs prob looks like torch.Size([1311]) torch.from_numpy(rewards) looks like torch.Size([1311]) rewards looks like (1888,) logs prob looks like torch.Size([1888]) torch.from_numpy(rewards) looks like torch.Size([1888]) rewards looks like (870,) logs prob looks like torch.Size([870]) torch.from_numpy(rewards) looks like torch.Size([870]) rewards looks like (1193,) logs prob looks like torch.Size([1193]) torch.from_numpy(rewards) looks like torch.Size([1193]) rewards looks like (1367,) logs prob looks like torch.Size([1367]) torch.from_numpy(rewards) looks like torch.Size([1367]) rewards looks like (1786,) logs prob looks like torch.Size([1786]) torch.from_numpy(rewards) looks like torch.Size([1786]) rewards looks like (992,) logs prob looks like torch.Size([992]) torch.from_numpy(rewards) looks like torch.Size([992]) rewards looks like (1037,) logs prob looks like torch.Size([1037]) torch.from_numpy(rewards) looks like torch.Size([1037]) rewards looks like (2417,) logs prob looks like torch.Size([2417]) torch.from_numpy(rewards) looks like torch.Size([2417]) rewards looks like (2027,) logs prob looks like torch.Size([2027]) torch.from_numpy(rewards) looks like torch.Size([2027]) rewards looks like (1203,) logs prob looks like torch.Size([1203]) torch.from_numpy(rewards) looks like torch.Size([1203]) rewards looks like (2168,) logs prob looks like torch.Size([2168]) torch.from_numpy(rewards) looks like torch.Size([2168]) rewards looks like (1097,) logs prob looks like torch.Size([1097]) torch.from_numpy(rewards) looks like torch.Size([1097]) rewards looks like (2070,) logs prob looks like torch.Size([2070]) torch.from_numpy(rewards) looks like torch.Size([2070]) rewards looks like (1878,) logs prob looks like torch.Size([1878]) torch.from_numpy(rewards) looks like torch.Size([1878]) rewards looks like (1325,) logs prob looks like torch.Size([1325]) torch.from_numpy(rewards) looks like torch.Size([1325]) rewards looks like (2611,) logs prob looks like torch.Size([2611]) torch.from_numpy(rewards) looks like torch.Size([2611]) rewards looks like (1549,) logs prob looks like torch.Size([1549]) torch.from_numpy(rewards) looks like torch.Size([1549]) rewards looks like (2479,) logs prob looks like torch.Size([2479]) torch.from_numpy(rewards) looks like torch.Size([2479]) rewards looks like (1987,) logs prob looks like torch.Size([1987]) torch.from_numpy(rewards) looks like torch.Size([1987]) rewards looks like (1370,) logs prob looks like torch.Size([1370]) torch.from_numpy(rewards) looks like torch.Size([1370]) rewards looks like (1003,) logs prob looks like torch.Size([1003]) torch.from_numpy(rewards) looks like torch.Size([1003]) rewards looks like (2640,) logs prob looks like torch.Size([2640]) torch.from_numpy(rewards) looks like torch.Size([2640]) rewards looks like (1486,) logs prob looks like torch.Size([1486]) torch.from_numpy(rewards) looks like torch.Size([1486]) rewards looks like (2105,) logs prob looks like torch.Size([2105]) torch.from_numpy(rewards) looks like torch.Size([2105]) rewards looks like (2222,) logs prob looks like torch.Size([2222]) torch.from_numpy(rewards) looks like torch.Size([2222]) rewards looks like (1209,) logs prob looks like torch.Size([1209]) torch.from_numpy(rewards) looks like torch.Size([1209]) rewards looks like (1666,) logs prob looks like torch.Size([1666]) torch.from_numpy(rewards) looks like torch.Size([1666]) rewards looks like (1435,) logs prob looks like torch.Size([1435]) torch.from_numpy(rewards) looks like torch.Size([1435]) rewards looks like (1231,) logs prob looks like torch.Size([1231]) torch.from_numpy(rewards) looks like torch.Size([1231]) rewards looks like (1207,) logs prob looks like torch.Size([1207]) torch.from_numpy(rewards) looks like torch.Size([1207]) rewards looks like (1155,) logs prob looks like torch.Size([1155]) torch.from_numpy(rewards) looks like torch.Size([1155]) rewards looks like (1526,) logs prob looks like torch.Size([1526]) torch.from_numpy(rewards) looks like torch.Size([1526]) rewards looks like (2181,) logs prob looks like torch.Size([2181]) torch.from_numpy(rewards) looks like torch.Size([2181]) rewards looks like (1868,) logs prob looks like torch.Size([1868]) torch.from_numpy(rewards) looks like torch.Size([1868]) rewards looks like (2452,) logs prob looks like torch.Size([2452]) torch.from_numpy(rewards) looks like torch.Size([2452]) rewards looks like (1363,) logs prob looks like torch.Size([1363]) torch.from_numpy(rewards) looks like torch.Size([1363]) rewards looks like (1543,) logs prob looks like torch.Size([1543]) torch.from_numpy(rewards) looks like torch.Size([1543]) rewards looks like (2103,) logs prob looks like torch.Size([2103]) torch.from_numpy(rewards) looks like torch.Size([2103]) rewards looks like (1750,) logs prob looks like torch.Size([1750]) torch.from_numpy(rewards) looks like torch.Size([1750]) rewards looks like (1453,) logs prob looks like torch.Size([1453]) torch.from_numpy(rewards) looks like torch.Size([1453]) rewards looks like (1996,) logs prob looks like torch.Size([1996]) torch.from_numpy(rewards) looks like torch.Size([1996]) rewards looks like (1634,) logs prob looks like torch.Size([1634]) torch.from_numpy(rewards) looks like torch.Size([1634]) rewards looks like (1364,) logs prob looks like torch.Size([1364]) torch.from_numpy(rewards) looks like torch.Size([1364]) rewards looks like (2401,) logs prob looks like torch.Size([2401]) torch.from_numpy(rewards) looks like torch.Size([2401]) rewards looks like (1041,) logs prob looks like torch.Size([1041]) torch.from_numpy(rewards) looks like torch.Size([1041]) rewards looks like (1014,) logs prob looks like torch.Size([1014]) torch.from_numpy(rewards) looks like torch.Size([1014]) rewards looks like (1723,) logs prob looks like torch.Size([1723]) torch.from_numpy(rewards) looks like torch.Size([1723]) rewards looks like (1141,) logs prob looks like torch.Size([1141]) torch.from_numpy(rewards) looks like torch.Size([1141]) rewards looks like (1153,) logs prob looks like torch.Size([1153]) torch.from_numpy(rewards) looks like torch.Size([1153]) rewards looks like (1345,) logs prob looks like torch.Size([1345]) torch.from_numpy(rewards) looks like torch.Size([1345]) rewards looks like (1537,) logs prob looks like torch.Size([1537]) torch.from_numpy(rewards) looks like torch.Size([1537]) rewards looks like (1362,) logs prob looks like torch.Size([1362]) torch.from_numpy(rewards) looks like torch.Size([1362]) rewards looks like (1400,) logs prob looks like torch.Size([1400]) torch.from_numpy(rewards) looks like torch.Size([1400]) rewards looks like (1363,) logs prob looks like torch.Size([1363]) torch.from_numpy(rewards) looks like torch.Size([1363]) rewards looks like (1381,) logs prob looks like torch.Size([1381]) torch.from_numpy(rewards) looks like torch.Size([1381]) rewards looks like (2077,) logs prob looks like torch.Size([2077]) torch.from_numpy(rewards) looks like torch.Size([2077]) rewards looks like (2517,) logs prob looks like torch.Size([2517]) torch.from_numpy(rewards) looks like torch.Size([2517]) rewards looks like (1419,) logs prob looks like torch.Size([1419]) torch.from_numpy(rewards) looks like torch.Size([1419]) rewards looks like (960,) logs prob looks like torch.Size([960]) torch.from_numpy(rewards) looks like torch.Size([960]) rewards looks like (1079,) logs prob looks like torch.Size([1079]) torch.from_numpy(rewards) looks like torch.Size([1079]) rewards looks like (1285,) logs prob looks like torch.Size([1285]) torch.from_numpy(rewards) looks like torch.Size([1285]) rewards looks like (2475,) logs prob looks like torch.Size([2475]) torch.from_numpy(rewards) looks like torch.Size([2475]) rewards looks like (1376,) logs prob looks like torch.Size([1376]) torch.from_numpy(rewards) looks like torch.Size([1376]) rewards looks like (2248,) logs prob looks like torch.Size([2248]) torch.from_numpy(rewards) looks like torch.Size([2248]) rewards looks like (2912,) logs prob looks like torch.Size([2912]) torch.from_numpy(rewards) looks like torch.Size([2912]) rewards looks like (1334,) logs prob looks like torch.Size([1334]) torch.from_numpy(rewards) looks like torch.Size([1334]) rewards looks like (1481,) logs prob looks like torch.Size([1481]) torch.from_numpy(rewards) looks like torch.Size([1481]) rewards looks like (2016,) logs prob looks like torch.Size([2016]) torch.from_numpy(rewards) looks like torch.Size([2016]) rewards looks like (1899,) logs prob looks like torch.Size([1899]) torch.from_numpy(rewards) looks like torch.Size([1899]) rewards looks like (1171,) logs prob looks like torch.Size([1171]) torch.from_numpy(rewards) looks like torch.Size([1171]) rewards looks like (1250,) logs prob looks like torch.Size([1250]) torch.from_numpy(rewards) looks like torch.Size([1250]) rewards looks like (1945,) logs prob looks like torch.Size([1945]) torch.from_numpy(rewards) looks like torch.Size([1945]) rewards looks like (2421,) logs prob looks like torch.Size([2421]) torch.from_numpy(rewards) looks like torch.Size([2421]) rewards looks like (1859,) logs prob looks like torch.Size([1859]) torch.from_numpy(rewards) looks like torch.Size([1859]) rewards looks like (1101,) logs prob looks like torch.Size([1101]) torch.from_numpy(rewards) looks like torch.Size([1101]) rewards looks like (1297,) logs prob looks like torch.Size([1297]) torch.from_numpy(rewards) looks like torch.Size([1297]) rewards looks like (2085,) logs prob looks like torch.Size([2085]) torch.from_numpy(rewards) looks like torch.Size([2085]) rewards looks like (1478,) logs prob looks like torch.Size([1478]) torch.from_numpy(rewards) looks like torch.Size([1478]) rewards looks like (1131,) logs prob looks like torch.Size([1131]) torch.from_numpy(rewards) looks like torch.Size([1131]) rewards looks like (1370,) logs prob looks like torch.Size([1370]) torch.from_numpy(rewards) looks like torch.Size([1370]) rewards looks like (1503,) logs prob looks like torch.Size([1503]) torch.from_numpy(rewards) looks like torch.Size([1503]) rewards looks like (1058,) logs prob looks like torch.Size([1058]) torch.from_numpy(rewards) looks like torch.Size([1058]) rewards looks like (1350,) logs prob looks like torch.Size([1350]) torch.from_numpy(rewards) looks like torch.Size([1350]) rewards looks like (1250,) logs prob looks like torch.Size([1250]) torch.from_numpy(rewards) looks like torch.Size([1250]) rewards looks like (1364,) logs prob looks like torch.Size([1364]) torch.from_numpy(rewards) looks like torch.Size([1364]) rewards looks like (1084,) logs prob looks like torch.Size([1084]) torch.from_numpy(rewards) looks like torch.Size([1084]) rewards looks like (1250,) logs prob looks like torch.Size([1250]) torch.from_numpy(rewards) looks like torch.Size([1250]) rewards looks like (1286,) logs prob looks like torch.Size([1286]) torch.from_numpy(rewards) looks like torch.Size([1286]) rewards looks like (1477,) logs prob looks like torch.Size([1477]) torch.from_numpy(rewards) looks like torch.Size([1477]) rewards looks like (1172,) logs prob looks like torch.Size([1172]) torch.from_numpy(rewards) looks like torch.Size([1172]) rewards looks like (1366,) logs prob looks like torch.Size([1366]) torch.from_numpy(rewards) looks like torch.Size([1366]) rewards looks like (1826,) logs prob looks like torch.Size([1826]) torch.from_numpy(rewards) looks like torch.Size([1826]) rewards looks like (1165,) logs prob looks like torch.Size([1165]) torch.from_numpy(rewards) looks like torch.Size([1165]) rewards looks like (2540,) logs prob looks like torch.Size([2540]) torch.from_numpy(rewards) looks like torch.Size([2540]) rewards looks like (1507,) logs prob looks like torch.Size([1507]) torch.from_numpy(rewards) looks like torch.Size([1507]) rewards looks like (2418,) logs prob looks like torch.Size([2418]) torch.from_numpy(rewards) looks like torch.Size([2418]) rewards looks like (1300,) logs prob looks like torch.Size([1300]) torch.from_numpy(rewards) looks like torch.Size([1300]) rewards looks like (2572,) logs prob looks like torch.Size([2572]) torch.from_numpy(rewards) looks like torch.Size([2572]) rewards looks like (1225,) logs prob looks like torch.Size([1225]) torch.from_numpy(rewards) looks like torch.Size([1225]) rewards looks like (1586,) logs prob looks like torch.Size([1586]) torch.from_numpy(rewards) looks like torch.Size([1586]) rewards looks like (1460,) logs prob looks like torch.Size([1460]) torch.from_numpy(rewards) looks like torch.Size([1460]) rewards looks like (1458,) logs prob looks like torch.Size([1458]) torch.from_numpy(rewards) looks like torch.Size([1458]) rewards looks like (1381,) logs prob looks like torch.Size([1381]) torch.from_numpy(rewards) looks like torch.Size([1381]) rewards looks like (1356,) logs prob looks like torch.Size([1356]) torch.from_numpy(rewards) looks like torch.Size([1356]) rewards looks like (1520,) logs prob looks like torch.Size([1520]) torch.from_numpy(rewards) looks like torch.Size([1520]) rewards looks like (1570,) logs prob looks like torch.Size([1570]) torch.from_numpy(rewards) looks like torch.Size([1570]) rewards looks like (1303,) logs prob looks like torch.Size([1303]) torch.from_numpy(rewards) looks like torch.Size([1303]) rewards looks like (2160,) logs prob looks like torch.Size([2160]) torch.from_numpy(rewards) looks like torch.Size([2160]) rewards looks like (1344,) logs prob looks like torch.Size([1344]) torch.from_numpy(rewards) looks like torch.Size([1344]) rewards looks like (1496,) logs prob looks like torch.Size([1496]) torch.from_numpy(rewards) looks like torch.Size([1496]) rewards looks like (1905,) logs prob looks like torch.Size([1905]) torch.from_numpy(rewards) looks like torch.Size([1905]) rewards looks like (1255,) logs prob looks like torch.Size([1255]) torch.from_numpy(rewards) looks like torch.Size([1255]) rewards looks like (1440,) logs prob looks like torch.Size([1440]) torch.from_numpy(rewards) looks like torch.Size([1440]) rewards looks like (1472,) logs prob looks like torch.Size([1472]) torch.from_numpy(rewards) looks like torch.Size([1472]) rewards looks like (1261,) logs prob looks like torch.Size([1261]) torch.from_numpy(rewards) looks like torch.Size([1261]) rewards looks like (2225,) logs prob looks like torch.Size([2225]) torch.from_numpy(rewards) looks like torch.Size([2225]) rewards looks like (1071,) logs prob looks like torch.Size([1071]) torch.from_numpy(rewards) looks like torch.Size([1071]) rewards looks like (1033,) logs prob looks like torch.Size([1033]) torch.from_numpy(rewards) looks like torch.Size([1033]) rewards looks like (856,) logs prob looks like torch.Size([856]) torch.from_numpy(rewards) looks like torch.Size([856]) rewards looks like (1261,) logs prob looks like torch.Size([1261]) torch.from_numpy(rewards) looks like torch.Size([1261]) rewards looks like (1782,) logs prob looks like torch.Size([1782]) torch.from_numpy(rewards) looks like torch.Size([1782]) rewards looks like (1867,) logs prob looks like torch.Size([1867]) torch.from_numpy(rewards) looks like torch.Size([1867]) rewards looks like (2025,) logs prob looks like torch.Size([2025]) torch.from_numpy(rewards) looks like torch.Size([2025]) rewards looks like (1250,) logs prob looks like torch.Size([1250]) torch.from_numpy(rewards) looks like torch.Size([1250]) rewards looks like (1323,) logs prob looks like torch.Size([1323]) torch.from_numpy(rewards) looks like torch.Size([1323]) rewards looks like (1349,) logs prob looks like torch.Size([1349]) torch.from_numpy(rewards) looks like torch.Size([1349]) rewards looks like (1617,) logs prob looks like torch.Size([1617]) torch.from_numpy(rewards) looks like torch.Size([1617]) rewards looks like (1668,) logs prob looks like torch.Size([1668]) torch.from_numpy(rewards) looks like torch.Size([1668]) rewards looks like (1109,) logs prob looks like torch.Size([1109]) torch.from_numpy(rewards) looks like torch.Size([1109]) rewards looks like (1102,) logs prob looks like torch.Size([1102]) torch.from_numpy(rewards) looks like torch.Size([1102]) rewards looks like (2017,) logs prob looks like torch.Size([2017]) torch.from_numpy(rewards) looks like torch.Size([2017]) rewards looks like (2368,) logs prob looks like torch.Size([2368]) torch.from_numpy(rewards) looks like torch.Size([2368]) rewards looks like (1128,) logs prob looks like torch.Size([1128]) torch.from_numpy(rewards) looks like torch.Size([1128]) rewards looks like (1469,) logs prob looks like torch.Size([1469]) torch.from_numpy(rewards) looks like torch.Size([1469]) rewards looks like (1091,) logs prob looks like torch.Size([1091]) torch.from_numpy(rewards) looks like torch.Size([1091]) rewards looks like (1516,) logs prob looks like torch.Size([1516]) torch.from_numpy(rewards) looks like torch.Size([1516]) rewards looks like (1145,) logs prob looks like torch.Size([1145]) torch.from_numpy(rewards) looks like torch.Size([1145]) rewards looks like (1594,) logs prob looks like torch.Size([1594]) torch.from_numpy(rewards) looks like torch.Size([1594]) rewards looks like (1536,) logs prob looks like torch.Size([1536]) torch.from_numpy(rewards) looks like torch.Size([1536]) rewards looks like (1295,) logs prob looks like torch.Size([1295]) torch.from_numpy(rewards) looks like torch.Size([1295]) rewards looks like (1473,) logs prob looks like torch.Size([1473]) torch.from_numpy(rewards) looks like torch.Size([1473]) rewards looks like (1458,) logs prob looks like torch.Size([1458]) torch.from_numpy(rewards) looks like torch.Size([1458]) rewards looks like (1316,) logs prob looks like torch.Size([1316]) torch.from_numpy(rewards) looks like torch.Size([1316]) rewards looks like (1257,) logs prob looks like torch.Size([1257]) torch.from_numpy(rewards) looks like torch.Size([1257]) rewards looks like (2354,) logs prob looks like torch.Size([2354]) torch.from_numpy(rewards) looks like torch.Size([2354]) rewards looks like (1340,) logs prob looks like torch.Size([1340]) torch.from_numpy(rewards) looks like torch.Size([1340]) rewards looks like (1900,) logs prob looks like torch.Size([1900]) torch.from_numpy(rewards) looks like torch.Size([1900]) rewards looks like (1513,) logs prob looks like torch.Size([1513]) torch.from_numpy(rewards) looks like torch.Size([1513]) rewards looks like (1873,) logs prob looks like torch.Size([1873]) torch.from_numpy(rewards) looks like torch.Size([1873]) rewards looks like (1279,) logs prob looks like torch.Size([1279]) torch.from_numpy(rewards) looks like torch.Size([1279]) rewards looks like (2151,) logs prob looks like torch.Size([2151]) torch.from_numpy(rewards) looks like torch.Size([2151]) rewards looks like (1933,) logs prob looks like torch.Size([1933]) torch.from_numpy(rewards) looks like torch.Size([1933]) rewards looks like (2081,) logs prob looks like torch.Size([2081]) torch.from_numpy(rewards) looks like torch.Size([2081]) rewards looks like (1054,) logs prob looks like torch.Size([1054]) torch.from_numpy(rewards) looks like torch.Size([1054]) rewards looks like (1158,) logs prob looks like torch.Size([1158]) torch.from_numpy(rewards) looks like torch.Size([1158]) rewards looks like (1369,) logs prob looks like torch.Size([1369]) torch.from_numpy(rewards) looks like torch.Size([1369]) rewards looks like (1148,) logs prob looks like torch.Size([1148]) torch.from_numpy(rewards) looks like torch.Size([1148]) rewards looks like (1898,) logs prob looks like torch.Size([1898]) torch.from_numpy(rewards) looks like torch.Size([1898]) rewards looks like (1424,) logs prob looks like torch.Size([1424]) torch.from_numpy(rewards) looks like torch.Size([1424]) rewards looks like (2106,) logs prob looks like torch.Size([2106]) torch.from_numpy(rewards) looks like torch.Size([2106]) rewards looks like (1310,) logs prob looks like torch.Size([1310]) torch.from_numpy(rewards) looks like torch.Size([1310]) rewards looks like (1423,) logs prob looks like torch.Size([1423]) torch.from_numpy(rewards) looks like torch.Size([1423]) rewards looks like (1866,) logs prob looks like torch.Size([1866]) torch.from_numpy(rewards) looks like torch.Size([1866]) rewards looks like (2571,) logs prob looks like torch.Size([2571]) torch.from_numpy(rewards) looks like torch.Size([2571]) rewards looks like (1958,) logs prob looks like torch.Size([1958]) torch.from_numpy(rewards) looks like torch.Size([1958]) rewards looks like (1608,) logs prob looks like torch.Size([1608]) torch.from_numpy(rewards) looks like torch.Size([1608]) rewards looks like (1197,) logs prob looks like torch.Size([1197]) torch.from_numpy(rewards) looks like torch.Size([1197]) rewards looks like (1429,) logs prob looks like torch.Size([1429]) torch.from_numpy(rewards) looks like torch.Size([1429]) rewards looks like (1466,) logs prob looks like torch.Size([1466]) torch.from_numpy(rewards) looks like torch.Size([1466]) rewards looks like (1405,) logs prob looks like torch.Size([1405]) torch.from_numpy(rewards) looks like torch.Size([1405]) rewards looks like (1304,) logs prob looks like torch.Size([1304]) torch.from_numpy(rewards) looks like torch.Size([1304]) rewards looks like (2045,) logs prob looks like torch.Size([2045]) torch.from_numpy(rewards) looks like torch.Size([2045]) rewards looks like (1565,) logs prob looks like torch.Size([1565]) torch.from_numpy(rewards) looks like torch.Size([1565]) rewards looks like (2539,) logs prob looks like torch.Size([2539]) torch.from_numpy(rewards) looks like torch.Size([2539]) rewards looks like (1497,) logs prob looks like torch.Size([1497]) torch.from_numpy(rewards) looks like torch.Size([1497]) rewards looks like (2141,) logs prob looks like torch.Size([2141]) torch.from_numpy(rewards) looks like torch.Size([2141]) rewards looks like (1141,) logs prob looks like torch.Size([1141]) torch.from_numpy(rewards) looks like torch.Size([1141]) rewards looks like (2892,) logs prob looks like torch.Size([2892]) torch.from_numpy(rewards) looks like torch.Size([2892]) rewards looks like (841,) logs prob looks like torch.Size([841]) torch.from_numpy(rewards) looks like torch.Size([841]) rewards looks like (1129,) logs prob looks like torch.Size([1129]) torch.from_numpy(rewards) looks like torch.Size([1129]) rewards looks like (1347,) logs prob looks like torch.Size([1347]) torch.from_numpy(rewards) looks like torch.Size([1347]) rewards looks like (1596,) logs prob looks like torch.Size([1596]) torch.from_numpy(rewards) looks like torch.Size([1596]) rewards looks like (2045,) logs prob looks like torch.Size([2045]) torch.from_numpy(rewards) looks like torch.Size([2045]) rewards looks like (1247,) logs prob looks like torch.Size([1247]) torch.from_numpy(rewards) looks like torch.Size([1247]) rewards looks like (1289,) logs prob looks like torch.Size([1289]) torch.from_numpy(rewards) looks like torch.Size([1289]) rewards looks like (2360,) logs prob looks like torch.Size([2360]) torch.from_numpy(rewards) looks like torch.Size([2360]) rewards looks like (2745,) logs prob looks like torch.Size([2745]) torch.from_numpy(rewards) looks like torch.Size([2745]) rewards looks like (1191,) logs prob looks like torch.Size([1191]) torch.from_numpy(rewards) looks like torch.Size([1191]) rewards looks like (1266,) logs prob looks like torch.Size([1266]) torch.from_numpy(rewards) looks like torch.Size([1266]) rewards looks like (1424,) logs prob looks like torch.Size([1424]) torch.from_numpy(rewards) looks like torch.Size([1424]) rewards looks like (929,) logs prob looks like torch.Size([929]) torch.from_numpy(rewards) looks like torch.Size([929]) rewards looks like (2134,) logs prob looks like torch.Size([2134]) torch.from_numpy(rewards) looks like torch.Size([2134]) rewards looks like (1933,) logs prob looks like torch.Size([1933]) torch.from_numpy(rewards) looks like torch.Size([1933]) rewards looks like (1357,) logs prob looks like torch.Size([1357]) torch.from_numpy(rewards) looks like torch.Size([1357]) rewards looks like (1807,) logs prob looks like torch.Size([1807]) torch.from_numpy(rewards) looks like torch.Size([1807]) rewards looks like (2153,) logs prob looks like torch.Size([2153]) torch.from_numpy(rewards) looks like torch.Size([2153]) rewards looks like (1101,) logs prob looks like torch.Size([1101]) torch.from_numpy(rewards) looks like torch.Size([1101]) rewards looks like (1263,) logs prob looks like torch.Size([1263]) torch.from_numpy(rewards) looks like torch.Size([1263]) rewards looks like (2021,) logs prob looks like torch.Size([2021]) torch.from_numpy(rewards) looks like torch.Size([2021]) rewards looks like (1306,) logs prob looks like torch.Size([1306]) torch.from_numpy(rewards) looks like torch.Size([1306]) rewards looks like (1696,) logs prob looks like torch.Size([1696]) torch.from_numpy(rewards) looks like torch.Size([1696]) rewards looks like (1593,) logs prob looks like torch.Size([1593]) torch.from_numpy(rewards) looks like torch.Size([1593]) rewards looks like (1181,) logs prob looks like torch.Size([1181]) torch.from_numpy(rewards) looks like torch.Size([1181]) rewards looks like (2203,) logs prob looks like torch.Size([2203]) torch.from_numpy(rewards) looks like torch.Size([2203]) rewards looks like (2740,) logs prob looks like torch.Size([2740]) torch.from_numpy(rewards) looks like torch.Size([2740]) rewards looks like (1403,) logs prob looks like torch.Size([1403]) torch.from_numpy(rewards) looks like torch.Size([1403]) rewards looks like (1326,) logs prob looks like torch.Size([1326]) torch.from_numpy(rewards) looks like torch.Size([1326]) rewards looks like (2057,) logs prob looks like torch.Size([2057]) torch.from_numpy(rewards) looks like torch.Size([2057]) rewards looks like (3534,) logs prob looks like torch.Size([3534]) torch.from_numpy(rewards) looks like torch.Size([3534]) rewards looks like (1318,) logs prob looks like torch.Size([1318]) torch.from_numpy(rewards) looks like torch.Size([1318]) rewards looks like (1419,) logs prob looks like torch.Size([1419]) torch.from_numpy(rewards) looks like torch.Size([1419]) rewards looks like (1403,) logs prob looks like torch.Size([1403]) torch.from_numpy(rewards) looks like torch.Size([1403]) rewards looks like (2790,) logs prob looks like torch.Size([2790]) torch.from_numpy(rewards) looks like torch.Size([2790]) rewards looks like (1318,) logs prob looks like torch.Size([1318]) torch.from_numpy(rewards) looks like torch.Size([1318]) rewards looks like (1406,) logs prob looks like torch.Size([1406]) torch.from_numpy(rewards) looks like torch.Size([1406]) rewards looks like (1603,) logs prob looks like torch.Size([1603]) torch.from_numpy(rewards) looks like torch.Size([1603]) rewards looks like (1794,) logs prob looks like torch.Size([1794]) torch.from_numpy(rewards) looks like torch.Size([1794]) rewards looks like (1461,) logs prob looks like torch.Size([1461]) torch.from_numpy(rewards) looks like torch.Size([1461]) rewards looks like (1343,) logs prob looks like torch.Size([1343]) torch.from_numpy(rewards) looks like torch.Size([1343]) rewards looks like (1442,) logs prob looks like torch.Size([1442]) torch.from_numpy(rewards) looks like torch.Size([1442]) rewards looks like (1414,) logs prob looks like torch.Size([1414]) torch.from_numpy(rewards) looks like torch.Size([1414]) rewards looks like (2715,) logs prob looks like torch.Size([2715]) torch.from_numpy(rewards) looks like torch.Size([2715]) rewards looks like (2386,) logs prob looks like torch.Size([2386]) torch.from_numpy(rewards) looks like torch.Size([2386]) rewards looks like (1905,) logs prob looks like torch.Size([1905]) torch.from_numpy(rewards) looks like torch.Size([1905]) rewards looks like (1031,) logs prob looks like torch.Size([1031]) torch.from_numpy(rewards) looks like torch.Size([1031]) rewards looks like (1125,) logs prob looks like torch.Size([1125]) torch.from_numpy(rewards) looks like torch.Size([1125]) rewards looks like (1556,) logs prob looks like torch.Size([1556]) torch.from_numpy(rewards) looks like torch.Size([1556]) rewards looks like (1906,) logs prob looks like torch.Size([1906]) torch.from_numpy(rewards) looks like torch.Size([1906]) rewards looks like (1777,) logs prob looks like torch.Size([1777]) torch.from_numpy(rewards) looks like torch.Size([1777]) rewards looks like (1269,) logs prob looks like torch.Size([1269]) torch.from_numpy(rewards) looks like torch.Size([1269]) rewards looks like (1407,) logs prob looks like torch.Size([1407]) torch.from_numpy(rewards) looks like torch.Size([1407]) rewards looks like (1333,) logs prob looks like torch.Size([1333]) torch.from_numpy(rewards) looks like torch.Size([1333]) rewards looks like (1224,) logs prob looks like torch.Size([1224]) torch.from_numpy(rewards) looks like torch.Size([1224]) rewards looks like (1997,) logs prob looks like torch.Size([1997]) torch.from_numpy(rewards) looks like torch.Size([1997]) rewards looks like (1610,) logs prob looks like torch.Size([1610]) torch.from_numpy(rewards) looks like torch.Size([1610]) rewards looks like (1393,) logs prob looks like torch.Size([1393]) torch.from_numpy(rewards) looks like torch.Size([1393]) rewards looks like (1808,) logs prob looks like torch.Size([1808]) torch.from_numpy(rewards) looks like torch.Size([1808]) rewards looks like (1448,) logs prob looks like torch.Size([1448]) torch.from_numpy(rewards) looks like torch.Size([1448]) rewards looks like (1558,) logs prob looks like torch.Size([1558]) torch.from_numpy(rewards) looks like torch.Size([1558]) rewards looks like (1766,) logs prob looks like torch.Size([1766]) torch.from_numpy(rewards) looks like torch.Size([1766]) rewards looks like (1942,) logs prob looks like torch.Size([1942]) torch.from_numpy(rewards) looks like torch.Size([1942]) rewards looks like (1487,) logs prob looks like torch.Size([1487]) torch.from_numpy(rewards) looks like torch.Size([1487]) rewards looks like (2154,) logs prob looks like torch.Size([2154]) torch.from_numpy(rewards) looks like torch.Size([2154]) rewards looks like (1400,) logs prob looks like torch.Size([1400]) torch.from_numpy(rewards) looks like torch.Size([1400]) rewards looks like (1379,) logs prob looks like torch.Size([1379]) torch.from_numpy(rewards) looks like torch.Size([1379]) rewards looks like (2227,) logs prob looks like torch.Size([2227]) torch.from_numpy(rewards) looks like torch.Size([2227]) rewards looks like (1308,) logs prob looks like torch.Size([1308]) torch.from_numpy(rewards) looks like torch.Size([1308]) rewards looks like (1469,) logs prob looks like torch.Size([1469]) torch.from_numpy(rewards) looks like torch.Size([1469]) rewards looks like (1734,) logs prob looks like torch.Size([1734]) torch.from_numpy(rewards) looks like torch.Size([1734]) rewards looks like (1994,) logs prob looks like torch.Size([1994]) torch.from_numpy(rewards) looks like torch.Size([1994]) rewards looks like (2025,) logs prob looks like torch.Size([2025]) torch.from_numpy(rewards) looks like torch.Size([2025]) rewards looks like (2223,) logs prob looks like torch.Size([2223]) torch.from_numpy(rewards) looks like torch.Size([2223]) rewards looks like (2418,) logs prob looks like torch.Size([2418]) torch.from_numpy(rewards) looks like torch.Size([2418]) rewards looks like (1520,) logs prob looks like torch.Size([1520]) torch.from_numpy(rewards) looks like torch.Size([1520]) rewards looks like (1613,) logs prob looks like torch.Size([1613]) torch.from_numpy(rewards) looks like torch.Size([1613]) rewards looks like (1984,) logs prob looks like torch.Size([1984]) torch.from_numpy(rewards) looks like torch.Size([1984]) rewards looks like (1563,) logs prob looks like torch.Size([1563]) torch.from_numpy(rewards) looks like torch.Size([1563]) rewards looks like (1559,) logs prob looks like torch.Size([1559]) torch.from_numpy(rewards) looks like torch.Size([1559]) rewards looks like (2198,) logs prob looks like torch.Size([2198]) torch.from_numpy(rewards) looks like torch.Size([2198]) rewards looks like (1582,) logs prob looks like torch.Size([1582]) torch.from_numpy(rewards) looks like torch.Size([1582]) rewards looks like (1423,) logs prob looks like torch.Size([1423]) torch.from_numpy(rewards) looks like torch.Size([1423]) rewards looks like (2810,) logs prob looks like torch.Size([2810]) torch.from_numpy(rewards) looks like torch.Size([2810]) rewards looks like (1279,) logs prob looks like torch.Size([1279]) torch.from_numpy(rewards) looks like torch.Size([1279]) rewards looks like (1101,) logs prob looks like torch.Size([1101]) torch.from_numpy(rewards) looks like torch.Size([1101]) rewards looks like (2219,) logs prob looks like torch.Size([2219]) torch.from_numpy(rewards) looks like torch.Size([2219]) rewards looks like (1930,) logs prob looks like torch.Size([1930]) torch.from_numpy(rewards) looks like torch.Size([1930])
Training Result
During the training process, we recorded avg_total_reward
, which represents the average total reward of episodes before updating the policy network.
Theoretically, if the agent becomes better, the avg_total_reward
will increase.
The visualization of the training process is shown below:
In addition, avg_final_reward
represents average final rewards of episodes. To be specific, final rewards is the last reward received in one episode, indicating whether the craft lands successfully or not.
Testing
The testing result will be the average reward of 5 testing
-209.13696525868605
-106.5599827895497
Action list
Action list looks like [[2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 2, 3, 2, 2, 3, 2, 2, 0, 3, 2, 3, 2, 2, 0, 2, 2, 0, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 1, 1, 2, 2, 1, 1, 2, 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 2, 0, 2, 0, 2, 2, 3, 3, 2, 3, 3, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 2, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 2, 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, 2, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], [1, 1, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 3, 3, 3, 2, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 2, 3, 3, 2, 2, 3, 3, 3, 2, 2, 3, 3, 2, 3, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 2, 2, 2, 2, 2, 3, 3, 2, 2, 3, 2, 3, 3, 3, 3], [0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 2, 2, 3, 2, 3, 0, 2, 2, 2, 0, 2, 1, 2, 3, 2, 2, 0, 2, 2, 1, 0, 2, 2, 3, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 0, 1, 2, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 3, 2, 2, 0, 3, 2, 3, 2, 2, 2, 3, 2, 2, 3, 3, 2, 3, 2, 2, 3, 2, 2, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 3, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 3, 2, 2, 0, 2, 2, 0, 2, 2, 1, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1, 1, 1, 2, 1, 2, 1, 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 3, 3, 2, 3, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 2, 2, 3, 3, 3, 2, 3, 3, 3, 2, 2, 3, 2, 2, 3, 3, 3, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 3, 3, 2, 2, 2, 2, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 2, 1, 1, 2, 1, 2, 1, 1, 1, 1, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 2, 2, 3, 3, 3, 3, 3, 2, 2, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 3, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 2, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 0, 2, 3, 2, 3, 3, 3, 2, 2, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 2, 2, 3, 3, 2, 2, 3, 2, 3, 2, 2, 2, 3, 2, 2, 3, 2, 3, 2, 2, 2, 2, 3, 2, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 2, 1, 2, 1, 2, 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 2, 1, 2, 2, 1, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 3, 3, 3, 3, 2, 3, 3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 2, 2, 3, 3, 3, 2, 3, 3, 3, 3, 3, 2, 2, 3, 2, 3, 2, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 2, 2, 2, 3, 0, 2, 0, 0, 2, 3, 2, 0, 2, 3, 3, 2, 0, 2, 0, 2, 2, 1, 2, 2, 1, 1, 2, 2, 1, 3, 2, 2, 0, 2, 1, 0, 2, 1, 2, 3, 2, 0, 2, 1, 2, 2, 2, 1, 2, 1, 1, 2, 2, 2, 1, 2, 3, 2, 2, 2, 3, 0, 2, 3, 2, 3, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 3, 3, 2, 2, 3, 2, 2, 2, 3, 2, 3, 2, 0, 2, 3, 2, 3, 0, 2, 3, 2, 3, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 2, 1, 1, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 1, 2, 1, 2, 1, 2, 2, 1, 2, 0, 2, 3, 0, 2, 3, 2, 3, 3, 2, 3, 3, 2, 3, 2, 3, 2, 3, 3, 2, 3, 3, 3, 2, 3, 2, 3, 2, 2, 3, 2, 3, 3, 2, 2, 2, 3, 2, 2, 3, 2, 3, 2, 2, 2, 3, 2, 3, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2, 2, 1, 2, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 1, 1, 2, 1, 1, 2, 1, 1, 2, 1, 2, 2, 1, 1, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 3, 2, 3, 3, 3, 3, 3, 2, 3, 3, 2, 3, 3, 3, 2, 2, 2, 3, 2, 3, 3, 2, 3, 2, 2, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, 2, 2, 1, 1, 2, 1, 2, 2, 1, 2, 2, 1, 2, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3]] Action list's shape looks like (5,) /opt/conda/lib/python3.8/site-packages/numpy/core/fromnumeric.py:2007: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. result = asarray(a).shape
Analysis of actions taken by agent
{2: 991, 3: 374, 0: 108, 1: 496}
Saving the result of Model Testing
/tmp/ipykernel_123/1616289779.py:2: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray. np.save(PATH ,np.array(action_list))
This is the file you need to submit !!!
Download the testing result to your device
--------------------------------------------------------------------------- ModuleNotFoundError Traceback (most recent call last) Cell In[26], line 1 ----> 1 from google.colab import files 2 files.download(PATH) ModuleNotFoundError: No module named 'google.colab'
Server
The code below simulate the environment on the judge server. Can be used for testing.
Your reward is : -209.14 Your reward is : -45.50 Your reward is : 62.21 Your reward is : -200.09 Your reward is : -240.06
Your score
Your final reward is : -126.51
Reference
Below are some useful tips for you to get high score.