Using Model-Free Reinforcement Learning Combined With Underwater Mass Spectrometer and Material Archiving Coupled to Lab Analysis for Autonomous Chemical Source Verifications

Connor Tate,David P. Fries,Micael Vignati,Kevin Francis
DOI: https://doi.org/10.23919/oceans44145.2021.9706111
2021-09-20
Abstract:To support the general problem of Autonomous Underwater/Surface Vehicle (AUV/ASV) based chemical detection, source localization, we propose the design of a system that is a fusion of AUV/ASV with Q-learning, and a real-time underwater mass spectrometer, used to provide the feedback and reward signal for in situ source localization. Additionally, an autonomous sampler can be coupled to the system permitting molecular material archiving for subsequent expanded measurement and validation in the lab. This real-time chemical sensor and archived sample capture and verification approach yields an adaptive sensing and sampling system. The in situ mass spectrometer allows for real time measurement of membrane compatible chemistries such as volatile oxidative compounds (VOC’s) and lightweight gases, while the sampler purifies, enriches and accurately isolates targeted molecular compounds in the field for subsequent full mass spectrometer analysis back in the lab. In the overall AUV system design, the battery driven mass spectrometer provides real-time mass spectrometer signals for reinforcement learning (RL) behaviors and the portable adaptive sampling system automates sample collection, molecular purification/concentration and preservation. The mass spectrometer is of the membrane inlet type and the automated sampler system is a combination of customizable fluidic management systems, pumps, valve arrays and motion control systems. For the field sampling use, the prototype sampling module is designed for triggered sensing and sampling but also can be variably actuated to sample variable volumes over any period of time. The mass spectrometer and sampling systems can be hosted on AUVs/ASVs for most chemical source localization activities. The entire mobile system: AUV mobile platform, reinforcement learning controller, mass spectrometer, and sampler, constitute an adaptive chemical sampling platform. The ‘back end’ laboratory identification is performed using any type of mass spectrometers and can provide a high confidence verification of the specific material archived. The results from the lab verification can also constitute the design of a reward signal for subsequent Q-learning training, mass spectrometer data sub-system to increase the accuracy of the source localization policy. The potential of using mass spectrometer data to train a Q-learning based agent allows the team to pretrain the agent with real sensory data similar to that which will be seen in the field for future deployments. Appropriately simulated data can approximate the environment and distribution patterns that are anticipated for the development of a custom reward function, representative of the mission objective. Preliminary simulations testing the agent’s performance, utilizing a trained policy in a similar environment in which the location of a generic `pollution source’ has been perturbed from the training scenario, have shown promising results. The policy is acquired by training on pollution data for a set environment in which the trade-off between exploration and exploitation is defined appropriately for the environment size, pollution distribution and training duration to optimize the agent’s learning. That policy is then tested in a similar but slightly perturbed environment. This method can be applied to future missions to allow for continual policy update based on the observed data. This would be an advantageous approach as it limits the necessity for operator-vehicle communication giving the agent sufficient autonomy to locate the source based on its prior training as well as circumvents the need for a model-based decision and control approach as the agent becomes better trained through real world observations. This is a model-free learning approach requiring no a priori knowledge of the environment. This has a distinct benefit over model-based approaches which are dependent on the accuracy and fidelity of the environmental model during the training of the agent, which is notoriously difficult both logistically and computationally.
What problem does this paper attempt to address?