Abstract:This article, written by JPT Technology Editor Chris Carpenter, contains highlights of paper SPE 201254, “Reinforcement Learning for Field-Development Policy Optimization,” by Giorgio De Paola, SPE, and Cristina Ibanez-Llano, Repsol, and Jesus Rios, IBM, et al., prepared for the 2020 SPE Annual Technical Conference and Exhibition, originally scheduled to be held in Denver, Colorado, 5–7 October. The paper has not been peer reviewed. A field-development plan consists of a sequence of decisions. Each action taken affects the reservoir and conditions any future decision. The presence of uncertainty associated with this process, however, is undeniable. The novelty of the approach proposed by the authors in the complete paper is the consideration of the sequential nature of the decisions through the framework of dynamic programming (DP) and reinforcement learning (RL). This methodology allows moving the focus from a static field-development plan optimization to a more-dynamic framework that the authors call field-development policy optimization. This synopsis focuses on the methodology, while the complete paper also contains a real-field case of application of the methodology. Methodology Deep RL (DRL). RL is considered an important learning paradigm in artificial intelligence (AI) but differs from supervised or unsupervised learning, the most commonly known types currently studied in the field of machine learning. During the last decade, RL has attracted greater attention because of success obtained in applications related to games and self-driving cars resulting from its combination with deep-learning architectures such as DRL, which has allowed RL to scale on to previously unsolvable problems and, therefore, solve much larger sequential decision problems. RL, also referred to as stochastic approximate dynamic programming, is a goal-directed sequential-learning-from-interaction paradigm. The learner or agent is not told what to do but instead has to learn which actions or decisions yield a maximum reward through interaction with an uncertain environment without losing too much reward along the way. This way of learning from interaction to achieve a goal must be achieved in balance with the exploration and exploitation of possible actions. Another key characteristic of this type of problem is its sequential nature, where the actions taken by the agent affect the environment itself and, therefore, the subsequent data it receives and the subsequent actions to be taken. Mathematically, such problems are formulated in the framework of the Markov decision process (MDP) that primarily arises in the field of optimal control. An RL problem consists of two principal parts: the agent, or decision-making engine, and the environment, the interactive world for an agent (in this case, the reservoir). Sequentially, at each timestep, the agent takes an action (e.g., changing control rates or deciding a well location) that makes the environment (reservoir) transition from one state to another. Next, the agent receives a reward (e.g., a cash flow) and an observation of the state of the environment (partial or total) before taking the next action. All relevant information informing the agent of the state of the system is assumed to be included in the last state observed by the agent (Markov property). If the agent observes the full environment state once it has acted, the MDP is said to be fully observable; otherwise, a partially observable Markov decision process (POMDP) results. The agent’s objective is to learn policy mapping from states (MDPs) or histories (POMDPs) to actions such that the agent’s cumulated (discounted) reward in the long run is maximized.

Development of Parametric Reinforcement Learning for different operation preferences

Human operator decision support for highly transient industrial processes: a reinforcement learning approach

Quantification Before Selection: Active Dynamics Preference for Robust Reinforcement Learning

Data Efficient Training for Reinforcement Learning with Adaptive Behavior Policy Sharing

PP-PG: Combining Parameter Perturbation with Policy Gradient Methods for Effective and Efficient Explorations in Deep Reinforcement Learning

Approximate Policy-Based Accelerated Deep Reinforcement Learning.

Probabilistic Automata-Based Method for Enhancing Performance of Deep Reinforcement Learning Systems

Policy Agnostic RL: Offline RL and Online RL Fine-Tuning of Any Class and Backbone

Alleviating Imbalanced Problems of Reinforcement Learning when Applying in Real-Time Power Network Dispatching and Control

RLOR: A Flexible Framework of Deep Reinforcement Learning for Operation Research

Efficient Online Hyperparameter Adaptation for Deep Reinforcement Learning.

Deep reinforcement learning applied to an assembly sequence planning problem with user preferences

Reinforcement Learning from Diverse Human Preferences

Performance Comparison of Deep RL Algorithms for Energy Systems Optimal Scheduling

Guided Probabilistic Reinforcement Learning for Sampling-Efficient Maintenance Scheduling of Multi-Component System

Dynamic Policy Programming with Descending Regularization for Efficient Reinforcement Learning Control

Efficient Reinforcement Learning via Decoupling Exploration and Utilization

Reinforcement Learning Enables Field-Development Policy Optimization

Solving Inventory Management Problems Through Deep Reinforcement Learning

OMPO: A Unified Framework for RL under Policy and Dynamics Shifts

Policy Rehearsing: Training Generalizable Policies for Reinforcement Learning