Abstract:The reliability of modern computing control systems in a heterogeneous distributed computing environment, along with efficiency, survivability, security, control efficiency, is an important component of their quality. Increasingly, these systems fall into the category of "critical", have an absolute impact on the activities of organizations and enterprises within which they operate. The loss of such systems, even for a short time, leads to serious problems related to loss of income, unforeseen costs, downtime of production and personnel, loss of time, and sometimes man-made disasters. As you know, the greatest impact on the reliability of control systems has the reliability and fault tolerance of a set of software and hardware. Therefore, solving problems related to improving the reliability of the software part of the systems is the most urgent task. Currently, significant results have been obtained in the field of evaluation and forecasting of reliability indicators of elements and typical software packages at the stage of their design; a large number of methods known to algorithms and programs are known; a number of normative documents on project reliability assessment have been developed. However, the task of real-time reliability assessment, when accurate and operational accounting of a number of factors is required, has not been sufficiently solved. To solve the problem of multi-agent approach to computing control in a heterogeneous distributed computing environment used methods of systems analysis, set theory - to develop models of task distribution, models of tasks and computing resources, general systems theory - to study and develop methods of task distribution, logic-theory theory. for modeling computational processes. The article considers a multi-agent approach to computing control in a heterogeneous distributed computing environment. The algorithm is based on the use of economic mechanisms to regulate the supply and demand of resources in the computing environment. The architecture of the multi-agent approach and the functions of the agents are described. Particular attention is paid to calculating the reliability of the task plan based on the logical-probabilistic method.

Can Agent Intelligence be used to Achieve Fault Tolerant Parallel Computing Systems?

Distributed Fault Detection and Diagnosis

Automating fault tolerance in high-performance computational biological jobs using multi-agent approaches

Intelligent Agent-Based Scheduling Mechanism for Grid Service

Algorithmic Based Fault Tolerance Applied to High Performance Computing

Improving Fault Tolerance and Reliability of Heterogeneous Multi-Agent IoT Systems Using Intelligence Transfer

Error-Tolerant Agents

Fault Tolerance Based on Neural Networks for the Intelligent Distributed Framework

Building and utilizing fault tolerance support tools for the GASPI applications

Building a Fault Tolerant Application Using the GASPI Communication Layer

Decentralized On-line Task Reallocation on Parallel Computing Architectures with Safety-Critical Applications

Characterizing fault tolerance in genetic programming

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

Toward parallel intelligence: an interdisciplinary solution for complex systems

MULTIAGENT APPROACH TO COMPUTER MANAGEMENT IN A HETEROGENEOUS DISTRIBUTED COMPUTER ENVIRONMENT

Designing Bio-Inspired Autonomous Error-Tolerant Massively Parallel Computing Architectures

Building algorithmically nonstop fault tolerant MPI programs

Decentralized and Fault-Tolerant Task Offloading for Enabling Network Edge Intelligence

Study on Error-Detecting Approach for Fault Tolerance Recomputing Oriented Parallel Digital Terrain Analysis

Fault Tolerant Adaptive Parallel and Distributed Simulation through Functional Replication

Parallel Asynchronous Team Algorithms: Convergence and Performance Analysis