Harrison Lam,Yuanjie Chen,Noboru Kanazawa,Mohammad Chowdhury,Anna Battista,Stephan Waldert
Abstract:We explored the challenge of predicting and explaining the occurrence of events within sequences of data points. Our focus was particularly on scenarios in which unknown triggers causing the occurrence of events may consist of non-consecutive, masked, noisy data points. This scenario is akin to an agent tasked with learning to predict and explain the occurrence of events without understanding the underlying processes or having access to crucial information. Such scenarios are encountered across various fields, such as genomics, hardware and software verification, and financial time series prediction. We combined analytical, simulation, and machine learning (ML) approaches to investigate, quantify, and provide solutions to this challenge. We deduced and validated equations generally applicable to any variation of the underlying challenge. Using these equations, we (1) described how the level of complexity changes with various parameters (e.g., number of apparent and hidden states, trigger length, confidence, etc.) and (2) quantified the data needed to successfully train an ML model. We then (3) proved our ML solution learns and subsequently identifies unknown triggers and predicts the occurrence of events. If the complexity of the challenge is too high, our ML solution can identify trigger candidates to be used to interactively probe the system under investigation to determine the true trigger in a way considerably more efficient than brute force methods. By sharing our findings, we aim to assist others grappling with similar challenges, enabling estimates on the complexity of their problem, the data required and a solution to solve it.
What problem does this paper attempt to address?
### The problems the paper attempts to solve
This article explores the challenges in predicting and explaining the occurrence of events in a sequence of data points. Particular attention is paid to the situation where events are caused by unknown triggers, and these triggers may consist of non - continuous, masked or noisy data points. This scenario is similar to an agent needing to learn how to predict and explain the occurrence of events without understanding the underlying process or having access to key information. Such scenarios occur in multiple fields, such as genomics, hardware and software verification, and financial time - series prediction.
To address this challenge, the author combines analytical methods, simulation methods and machine - learning (ML) methods to study, quantify and provide solutions. Specifically:
1. **Describe the change in complexity**: By deriving and validating equations applicable to any variant challenge, describe how the complexity changes as different parameters (such as the number of explicit states \(a\), the number of implicit states \(h\), the trigger length \(l\), confidence, etc.) change.
2. **Quantify the amount of data required**: Using the above - mentioned equations, quantify the amount of data required to successfully train a machine - learning model.
3. **Prove the effectiveness of the ML solution**: Prove that the machine - learning solution can learn and identify unknown triggers, thereby predicting the occurrence of events. If the complexity of the challenge is too high, the ML solution can identify trigger candidates for interactively probing the system to determine the real trigger, which is much more efficient than the brute - force method.
By sharing these findings, the author aims to assist researchers facing similar challenges, enabling them to estimate the complexity of the problem, the amount of data required, and find solutions to the problem.
### Main problem statements
1. **Problem 1**: Based on \(a\), \(h\) and \(l\), what is the relationship between the window length \(n\) and the probability of finding the trigger \(t\)? That is, what is the probability that the trigger \(t\) exists in a window of length \(n\)?
2. **Problem 2**: Infer \(T\) from \(X\) without knowing the length \(l\) of \(t\) and with each \(x_i\) having an unobservable state.
3. **Problem 3**: What is the minimum amount of data required to solve Problem 1 and Problem 2?
### Example problem
Suppose each element can have two explicit states (\(a = 2\)): leave (L) and stay (S), four implicit states (\(h = 4\)), and the trigger length is three (\(l = 3\)). A possible explicit sequence (\(n = 10\)) could be [LSLLLSSLSS] ⇒ E, while the actual but unobservable sequence is [LHSHLHLHLHSHSHLHSHSH] ⇒ E. Although the explicit sequence is known, there are \(h^n=1,048,576\) possible sequences.
### Solutions
1. **Analytical method**: Derive and numerically validate equations to directly quantify the complexity of the problem and the amount of data required.
2. **Simulation method**: Verify the analytical results through simulation.
3. **Machine - learning method**: Propose a machine - learning model that automatically extracts the triggers that cause events, thereby understanding the underlying process driving the occurrence of events.
### Results
- **Analysis and simulation**: Through analysis and simulation, equations applicable to any trigger length and number of explicit/implicit states are derived, solving the problems of window length and the amount of data required.
- **Machine - learning**: A deep - learning architecture is designed, using an embedding layer and an attention mechanism, which successfully identifies the hidden trigger sequences. Experimental results show that the model can identify the real trigger sequences with high confidence, whether in continuous or non - continuous trigger cases.
### Summary
This article successfully solves the problem of predicting and explaining the occurrence of events in the case of partial information loss by combining analytical, simulation and machine - learning methods. The research results provide important references and tools for researchers in related fields.