Modelling Second-Order Uncertainty in State Machines

Modelling the behaviour of state-based systems can be challenging, especially when the modeller is not entirely certain about its intended interactions with the user or the environment. Currently, it is possible to associate a stated level of uncertainty with a given event by attaching probabilities to transitions (producing ‘Probabilistic State Machines’). This captures the ‘First-order uncertainty’ - the (un-)certainty that a given event will occur. However, this does not permit the modeller to capture their own uncertainty (or lack thereof) about that stated probability - also known as ‘Second-order uncertainty’. In this article we introduce a generalisation of probabilistic finite state machines that makes it possible to incorporate this important additional dimension of uncertainty. For this we adopt a formalism for reasoning about uncertainty called Subjective Logic. We present an algorithm to create these enhanced state machines automatically from a conventional state machine and a set of observed sequences. We show how this approach can be used for reverse-engineering predictive state machines from traces.

to capture uncertainty: "Uncertainty is an inherent property of any system that operates in a real environment or that interacts with physical elements or with humans.". They note that there is a need for software engineers to be able to express uncertainty within a model in a suitable way. They also note the need to be able to explicitly analyse uncertainty within these models, so that it can be handled appropriately.
In state machines uncertainty might arise if the system being modelled uses probabilistic algorithms or the system's environment is stochastic or poorly understood. Such systems are conventionally represented as probabilistic state machines [3], where transitions are labelled with probabilities. This information can be leveraged to, for example, verify probabilistic properties [4], or to tailor test sets [5].
If we have a state with two outgoing transitions, and the modeller knows that the choice of transition comes down to a random process that amounts to a fair coin-flip (c.f. the IEEE1394 FireWire root contention protocol [6]), it is clear that each transition has a probability of 0.5. This sort of uncertainty is referred to as 'first-order uncertainty' [7].
It can, however, easily be the case that such probabilities are subject to uncertainty. The modeller might not have prior knowledge or empirical observations upon which to gauge the probabilities of events in a given state (epistemic uncertainty [8]). Even if they have plenty of observations, these might be subject to random effects that make it difficult to definitively pin-down a specific uncertainty value (aleatory uncertainty [8]). In either case, there is uncertainty surrounding the probability in question. This is generally referred to as 'second-order uncertainty' [7].
There is a qualitative difference between probabilities that are well-founded (based on, for example, 15,000 observations) and probabilities that are speculative and uncertain (a guess that each transition has a likelihood of 0.5, based on no prior knowledge). However, Probabilistic FSMs (PFSMs) do not capture such second-order (un-)certainties. As a result, if we are using a PFSM, we can only approximate the trustworthiness of a prediction if we understand how it was generated (obtaining associated execution samples if these were used to estimate the probabilities). This could at best be used to produce a generic assessment of trustworthiness of the PFSM, but there is no apparent means of deriving levels of trustworthiness for specific paths through the model. Subjective Logic [9] is a relatively recent framework that was developed to reason about such uncertainties. It provides a formalism that captures a probability and associated uncertainty as a 'subjective opinion'. It also provides a variety of operators that enable us to combine subjective opinions (conjunction, disjunction, fusion, etc.), whilst also computing the associated uncertainty.
In this article we use Subjective Logic to reason about the uncertainty pertaining to probabilities of events occurring at different states in a state machine. We show how Subjective Logic can be used to reason about the cumulative uncertainty of sequences of events in a model. As a result, any path through a model corresponding to a sequence of events can be explicitly associated with a corresponding probability and a level of 'trustworthiness' in this probability.
The article is motivated by a specific scenario: We have an FSM model of a system and a set of traces corresponding to system execution sequences (e.g., execution logs). This situation arises in most settings where Machine Learning is used to infer a model from traces (examples from adaptive GUI / Android app testing are cited here [10], [11], but there are similar approaches for other areas). The ability to associate trustworthiness with predicted likelihoods of sequences would enable us to leverage this additional information, to avoid acting on untrustworthy predictions, or even to correct predictions. As an example of a potential application area, in their work on Android testing, Choi et al. [10] trace paths across inferred FSMs to identify test sequences that are as long as possible (to avoid frequent expensive restarts). In this setting, one might leverage the second-order uncertainty surrounding the feasibility of a sequence to avoid attempting impossible paths.
To enable this we propose Subjective Opinion State Machines (SOSMs). We show how these can be automatically derived from a state machine and an accompanying set of traces. In an SOSM, for state q, there is a subjective opinion that captures probability, coupled with a degree of second-order uncertainty, of a given event occurring in state q.
The contributions of this article are as follows: r We introduce SOSMs, which generalise PFSMs by replacing probabilities by subjective opinion, enabling the 'belief' corresponding to a transition to be accompanied by an explicit level of uncertainty. (Section III) r We produce an algorithm that generates an SOSM from a traditional (non-probabilistic) state machine and a set of traces. This makes our approach applicable to a broad range of Software Engineering settings. r We present an empirical study that evaluates the approach on 47 published state machine models.

II. BACKGROUND
We start with definitions of state machines and probabilistic state machines and then describe Subjective Logic.

A. State Machines and Probabilistic State Machines
Q is the finite set of states, q 0 ∈ Q is the initial state, A is the finite alphabet. δ : Q × A → Q is the state transition function, and Q F ⊆ Q represents the set of accepting states.
An FSM processes a sequence of elements from A; we use A * to denote the set of such sequences. Given FSM F and string x ∈ A * , x is accepted if there exists a corresponding path (q 0 , x 0 , q 1 , x 1 , . . . , x n , q n ), such that q n ∈ Q F . If there is no such path, or q n / ∈ Q F then x is 'rejected'. Observe that the FSMs used here are deterministic: for each state and event, there is at most one possible next state. However, a non-deterministic FSM can always be mapped to an equivalent deterministic FSM.
Definition 2.2. A Probabilistic Finite State Machine (PFSM) 4 is defined by a tuple (Q, q 0 , A, δ, Q F , P ). Q, q 0 , A, δ and Q F are defined as in Definition 2.1. The additional element P is a transition probability function Q × A → [0, 1]. For all q ∈ Q, Σ a∈A P (q, a) = 1.
Probabilities model how likely it is that a particular event a occurs in state q. We can define a function P T that gives the probability of a transition occurring from a given state.
Given PFSM P F and string x ∈ A * , one can derive probability p(x) by tracing the corresponding path

B. First-and Second-Order Probabilities
There have been many attempts to offer different taxonomies of uncertainty, several of which are discussed by Troya et al. in their survey of uncertainty in software models [2]. Perhaps the most common way of classifying uncertainty is to divide it into 'epistemic' or 'aleatory' uncertainty [8]. Epistemic uncertainty refers to the situation where uncertainty arises because of a fundamental lack of information or knowledge about the phenomenon in question. Aleatory uncertainty [8] on the other hand refers to the intrinsic variability or randomness of a phenomenon.
The uncertainty intrinsic to a probabilistic statement (e.g., "I believe that there is a 60% chance my local team will win the match today.") is captured as a 'first-order' probability (this is irrespective of whether the uncertainty is aleatory or epistemic [9]). However, our interpretation of that statement might change if we learn that it is made by somebody who has never seen the team in question play, as opposed to someone who has coached the team for many years. This degree of trust (or lack thereof) in a first-order probability is referred to as a 'second-order probability' [9].
This notion, that first-order probabilities can be subject to uncertainty is well established. The question of how to reason about probabilities that are subject to different levels of uncertainty is a longstanding one. In Boole's 1854 "An Investigation of the Laws of Thought" [12], he criticises the assumption that it is possible to "... assign the probabilities with perfect rigour... " even when the problem "... admits only of an indefinite solution".

C. Subjective Logic
There have been numerous efforts over the last 60 years to capture second-order probabilities and to reason about them. Notable examples include Dempster Shafer Theory [13], Evidential Reasoning [14], the Imprecise Dirichlet Model [15], and Fuzzy Logic [16]. In recent years Subjective Logic has emerged as a particularly flexible and general basis upon which to reason about probabilities and their respective uncertainties. It subsumes Dempster-Shafer theory, and incorporates a range of operators that make it a generalisation of traditional probabilistic logic [9].
Subjective Logic is based on the premise that it is possible to associate traditional propositions with a belief that the proposition is true and a level of uncertainty, representing the epistemic uncertainty in the assessment of the proposition. These 'subjective opinions' can be reasoned about and combined with a variety of operators. Below we introduce the fragments of Subjective Logic used in this article (a more complete reference is available in Jøsang's book [9]). 1) Subjective Opinions: Subjective opinions express beliefs about the truth of propositions. The key difference between subjective opinions and conventional probabilistic statements is the ability to associate them with an explicit level of second-order uncertainty. With reference to the statement of confidence in the performance of a local team (Section II-B), a subjective opinion can also capture a level of second-order uncertainty associated with that statement: I believe that there is a 60% chance my local team will win the match today, but I only give my assessment a 10% chance of being accurate [perhaps because I have no experience of watching them or their opposition play]".
There are three types of subjective opinion [9]: Binomial opinions, multinomial opinions, and hypernomial opinions. Binomial opinions capture a belief (and associated uncertainty) regarding a single proposition that can be true or false. Multinomial opinions generalise this: the proposition in question can take on one of several possible values (not just true or false). Hypernomial opinions further generalise this to enable the expression of a belief that one of a set of values is true, without the need to identify a particular value which is believed to be true. In this article we use the first two types: binomial and multinomial opinions. Throughout this article we base our notation on Jøsang's notation [9]. Definition 2.3. A binomial subjective opinion is an opinion over a binary domain X = {x,x}, where x is a random variable in X. A binomial opinion about the truth of x is the ordered tuple: r b x is a belief mass in support of x being true. r d x is a belief mass in support of x being false (i.e.,x being true).
r u x is a scalar uncertainty mass representing the 'vacuity of evidence'.
r a x is a 'base rate' -a prior probability of x without any evidence.
For an opinion to be valid, the following 'additivity requirement' must hold: The 'base rate' is akin to the 'prior probability distribution' in a Bayesian setting. To use our earlier example, it may be common knowledge that our local team has traditionally won 90% of its games, so that the 'a-priori' probability of a win can be taken as 0.9 (a x = 0.9). However, we might also know that for this match one of our key players is injured, which changes our assessment of the chances of a win, and introduces a high degree of uncertainty. We might only give a 15% chance that whatever we estimate is accurate (u x = 0.85), but still marginally favour a win with a probability of 60%, which results in the remaining belief mass of 0.15 (1-u It is worth highlighting two notable opinions. First, u x = 1 is a 'vacuous opinion' -nothing is known about x (there is zero belief mass). Second, b x = 1 or d x = 1: x is dogmatically true or false respectively (and so u x = 0).
Definition 2.4. The 'projected probability' [9] for x is defined as: P (x) = b x + a x u x . This represents the overall belief in x, factoring in the prior probability and the uncertainty.
In multinomial opinions, a belief is a distribution (across propositions) with an uncertainty in the whole distribution: Definition 2.5. For a multinomial subjective opinion, let X be a domain, where |X| > 2. Let X be a random variable in X. A multinomial opinion over the random variable X is the ordered triplet: ω X = (b X , u X , a X ), where: r b X is a belief mass distribution over X. r u X is a scalar uncertainty mass representing the 'vacuity of evidence'. r a X is a prior probability distribution over X.
For an opinion to be valid, the following 'additivity requirement' must hold: u X + Σ x∈X b X (x) = 1.
A vacuous opinion can be expressed by using u X = 1 and ∀x ∈ X : b(x) = 0. A traditional probability distribution occurs when u X = 0.
Sometimes it can be helpful to 'extract' a belief about a single proposition from a multinomial opinion. This process is referred to as 'coarsening' [9], and works as follows.
Definition 2.6. Given a multinomial opinion ω X = (b X , u X , a X ) where X is some random variable in X, it is possible to coarsen it to a binomial opinion for some x ∈ X such that: A binomial opinion can be visualised as an equilateral barycentric triangle [17]. Here, each of the vertices in the triangle represents a maximum value for belief, disbelief and uncertainty respectively, with the minimum value sitting on the opposite axis. For example, the maximum value for Uncertainty is at the top vertex of the triangle, and the lowest value is on the mid-point of the Disbelief-Belief axis of the triangle. All possible binomial opinions sit within this triangle (this is ensured by the inherent constraint that all values sum to 1). Examples of barycentric triangles for two subjective logic opinions are shown in Fig. 1. Fig. 1. Barycentric triangles and corresponding beta-distributions for two binomial subjective opinions. An opinion is captured by a coordinate within the triangle, where the distance from each edge represents the degree of uncertainty, belief and disbelief. The red line represents the a-priori probability, and the blue line represents the projected probability.

2) Subjective Opinions as Probability Distributions:
When it comes to reasoning about uncertain, it is commonplace to visualise and understand them in terms of probability distributions. In the context of Software Engineering, for example, Duan et al. have shown how to represent safety properties as beta-distributions [18]. One limitation of distributions is that there is no established basis upon which to analytically combine them or reason about them.
One useful attribute of Subjective Logic is the capability to map opinions to probability distributions [9]. The components of a binomial opinion can be mapped to the parameters that are used to express beta-distributions, and the components of multinomial opinions can be mapped to the parameters that are used to express Dirichlet distributions. We will be largely concerned with binomial opinions, and so focus on using beta-distributions.
Definition 2.7. The beta-distribution refers to a family of distributions that are a continuous version of the binomial distribution, which make them appropriate for modelling probabilities [18]. Its probability density function is defined by two 'shape' parameters α and β, and is where α > 0, β > 0 and B(α, β) is the beta function.
As α increases (and β is held constant), the mode of the distribution shifts to the right (towards 1). As β increases (and α is held constant), the mode shifts to the left (towards 0).
When α and β are equal we get a symmetric distribution representing an approximate Gaussian shape. When α = β = 1 we get a uniform distribution. When modelling a belief, a relative increase in α can be interpreted as an increase in supportive evidence, whereas a relative increase in β can be interpreted as an increase in evidence to the contrary. A distribution with a welldefined peak indicates that the probability is highly concentrated around a point, whereas a flatter distribution indicates higher uncertainty.
Jøsang has shown that there is a bijective mapping between subjective opinions and beta-distributions [17]. It is therefore possible to interpret a subjective opinion 'coordinate' as a continuous distribution, where 'density' represents probability. This mapping (with respect to some x) is achieved by reformulating the definition of α and β, so that these are explicitly linked to evidence that either supports or contradicts a belief: Definition 2.8. Jøsang considers the following parameters r x , s x , a x , and W [9]. r x represents the number of observations in support of x, s x represents the number of observations that contradict x. a x is the a-priori belief in x as defined above. Finally W is the weighting to be given to a x . The α and β parameters are defined by: From this, the bijective relation derived by Jøsang [9] is shown 5 in Definition 2.9. The non-informative prior weight W is set to W = 2, because of the requirement that a vacuous opinion is mapped to a uniform beta-distribution in the case of a default base rate a x = 1 2 [9]. Definition 2.9. The bijective mapping between a binomial opinion (b x , d x , u x , a x ) and a beta-distribution as expressed by the parameters r x , s x , a x and W (see Definition 2.8) is as follows.

Binomial subjective opinion beta distribution
This relationship between subjective opinions and the betadistribution has been explored within Software Engineering.
In work on reasoning about uncertainty in safety cases, Duan et al. [18] consider the setting where different safety requirements are subject to varying degrees of evidence. They show how the beta-distribution can capture the relationship between evidence and uncertainty.
In Fig. 1, the distributions corresponding to the two sample opinions are shown on the right. This illustrates how a higher level of belief will end up with a higher peak in the distribution towards a probability of 1. A higher level of uncertainty will result in a 'flatter' distribution.
3) Multiplying Subjective Opinions: Subjective Logic is associated with operators that combine subjective opinions. All of the traditional probabilistic logic operators have corresponding operators in Subjective Logic, making it possible to factor second-order uncertainty into traditional probabilistic reasoning techniques. Subjective Logic also has additional 'fusion' operators, making it possible to combine subjective opinions more flexibly [9]. We use the binomial multiplication operator [9]. Definition 2.10. Given binomial opinions and ω y = (b y , d y , u y , a y ), the binomial opinion ω x∧y on the conjunction (multiplication) (x ∧ y) is: The operator involves the multiplication of the respective belief values. However, this product is then modulated according to the respective uncertainties and a-priori probabilities of the two beliefs. A more elaborate discussion of this operator is available elsewhere 6 [9]. One useful note pertaining to the validity of the operator is the property, observed by Jøsang, that the projected probability of a subjective opinion (Definition 2.4) of the product computed from this operator is always equal to the expected probability of the product of the equivalent beta-distributions [9].

III. SUBJECTIVE OPINION STATE MACHINES
Probabilities in a PFSM can be associated with a considerable amount of second-order uncertainty. For example, where the state machine is created by a human modeller, they might be less sure about the probabilities governing some aspects of system behaviour than others. In a setting where the state machine is inferred by some inference algorithm [19], the probabilities labelling transitions would be entirely dependent on the quality and amount of training data, which will vary from one transition to another.
In both cases, it would be helpful to convey the level of secondorder (un-)certainty alongside the probability for a transition, or a path through the machine. If an inferred state machine was used in an operational (e.g., automotive [20]) context, it would be helpful to distinguish between predictions with a low degree of uncertainty, and those that are entirely speculative (and thus entirely uncertain). Conventional PFSMs do not offer such a distinction.
In this section we introduce Subjective Opinion State Machines (SOSMs), where transitions can be associated with explicit levels of uncertainty. We also introduce a technique whereby traditional FSMs that are accompanied by a set of traces can be used to automatically derive SOSMs. This means that our approach can be used as a post-processing step for any technique that seeks to reason about a state machine with respect to a set of observations, such as testing techniques or state machine inference techniques.
For the sake of illustration, we use a toy example of a simple editor, where the basic FSM is shown in Fig. 2. A goal in this scenario could, for example, be to probabilistically model a user's interaction with the editor in a similar vein to the work on PFSM inference for GUIs [21], [22]. 6 A visual demo of the multiplication operator, along with others, is available here: https://folk.universitetetioslo.no/josang/sl/Op.html

A. Definition of Subjective Opinion Machines
A Subjective Opinion State Machine (SOSM) combines an FSM structure with subjective opinions. Each state is linked to a multinomial subjective opinion, which captures the belief that a given element of the alphabet can be consumed at that state. Being a subjective-opinion, this includes an explicit level of second-order uncertainty.
Definition 3.1. An SOSM is defined as a tuple (Q, q 0 , A, δ, Q F , Ω, S). Elements Q, q 0 , A, δ and Q F are defined as in an FSM (Section II-A). Ω is a set of multinomial subjective opinions where X = A (Definition 2.5). S : Q → Ω is a function from states to subjective opinions over elements in A and is defined for all q ∈ Q, where q has at least one outgoing transition. The subjective opinion for state q ∈ Q is denoted ω q . By default, we assume a uniform distribution of prior probabilities over the elements in A that label transitions from q.
As with PFSMs, we are interested in the likelihood of an event in A occurring when in a state s. However, we also want to capture the associated second-order uncertainty.
As an example, consider state D in our editor example (Fig. 2). We know that some events in A (exit and open) are impossible from state D, so their a-priori probabilities and their belief values are 0. Let us suppose that we have no prior reason to believe that one of the remaining possible events would be more likely than the other in state D. This means that the events save, close, and edit have an a-priori possibility of 1 3 each. During operation of the program, however, we observe that when in state D the edit event occurs with 60% probability, the save event with a 30% probability, and the close event with a 10% probability. However, we have only observed state D a small number of times and would only rate our certainty in these probabilities at 20%. The corresponding subjective opinion is shown in Table I.

B. Derivation of SOSMs From FSMs and Traces
Given an FSM M , there is a well-established method for taking a set of observations of sequences accepted by M and deriving probabilities for each transition [23] (Chapter 17.1). This can be achieved by counting the number of times each transition is traversed and dividing a transition's frequency by the sum of frequencies across all transitions from its source state. The process is illustrated in Fig. 3 in which, for a state q and event a, the machine on the left gives the number of times that the traces included a occurring in q; the machine on the right is the result of normalising these for each state so that they form probabilities (ie the sum of the values, over transitions leaving q, is 1).
As an example, let us assume that we have observed the following sequences for the FSM in Fig. 2  These are based on a small sample and so there is significant second-order uncertainty: an additional sequence could lead to very different probabilities. In addition, usage can vary significantly between users, leading to aleatory uncertainty [8].
Our approach of deriving SOSMs from traces is similar to the approach for PFSMs described above. However, instead of using trace data to derive a simple probability (which cannot capture second-order uncertainty), we use the same data to produce a subjective opinion for each state. The approach described here was devised with a specific application in mind: The derivation of SOSMs to make predictions of software behaviour. Thus, certain design choices (particularly around the derivation of apriori distributions for subjective opinions and the calculation of uncertainties) might not be appropriate for other applications.
We start by providing an overview of the algorithm, following this by elaborating on (1) the parameter α used to calculate the uncertainty for each subjective opinion, and (2) the a-priori distributions for each subjective opinion.
The Algorithm. The process is captured in Algorithm 1. We start in lines 3-11 by creating, for each state, a distribution of frequencies spanning the alphabet of the machine (lines 4-6) and an a-priori distribution that attributes an equal probability to all outgoing transitions (lines 7-10). In lines 11-13 we calculate the uncertainty by dividing the sum of transition frequencies by parameter α and subtracting from 1 (line 11); if the resultant Algorithm 1: Computing the SOSM From Traces.
value is negative then we assign zero to the uncertainty (line 13). We then calculate the belief distribution from the frequency values, normalising with respect to uncertainty (lines [14][15][16]. The update only happens if there is at least one observation (total = 0). The belief distribution, uncertainty value and apriori distribution form a subjective opinion for the state (lines [17][18][19]. The α Parameter to Calculate Uncertainty. The traditional approach to deriving probabilities for FSMs from observations involves calculating the relative frequencies (i.e., the distribution) of elements in A from each state. If we rely entirely on the resulting probability distribution over A, it is impossible to distinguish whether the distribution is the result of 10 data-points or 1,000. If it is the result of 1,000 data-points we would consider it to be much more reliable (i.e., have a much lower level of second-order uncertainty).
We incorporate a user-defined parameter α that represents the number of times a state must be traversed by a set of traces for the probabilities to be deemed to be certain. The value of α determines the balance between the probability mass that is associated with belief or disbelief, and the probability mass that is associated with uncertainty. The choice of α depends to an extent on an understanding of the underlying model and on an intuition of what a sufficient number of observations should be for a given state. In practice, this may vary depending on usage context (e.g., the amount of data that is realistically available) and on the domain (e.g., with a particularly high threshold for safety-critical systems). For the models we refer to in our evaluation, our preliminary study on the selection of α (described in Appendix B, online available) also suggests that it is safer to use higher values. Whereas selecting values of α that are too low can hide differences in terms of uncertainty between sequences, selecting high values of α (perhaps higher than they need to be) does not affect the relative balance between belief and disbelief and ensures that different levels of uncertainty can be established for a larger range of sequences.
Distributions of a-Priori Probabilities. The distributions of a-priori probabilities for each state enable us to express any prior knowledge about the relative likelihoods of events occurring at a state. As with any Bayesian approach [24], these could be specified up-front (e.g., if our domain knowledge of the program suggests that 'save' operations occur much less frequently than 'edit' operations). As a default, we presume that there is no such prior knowledge and so we use a 'uniform' prior distribution across elements in A that are possible from that state. Any element in A that is not possible from a given state is given an a-priori probability of 0. For any n remaining outgoing elements, the respective a-priori probabilities are set to 1 n . Example. Table II contains the subjective opinions derived from the traces for the accepting states in the example (Fig.  3). To take state A as an example, there are 12 observations of outgoing transitions. The a-priori belief is uniformly distributed; 0.2 per event. Uncertainty (line 11, Algorithm 1) is 1 − 12 20 (for α = 20), which is 0.4. The remaining belief mass (0.6) is used to define the belief distribution. For A, 'open' and 'exit' are executed the same number of times, so we split this belief mass evenly between them (0.3 each), having a belief of 0 for other events. Note that, although we attribute a belief of 0 to edit, save, and close, this does not mean that they cannot happen. The uncertainty mass accounts for the possibility that they could occur -that we might have been mistaken in our attribution of beliefs. This is not captured in conventional PFSMs.

C. The Subjective Opinion of a Sequence of Events
Given an SOSM, Subjective Logic operators can be used to reason about the likelihood and uncertainty associated with events in different states. As with PFSMs, we can use multiplication to reason about the likelihood of sequences of events. Here we provide a relatively brief and formalised description of how these operators can be applied to SOSMs.
Given a PFSM, one can follow the path corresponding to a sequence and multiply together the probabilities that label the transitions. In an SOSM, it is possible to achieve the equivalent by multiplying together the multinomial opinions on the states of a path. However, if the alphabet has size greater than one then the multiplication of multinomial opinions is exponential in terms of the sequence length [9].
Fortunately, there is a work-around. When we encounter a state, we are only concerned with the probability of a single outgoing transition corresponding to the current symbol in the sequence. As such, we can always coarsen a multinomial opinion to the Binomial opinion for the given event (Definition 2.6). Multiplying binomial opinions (Definition 2.10) is computationally much more efficient.
Given string x 0 , x 1 , . . . , x n ∈ A * and SOSM SO, we trace the path (q 0 , x 0 , q 1 , . . . , x n , q n+1 ). For a transition (q, x, q ), we obtain (multinomial) subjective opinion S(q) for q, and then apply the binomial extraction procedure with respect to x (Definition 2.6). This leads to a sequence of binomial opinions The result can be obtained by multiplying these together with the binomial multiplication operator (Definition 2.10) as ω q 0

IV. UNCERTAINTY IN INFERRED STATE MACHINES
State machine inference [23] has become an active area of research within Software Engineering, especially within the context of mining software or hardware specifications [20], [21], [22], [25], [26]. State machines are inferred from samples of observed sequences of events. Although the task of inferring a state machine is NP-hard [27], [28], numerous approaches have been developed that are capable of inferring approximately correct models [29] in polynomial time.
Before we describe our use of SOSMs with FSM inference, it is worth setting this into the context of existing similar approaches for inferring PFSMs. The task of PFSM inference can be split into two families [23], [30]: (1) augmenting a given FSM with probabilities to enable probabilistic predictions (also referred to as 'probability estimation' [23]), or (2) inferring the PFSM structure by taking into account the distributions of sequences in the trace data.
We focus on probability estimation (Section III-B): to take an inferred FSM and its training set and estimate the probabilities (and in our case uncertainties) for different paths through the model. We use the approach described in Section III-B to convert an inferred FSM into an SOSM. We show how subjective opinions can help one interpret uncertain predictions and correcting incorrect predictions.

A. Motivation for the Use of SOSMs
Probabilities in PFSMs are first-order. As discussed in Section II-B, they are 'point-values' that capture the likelihood of a transition, but do not convey (second-order) uncertainty. In other words, if a state has two outgoing transitions, each of which has a probability of 0.5, we cannot tell from these numbers alone whether the probabilities are the guaranteed equivalent of a fair coin-flip, or whether they are equal because there is no information to suggest otherwise. Let us consider the model in Fig. 4 -this was inferred by the EDSM state merging algorithm 7 [31] from the traces used for Fig. 3. Superficially, the probabilities look sensible. However, there are some striking features. For example, from state E (entered via a save event) there is a 100% probability that the next event will be close. This is consistent with the observations: only three of the six traces involve state E, and none of the traces feature the scenario where a file is saved and subsequently edited.
In practice we know that this does not reflect the actual system (see Fig. 2): it is possible to carry on editing the document after saving but the sample of sequences used missed this scenario. There is a high degree of epistemic uncertainty [2]; having only observed the state three times, we lack sufficient evidence to conclude that there are no missing transitions from that state. As a result, this value of 100% is subject to a high degree of second-order uncertainty and should be treated with a substantial degree of caution. This is where subjective opinions can play an important role. In the rest of this section we show how SOSMs can be computed from the same trace data. These make it possible to explicitly factor-in and reason about underlying uncertainties when referring to a prediction made.

B. Computing an SOSM for an Inferred State Machine
Uncertainty in subjective opinions conveys the extent to which probabilities for a given state should be trusted. The approach of deriving an SOSM from an FSM and a set of traces (Algorithm 1) can be easily used as a post-processing technique for any model (not just an inferred model). The underlying inference algorithm does not matter (it could be a 'passive' state merging algorithm [26] or an 'active' algorithm in the Angluin L * style [32]).
The only requisite is that we have the FSM and the traces used.   . 4), FOR α = 20  TABLE III final element of any sequence should cause the sequence to be rejected.
Once an FSM has been inferred, to obtain an SOSM, we apply the process described in Algorithm 1. The set T races used is the union of T + and the set T − , with T − stripped to the accepting prefixes. The input F SM used in the algorithm corresponds to the FSM inferred from T races.
Once the SOSM has been derived, it is possible to compute the second-order uncertainty associated with a path (Section III-C). To illustrate this, consider the sequence: open,edit,save,close. Using the probabilities in Fig. 4 (and treating the machine as a PFSM), this sequence would yield probability 0.67 × 0.6 × 0.2 × 1 = 0.0804, or 8.04%. This conveys nothing about the 'trustworthiness' of the probabilities.
Suppose we choose α = 20 for Algorithm 1; we assume that 20 observations of a state is sufficient. The resulting subjective opinions are shown in Table III (recall from Definition 3.1 that we do not define subjective opinions for states without outgoing transitions, so C and F are omitted).
We obtain binomial opinions by coarsening the multinomial opinions. The resultant opinions are shown in Table IV. Whereas the probability E close − −− → A was 1 in the PFSM, our binomial opinion differs. Only a small amount of belief mass (0.15) is attributed to close. The bulk of the belief mass (0.85) is attributed to uncertainty; there may be other events that are possible, but haven't occurred in our sample.
Multiplying binomial opinions (Definition 2.10) gives the opinion shown in Table IV (bottom row), and visualised in Fig.  5. The PFSM probability (8.04%) is close to the peak of the beta-distribution. However, uncertainty is also captured and is visualised in the beta-distribution (right of Fig. 5). This shows that the probability could be significantly higher -the tail only levels off at 0 when p > 0.5.  Table IV.

C. Correcting Predictions
Inferred FSMs can be error-prone [26]: they can be too general (accept too many sequences) or too specific (reject too many sequences). Once an FSM has been converted to an SOSM, however, a prediction of whether a sequence should be accepted is given a (second-order) uncertainty. If a sequence is classified as 'Accept,' but has high uncertainty, there is a greater chance that it is actually a 'Reject'.
The relationship between the subjective opinions calculated for sequences and the (in-)correctness of the corresponding prediction depends on the training data. In this subsection we show how it is possible, for an SOSM and associated sequence data, to infer a classifier. This makes it possible to improve the classification (accept/reject decision) accuracy for future sequences that we have not yet encountered and to correct any incorrect classifications that would have been made by the underlying FSM.
The training set for this classifier is 'recycled' from the set of sequences used to derive the SOSM. Instead of labelling sequences as 'accept' or 'reject,' we label the corresponding subjective opinions (i.e., belief and uncertainty). The classifier trained on this set of decisions can then be used to provide an accept/reject decision on the basis of the underlying subjective opinion for sequences, even if they are not part of the original set of traces.
The training process is shown in Algorithm 2. First, each positive trace in T + is traced in the SOSM (line 5), the subjective opinion is calculated (line 6), and the resulting belief is linked to an 'ACCEPT' (line 7). This is repeated for the traces in T − , with the resulting opinions linked to 'REJECT' (lines 9-13). The resulting set is used to train a classifier (line 14). The choice of classifier is flexible. In our evaluation we opt for a Random Forest classifier [33].

V. EVALUATION
The use of SOSMs for inferred automata is based on the rationale that the additional information in subjective opinions should be useful for interpreting their predictions. Subjective opinions should convey the trustworthiness of a prediction. Here we explore whether this is the case: Research Question 1. Does the level of uncertainty associated with a prediction tend to be higher when a smaller number of traces has been used to derive the SOSM?
When working with inferred models on software engineering tasks, the second-order uncertainty pertaining to their predictions should reflect the amount of evidence from which the underlying model has been derived.

Research Question 2.
To what extent, and in what respect, are the computed subjective opinions able to discriminate between correct and incorrect predictions?
If, as we expect, subjective opinions are able to discriminate between correct and incorrect predictions, this gives rise to the question of whether this ability is useful.
Research Question 3. Does the use of a classifier (as detailed in Algorithm 2) lead to more accurate predictions?

A. Methodology 1) RQ1 -Does the Level of Uncertainty Increase as the Number of Traces Used to Derive SOSMs Decreases: a) Subjects:
For this RQ we applied the approach to log data recorded from a production system. For this we use the Android trace from the LogHub data set [34]. This consists of 1,555,005 OS-level messages.
An Android log comprises a list of events. Each event is associated with a process identifier, a thread identifier within that process, and the name of a function, along with other logging data. For this RQ we inferred an SOSM corresponding to each process, where the sequence of events for each thread within the process is treated as a trace.
Splitting the LogHub trace in this way led to 643 sets of traces (i.e., 643 processes) containing a total of 10,917 traces (sequences of events belonging to different threads). The number of traces per process forms a long-tailed distribution. The largest set had 2001 traces (followed by a process that had 1,021 traces and another that had 527), and 187 processes had only one trace.
We omitted single-trace state machines from this experiment, as explained below. b) Analysis: For each Android process we inferred a state machine from the underlying trace set. For this we used the Evidence Driven State Merging (EDSM) algorithm with the Blue-Fringe search window [31], a baseline that is commonly used to evaluate passive state machine inference approaches [26]. In state merging algorithms it is possible to set a "minimal state match" score threshold, where a pair of states are only merged if a given number of transitions in their outgoing state transition structures overlap. We chose a threshold of 10, because values below 10 would tend to 'collapse' into trivial single-state machine.
We used the Leave One Out k−folds cross validation approach. One trace was kept aside, and the remainder was used to infer the machine. The trace left-out was used to compute a walk over the inferred machine, to produce the final uncertainty value for that walk. This process was repeated for all traces in the set. Since this process requires at least two traces in a set, we omitted models for which there was only a single trace (187 of the 643 processes).

2) RQ2 -are the Computed Subjective Opinions Able to Discriminate Between Correct and Incorrect Predictions?: a)
Subjects: To answer this question (and RQ3), we required a set of reference models. We identified two sets of state machines that have been used in the state machine literature (mainly related to testing and verification). We used the ACM/SIGDA benchmarks [35], a set of FSMs used in workshops between 1989 and 1993. We also used some machines from a collection curated by Neider et al. [36].
We used Mealy machines that were not inferred, but represented a genuine 'ground truth'. We excluded those where we could not generate suitable traces. In total, 11 models were left out. These are listed in Appendix A, available in the online supplemental material. The final set of 47 models is listed in Table V. b) Generating traces: For each model we created a set of positive traces (sequences ending in a final state) and a set of negative traces (the final element was not accepted). For both sets, the maximum length was the depth of the machine (the length of the longest shortest-path) + 5. Traces were generated as random walks. We generated 200 random walks that were accepted, and 200 that were rejected. We omitted walks that were prefixes of existing sequences, ensuring that each sequence added information.
The ACM/SIGDA benchmark machines were fully-specified (every state had an outgoing transition for every element in the alphabet). In such machines, elements in the alphabet for which there should be no response by the machine tend to be modelled by silent loops. For these machines, we removed the silent selfloops, so that the corresponding sequences involving them would be rejected. c) Model inference and accuracy evaluation: Since the objective was to investigate whether subjective opinions can provide information about (in-)accuracies, we used sets of traces that were not likely to be sufficient to infer an accurate model. We used k−folds cross validation for k = 10 to partition the set of traces into ten different batches of training and testing samples.  Sensitivity+Specif icity . d) Method: To choose a suitable value of α, for use in Algorithm 1, we carried out a small preliminary study (see Appendix B), available in the online supplemental material. This indicated that, for our model inference setting, α = 2000 is appropriate.
We will return to a more detailed study of how to calibrate α in future work.
We used the training set partitions (from the k-folds cross validation) to calculate the differences in uncertainty and belief with respect to sequences that are correctly or incorrectly classified. To measure this difference we used the Vargha-Delaneŷ A 12 effect size [44]. This is a non-parametric measure, where the size is between 0 and 1 and represents the probability that two sets of classifications are equivalent.
TheÂ 12 score related sequences correctly classified against sequences incorrectly classified. We calculated the score in terms of Belief and Uncertainty. For example, if we are measuring uncertainty, and we getÂ 12 = 0.7, this can be interpreted as '70% of correctly classified sequences have a higher uncertainty score than incorrectly classified sequences'. To be discriminative, effect size for either belief or uncertainty should be higher or lower than 0.5 (indicating no effect), with the utility increasing as the distance from 0.5 increases.
We divided the set of sequences into sequences classified by the inferred FSM as "reject" and sequences classified as "accept". Bearing in mind that the level of belief specifically refers to the proposition that a sequence should be accepted, we would expect correctly accepted sequences to have a high level of belief, and correctly rejected sequences to have a low level of belief. For sequences that are accepted but should be rejected, there should be a lower level of belief than for sequences that are correctly accepted (i.e.,Â 12 > 0.5). Sequences that are incorrectly rejected should have a higher belief value than sequences that are correctly rejected. It was not clear what to expect of uncertainty.
We plotted theÂ 12 values on two charts: one for belief and one for uncertainty. For each model (x−axis), we plotted thê A 12 for sequences that had been accepted and rejected. For subjective opinions to effectively discriminate between correct and incorrect predictions, we would expectÂ 12 to be 'medium' or 'high' for belief, uncertainty or both. We adopted the thresholds by Vargha and Delaney [44] for a "medium" effect size: A 12 > 0.71 orÂ 12 < 0.29 (0.5 represents a negligible effect size). To visualise the results we subtracted 0.5 from theÂ 12 value.
3) RQ3 -Does the Use of a Classifier Lead to More Accurate Predictions?: The setup for RQ3 was identical to that of RQ2. However, we applied Algorithm 2 to produce a classifier from the SOSM and the original training set. We again used k−folds cross validation, with stratified sub-sampling to ensure that we properly distributed 'ACCEPT' and 'REJECT' sequences. We used a Random Forest classifier [33] to relate subjective opinions to classifications. For every evaluation phase in k−folds cross Fig. 6. Relationship between the number of traces used to infer a SOSM, and the uncertainty computed for sequences across it (restricted to models with 400 traces or fewer). validation, we retained the classifications made by the inferred model against the training set. We compared the prediction made by our classifier against the ground truth and the prediction made by the original inferred FSM, computing the BCR for each case.

B. Results and Discussion
The software used to infer state machines, along with the target state machines used for RQs 2 and 3 is available online. 9 1) RQ1: Relating Uncertainty to Amount of Evidence: Fig.  6 shows an apparent relationship between the number of traces used to infer a model, and the uncertainty computed for sequences. When fewer than 20 traces were involved in the inference, the uncertainty tends to be high. For all models that involved over 20 traces, the uncertainty is below 0.5 (mostly below 0.25). The Spearman-rank correlation coefficient is -0.68 (a 'strong' correlation).

RQ1:
Uncertainty values tend to be higher for state machines inferred from low numbers of traces.
2) RQ2: Discriminating Between Correct and Incorrect Predictions: TheÂ 12 results are shown in Fig. 7. We separate the statistics for the accept-and reject-sequences because they are different (for reasons which we discuss below). As a reminder of how to interpret them, consider the "Accept" and "Reject" bars for ActiveMQ in the 'Uncertainty' chart. The dashed horizontal bars indicate the threshold for 'medium' effect size in either direction. The red bar for ActiveMQ can be read as "The majority of sequences correctly classified as "accept" have a lower level of uncertainty than sequences that were incorrectly classified as "accept". Only 28% (Â 12 = 0.28) had a higher level of uncertainty)". The blue bar can be interpreted as "57% (Â 12 = 0.57) of sequences correctly classified as 'reject' have a higher level of uncertainty than sequences incorrectly classified as 'reject"'. In both cases the effect sizes would count as "low" [44] -they are not either above the 0.75 or below the 0.25 threshold to count as "medium".
In 98% (46/47) of the subject systems, accepted sequences have anÂ 12 value of ≤ 0.5 for uncertainty (i.e., in the majority of cases, sequences that are correctly classified as accept have a lower level of uncertainty than sequences that are incorrectly classified). In 53% (25/47)Â 12 ≤ 0.25. Thus, for sequences accepted by the inferred model, uncertainty tends to be a good indicator for whether the 'accept' classification is correct (incorrect classifications are associated with higher uncertainty).
For sequences rejected by the inferred model, uncertainty is a less reliable indicator. There is no obvious trend -Â 12 values are > 0.5 in 68% (32/47) of models; they are ≤ 0.5 in 34% (16/47) of cases. The magnitude of the effect is only 'medium' (i.e., ≤ 0.25 or ≥ 0.75) in three instances.
Looking at levels of belief, in 87% (41/47) of cases, accept sequences have anÂ 12 value of ≤ 0.5, and in 81% of cases (38/47) this is ≤ 0.25. In other words, sequences that are correctly classified as accepts have a lower level of belief than incorrectly classified sequences. In summary, accept sequences that are incorrectly classified will tend to be associated with a higher level of belief as well as a higher level of uncertainty than correctly classified sequences.

RQ2:
Subjective opinions are capable of discriminating between correct and incorrect predictions. Incorrect predictions of accept sequences tend to be associated with a higher level of uncertainty, as well as a higher level of belief. There is a less apparent effect for sequences that should be rejected.
Although the uncertainty results are intuitive (correctly accepted sequences have lower uncertainty than incorrectly accepted ones), the results for belief are somewhat counterintuitive. Correctly accepted sequences have lower belief values than incorrectly accepted ones (and correctly rejected sequences have higher belief values than incorrectly rejected ones). This can be explained by the fact that the EDSM algorithm (used to infer the models) always merges together states for which there is the most evidence in the underlying trace-sets. In the resulting state machine, we can always expect an especially high level of belief for any accepted sequence, and an especially low level of belief for any incorrect sequence. When the algorithm is inaccurate, and incorrectly merges together states, this will be reflected in levels of belief that are too high for sequences that should be rejected, and too low for sequences that should be accepted.
Recall that we deliberately used a setting where the the inferred models are inaccurate (see Section c). Although the belief values can be misleading (reflecting inference mistakes made by the EDSM algorithm), second-order uncertainty values are more intuitive; correctly classified "accept" sequences tend to have a lower level of uncertainty than incorrectly classified ones. This gives rise to the question of whether the differentiation between correctly and incorrectly classified sequences can be exploited (RQ3).

3) RQ3: Correcting Incorrect Predictions:
Here we face the question of whether we can use the additional information available from subjective opinions to predict when a classification is incorrect, and correct the prediction? The results are shown in Table VI; the left half of the table contains the 'uncorrected' results, and the right half contains the corrected counterparts.
The uncorrected scores show the prediction accuracy of using just the inferred FSM. These show, as expected, that there has been a significant degree of over-generalisation in FSM inference. Accepting too many sequences leads to a high Sensitivity (a mean of 0.96), but a low Specificity (mean of 0.64), with a mean BCR of 0.76.
The uncorrected mean BCR is high because Sensitivity tends to be in the high 90 s. Specificity scores can be very low. There are 16 Specificity scores < 0.6. For dk16 there is Specificity of 0.21. For planet and planet1 there is a Specificity of 0.51 and 0.5 respectively; both have a BCR of 0.66. For tav there is a Specificity of 0.18 and a BCR of 0.29.
With the use of the classifier trained on the subjective opinions, there is a marked improvement. Mean Sensitivity slightly decreases by 0.03 to 0.93, but mean Specificity increases by a very large margin from 0.64 to 0.94, leading to an increase in mean BCR from 0.76 to 0.94. There is no trade-off; BCR increases for every model, and often by a significant margin. This extends to the cases mentioned above for which the initial inference produced especially inaccurate results. For example for tav the original BCR score is 0.29 (arising from a poor Specificity score of 0.18). The corrected model produces a BCR of 0.97, with a Specificity of 0.98.

RQ3:
Subjective opinions can be used to correct incorrect predictions from inferred state machines.  VI  THE SENSITIVITY, SPECIFICITY AND BCR SCORES FOR THE FSM, AND  EQUIVALENT SCORES PRODUCED WITH THE SUBJECTIVE-OPINION TRAINED  CLASSIFIER (RQ2) The extent to which the additional information offered through subjective opinions can improve predictions is worth highlighting. The mean improvement of the BCR score is 18% (minimum of 6% and maximum of 67%). In the case of planet and planet1, BCR improves from 0.66 to 1.
Increases in accuracy result from improved Specificity. The inferred state machines tended to over-generalised. However, for the 'corrected' versions, a larger proportion of sequences that had been falsely accepted (false-positives) were instead correctly rejected (true-negatives). In a software-engineering context where, for example, the inferred models are used to identify candidate test cases [10], [11], this would lead to a more focussed set of candidates, avoiding candidates that are infeasible in practice.

C. Threats to Validity
Internal Validity. For all RQs we used the EDSM Blue-Fringe algorithm [31] and this choice may have affected the results. In future work we will investigate the use of different learning algorithms to infer SOSMs.
External Validity. For RQ1 we based our findings on state machines inferred from logs for Android processes, derived from a single system [34]. There is the possibility that the same findings might not hold for other classes of traces, and exploring this will be the subject of future work.
The state machines for RQs 2 and 3 were drawn from two collections [35], [36], raising the question of whether they are more generally representative. This threat was mitigated by the results being similar for the two collections. In addition, the models are highly diverse (see Table V). Some are small and simple (e.g., lion has 5 states, 36 transitions and an alphabet of 9), but others are much larger (e.g., s298 with 219 states, 8,066 transitions and an alphabet of 46). Sensitivity, Specifity, and BCR scores were calculated with respect to 10-folds Cross-Validation, mitigating against overfitting or selection bias. However, the results have to be interpreted as being valid with respect to traces sampled according to the probabilities attached to the transitions in the models. Although there are measures that assess the structural accuracy of inferred models (as opposed to language accuracy) [43], these will need to be enhanced to accommodate probabilities, and fit into future work.

A. Second-Order Uncertainty in Software Engineering
Several recent developments support reasoning about uncertainty in Software Engineering [2], particularly uncertainty arising from noisy, partial and skewed distributions in empirical studies. The rigidity of "traditional" statistical analyses has often led to findings that have been difficult to justify and explain. The rise of Bayesian analysis, recently illustrated by Furia et al. [24] and Dorn et al. [45] offers a more robust and explainable basis for managing this uncertainty. Several research efforts have specifically sought to focus on representing second-order uncertainty.
Recent work by Walkinshaw and Shepperd [46] showed how outcomes from empirical studies could be encoded as binomial subjective opinions, and how Subjective Logic fusion operators could be used to combine results from multiple experiments, whilst providing an explicit measure of uncertainty for the grouped experiments. Although we did not use fusion operators, their ability to systematically 'fuse' together opinions is something that could be useful in the context of reasoning about or manipulating SOSMs.
Software safety assessment has also seen efforts to reason about epistemic uncertainty. Duan et al. used Subjective Logic to reason about uncertainty in safety-cases, and emphasised the value of using the beta-distribution (Section II-C2) for capturing this uncertainty. Nair et al. [47] applied Evidential Reasoning [14] to the same area. Evidential Reasoning is also founded on Dempster-Shafer theory, but does not feature the variety of operators in Subjective Logic.

B. Uncertainty in Sequential Models
There is an extensive history of research into the combination of finite automata and probability theory. Although this article used PFSMs [3] as a baseline, there exist other potential representations of probabilistic sequential systems. Hidden Markov Models (HMMs) and Fuzzy Finite Automata are two notable, widely used examples.
HMMs can be seen as a form of PFSM [48], where symbols label states and every state has a probability of being the initial state. Given the equivalence with PFSMs [48], HMMs also only model 'first-order' probabilities and do not explicitly model uncertainty associated with a probability.
Fuzzy Automata map a transitions t to a value representing the degree to which t is in the automaton [49]. Initial work used values in [0,1] but, more generally, one can use values from a complete lattice [50]. The value for a path is the greatest lower bound of the values of the transitions; for [0,1], this is the minimum. With multiple paths, the value is the least upper bound of the values for the individual paths [49]; for [0,1], the maximum, over the paths.
One might use values to represent uncertainty in probabilities and we discuss two options. First, one might use a value in [0,1]. Consider a path t 1 t 2 , where t 1 and t 2 are transitions. If t 1 is assigned a and t 2 is assigned b then t 1 t 2 has value min{a, b}. Now, if a and b are identical and strictly between 0 and 1 then t 1 t 2 also has uncertainty a. However, we would expect the path to have greater uncertainty than the transitions. We obtain the same problem if we instead give a transition a value [a, b], 0 ≤ a ≤ b ≤ 1, and we use the subset relation as the partial order. SOSMs thus appear to be more suitable for the work described in this article.

C. Inference of Probabilistic Sequential Models
There have been several efforts to reverse-engineer probabilistic models from software (and hardware) systems [51]. Although most concerned PFSM inference, some involved HMMs and other probabilistic sequential models.
Efforts to infer PFSMs date back to work by Rivest and Shapire [52], who inferred robot controllers. In 1998 Cook and Wolf experimented with different inference approaches to infer sequential models, one being a Markov Model 10 [53].
There have been recent efforts to infer HMMs. Nguyen et al.'s DroidAssist tool [54] infers what are in effect HMMs representing sequences of API usages in mobile apps. Emam and Miller's approach [25] produces PFSMs, but is underpinned by HMM 10 The non-hidden version of the Markov Model is equivalent to a PFSM; probabilities are expressed on transitions as opposed to states. inference. HMMs have been extensively used to detect malware from the structure of executable fragments of bytecode [55]. Recently, HMMs featured in eye-tracking studies of software developers [56].

D. Testing Uncertain Sequential Systems
The SUT might incorporate stochastic or non-deterministic behaviours. Elbaum and Rosenblum [57] present compelling examples, such as services that depend on GPS location services and so on estimated locations of a device. They suggest HMMs as a useful basis for testing such systems.
There is also epistemic uncertainty; having observed a set of executions (a test set), how certain can we be that the SUT is 'correct'? Weyuker showed that the answer to this question can never (in the general case) be certain [58]. There is always the risk that there exists some input that is not part of the test set that might expose new behaviours. This has given rise to 'learning-based testing' [59]. Much of the attention has been on the combination of software testing with state machine inference; a good overview is provided by Aichernig et al. [60]. Although uncertainty has been exploited for non-sequential systems [61], the models inferred have tended to be FSMs [60].
Where probabilistic models are available, authors have explored testing from these. Much of this work concerns probabilistic labelled transition system. In order to decide which test cases to use, one can find tests that kill mutants [5]. Give test case t, one can test the SUT with t multiple times and use statistical techniques to check that the observed frequencies are consistent with the specification [5].

VII. CONCLUSION AND FUTURE WORK
We have introduced SOSMs, a generalisation of PFSMs, where states are labelled with multinomial subjective opinions. These not only capture the relative likelihoods of transitions, but also associate them with a level of uncertainty. Thanks to Subjective Logic [9], it is possible to compute the combined likelihood and uncertainty associated with sequences of events. This means that predictions can be associated with a level of 'trustworthiness'. We also provided an algorithm that can be used to automatically generate an SOSM from an FSM and a set of traces.
In our evaluation, we used SOSMs to infer predictive classifiers of behaviour, showing that uncertainty captured within an inferred SOSM can be a strong indicator of how (in-)accurate a prediction will be. We also showed that the subjective opinions associated with classifications can be used to 'correct' the predictions of the underlying FSM, to the point that the predictions in our experiments were improved from a mean BCR score of 0.76 to a score of 0.94.
We believe that there are several exciting opportunities to apply SOSMs within software testing. There is a well established line of work in testing from probabilistic models, and a (currently) separate line of research into the relationship between testing and uncertainty. We believe that the SOSM could form the basis for combining these two strands. Finally, we have only used a relatively small subset of Subjective Logic; there are other elements that could be used to refine and strengthen our approach, such as the use of hyper-opinions [9], which would enable us to model uncertainty at individual states in more accurate terms.