Why do some probabilistic forecasts lack reliability?

In this work, we investigate the reliability of the probabilistic binary forecast. We mathematically prove that a necessary, but not sufficient, condition for achieving a reliable probabilistic forecast is maximizing the Peirce skill score (PSS) at the threshold probability of the climatological base rate. The condition is confirmed by using artificially synthesized forecast-outcome pair data and previously published probabilistic solar flare forecast models. The condition gives a partial answer as to why some probabilistic forecast system lack reliability, because the system, which does not satisfy the proved condition, can never be reliable. Therefore, the proved condition is very important for the developers of a probabilistic forecast system. The result implies that those who want to develop a reliable probabilistic forecast system must adjust or train the system so as to maximize PSS near the threshold probability of the climatological base rate.


Introduction
Forecasts of space weather phenomena have become operational. There are at least two types of forecast of the occurrence of space weather phenomena, namely, deterministic and probabilistic. Because it is difficult to forecast the occurrence of natural phenomena deterministically, a probabilistic forecast is suitable for the occurrence of space weather phenomena, such as solar flare. Moreover, the deterministic forecast can easily be derived by thresholding to a probabilistic forecast (e.g., Jolliffe & Stephenson, 2012). Converting the probabilistic forecast to a deterministic forecast can be performed by forecast users themselves, whose threshold probabilities to determine event occurrence are different. Several authors (e.g., Murphy, 1977;Richardson, 2000;Zhu et al., 2002) showed in a framework of decision-analytic models that a relative economic value of a probabilistic forecast is higher than that of a deterministic forecast, which meant that a probabilistic forecast is more useful than a deterministic forecast in the sense of economic value. Murphy (1993) mentioned in the sense of forecast consistency that "Since forecasters' judgments necessarily contain an element of uncertainty, their forecasts must reflect this uncertainty accurately in order to satisfy the basic maxim of forecasting. In general, then, forecasts must be expressed in probabilistic terms." For these reasons, probabilistic forecast models for the occurrence of space weather phenomena have been developed by several authors.
Solar flare occurrence forecasts have been actively studied in the operational space weather forecast community. Recently, many articles related to solar flare occurrence forecasts have been published, which include deterministic forecasts as well as probabilistic forecasts. Examples are human-judged forecasts (Crown, 2012;Devos et al., 2014;Kubo et al., 2017;Murray et al., 2017), statistical methods (Wheatland, 2005;Falconer et al., 2011;Bloomfield et al., 2012;McCloskey et al., 2016;Steward et al., 2017;Leka et al., 2018), and machine learning forecasts (Bobra & Couvidat, 2015;Muranushi et al., 2015;Huang et al., 2018;Nishizuka et al., 2017Nishizuka et al., , 2018. Many authors have assessed the performance of the forecast models. However, many of the probabilistic forecast models verify a discrimination performance only by using a relative operating characteristic curve, and do not verify other attributes such as reliability, which is one of the most important attributes to be assessed in forecast verification. Several authors (e.g., Jolliffe & Stephenson, 2012;Kubo et al., 2017) mentioned that there are many attributes to be assessed for forecast verification, such as bias, accuracy, discrimination, reliability, and skill. Murphy (1991) pointed out that only one verification measure was not enough to correctly assess the forecast performance due to high dimensionality of a joint probability density of the outcome and forecast. For example, at least three verification measures are required in case of a dichotomous deterministic forecast because the dimensionality in this situation is three.
Efforts on comparing the performances of several forecast models have also been in progress. Barnes et al. (2016) compared eleven probabilistic solar flare forecast models and used the relative operating characteristic curve, reliability diagram, and Brier skill score as a verification measures, together with some skill scores for a contingency table created using only one threshold probability of the probabilistic forecast. The reliability diagrams shown in Barnes et al. (2016) showed that several probabilistic solar flare forecast models lack reliability. In the terrestrial weather forecast community, an unreliable probabilistic forecast model is often calibrated (e.g., Gneiting et al., 2007;Primo et al., 2009). However, the calibration of an unreliable probabilistic forecast model is not yet popular in the space weather forecast community. Therefore, it is better that direct outputs from the probabilistic forecast model are already reliable. To realize reliable outputs of probabilistic forecast models, we must investigate the reason why some probabilistic forecast models lack reliability.
We investigate a condition for a probabilistic binary forecast to be reliable in this work. In Section 2, we investigate the condition mathematically, and derive a necessary, but not a sufficient, condition for a probabilistic binary forecast to be reliable. In Section 3, the condition will be confirmed by using artificially synthesized forecast probabilities with corresponding outcomes and several probabilistic solar flare forecast models. The discussion and conclusion are described in Sections 4 and 5, respectively.

Mathematical derivation of the condition
One of the important attributes to be satisfied for the probabilistic forecast system is reliability. Reliability means a coincidence between the forecast of an event occurrence probability x with the probability density function p(x) (0 x 1) and the conditional expectation value of the outcome given the probability x; EðojxÞ (e.g., Jolliffe & Stephenson, 2012). If a probabilistic forecast system is perfectly reliable, EðojxÞ should be equal to x. In the case of a binary event, as the outcome o is 1 (100% probability) for an event and 0 (0% probability) for no event, EðojxÞ can be rewritten as: where p(o = 1, x) and p(o = 0, x) are joint probability densities of the outcome and forecast. Therefore, the equation, must be satisfied for a perfectly reliable probabilistic forecast system. By using Bayes' theorem, pðo ¼ 1jxÞ can be rewritten as: where pðxjo ¼ 1Þ and pðxjo ¼ 0Þ are the conditional probability density functions given the outcome of event and no event, respectively. Hereafter, we refer to pðxjo ¼ 1Þ and pðxjo ¼ 0Þ as p 1 (x) and p 0 (x), respectively. As p(o = 1) is a climatological base rate, we write p(o = 1) and p(o = 0) as s and 1Às, respectively. From the equations (2) and (3), the equation is derived for a perfectly reliable forecast system. Here, we define the function f(x) as: f(x) takes zero for x = s, positive or zero for 0 x < s, and negative or zero for s < x 1, because p 1 (x) takes a positive or zero value for 0 x 1. Because p 1 (x) and p 0 (x) are conditional probability density functions given the outcome of event and no event, respectively, the integrals of the functions p 1 (x) and p 0 (x) from x to 1 are regarded as a Probability of Detection (POD) and a Probability of False Detection (POFD), respectively, in the forecast verification measure. Therefore, a derivative of the Peirce Skill Score 1 (PSS = POD À POFD) by x becomes f(x). As already mentioned, because f(x) takes zero for x = s, positive or zero for 0 x < s, and negative or zero for s < x 1, PSS(x) is maximum at x = s. In conclusion, we were able to prove that the proposition, is true. This means that the maximization of PSS at a threshold probability, which is equal to the climatological base rate, is a necessary condition for a reliable probabilistic forecast. In the following section, we investigate whether the derived necessary condition is sufficient. If a probabilistic forecast system is unreliable, the conditional expectation value of the outcome given a forecast of the event occurrence probability is not equals to the forecast probability, that is, can be assumed, where g(x) is a function representing a reliability curve. From the equations (1), (3), and (7), the equation is derived. As p 1 (x) and p 0 (x) are conditional probability density functions given the outcome of event and no event, respectively, a derivative of PSS by x is written as: If there exists a function g(x) satisfying then PSS(x) can be maximum at x = s, because the derivative of PSS(x) by x takes zero for x = s, positive or zero for 0 x < s, and negative or zero for s < x 1. Actually, because the function satisfies the equation (10), PSS(x) is maximum at x = s for the unreliable forecast system. Therefore, the proposition, is false. This means that the proposition, is false, and the maximization of PSS at a threshold probability equal to the climatological base rate is a necessary, but not sufficient, condition for a reliable probabilistic forecast system. An important point is that no assumption is made for a functional form of the probability density p 1 (x) and p 0 (x) when deriving the condition. This means that the condition is independent of the form of the probability density function.

Confirmation using forecast data and models
The necessary condition derived in the previous section is based on continuous probability density functions, which implies that it is based on an infinite number of sample data. However, no infinite number of samples is available in reality. Therefore, the derived condition should be confirmed by using a finite number of sample data. In this section, we confirm the derived condition first by using artificially sampled forecastoutcome pairs and then by using several probabilistic solar flare forecast models described by Barnes et al. (2016).

Synthetic forecast data
A probabilistic binary forecast system is fully determined by defining the climatological base rate s and two conditional probability density functions of event occurrence probability, p 1 (x) and p 0 (x). Synthetic forecast-outcome pairs are randomly sampled from p 1 (x) and p 0 (x), so as to climatological base rate being s. In this article, the climatological base rate s is fixed at 0.1, which represents a somewhat rare event case. The total number of sampled forecast-outcome pairs is 10 000.
Because the independent variable of the conditional probability density functions is the probability x, the range of x must be from 0 to 1. Therefore, the beta distribution Be(x; a, b) is employed for probability density functions, whose definition appears in Appendix. While a beta distribution can flexibly change its shape depending on the two parameters, it is suitable for investigating various types of situations. Three cases are investigated: (1) perfectly reliable, (2) PSS is maximum at the probability largely different from the climatological base rate, and (3) PSS is maximum at the probability equal to the climatological base rate but unreliable. Although only specific forms of probability density function are considered in the subsequent three subsections, the results of the studies are independent of the form of the probability density function.

Case 1: Perfectly reliable forecast
In case 1, the two conditional probability density functions of event occurrence probability, p 1 (x) and p 0 (x), are set as Be(x; 1.1, 0.9) and [10Be(x; 0.1, 0.9) À Be(x; 1.1, 0.9)]/9, respectively, so that the two density functions satisfy the equation (4), which states that the probabilistic forecast system is reliable. Randomly sampled variates from the probability density functions are pooled as the artificial forecast-outcome pairs. Figure 1a shows a reliability diagram for case 1. The blue dots connected by lines depict the conditional expectation values of the outcome. A perfect reliability curve is depicted by the diagonal dashed line, on which the 99% consistency bars (Bröcker & Smith, 2007, Jolliffe & Stephenson, 2012 are drawn as vertical dashes. The 99% consistency bar shows the range within which 99% of the conditional expectation value of the outcome given the probability would fall, if it were assumed that the original data is sampled from the perfectly reliable probabilistic forecast system. The red histograms with the right axis show a number of probabilistic forecasts within bins. It is clear that all the conditional expectation values of the outcome are located within the 99% consistency bars. This means that the synthetic probabilistic forecast is almost perfectly reliable (of course, it is by definition).
According to the condition derived in Section 2, PSS must be maximum at the threshold probability of 0.1, which is a climatological base rate. Figure 1b shows the variation of PSS versus the various threshold probabilities calculated using the synthetic forecast-outcome pairs. We can clearly see that PSS is maximum at around the climatological base rate.

Case 2: Maximize PSS at a probability different from the climatological base rate
In case 2, p 1 (x) and p 0 (x) are set as Be(x; 2.2, 0.4) and [10Be (x; 0.2, 0.4) À Be(x; 2.2, 0.4)]/9, respectively, for which PSS is maximum at the threshold probability of 0.5, which is largely different from the climatological base rate. Figure 2b depicts the plot of PSS versus various threshold probabilities. The diagram shows that PSS is maximum at the threshold probability of around 0.5 (by definition), which is far from the climatological base rate.
According to the condition mathematically derived in Section 2, the probabilistic forecast on case 2 must be unreliable. In the following, we will confirm that the forecast is unreliable by drawing the reliability diagram. Figure 2a shows a reliability diagram for case 2. The dots, lines, dashes, and histogram represent the same quantities as those in case 1. We can recognize from the figure that the conditional expectation values of the outcome are not on the perfect reliability line. This fact confirms that case 2 is an unreliable probabilistic forecast system.

Case 3: Maximize PSS at the climatological base rate but unreliable forecast
In case 3, p 1 (x) and p 0 (x) are set as Be(x; 0.83, 1.19) and [10Be(x; 0.23, 1.19) À Be(x; 0.83, 1.19)]/9, respectively, for which PSS is maximum at the threshold probability of the climatological base rate.
A plot of PSS versus various threshold probabilities is depicted in Figure 3b    Y. Kubo: J. Space Weather Space Clim. 2019, 9, A17 forecast-outcome pairs. The figure shows that PSS is maximum at around the climatological base rate (by definition). However, as proven in the previous section, because the maximization of PSS at the threshold probability of the climatological base rate is not sufficient condition for probabilistic forecast to be reliable, whether the probabilistic forecast is reliable should not be decided. To confirm this theoretically derived result, a reliability diagram for case 3 is drawn in Figure 3a. The dots, lines, dashes, and histogram represent the same quantities as those in case 1. Clearly, the conditional expectation values of the outcome do not follow a perfect reliability line. This fact shows that the probabilistic forecast is unreliable even if PSS is maximum at around the climatological base rate.

Solar flare forecast models
As Barnes et al. (2016) plotted reliability diagrams and estimated threshold probabilities maximizing PSS 2 for eleven solar flare forecast models, these results are used for confirming the validity of the condition derived in this study. Although they dealt with three event definitions, we refer to only one event definition (C1.0 or greater flare) because, as there were few flare event samples for other the two event definitions, the error bars for the reliability diagrams were large. In this subsection, the terms "table" and "figure" denote the table and figure that appeared in Barnes et al. (2016) unless explicitly stated.
Ten models out of eleven can forecast the events of C1.0 or greater flare, and were assessed for the events (figures 11, 12, 13, 17, 19, 20, 22, 23, 25, and 26). Climatological base rates for the ten models were shown in the tables 8, 9, 10, 12, 13, 14, 15, 16, 17, and 18, respectively. Reliability diagrams (top panels) for figures 12, 13, 15, 19, 20, and 22 show that the reliabilities for these models were relatively good (of course, no models has perfect reliability). According to the condition derived in Section 2, the threshold probabilities maximizing PSS for these models should be near the climatological base rate. As the threshold probabilities maximizing PSS are shown in the bottom panels of the figures, we refer to these values. The absolute values of difference between the climatological base rate and the threshold probability maximizing PSS for the relatively reliable forecast models were between 0.015 and 0.049, which shows that the threshold probabilities maximizing PSS were very close to the climatological base rates. On the other hand, the absolute values of difference between the climatological base rate and the threshold probability maximizing PSS for the models shown in figures 11, 23, 25, and 26 were 0.193, 0.268, 0.393, and 0.150, respectively, which meant that the threshold probabilities maximizing PSS were largely far from the climatological base rates. We clearly recognize from the figures 11, 23, 25, and 26 that the reliabilities for these models were relatively poor. This result shows that the model that has a threshold probability maximizing PSS far from a climatological base rate lacks reliability. These results are consistent with the mathematically derived condition. These results are summarized in Table 1 in this paper.
From the examples shown in this section, it is confirmed that the maximization of PSS at a climatological base rate is a necessary, but not sufficient, condition for a reliable probabilistic forecast. We used beta distributions to describe the probability densities in the examples. However, we emphasize again that the confirmed result does not depend on the form of the probability density as shown in Section 3.5, so the result is quite general.

Discussion
The condition that PSS is maximum at a threshold probability of a climatological base rate is a necessary condition for a probabilistic forecast system to be reliable. That is, if the probabilistic forecast system is reliable, the PSS of the system is maximum at the threshold probability of, definitely, the climatological base rate. In other words, a probabilistic forecast system whose PSS is maximum at a largely different climatological base rate can never become reliable. This claim is very important for developers of probabilistic forecast systems. Those who want to develop a reliable probabilistic forecast system must adjust or train their system so that PSS is maximum at the threshold probability of the climatological base rate. Of course, the adjustment or training alone is not necessarily enough for a reliable system, because the condition is not a sufficient condition. However, if no adjustment or training is carried out, their system can never become reliable.
A joint probability density of forecast-outcome pairs can be factored into a conditional probability density and marginal probability density. In a distribution-oriented forecast verification framework (Murphy & Winkler, 1987), two types of factorization are possible. One is a calibration-refinement factorization, which is a factorization into the conditional probability density of observation given forecast (calibration distribution) and the marginal probability density of a forecast (refinement distribution). The other is a likelihood-base rate factorization, which is a factorization into the conditional probability density of forecast given observation (likelihood distribution) and a marginal probability density of observation (base rate distribution). While an attribute of reliability is directly related with the calibration Table 1. Summary of climatological base rate (s) and threshold probability maximizing PSS (p th ) appeared in Barnes et al. (2016 Note. Fig. # and Tab. # are figure numbers and table numbers in Barnes et al. (2016), respectively.
distribution, PSS is only related with the likelihood distribution, which implies that PSS can say nothing on reliability. That is, the completely different aspects of joint probability density are assessed on the basis of reliability and PSS. It is interesting that, despite this fact, a reliable probabilistic forecast is directly related with the maximization of PSS. The interesting question as to why the maximization of PSS at the threshold probability of a climatological base rate is related with the reliable probabilistic forecast system, can partly be accounted for by considering the factorization of the joint probability density. A combination of likelihood and base rate distributions can completely describe the joint probability density of forecast-outcome. This means that although the likelihood distribution alone cannot assess a calibration distribution, the combination of likelihood and base rate distributions can do so. Therefore, the combination of information of PSS and the climatological base rate is required for assessing information of reliability. Some related literatures with this study have published in meteorological forecast verification. Richardson (2000) discussed a relative economic value of forecasts in the framework of a decision-analytic models. He mentioned that the maximum relative economic value for a deterministic forecast was attained at the point where an user's cost-loss ratio equals to a climatological base rate and was given by PSS 3 . This meant that a maximum relative economic value for probabilistic forecast was given by a maximum PSS under the condition that an user's cost-loss ratio equals to a climatological base rate. The fact that the relationship between a climatological base rate and maximum PSS appears in several kinds of situation for forecast verification is interesting. This point should be further investigated.

Conclusion
We mathematically derived a necessary condition for a probabilistic binary forecast to be reliable. The condition was maximizing a PSS at a threshold probability of a climatological base rate. The condition was confirmed by using artificially synthesized forecast-outcome pair data and several published probabilistic solar flare forecast models. An important point is that the condition is derived without assuming the form of the probability density function. This means that the condition generally holds. This condition is quite important for the developers of probabilistic forecast systems. When a reliable probabilistic binary forecast system is developed, the developer must adjust or train the system so as to maximize PSS at the threshold probability of the climatological base rate. The condition gives a partial answer as to why some probabilistic forecast systems lack reliability because the system that does not satisfy the condition can never be reliable.