Issue 
J. Space Weather Space Clim.
Volume 7, 2017



Article Number  A20  
Number of page(s)  16  
DOI  https://doi.org/10.1051/swsc/2017018  
Published online  24 August 2017 
Research Article
Verification of operational solar flare forecast: Case of Regional Warning Center Japan
National Institute of Information and Communications Technology (NICT), Tokyo
1848795, Japan
^{*} Corresponding author: kubo@nict.go.jp
Received:
1
November
2016
Accepted:
20
July
2017
In this article, we discuss a verification study of an operational solar flare forecast in the Regional Warning Center (RWC) Japan. The RWC Japan has been issuing fourcategorical deterministic solar flare forecasts for a long time. In this forecast verification study, we used solar flare forecast data accumulated over 16 years (from 2000 to 2015). We compiled the forecast data together with solar flare data obtained with the Geostationary Operational Environmental Satellites (GOES). Using the compiled data sets, we estimated some conventional scalar verification measures with 95% confidence intervals. We also estimated a multicategorical scalar verification measure. These scalar verification measures were compared with those obtained by the persistence method and recurrence method. As solar activity varied during the 16 years, we also applied verification analyses to four subsets of forecastobservation pair data with different solar activity levels. We cannot conclude definitely that there are significant performance differences between the forecasts of RWC Japan and the persistence method, although a slightly significant difference is found for some event definitions. We propose to use a scalar verification measure to assess the judgment skill of the operational solar flare forecast. Finally, we propose a verification strategy for deterministic operational solar flare forecasting. For dichotomous forecast, a set of proposed verification measures is a frequency bias for bias, proportion correct and critical success index for accuracy, probability of detection for discrimination, false alarm ratio for reliability, Peirce skill score for forecast skill, and symmetric extremal dependence index for association. For multicategorical forecast, we propose a set of verification measures as marginal distributions of forecast and observation for bias, proportion correct for accuracy, correlation coefficient and joint probability distribution for association, the likelihood distribution for discrimination, the calibration distribution for reliability and resolution, and the GandinMurphyGerrity score and judgment skill score for skill.
Key words: Solar flare / Multicategorical forecast / Forecast verification / Confidence interval
© Y. Kubo et al., Published by EDP Sciences 2017
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Forecast verification has been recognized as one of the most important topics in terrestrial weather forecast research. In the field of space weather forecast, however, forecast verification is still in the early stages. Recently, some Regional Warning Centers (RWCs), belonging to the International Space Environment Service (ISES), have started to verify their operational forecasts. As a result of this initiative, forecast verification has become recognized as one of the important research topics in the operational space weather forecasting community.
RWC USA (Space Weather Prediction Center at the National Oceanic and Atmospheric Administration: SWPC/NOAA) verified their operational solar flare forecast by using forecast data accumulated over about 12.5 years (Crown 2012). RWC Belgium (Royal Observatory of Belgium: ROB) also verified their operational solar flare forecast and geomagnetic Kindex forecast by using forecast data accumulated over about 8.5 years^{1} (Devos et al. 2014). Other RWCs have also started verification studies of their operational space weather forecasts.
Recently, the Community Coordinated Modeling Center has been planning the Flare Scoreboard, which is an online platform of realtime probabilistic solar flare forecast verification (http://ccmc.gsfc.nasa.gov/challenges/flare.php). RWC Japan (National Institute of Information and Communications Technology: NICT) also has an online platform of operational solar flare and geomagnetic Kindex forecast verification (http://segweb.nict.go.jp/cgibin/forecast/eng_forecast_score.cgi), which compares forecasts of some RWCs. However, as the forecast conditions are not the same and verification methods are insufficient to compare the forecasts, we cannot directly compare the forecast performances among the RWCs. While comparing forecast performances among some RWCs is very informative, we have to proceed with efforts to compare the operational space weather forecasts. Verifying one’s own forecast performance is the first step toward proceeding with the comparison of forecast performances.
The origin of the study of forecast verification goes back to 1884. Finley published some results of his tornado occurrence forecast (Finley 1884). Finley’s results indicated that the accuracy of the tornado forecasts was extremely high, with the probability of correct forecasts exceeding 95%. Shortly after the publication of Finley’s paper, three papers that pointed out the deficiency of Finley’s verification method appeared and alternative verification measures were proposed (Murphy 1996). These events were the start of forecast verification studies. Therefore, forecast verification has a long history in the terrestrial weather forecasting community, and the verification methods of the terrestrial weather forecast are more sophisticated than those of the space weather forecast. In this work, we perform a verification study of an operational solar flare forecast in RWC Japan (hereafter, RWCJ forecast) while referring to the methods of verifying terrestrial weather forecasts.
This article is organized as follows. In Section 2, we describe the RWCJ forecast and solar flare observation data. In Section 3, we describe verification measures with the method of estimating the confidence intervals of the measures. The verification analyses of the RWCJ forecast are described in Section 4. A discussion and summary are given in Sections 5 and 6, respectively. In the Appendices, brief descriptions of the definitions of the verification measures used in this study are shown.
2. RWCJ forecast and flare observation data
The solar flare class is defined on the basis of the 1–8Å Xray flux observation by the Geostationary Operational Environmental Satellite (GOES). The RWCJ forecast gives the expected maximum solar flare class within 24 h from the forecast issuing time. Table 1 shows the definition of the scale of the RWCJ forecast. We can easily recognize from the definition that the RWCJ forecast is a fourcategorical deterministic forecast. Note that the flux level of a specific forecast has an upper bound, for example, when the forecast is “Active”, the expected Xray flux F (Wm^{−2}) is not 10^{−5} to infinity but 10^{−5} to 10^{−4}. The RWCJ forecast is not an automatically determined forecast but a human decision forecast. Forecasters analyze many types of solar data such as the current situation and history of the solar Xray flux, sunspot magnetic field configuration, and chromospheric brightening in active regions. Finally, the forecasters decide deterministically which class of solar activity will most likely occur in the next 24 h. The forecast is issued at 6:00UT every day, so the range of the RWCJ forecast is from 6:00UT to 6:00UT the next day. While the RWCJ forecast started in 1992 and has continued to the present day, we use the RWCJ forecast data accumulated over 16 years, from 2000 to 2015 (5844 days), in this verification study. We did not use the data from 1992 to 1999 because a complete set of the forecast data could not be collected due to some missing data, and the forecast issue time was different until October 1994.
Definition of Regional Warning Center Japan’s operational solar flare forecast.
For solar flare observation data, we use solar activity event lists issued by SWPC/NOAA. The flare peak time and peak flux found in the event lists are used as the flare time and flare class, respectively. We define a day as from 6:00UT to 6:00UT the next day, which is the same as the RWCJ forecast range. The observed flare class of a day is defined as the maximum flare class of that day, the flare time of which is within the forecast range. Because of the automatic detection of Xray flare for the event lists, some small and/or gradual increases in Xray flux cannot be detected and listed as a flare even if it is a Cclass activity. On 22 February 2001, for example, no flare was registered in the event lists although the flare activity was apparently Cclass. Owing to the absence of flares, there were 6 days whose flare activities were incorrectly determined as below Cclass activity instead of Cclass. We corrected these incorrectly determined flare classes. By compiling the forecast and observation data, we obtained 5844 forecastobservation pair data that were analyzed to assess the RWCJ forecast performance.
3. Forecast verification methods
Here we introduce the verification methods for the RWCJ forecast. Using the 5844 forecastobservation pair data, we constructed a contingency table for the RWCJ forecast. Table 2 depicts a fourcategorical contingency table for the RWCJ forecast. The number 0–3 on the forecast and observation axes stands for the flare class codes defined in Table 1.
Fourcategorical contingency table for the RWCJ forecast from 2000 to 2015.
3.1. Verification measures for dichotomous forecasts
There are many scalar measures that can be used to verify the performance of a dichotomous forecast. Although the RWCJ forecast is not a dichotomous forecast, we can apply the RWCJ forecast to the scalar measures by collapsing the fourcategorical forecast to a dichotomous forecast with a certain threshold. In this study, we deal with two types of event thresholds. In the Mthreshold, Mclass or larger flares are defined as events and below Mclass flares are defined as no events. In the Xthreshold, Xclass flares are defined as events and below Xclass flares are defined as no events.
Many scalar verification measures have been proposed since the first paper on terrestrial weather forecast verification by Finley (1884). All scalar verification measures have some advantages and disadvantages, and it is not known which scalar verification measures are the most suitable for operational solar flare forecast verification although Bloomfield et al. (2012) recommended to use Peirce skill score for comparing a performance of solar flare forecasts. However, we may obtain some hints from terrestrial weather forecast verification strategies. The World Meteorological Organization (WMO) published a recommendation on verification methods for tropical cyclone forecasts, which describes the recommended scalar verification measures to assess tropical cyclone forecasts (WMO 2014). WMO (2009) published a recommendation for the verification and intercomparison of precipitation forecast, and this reappeared in WMO (2014) together with an extremal dependence index (EDI), which has emerged as the important factor for rare event forecast verification since WMO (2009). Because tropical cyclones are relatively rare events, tropical cyclone forecasts resemble large class solar flare forecasts in terms of their rarity. Table 3 shows the recommended verification measures for assessing rare events that appeared in WMO (2014), which divided the scalar verification measures into three categories: mandatory, highly recommended, and recommended measures. The mandatory category is composed of hits, misses, false alarms, and correct rejections, which are all elements of a dichotomous forecast contingency table. The highly recommended measures are the frequency bias (FB), proportion correct (PC), probability of detection (POD), false alarm ratio (FAR), and equitable threat score (ETS). The recommended measures are the probability of false detection (POFD), critical success index (CSI), Peirce skill score (PSS), Heidke skill score (HSS), odds ratio (OR), odds ratio skill score (ORSS), and extremal dependence index (EDI). In this study, we estimate these verification measures except for the OR to assess the RWCJ forecast. Because the ORSS is the OR transformed so that its range is from −1 to +1, the OR and ORSS have a similar meaning. While two types of EDIs were proposed, we use a symmetric version of the extremal dependence index (SEDI) because the SEDI has somewhat better properties than a nonsymmetric version of EDI (Ferro & Stephenson 2011). Brief definitions of these scalar verification measures are given in Appendix 1.
We have to note that it is not sure whether this recommendation is the best for operational solar flare forecasting or not. An important implication of the recommendation is that only one verification measure is not enough to correctly assess the forecast performance (Murphy 1991), at least several attributes of forecast system, such as bias, accuracy, discrimination, reliability, and skill, must be assessed by verification measures.
3.2. Verification measure for multicategorical forecast
As already mentioned, most of the scalar verification measures are defined on the basis of the dichotomous contingency table, so the multicategorical contingency table must be collapsed to a dichotomous contingency table with a certain threshold so that the scalar verification measures can be applied to the RWCJ forecast. However, the collapsing process causes a loss of information included in the multicategorical contingency table. Therefore, it is better to directly estimate a scalar verification measure for a multicategorical forecast for the verification of the RWCJ forecasts.
Gandin & Murphy (1992) proposed a scalar verification measure for multicategorical forecasts. This measure is based on a scoring matrix, whose elements denote scores assigned to all elements of a multicategorical contingency table, so it does not require the collapse of a multicategorical contingency table. Moreover, the measure satisfies equitability, which is a highly desirable property for a scalar verification measure. An undesirable property of the measure is that it is not a closed form, meaning that some free parameters are required to estimate them. Gerrity (1992) derived a closed form of the Gandin and Murphytype verification measure, which does not require any free parameters. In this study, we use the scalar verification measure derived by Gerrity (1992), which we hereafter call the GandinMurphyGerrity score (GMGS). A brief definition of the GMGS is given in Appendix 2.
A ranked probability score (RPS) is often used for a verification of a multicategorical probabilistic forecast. If the forecast probabilities are set as one for any category and zero for others, it may be possible that the RPS is used for the verification of the multicategorical deterministic forecast. However, as the RPS applied to a deterministic forecast does not satisfy equitability (see Appendix 3), the GMGS is suitable for a verification of a multicategorical deterministic forecast.
3.3. Confidence interval for verification measures
Many of the forecast verification studies do not take data sampling uncertainty into account. However, because all of the verification measures are calculated from a finite number of sampled data, it is necessary to estimate confidence intervals for the calculated verification measures, as some authors have pointed out (e.g., Stephenson 2000; Jolliffe & Stephenson 2003; Wilks 2006; Jolliffe 2007).
Because the elements of a dichotomous contingency table are regarded as binomial variables, the measures expressed as proportions such as the PC, POD, FAR, POFD, and CSI are sample estimates of the binomial probability . The simplest method of estimating a confidence interval of the binomial probability is the socalled Wald confidence interval. The interval is calculated on the basis of a Gaussian approximation with mean and variance instead of the binomial distribution. The resulting 1 − α confidence interval is(1)where z _{(1−α/2)} is the 1 − α/2 quantile of the standard Gaussian distribution. The Wald confidence interval is simple, but can be rather inaccurate unless the number of samples n is very large (Agresti & Coull 1998).
A more accurate method of estimating a confidence interval was proposed by Wilson (1927). This confidence interval is also based on a Gaussian approximation, but its mean and variance are not assumed by the sample estimate of the binomial probability but by using an unfixed value of the binomial probability as mean p and variance . This interval is derived from the inequality . The resulting 1 − α confidence interval is(2)
This confidence interval is sufficiently accurate even when the number of samples n is quite small. According to Agresti & Coull (1998), a confidence interval derived from later formula with sample number n = 5 is more accurate than the Wald confidence interval derived with sample number n = 100. For the confidence interval of the sample proportion, this formula is better than that for the Wald confidence interval.
There are some verification measures that cannot be written as sample proportions of the elements of the contingency table, such as the FB, ETS, HSS, PSS, ORSS, SEDI, and GMGS. For these verification measures, Equation (2) cannot be applied directly to estimate the confidence interval. In this case, an error propagation rule can be applied to the verification measures written as the function of the POD, POFD, and S (base rate), when estimating the confidence intervals. However, the error propagation rule assumes implicitly that confidence intervals of the POD, POFD, and S are sufficiently small with respect to main values. In this verification study, although most of the verification measures have a sufficiently small confidence interval, some of the measures do not satisfy the assumption for a rare event forecast, in which case the error propagation rule cannot be applied to estimate the confidence interval.
To overcome these problems, we use a bootstrap method to estimate confidence intervals for scalar verification measures. The bootstrap method is becoming a popular means of constructing confidence intervals in statistical analyses and is based on a resampling procedure from an original data set. Many bootstrap replicates of our interest, such as scalar verification measures, are calculated from many resampled data sets, then their distribution is estimated from the many replicates. The confidence intervals are calculated by estimating a 1 − α/2 quantile from the distribution. This means that an assumption about the distribution is not required in the bootstrap methods, which is a major advantage over the other methods. Details of the method have appeared in textbooks (e.g., Efron & Tibshirani 1993). While there are many types of bootstrap confidence intervals, we use a BCa confidence interval in this study (BCa stands for “biascorrected and accelerated”). Because the BCa confidence interval includes corrections of bias and skew of bootstrap distribution, the BCa confidence interval has secondorder accuracy^{2} in the sample number n, whereas the simplest bootstrap confidence interval has an accuracy of the first order in n (e.g., DiCiccio & Efron 1996). Bootstrap samples, which are randomly sampled with replacement from original data set, are produced by the MonteCarlo method. The number of bootstrap samples is the same as the original data set (5844). The number of bootstrap replicates is 10,000 in this study. In statistical analyses, 95% confidence intervals (α = 0.05) are often used, so we also estimate 95% confidence intervals in this study.
3.4. Distributionoriented approach
In a distributionoriented approach, a contingency table is regarded as a joint probability distribution for a pair of forecasts and observations, which is derived by dividing the elements of the contingency table by the total number of elements. The joint probability distribution is related to the attribute of the association between forecasts and observations.
A joint probability distribution can be factorized into a marginal distribution and a conditional probability distribution. There are two types of factorization in the forecast verification framework (Murphy & Winkler 1987). One is a calibrationrefinement factorization and the other is a likelihoodbase rate factorization. In the calibrationrefinement factorization, the joint probability is factorized into the marginal probability of the forecasts and the conditional probability of the observations given the forecasts: p(f, o) = p(of) · p(f). In the likelihoodbase rate factorization, the joint probability is factorized into the marginal probability of the observations and the conditional probability of the forecasts given the observations: p(f, o) = p(fo) · p(o). The p(of) and p(fo) are called the calibration distribution and likelihood distribution, respectively. The calibration distribution is related to attributes of reliability and resolution, whereas the likelihood distribution is related to the attribute of discrimination. Marginal distributions are related to the attribute of bias. Details of the distributionoriented approach can be found in Murphy & Winkler (1987).
4. RWCJ forecast verification
4.1. Verification of overall data
In this section, we describe the results of the verification of the RWCJ forecast using data accumulated over 16 years. As mentioned in Section 3.1, a multicategorical forecast can be collapsed to a dichotomous forecast by setting a specific threshold when the conventional scalar verification measures are calculated. In this study, we deal with the M and Xthresholds defined in Section 3.1. Table 4 summarizes the estimated scalar verification measures with a 95% confidence interval for the M and Xthresholds.
Estimated scores of verification measures for events defined as larger than or equal to the M and Xclass flares.
The scores of FB are almost unbiased in the Mthreshold, meaning that the number of forecasted events is almost the same as the number of observed events. On the other hand, events are underforecasted in the Xthreshold.
The accuracy of the RWCJ forecast seems to be extremely high and the Xthreshold seems to be more accurate than the Mthreshold according to the scores of PC. According to the numbers of hits and correct rejections in Table 4, most of the correct forecasts are forecasts of null events, and the null events can be forecasted easily when the event is quite rare. This is why the scores of PC are extremely high, especially for the Xthreshold. On the other hand, the Xthreshold is less accurate than the Mthreshold in terms of the scores of CSI. As described in Appendix 1, the correct rejections are not taken into consideration when calculating the CSI. This is why the scores of CSI show the opposite trend to those of PC. When the correct forecasts of null events do not have an essential meaning, the CSI becomes a good measure of accuracy although an intuitive meaning of the measure is somewhat ambiguous.
The POD and POFD are the verification measures of discrimination, which is related to the likelihood distribution of the dichotomous contingency table. The discrimination means the ability of a forecast system to discriminate whether an event would occur under a condition of realized events. The scores of POD and POFD for the M and Xthresholds are shown in Table 4. The POD for the Mthreshold is reasonably good, meaning that the RWCJ forecast can discriminate the occurrence of below Mclass flares and M or Xclass flares. On the other hand, the POD for the Xthreshold is small, meaning that the RWCJ forecast cannot effectively discriminate the occurrence of below Xclass and Xclass flares. Because the POFD is a negative orientation measure, a small POFD means better performance. From the scores of POFD, it appears that the Xthreshold has a better performance than the Mthreshold. However, this is an invalid conclusion because the rarer the event, the larger the number of null events and the smaller the POFD. This means that the POFD approaches zero regardless of the discrimination performance when the event frequency approaches zero.
The reliability of a forecast can be expressed by the FAR. The FAR pertains to the relationship of a forecast to an average observation given the specific forecast and the calibration distribution of the dichotomous contingency table. The FAR is a negative orientation measure. For the RWCJ forecast, the Xthreshold has bad performance, as shown in Table 4. The 55–75% alarms issued for the occurrence of Xclass flare were false alarms. This rate may be too large when the forecast is used for decision making to activate some countermeasures.
The ETS, HSS, and PSS are all verification measures of skill. All of these verification measures have significant positive values. All of these verification measures have larger scores for the Mthreshold than for the Xthreshold. It seems that the forecast skill of the Mthreshold is better than that of the Xthreshold. However, according to Stephenson et al. (2008), the scores of ETS, HSS, and PSS tend to degenerate to zero irrespective of their skill and behave as the trivial noninformative limit for vanishingly rare events. Therefore, we have to pay attention to the interpretation of the scores of the skill measures when comparing the scores for different event frequencies such as M and Xthresholds.
The association describes the overall strength of the relationship between the individual pairs of forecasts and observations. According to the ORSS, the association between the forecasts and the observations is reasonably strong for both the M and Xthresholds. The ORSS for Xthreshold is higher than that of Mthreshold. This is easily accounted for from a definition of the ORSS (see Appendix 1). When the POFD is much smaller than the POD like the Xthreshold, the ORSS approaches one. However, when the POFD is as same order of magnitude as the POD, the ORSS approaches zero even when the POFD is small enough (see Fig. 3). We note that the association and the accuracy seem to have the same attributes; however, they are different. The accuracy describes the correspondence between the individual pairs of forecasts and observations, while the association describes the overall strength of the relationship between them.
The SEDI was proposed by Ferro & Stephenson (2011) and was designed to verify the performance of extremely rare event forecasts. The advantage of this measure is that the score for vanishingly rare events does not converge to zero but to a nontrivial meaningful value, whereas conventional measures such as the HSS, PSS, and ETS tend to be zero for vanishingly rare events. The scores of the SEDI in Table 4 show that the RWCJ forecast for rare events is significantly better than random forecasting, for which the score of SEDI is zero.
In the rest of the subsection, a verification of the fourcategorical forecast is given. Figure 1a depicts the joint probability distribution p(f, o) for the RWCJ forecast calculated from the fourcategorical contingency table in Table 2. A good association between the RWCJ forecast and the observation can be seen in the figure. The correlation coefficient (CC) between them is estimated to be 0.717 with a 95% confidence interval of [0.703, 0.730].
Fig. 1. Probability distributions for the RWCJ forecast. (a) Joint probability density. (b) Marginal distributions of forecast and observation. (c, d) Calibration and likelihood distributions, respectively. The black dots connected with lines in (c) and (d) stand for the conditional expectation values of the observation given the forecast and of the forecast given the observation, respectively. 
Figures 1b shows the marginal distributions of the RWCJ forecast and observation. The figure shows that the RWCJ forecast is almost unbiased, although there is slight overforecasting (underforecasting) for the Mclass (Xclass) flare. Figures 1c and 1d show the calibration distribution and the likelihood distribution, respectively. The black dots with numbers connected by the line in Figures 1c and 1d are the conditional expectation values of the observation given the forecast and of the forecast given the observation, respectively. We can recognize from Figure 1c that the flare class that most frequently occurred under the condition that a specific flare class had been forecasted was the same as the forecasted flare class except when an Xclass flare had been forecasted. When Xclass flares had been forecasted, the most frequently occurring flares were Mclass flares. This means that the reliability of the RWCJ forecast is good for below Xclass flares but not good for Xclass flares. As shown in Figure 1d, under the condition that a specific class flare occurred, the most frequently forecasted flare class was the same as the observed class except for the condition that an Xclass flare occurred. The most frequently forecasted flare class under the condition that an Xclass flare occurred was an Mclass flare. This means that the RWCJ forecast cannot successfully discriminate between the occurrences of Xclass and below Xclass flares.
The GMGS was calculated from the fourcategorical contingency table in Table 2. As already mentioned, because the GMGS satisfies equitability, the score for unskillful forecasts becomes zero. For the RWCJ forecast shown in Table 4, the score is significantly larger than zero, meaning that the RWCJ forecast has some forecast skill.
4.2. Comparison with persistence and recurrence methods
It was shown in Section 4.1 that the RWCJ forecast seems to have reasonable performance. In this subsection, we compare the performance of the RWCJ forecast with those of two other forecasting methods: a persistence method and a recurrence method. In the persistence method, today’s forecast is the same as yesterday’s observation results. In the recurrence method, today’s forecast is the same as the observation results 27 days ago. As solar flare activity is not independent between consecutive days, it is expected that the persistence method have some forecast performance. As the solar rotation period when viewed from Earth is almost 27 days, the recurrence method is also expected to have some forecast performance. Therefore, it is useful to compare the three forecasting methods to assess the RWCJ forecast performance.
Figure 2a shows comparisons of various verification measures among the three forecasting methods for the Mthreshold. The magenta, cyan, and yellow bars show the resultant scores for RWCJ, persistence, and recurrence methods, respectively. The black intervals drawn on the colored bars stand for the 95% confidence intervals of the scores. We can easily recognize that all three methods are skillful forecasts because all the scores of ETS, HSS, and PSS are positive. All the verification measures, except for FB, for the recurrence method have significantly worse scores than the RWCJ forecast (FAR and POFD are negative orientation measures), meaning that the performance of the RWCJ forecast is significantly better than that of the recurrence method. On the other hand, the differences in the scores between the RWCJ forecast and the persistence method are small for all the verification measures, and the 95% confidence intervals overlap each other. To investigate the difference in scores between the RWCJ forecast and persistence method more precisely, we show the difference in the various scores between the RWCJ forecast and the other two methods with 95% confidence intervals in Figure 2b. The red bars stand for the differences in scores between the RWCJ forecast and the persistence method while the blue bars stand for the differences between the RWCJ forecast and the recurrence method. We can recognize that all the verification measures have slightly better scores for the RWCJ forecast than for the persistence method. However, the 95% confidence intervals for some measures such as the PC, FAR, POFD, and ORSS include the zero score. This means that we cannot definitely conclude that there are significant differences in the scores between the RWCJ forecast and the persistence method. What we can conclude from the results for the performances of the RWCJ forecast and the persistence method is that (1) there is no significant difference in the accuracy, but the accuracy is slightly better when correct forecasts of a null event are not essential, (2) discrimination is slightly better for the RWCJ forecast, (3) there is no significant difference in reliability, (4) the RWCJ forecast has slightly better skill than the persistence method, (5) there is very slight or no significant difference in association, and (6) the performance of extreme event forecasts is slightly better for the RWCJ forecast. In summary, the RWCJ forecast for the Mthreshold seems to have a slightly better performance than the persistence method.
Fig. 2. Comparison of various verification measures among the RWCJ forecast, persistence method, and recurrence method for the Mthreshold. (a) Absolute scores of the verification measures. Magenta, cyan, and yellow stand for the RWCJ forecast, persistence method, and recurrence method, respectively. (b) Difference in the scores between the RWCJ forecast and the persistence method as red and that between the RWCJ forecast and the recurrence method as blue. The black bars stand for the 95% confidence intervals of the scores. We note that as the FAR and POFD are negative orientation measures, the 1FAR and 1POFD are plotted, which are positive orientation values. 
Figure 3 is the same as Figure 2 except that the event definition is the Xthreshold. From Figure 3a, we can immediately recognize that the RWCJ forecast and persistence method have some forecast performance, whereas the recurrence method has no performance. This may mean that the recurrence of an active region that produced an Xclass flare during the last solar rotation period provides little information on the productivity of the Xclass flare. Differences in scores between the RWCJ forecast and the persistence method are shown as red bars in Figure 3b. For almost all verification measures, the 95% confidence intervals include the zero score, meaning that we cannot definitely conclude that there is a significant difference between the performances of the RWCJ forecast and the persistence method for the Xthreshold.
Fig. 3. Same as Figure 2, except that the event definition is the Xthreshold. 
The scores of GMGS for the three forecast methods are included in Figures 2 and 3. The scores drawn in these two figures are exactly the same because the GMGS is calculated directly from the fourcategorical contingency table, not a collapsed dichotomous contingency table. Therefore, the measures express the performances of the fourcategorical forecast system. The scores of the GMGS show that all three forecast methods have some skill as a fourcategorical forecast system. The skill of the recurrence method is significantly worse than those of the other two methods. The difference between the RWCJ forecast and the persistence method may not be significant.
4.3. Verification of subset data
Hamill & Juras (2006) revealed that some conventional scalar verification measures such as the ETS were prone to give an unexpectedly increased score when the climatological frequency of the event occurrence varies among pooled samples. In the case of solar flare forecasts, the climatological frequency of event occurrence appears to vary chronologically owing to changing solar activity. Therefore, we apply the verification study to subset data, which are divided by solar activity levels. Figure 4 depicts a chronological history of the Mthreshold event occurrence defined in Section 4.1. The blue vertical lines show the maximum Mthreshold solar flare events within 24 h from the forecast issue time. The red dots show the cumulative number of maximum Mthreshold events counted from 1 January 2000 plotted against the event occurrence date. The slope of the red dots can be regarded as the climatological frequency of event occurrence. We can clearly recognize that there are four separate periods, during which the event frequencies are almost constant, which are shown by the dashed gray lines. Therefore, we divide the 16 years of data into four subsets: 2000–2002 (subset1), 2003–2005 (subset2), 2006–2010 (subset3), and 2011–2015 (subset4).
Fig. 4. Chronological history of the Mthreshold event occurrence. The blue vertical lines show the maximum Mthreshold events within 24 h from the forecast issue time. The red dots show the cumulative number of maximum Mthreshold events plotted against the event occurrence date. The gray dashed lines are fitted lines of cumulative number of events for each subset. 
Figures 5 and 6 show the scores of various verification measures for the four subsets with a whole data set as references for the M and Xthresholds, respectively. The cyan, yellow, red, and blue bars are for subset1 through subset4, respectively, with magenta used for the whole data set. In the Mthreshold, the FB shows that subset1 and 2 are overforecasting while subset3 and 4 are underforecasting, although subset3 has very wide 95% confidence interval. According to the PC, accuracy seems to be the best for subset3. However, the high score of subset3 is ascribed by correct forecast of null events because the CSI of subset3 is significantly lower than those of other three subsets. Comparing the subset2 and 4, which have almost the same event frequency, discrimination shown by the POD of subset2 is better than that of subset4. From the POD and FAR, we can recognize that subset3 has low discrimination and reliability than other three subsets. For the forecast skill, we can see from the ETS, HSS, and PSS that subset2 seems to have the best forecast skill among the four subsets, although a slight overlap of the 95% confidence intervals exists. The SEDI shows that the performance of extreme event forecast is subset2, 4, and 1 in descending order, except for subset3 which has large uncertainty. In the Xthreshold, we can recognize from the FB that subset1 and 4 are large underforecasting. Accuracy shown by the CSI is subset2, 4, and 1 in descending order, although 95% confidence intervals overwrap each other. Low PODs for subset1 and 4 would be ascribed by large underforecasting. There seems to be no significant difference for reliability among all subsets because all the 95% confidence intervals of FAR overwrap each other. It seems that subset2 has the best forecast skill among the subsets except for subset3, for which we cannot comment on the forecast skill because the 95% confidence intervals for subset3 are extremely wide. The SEDI has the same pattern as the POD because the POFDs for all subsets have almost the same values and the SEDI is calculated by only the POD and POFD. Regarding the skill of fourcategorical forecast, the GMGS shows that the best forecast skill seems to be for subset2, although the upper limit of the relatively wide confidence interval of subset3 is higher than that of subset2.
Fig. 5. Comparison of various verification measures with 95% confidence interval for the Mthreshold for four subsets with the overall data as a reference. The magenta, cyan, yellow, red, and blue bars stand for 2000–2015 (overall), 2000–2002 (subset1), 2003–2005 (subset2), 2006–2010 (subset3), and 2011–2015 (subset4), respectively. We note that as the FAR and POFD are negative orientation measures, the 1FAR and 1POFD are plotted, which are positive orientation values. 
As already shown in Section 4.2, the skill score of the RWCJ forecast is often similar to that of the persistence method. This may imply that a relatively high score of subset2 is due to the persistence of event occurrence because the solar activity during the period of subset2 was reasonably high. To investigate this point, the differences in score between the RWCJ forecast and the persistence method for each subset for the Mthreshold are drawn in Figure 7. For most of the verification measures, a zero score is included in the 95% confidence interval, meaning that we cannot conclude that there are significant differences between the RWCJ forecast and the persistence method for all the subsets. The comparison of each subset for a specific verification measure is also important. For the verification measure of the skill (ETS, HSS, and PSS), the largest differences in scores appear in subset4. For subset2, the difference in score is positive but the smallest among all the subsets except for subset3. Although we cannot give a definitive conclusion because the difference among the subsets is small and the confidence intervals overlap each other, the relatively high score of subset2 is probably due to the persistence of event occurrence. Moreover, the best judgment skill, which is defined in Section 5, for the Mthreshold may be in subset4 because the difference in scores of the ETS, HSS, and PSS between the RWCJ forecast and the persistence method for the subset4 is the largest among those for all subsets. In the Xthreshold, we cannot comment on the significance of the score differences because of the extremely wide confidence intervals compared with the score difference (not shown).
Fig. 7. Difference in scores with 95% confidence interval for Mthreshold event between the RWCJ forecast and the persistence method. The magenta, cyan, yellow, red, and blue bars stand for 2000–2015 (overall), 2000–2002 (subset1), 2003–2005 (subset2), 2006–2010 (subset3), and 2011–2015 (subset4), respectively. We note that as the FAR and POFD are negative orientation measures, the 1FAR and 1POFD are plotted, which are positive orientation values. 
5. Discussion
As already mentioned in Section 4.2, the persistence method, as well as the RWCJ forecast, is a skillful forecast method. However, the persistence method is determined by only the observation result of the previous day, namely, the persistence method is a skillful forecast method without judgment. On the other hand, the RWCJ forecast (and other operational solar flare forecasts) includes a judgment process to determine the issued forecast. If the skill scores of the operational solar flare forecast are smaller than those of the persistence method, the contribution of the judgment of flare occurrence to the operational solar flare forecast is almost zero. Therefore, it would be better to assess the judgment skill for the operational solar flare forecast in addition to forecast skills, which are assessed by the verification measures of ETS, HSS, and PSS. In Sections 4.2 and 4.3, we estimated the difference in the scores between the RWCJ forecast and the persistence method as one of the assessments of the judgment skill. It is also useful for estimating a skill measure with reference to the persistence method to assess the judgment skill (called a judgment skill measure henceforth). Similar to the HSS, the judgment skill (JS) measure based on the PC is defined as(3)where a, b, c, and d are the elements of the dichotomous contingency table (see, Table 5). The elements with subscript “p” are for the persistence method. The scores of the JS are 1 and 0 for a perfect forecast and the persistence method, respectively. However, for a perfectly incorrect forecast the score of the JS is not −1 but depends on the score of the persistence method. One of the most important characteristics that a verification measure of skill should have is equitability (Gandin & Murphy 1992). The equitability means that the unskillful forecasts involving forecasting “yes” every time, forecasting “no” every time, and forecasting at random must have the same scores. However, the JS does not satisfy equitability. This is not a good characteristic for a verification measure of skill. Recognizing this characteristic, it is better to use the judgment skill measure JS. For the RWCJ forecast for overall data, the JS is estimated to be 0.0639 [−0.0305, 0.145] for the Mthreshold and 0.223 [0.0843, 0.332] for the Xthreshold. Since the 95% confidence intervals of the scores of JS for M and Xthresholds are largely overwrapped, the difference in the JS scores is probably not significant. We have to pay attention to interpretation of the JS for different event thresholds in case of PC ~ PC_{p}, because a denominator of the JS is dominated by the number of false alarms (b) and misses (c), which can be small for rare event forecast, and the JS can be large irrespective of their judgment skill. Similar to the ETS, the judgment skill measure based on the CSI can also be considered, whose definition is the same as the ETS except with a _{r} replacing a _{p}. However, this formulation has a critical defect. When the judgment skill is very poor, it may be possible that a _{p} is larger than a + b + c. In this case, the defined formula is positive despite the very poor judgment skill. Therefore, the formulation based on the CSI cannot be used to assess the judgment skill.
Contingency table for dichotomous forecast.
As the RWCJ forecast is a fourcategorical forecast, the performance assessment of only the collapsed dichotomous forecast is insufficient, and the direct assessment of a fourcategorical contingency table is required in addition to the verification as a dichotomous forecast with some thresholds. However, a small scalar verification measure for a multicategorical contingency table has been discovered, for example, the GMGS for measure of skill. For accuracy, we can use a proportion correct extended to a multicategorical contingency table (PC_{m}), which is defined as PC_{m} = ∑_{ i } p _{ ii }, where p _{ ii } is the diagonal element of a joint probability distribution for the multicategorical contingency table (e.g., Jolliffe & Stephenson 2003). No scalar verification measures for other attributes such as reliability, resolution, and discrimination have been proposed yet. Therefore, a distributionoriented approach such as the discussion of a joint probability distribution in Section 4.1 is also required. We propose a verification strategy for a multicategorical deterministic operational solar flare forecast as follows: marginal distributions for bias, PC_{m} for accuracy, CC and joint probability distribution for association, the likelihood distribution for discrimination, the calibration distribution for reliability and resolution, the GMGS for forecast skill, and JS_{m} for judgment skill, whose definition is the same as the JS except with PC replaced by PC_{m} and PC_{p} replaced by PC_{mp}, in addition to the verification as dichotomous forecasts with M and Xthresholds. The scores of the proposed scalar verification measures for the overall data of multicategorical RWCJ forecast are summarized in Table 6. The accuracy as the fourcategorical forecast is reasonably good because PC_{m} is 68–71%. According to CC and joint probability distribution, the association for fourcategorical forecast is also reasonably good. However, forecast skill is not so high although it may be acceptable depending on the user needs. As the GMGS imposes a larger penalty for multicategory error than for onecategory error, reducing the multicategory error will lead to a high score. Although the confidence interval of the JS_{m} does not include zero, the judgment skill as fourcategorical forecast is small.
Estimated scores of multicategorical verification measures for the overall data of RWCJ forecast.
On the other hand, the dichotomous forecast verification strategy for a deterministic operational solar flare forecast is under discussion. As already mentioned, many verification measures for a dichotomous deterministic forecast have been proposed since the paper by Finley (1884). Because all verification measures have both advantage and disadvantage, it is difficult to determine which verification measure is the best for operational solar flare forecasting. As we already mentioned, the CSI will be more suitable than the PC for rare event forecasting. However, an intuitive meaning of the CSI is somewhat ambiguous, while a meaning of the PC is completely clear, which means a percentage of correct forecast over whole forecast. For a verification measure of skill, the ETS is commonly used in terrestrial weather forecasting community, while the PSS is recommended by Bloomfield et al. (2012) in the space weather forecasting community. The PSS is baserateindependent measure, which is a good characteristic for verification measure of skill, while the ETS is baseratedependent measure. On the other hand, while the PSS are designed on the basis of the PC, the ETS is defined on the basis of the CSI. So, it is a difficult question as to which measure is more suitable for verifying rare event forecasts. These kinds of discussions appear frequently for other verification measures. With the understanding of this situation, we propose a verification strategy for dichotomous forecast for rare event as follows: FB for bias, PC and CSI for accuracy, POD for discrimination, FAR for reliability, PSS for forecast skill, and SEDI for association. As the SEDI is formally a nonlinear transformation of the OR (Ferro & Stephenson 2011) although its derivation is completely different, the SEDI can be regarded as the verification measure of association for rare event forecast. For the verification measure of skill, although there are many discussions on their usability, the PSS is selected for respecting Bloomfield et al. (2012). We have to note that this is just one suggestion. We think further researches on verification measures themselves are highly required to determine the best verification strategy for dichotomous deterministic operational solar flare forecast.
As already mentioned in Section 1, comparing forecast performances among some RWCs is informative. We briefly discuss the comparison between the RWCJ forecast and the RWC Belgium (ROB) solar flare forecast (Devos et al. 2014). As a verification study of RWC USA (Crown 2012) is for probabilistic forecast of solar flare, it cannot be compared directly with RWCJ forecast. Scores for some verification measures for Mthreshold for ROB are shown in Table 3 of Devos et al. (2014). Comparison between scores for ROB and RWCJ forecast (Table 4 in this article) shows the following results: (1) RWCJ is almost unbiased while ROB is obviously underforecasted, (2) accuracy of RWCJ is a little worse than that of ROB according to PC, however, excluding trivial correct forecasts (correct rejections) leads to reverse result (CSI of ROB is estimated as 0.311), (3) discrimination of RWCJ is obviously better than that of ROB according to POD, (4) reliability of RWCJ is somewhat worse than that of ROB because FAR of RWCJ is larger than that of ROB, (5) forecast skill of RWCJ is somewhat better than that of ROB, (6) the performance of extreme event forecast of RWCJ is slightly better than that of ROB because SEDI of ROB is estimated as 0.594 from their Table 3. The obvious underforecasting of ROB may lead to the small POD and FAR. Because a large overforecasting (underforecasting) leads sometimes to a large POD and FAR (small POD and FAR), an unbiased forecast will be required for a good forecasting system. We should stress that a period of verified data used in Devos et al. (2014) is different from that of this study.
RWCJ forecast is a fourcategorial deterministic forecast, which is defined by ISES. However, some readers may think that only M and Xthreshold forecasts are enough because B or Cclass flares will not affect social infrastructures. This claim may be true. On the other hand, the M and Xthreshold forecasts can be easily made by setting some threshold in the fourcategorical forecast. Of course, no (or almost no) flare forecast such as less than or equal to Bclass flare forecast can be made from the fourcategorical forecast by setting the threshold between categories of B and Cclass flares. Therefore, various threshold forecasts, which the forecast users require, can be made from the fourcategorical forecast. As many RWCs belonging to ISES have issued fourcategorical forecast, issuing the fourcategorical forecast is preferable for the forecast verification researches among some RWCs. Another preferable forecast is a probabilistic forecast. Because deterministic solar flare forecast based on the underlying physics is unlikely to be realized, a probabilistic forecast is essential (Kubo 2008). We think that a future direction of our operational solar flare forecast will be a probabilistic forecast.
Forecast performance comparison with a persistence and recurrence method was performed in this study. This is a common approach as a verification study of space weather as well as terrestrial weather forecasting, for example, Devos et al. (2014). This approach is also justified by the scientific researches of solar flare occurrence. McCloskey et al. (2016) showed that time evolution of flare occurrence is important factor for flare forecasting. This fact will be related with the persistence method to be skillful. Démoulin et al. (2002) and Green et al. (2002) showed by analyzing long lifetime active regions that magnetic flux and helicity stored in an active region, which are expelled by solar flares accompanied by coronal mass ejections, are often reduced during successive solar rotations. This fact may be related with that the recurrence method is almost unskillful forecast, which was shown by verification of Xthreshold for recurrence method in this study.
6. Summary
A verification study on the operational solar flare forecast in the Regional Warning Center Japan (RWCJ forecast) was performed for the first time. Forecast and observation pair data accumulated over 16 years from 2000 to 2015 were used in the study. We estimated various types of scalar verification measures with 95% confidence intervals for the overall data of the RWCJ forecast, and they were compared with those of persistence and recurrence methods. The performance of the recurrence method is significantly worse than that of the RWCJ forecast. However, the score difference in various verification measures between the RWCJ forecast and the persistence method is small, and we could not conclude definitely that there were significant performance differences between these two forecast methods, although a slightly significant difference was found for the Mthreshold events. We also compared various types of scalar verification measures among four subsets of data, whose longterm event frequencies were almost the same. The forecast skill for 2003–2005 seemed to be the best among the four subsets; however, the better forecast skill for 2003–2005 seemed to be due to the persistence of solar activity. The judgment skill seemed to be the best during 2011–2015. Finally, we proposed the use of the judgment skill score to assess the judgment skill of an operational solar flare forecast and the verification strategy for a dichotomous and a multicategorical operational solar flare forecast.
Acknowledgments
We would like to thank SWPC/NOAA for compiling the GOES Xray flare events list. We would also like to thank anonymous referees for useful comments to improve the manuscript. The editor thanks David Jackson and an anonymous referee for their assistance in evaluating this paper.
Other bootstrap confidence intervals with secondorder accuracy, such as a bootstrapt confidence interval and an ABC confidence interval, are also proposed (e.g., DiCiccio & Efron 1996).
References
 Agresti, A., and B.A. Coull. Approximate is Better than “Exact” for Interval Estimation of Binomial Proportions. Amer. Stat., 52, 1, 1998. [Google Scholar]
 Bloomfield, D.S., P.A. Higgins, R.T. James McAteer, and P. Gallagher. Toward reliable benchmarking of solar flare forecasting methods. Astrophys. J., 747, L41, 2012, DOI: 10.1088/20418205/747/L41. [NASA ADS] [CrossRef] [Google Scholar]
 Crown, M.D. Validation of the NOAA Space Weather Prediction Center’s solar flare forecasting lookup table and forecasterissued probabilities. Space Weather, 10, S06006, 2012, DOI: 10.1029/2011SW000760. [CrossRef] [Google Scholar]
 Démoulin, P., C.H. Mandrini, L. van DrielGesztelyi, B.J. Thompson, S. Plunkett, Z. Kövári, G. Aulanier, and A. Young. What is the source of the magnetic helicity shed by CMEs? The longterm helicity budget of AR 7978. A&A, 382, 650, 2002. DOI: 10.1051/00046361:20011634. [NASA ADS] [CrossRef] [EDP Sciences] [Google Scholar]
 Devos, A., C. Verbeeck, and E. Robbrecht. Verification of space weather forecasting at the Regional Warning Center in Belgium. J. Space Weather Space Clim., 4, A29, 2014, DOI: 10.1051/swsc/2014025. [CrossRef] [EDP Sciences] [Google Scholar]
 DiCiccio, T.J., and B. Efron. Bootstrap confidence intervals. Stat. Sci., 11, 189, 1996. [CrossRef] [Google Scholar]
 Efron, B., and R.J. Tibshirani. An introduction to the bootstrap. Chapman and Hall, New York, 1993. [CrossRef] [Google Scholar]
 Ferro, C.A.T., and D.B. Stephenson. Extremal dependence indices: improved verification measures for deterministic forecasts of rare binary events. Wea. Forecasting, 26, 699, 2011, DOI: 10.1175/WAFD1005030.1. [CrossRef] [Google Scholar]
 Finley, J.P.. Tornado predictions. Amer. Meteor. J., 1, 85, 1884. [Google Scholar]
 Gandin, L.S., and A.H. Murphy. Equitable skill score for categorical forecasts. Mon. Wea. Rev., 120, 361, 1992. [CrossRef] [Google Scholar]
 Gerrity, J.P. A note on Gandin and Murphy’s equitable skill score. Mon. Wea. Rev., 120, 2709, 1992. [CrossRef] [Google Scholar]
 Green, L.M., M.C. López Fuentes, C.H. Mandrini, P. Démoulin, L. van DrielGesztelyi, and J.L. Culhane. The magnetic helicity budget of a CMEprolific active region. Sol. Phys., 208, 43, 2002. [NASA ADS] [CrossRef] [Google Scholar]
 Hamill, T.M., and J. Juras. Measuring forecast skill: is it real skill or is it the varying climatology? Q. J. R. Meteorol. Soc., 132, 2905, 2006. DOI: 10.1256/qj.06.25. [CrossRef] [Google Scholar]
 Jolliffe, I.T. Uncertainty and inference for verification measures. Wea. Forecasting, 22, 637, 2007, DOI: 10.1175/WAF989.1. [CrossRef] [Google Scholar]
 Jolliffe, I.T., and D.B. Stephenson. Forecast verification: a practitioner’s guide in atmospheric science. 1st edn., John Wiley and Sons Ltd, UK, 2003. [Google Scholar]
 Kubo, Y. Statistical models for the solar flare interval distribution in individual active regions. Sol. Phys., 248, 85, 2008, DOI: 10.1007/s1120700891356. [CrossRef] [Google Scholar]
 McCloskey, A.E., P.T. Gallagher, and D.S. Bloomfield. Flaring rates and the evolution of sunspot group mcintosh classifications. Sol. Phys., 291, 1711, 2016, DOI: 10.1007/s112070160933y. [CrossRef] [Google Scholar]
 Murphy, A.H. Forecast verification: its complexity and dimensionality. Mon. Wea. Rev., 119, 1590, 1991. [CrossRef] [Google Scholar]
 Murphy, A.H. The Finley affair: a signal event in the history of forecast verification. Wea. Forecasting, 11, 3, 1996. [CrossRef] [Google Scholar]
 Murphy, A.H., and R.L. Winkler. A general framework for forecast verification. Mon. Wea. Rev., 115, 1330, 1987. [CrossRef] [Google Scholar]
 Stephenson, D.B. Use of the “odds ratio” for diagnosing forecast skill. Wea. Forecasting, 15, 221, 2000. [CrossRef] [EDP Sciences] [Google Scholar]
 Stephenson, D.B., B. Casati, C.A.T. Ferro, and C.A. Wilson. The extreme dependency score: a nonvanishing measure for forecast of rare events. Meteorol. Appl., 15, 41, 2008, DOI: 10.1002/met.53. [CrossRef] [EDP Sciences] [Google Scholar]
 Wilks, D.S. Statistical methods in the atmospheric sciences. 2nd edn., Elsevier, Oxford, 2006. [Google Scholar]
 Wilson, E.B. Probable inference, the law of succession, and statistical inference. J. Amer. Stat. Assoc., 22, 209, 1927. [CrossRef] [Google Scholar]
 WMO. Recommendation for the Verification and Intercomparison of QPFs and PQPFs from Operational NWP Models, WWRP 20091. 2009, available at https://www.wmo.int/pages/prog/arep/wwrp/new/documents/WWRP20091_web_CD.pdf. [Google Scholar]
 WMO. Verification Methods for Tropical Cyclone Forecasts, WWRP 20137. 2014, available at https://www.wmo.int/pages/prog/arep/wwrp/new/documents/WWRP_2013_7_TC_verification_15_Nov_en.pdf. [Google Scholar]
Appendix 1
We briefly introduce the definitions of the scalar verification measures for the dichotomous forecast used in this study. Table 5 is a contingency table for the dichotomous forecast.
Appendix 2
We briefly introduce the definition of the GandinMurphyGerrity score (GMGS) used in this study. A content of this Appendix is following Gandin & Murphy (1992) and Gerrity (1992).
Define an Ncategorical contingency table P with an element p _{ ij }, which is a relative frequency of an observation falling category i and a forecast falling category j. The GMGS is calculated as(4)where S is an Nrank scoring matrix with an element s _{ ij }. The s _{ ij } is determined for the GMGS satisfying equitability condition. The condition is written as follows:(5) (6)where p _{ i } is the relative observation frequency of category i defined as(7)
Equation (5) means that scores of the GMGS for the forecast with always category i vanish. Equation (6) means that score of the GMGS for perfect forecast is one. Symmetry of scoring matrix (s _{ ij } = s _{ ji }) is also imposed.
Gerrity (1992) found a closed form scoring matrix satisfying the all conditions described above. The scoring matrix is defined as(8) (9) (10) (11)
As we can recognize from the definition that the elements of scoring matrix are determined by only the observation frequency, so they do not depend on the forecast frequency.
The GMGS has a score of zero for unskillful forecast and one for perfect forecast. The GMGS for Ncategorical contingency table is mathematically equal to an arithmetic mean of (N − 1)number PSSs, which are calculated from dichotomous contingency tables collapsed with threshold k (). Therefore, the GMGS reaches the PSS for a case of dichotomous forecast (i.e., N = 2).
Appendix 3
We briefly show that a ranked probability score (RPS) applied to a multicategorical deterministic forecast does not satisfy equitability.
An RPS for Ncategorical probabilistic forecast is defined as(12)where F _{ n } and O _{ n } are cumulative probabilities at category n of forecast and observation, respectively (e.g., Jolliffe & Stephenson 2003). The expectation value E(⋯) is calculated for all forecastobservation pairs. When an observation falls category i, the O _{ n } is zero and one for n < i and n ≥ i, respectively. If the forecast probabilities are set as one for category j and zero for others (meaning deterministic forecast for category j), the F _{ n } is zero and one for n < j and n ≥ j, respectively. Therefore, the RPS applied to the deterministic forecasts can be written as(13)
When an Ncategorical contingency table P with an element p _{ ij }, which is a relative frequency of an observation falling category i and a forecast falling category j, is constructed from all forecastobservation pairs, Equation (13) can be rewritten as(14)where s _{ ij } = j − i/(N − 1). Therefore, the RPS applied to the deterministic forecast can mathematically be expressed by similar form of the GMGS.
Equitability requires a condition of Equation (5) to be satisfied (more precisely, the condition does not require zero for right hand side of the Equation (5), but a certain constant c to be enough). The condition can be written by using the s _{ ij } determined for the RPS as(15)where p _{ i } is the relative observation frequency of category i. It is obvious that satisfying Equation (15) for any p _{ i } is impossible. Therefore, the RPS applied to the deterministic forecast cannot satisfy the condition of Equation (5) and does not satisfy equitability.
Cite this article as: Kubo Y, Den M & Ishii M. Verification of operational solar flare forecast: Case of Regional Warning Center Japan. J. Space Weather Space Clim., 7, A20, 2017, DOI: 10.1051/swsc/2017018.
All Tables
Estimated scores of verification measures for events defined as larger than or equal to the M and Xclass flares.
Estimated scores of multicategorical verification measures for the overall data of RWCJ forecast.
All Figures
Fig. 1. Probability distributions for the RWCJ forecast. (a) Joint probability density. (b) Marginal distributions of forecast and observation. (c, d) Calibration and likelihood distributions, respectively. The black dots connected with lines in (c) and (d) stand for the conditional expectation values of the observation given the forecast and of the forecast given the observation, respectively. 

In the text 
Fig. 2. Comparison of various verification measures among the RWCJ forecast, persistence method, and recurrence method for the Mthreshold. (a) Absolute scores of the verification measures. Magenta, cyan, and yellow stand for the RWCJ forecast, persistence method, and recurrence method, respectively. (b) Difference in the scores between the RWCJ forecast and the persistence method as red and that between the RWCJ forecast and the recurrence method as blue. The black bars stand for the 95% confidence intervals of the scores. We note that as the FAR and POFD are negative orientation measures, the 1FAR and 1POFD are plotted, which are positive orientation values. 

In the text 
Fig. 3. Same as Figure 2, except that the event definition is the Xthreshold. 

In the text 
Fig. 4. Chronological history of the Mthreshold event occurrence. The blue vertical lines show the maximum Mthreshold events within 24 h from the forecast issue time. The red dots show the cumulative number of maximum Mthreshold events plotted against the event occurrence date. The gray dashed lines are fitted lines of cumulative number of events for each subset. 

In the text 
Fig. 5. Comparison of various verification measures with 95% confidence interval for the Mthreshold for four subsets with the overall data as a reference. The magenta, cyan, yellow, red, and blue bars stand for 2000–2015 (overall), 2000–2002 (subset1), 2003–2005 (subset2), 2006–2010 (subset3), and 2011–2015 (subset4), respectively. We note that as the FAR and POFD are negative orientation measures, the 1FAR and 1POFD are plotted, which are positive orientation values. 

In the text 
Fig. 6. Same as Figure 5, except that the event definition is the Xthreshold. 

In the text 
Fig. 7. Difference in scores with 95% confidence interval for Mthreshold event between the RWCJ forecast and the persistence method. The magenta, cyan, yellow, red, and blue bars stand for 2000–2015 (overall), 2000–2002 (subset1), 2003–2005 (subset2), 2006–2010 (subset3), and 2011–2015 (subset4), respectively. We note that as the FAR and POFD are negative orientation measures, the 1FAR and 1POFD are plotted, which are positive orientation values. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.