The Flare Likelihood and Region Eruption Forecasting (FLARECAST) Project: Flare forecasting in the big data&machine learning era

The EU funded the FLARECAST project, that ran from Jan 2015 until Feb 2018. FLARECAST had a R2O focus, and introduced several innovations into the discipline of solar flare forecasting. FLARECAST innovations were: first, the treatment of hundreds of physical properties viewed as promising flare predictors on equal footing, extending multiple previous works; second, the use of fourteen (14) different ML techniques, also on equal footing, to optimize the immense Big Data parameter space created by these many predictors; third, the establishment of a robust, three-pronged communication effort oriented toward policy makers, space-weather stakeholders and the wider public. FLARECAST pledged to make all its data, codes and infrastructure openly available worldwide. The combined use of 170+ properties (a total of 209 predictors are now available) in multiple ML algorithms, some of which were designed exclusively for the project, gave rise to changing sets of best-performing predictors for the forecasting of different flaring levels. At the same time, FLARECAST reaffirmed the importance of rigorous training and testing practices to avoid overly optimistic pre-operational prediction performance. In addition, the project has (a) tested new and revisited physically intuitive flare predictors and (b) provided meaningful clues toward the transition from flares to eruptive flares, namely, events associated with coronal mass ejections (CMEs). These leads, along with the FLARECAST data, algorithms and infrastructure, could help facilitate integrated space-weather forecasting efforts that take steps to avoid effort duplication. In spite of being one of the most intensive and systematic flare forecasting efforts to-date, FLARECAST has not managed to convincingly lift the barrier of stochasticity in solar flare occurrence and forecasting: solar flare prediction thus remains inherently probabilistic.


Introduction
The first decades of the 21st century have seen the transformative effect of the ever-increasing, widespread use of wireless technologies.Enhanced by the equally expansive use of the internet, these technologies have claimed, and are expected to continue claiming, a crucial part of our everyday routine, from services to communications and from information to edutainment.Space-based satellite technologies have also been instrumental in establishing wireless capabilities to a degree that few could predict, even by the standards of the late 20th century.
When relying on space, however, it is imperative to keep in mind the decisive effects of our magnetically active star, the Sun.Simultaneously with the expansion of human capabilities came increased awareness of the adverse effects of space weather, namely, the short-term (hours to days) impact of solar magnetic activity, from the fast solar wind spewed by extended coronal holes to the storm-like transport of solar eruptions through the entire heliosphere.Electromagnetic and particulate emission from solar eruptions can cause anything from shortlived, relatively unimpactful disruptions to major damage in satellite infrastructure, on top of the biological hazards they pose to exposed humans in space conditions, either during extravehicular activities or in future manned space travel and missions to Moon and Mars (see, for example, ESA's 1 Moon Village and NASA's Artemis Programs).
Exceptional solar flares and eruptive manifestations, up to the first flare observed by Carrington (1859), are among deepspace phenomena whose repercussions go past the ionosphere, reaching down to aviation altitudes and even to Earth's surface.Figure 1 portrays this impact in an image produced by the University of Applied Sciences and Arts of Northwestern Switzerland (FHNW) partner of the FLARECAST Consortium.There, one sees ramifications spanning from what we already knew before the space age (i.e., the aurora, long-range electrical power networks or radar disruptions) to any applications that GPS or Galileo enable.During the space age we have seen some solar eruptions that have caused major effects in May 1967, March 1989, and October-November 2003, although solar flares associated with these eruptions were arguably smaller than the Carrington flare (Cliver & Dietrich, 2013).However, while cruising on the far-side of the Sun in July 2012, the STEREO-A spacecraft detected the transit of an extreme solar eruption that was characterized as a Carrington-scale event (Baker et al., 2013).Geospace was spared from that eruption but, statistically, in future events it will not be (see, e.g., Riley & Love, 2017).
The staggering short-and long-term financial impact of extreme space weather has been delineated in a series of recent works (e.g., MacAlester & Murtagh, 2014;Oughton et al., 2017;Eastwood et al., 2018), as well as in governmental guidelines and action plans, such as the US National Space Weather Strategy and Action Plan (2015, 2019 2 ) and the UK Space Weather Preparedeness Strategy (2015 3 ).Of particular interest, however, is the 2008 National Research Council's Severe Space Weather Events: Understanding Societal and Economic Impacts Workshop Report (NRC, 2008) that portrays in its Figure 3.1 the nonlinear inner workings and interconnections of sectors comprising our societal fabric.Like domino bricks, if any of these sectors fails due to extreme space weather, the ramifications will be hard to imagine and even harder to mitigate.The reliable forecasting of extreme space weather, therefore, upgrades to a major challenge of our times.
Energetic events accompanying major solar eruptions are by far the main agents of extreme space weather.These events comprise three distinct aspects: solar flares, CMEs, and SEP events.A reliable forecasting, therefore, should encompass three very different forecast efforts with unique characteristics and challenges.In solar flare prediction, that is the topic of discussion in this work, there is no early warning for the flares' X-and c-ray photons.Only a slim window of ~10-12 min exists for flare-accelerated SEPs, if any (Haggerty & Roelof, 2002;Rust et al., 2008).To address the lack of advance warning, therefore, major solar flaresand flare-related, impulsive SEP events, by extensionneed to be predicted well before their occurrence (i.e., several hours to 1-2 days in advance).There are currently significant shortcomings in our flare forecasting ability, as sections below will show.In addition, CMEs, particularly the fastest ones that are stemming from solar ARs and are virtually associated one-to-one to major flares (Yashiro et al., 2005, but see Liu et al., 2016 for an exceptional active region), should ideally be predicted along with flares and SEPs (e.g., Anastasiadis et al., 2017).There is a window of innerheliospheric transit ranging between ~20 h and 2-3 days after the initial, near-Sun detection of CMEs until they reach geospace.If Earth-directed, their arrival time and geoeffectiveness (i.e., their potential ability to trigger a geomagnetic storm) should also be predicted in advance (e.g., Möstl et al., 2014;Mays et al., 2015).Finally, CME-shock-accelerated SEP events may arrive at geospace hours after the source solar eruption or, in the worst-case scenario, even before the CME registers in near-Sun height-time diagrams (Reames, 2017;Malandraki & Crosby, 2018, and references therein).We also need to know in advance the temporal profile of the SEP event and its peak flux or fluence, particularly for proton energy channels exceeding 50-100 MeV, as per NOAA guidelines and recently defined benchmarks 4 .
Although an ultimate goal, we still seem to be far from achieving an integrated platform for the prediction of all extreme space weather manifestations.Among them, solar flare prediction has historically been humanity's first stride.Since the Fig. 1.A graphical representation of extreme space weather and its effects on contemporary technology and infrastructure.Credit: FHNW.
1980s, there have been persistent efforts toward flare prediction introducing a wealth of physical, semi-empirical or empirical AR properties and proxies that have been claimed to hold a flare-predictive capability.A short, non-exhaustive review of these properties, including an effort to group them into different categories, appears in Georgoulis (2012).
However, earlier efforts (e.g., Leka & Barnes, 2003b, 2007;Barnes & Leka, 2008) aiming to assess the relative performance of these properties indicated that, on one hand, none was solely capable to predict flares reliably while, on the other hand, when a capability to simultaneously test multiple properties was achieved, the predictive information contained in the full property set was highly redundant.The first comparative evaluation of prominent flare-predictive properties and methodologies, undertaken by Barnes et al. (2016), established that no single method clearly outperformed the others.This and other initial findings (e.g., the predictive value of timeseries, previous flare history) were further solidified by collaborative work on operational forecasts by Leka et al. (2019a, b) and Park et al. (2020).
Meanwhile, the explosive increase in computing power spearheaded critical advances in computer science, in an already existing Big Data ecosystem facilitated by the wealth of groundand space-based solar observations since the mid-1990s.Data mining and the advent of machine learning eventually led to the first application of a SVM and neural networks in flare forecasting (Qahwaji & Colak, 2007).Several seminal works followed thereafter (Li et al., 2008;Qahwaji et al., 2008;Song et al., 2009;Yu et al., 2009;Bobra & Couvidat, 2015;Muranushi et al., 2015;Nishizuka et al., 2017) and the list is ever-expanding.Today, we know that solar flare forecastingand space weather forecasting, in generalshould be viewed as an interdisciplinary effort, with a potentially critical contribution from machine learning (Camporeale et al., 2018, and references therein), albeit not without open challenges impeding progress (Camporeale, 2019).
In this continuously and rapidly evolving landscape, a Consortium of nine institutes spanning over six European countries took advantage of the EU Horizon-2020 2014 PROTEC-2014 opportunity to propose the FLARECAST project.Having all the above in mind, FLARECAST pledged to develop an advanced solar flare prediction system based on automatically extracted physical properties of solar active regions, coupled with state-of-the-art machine learning solar flare prediction methods and validated using the most appropriate forecast verification measures.FLARECAST featured three top-level objectives, namely, one scientific, one devoted to the R2O philosophy and one engaging in communication.In particular, FLARECAST proposed: -In terms of science, to understand the drivers of solar flare activity and improve flare forecasting.-In terms of R2O, to provide a globally accessible flare forecast service that facilitates expansion.-In terms of communication, to engage with space weather end users, inform stakeholders and policy makers, and educate the broader public on solar flares and space weather in general.
This collective work summarizes FLARECAST's most significant findings and conclusions in all three objectives above, along with key elements from peer-reviewed publications that occurred in its course.Section 2 describes the methodology followed throughout the project, while Section 3 elaborates on the tasks of data handling and monitoring.Section 4 discusses the FLARECAST performance verification strategy, while Section 5 briefly describes the project's scientific results and explorative component.Section 6 encapsulates the main conclusions of the project's three top-level objectives (i.e., science, operations, communication) along with lessons learned in its course.Finally, in a series of Appendices we provide detailed instructions on accessing FLARECAST data, codes and infrastructure (Appendix A), key results from the FLARECAST Users Survey (Appendix B; see Sect.2.5.1 for a relevant discussion), the list of refereed publications related to or acknowledging the FLARECAST project (Appendix C) and, finally, a list of acronyms and abbreviations used in this paper (Appendix D).
All things considered, as commented by an attendee of one of the FLARECAST Stakeholders' Workshops, "the real fun starts now"; we expect a number of future works that will take advantage of and exploit the FLARECAST products.These may not be restricted to flare prediction, as (i) the volume of metadata provided by the processing of the NRT SHARP data product (Bobra et al., 2014) is substantial and (ii) as per EU's OpenAIRE initiative (https://www.openaire.eu),all data, codes and infrastructure of the project are openly available to any interested individual or team worldwide.Ideally, then, one might view the FLARECAST infrastructure as a vehicle for a future integrated space weather prediction platform.Comprehensive and diverse material on the FLARECAST project can be found in the openly accessible FLARECAST website http://flarecast.eu(see also Fig. 9 in Sect.2.5 for a top-level structure of this website).

The FLARECAST approach
FLARECAST embraced the technical architecture and methodology structure illustrated in Figure 2.This was realized in a sequence of four procedural steps, namely, ( 1 This work plan structure, together with the technical scheme of Figure 2, show the design philosophy of the FLARECAST science and technology.The project relied on machine learning applied to data from the HMI vector magnetograph (Scherrer et al., 2012;Schou et al., 2012) onboard the SDO mission (Pesnell et al., 2012) in order to implement a technological service for flare forecasting with the ultimate goal of contributing to a data-driven understanding of the physical triggering fmechanisms of flares.Rigorous forecast verification and dissemination of results have been crucial steps of this basic research effort.

Science motivation
The solar flare phenomenon pertains to explosive energy release in the low solar atmosphere that results in short-lasting (i.e., minutes to hours) enhancement of emission over virtually the entire electromagnetic spectrum (see, e.g., Benz, 2008;Fletcher et al., 2011 for comprehensive reviews).Major, highly-energetic flares (i.e., GOES M-and X-class events, in particular) are much less frequent (in a distribution that is long known to be a power lawsee Drake, 1971;Rosner & Vaiana, 1978;Crosby et al., 1993) than minor flares (i.e, GOES C-class) and subflares5 .In terms of mean occurrence frequency, or climatology in the statistical language of forecasting problems (e.g., Barnes et al., 2016;Leka et al., 2019a), major flares fall under the category of rare events, namely, events that are much more infrequent than the physical systems in which they appear (in this case, solar ARs).A typical 11-year sunspot cycle involves the appearance, evolution and fading of a few thousand NOAA-numbered ARs, yet Carrington's flare is considered a one-in-a-century (or even more rare) event.In other words, viewing a set of flaring ARs as a "positive" sample vs. a "negative" sample of non-flaring ones, the ratio of sample sizes is substantially different than one.Increasing the flare magnitude threshold between positive and negative samples only pushes this ratio to further extremes.For example, in Solar Cycle 23 only ~1.8% of ARs hosted at least one GOES X-class flare (an imbalance ratio of ~0.0183), with a respective ratio of ~0.005 for GOES !X5.0 flares.The climatology of the FLAR-ECAST flare sample in the even weaker Solar Cycle 24 is 1 GOES C-class flare every ~11 h, 1 M-class flare every ~4.5 days and 1 X-class flare every ~67 days; substantial variations obviously exist over different phases of the cycle.Class imbalance in major flare prediction and other rare-event problems is a central concern (Woodcock, 1976;Bloomfield et al., 2012;Jolliffe & Stephenson, 2012;Bobra & Couvidat, 2015) for machine learning methods with proposed remedies including undersampling, oversampling and misclassification weighting (e.g., Longadge & Dongre, 2013;Ahmadzadeh et al., 2021).
Flares at the top end of the observed energy distribution are viewed and treated as extreme events, given their very significant disruptive ability on top of their scarcity.For diverse accounts of extreme events in physical systems one may review Albeverio et al. (2006), Meyers (2011) and Sharma et al. (2012), among other comprehensive works.These accounts also refer extensively to two intrinsic characteristics of extreme events: environmental complexity and difficulty in forecasting.It is a fact that forecasting solar flares is a pressing issue for space-faring nations, mainly for two reasons: first, because of flares' biological and technological repercussions and, second, because flares are a common element of the two other aspects of extreme space weather, CMEs and SEPs.Notwithstanding the lack of an early warning for flares, mentioned already, just a cursory examination of some biological implications indicates that, say, a 500 keV c-ray photon has a wavelength of ~0.25 nm.This is much shorter than a DNA helix (~3.5 nm), meaning that such radiation acting on exposed humans engaging in extravehicular activities can lead to acute radiation sickness, if not being altogether fatal (see, for example, Freese et al., 2016).
The complexity, in terms of the multiscale (i.e., multifractal) behavior in the turbulent solar atmosphere (see, for example, Fig. 2. The FLARECAST architecture and WP Assignment.Rectangles indicate exchangeable software components, such as algorithms and codes, while cylinders indicate data model components.WPs 6 and 7 are not included here but are integral to the project and complement the overall efforts.Georgoulis, 2005) undoubtedly adds to the difficulty of predicting solar flares.In particular, flares are long thought to be responses of nonlinear dynamical systems (solar ARs) in a SOC state.The SOC concept was initiated by seminal works on the topic (Lu & Hamilton, 1991;Lu et al., 1993;Vlahos et al., 1995) inspired by groundbreaking developments in theoretical physics (Bak et al., 1987(Bak et al., , 1988;;Kadanoff et al., 1989; see also Bak, 1996 for an encompassing account).An account of apparent SOC manifestations in Astrophysics, including flares, is presented in relatively recent anniversary works (Aschwanden et al., 2016;McAteer et al., 2016).However, a SOC evolution of flaring ARs would lead to an intrinsic stochasticity in flare occurrence; this translates to an intrinsically probabilistic forecasting relying by definition on small probabilities for major and extreme flares.Randomly driven SOC models explore precisely this stochasticity; hence they are incapable of predicting small-and mid-size flares.However, Strugarek & Charbonneau (2014) reported that a class of deterministically driven SOC models could be used for predictive purposes as they raise the memory of the system, thus exerting a partial control over flare waiting times; however, this is an avenue of research yet to be sufficiently explored.Recent studies have also aimed to assess the frequency of a Carrington-level event in the framework of extreme value theory (Elvidge & Angling, 2018).M.K. Georgoulis et al.: J. Space Weather Space Clim. 2021, 11, 39 Rather than focusing exclusively on extreme events, the key scientific question and incentive behind the FLARECAST project was to determine to what extent can the skill currently achieved in the forecasting of solar flares be advanced.
Correlating solar flares with magnetic evolution dates back several decades.One of the earliest, pioneering accounts was that of Howard & Severny (1963), who reported major magnetic field changes before and after a major (coined as "great") flare.More systematic works were added in the 1980s with limitedresolution magnetograms (Krall et al., 1982;Hagyard et al., 1984;Zirin & Liggett, 1987) linking flares -and even repeated flaringto d-sunspot complexes and velocity shear, before the first semi-operational flare prediction schemes appeared (McIntosh, 1990;Zirin & Marquette, 1991).In more recent years, however, the formation of long and intense magnetic PILs in the photospheric magnetic field of ARs was established as a feature of direct relevance to major flaring (for comprehensive reviews, see Schrijver, 2009;Toriumi & Wang, 2019).By "intense", we mean PILs exhibiting substantial amounts of magnetic flux and strong magnetic gradients due to the closely seated, opposite magnetic polarities and magnetic/velocity shear (e.g., Georgoulis et al., 2019;Patsourakos et al., 2020, and references therein).Major flares occur as such PILs evolve and intensify, fueled by magnetic flux emergence and cancellation along them (van Ballegooijen & Martens, 1989;Gibson & Fan, 2006;Archontis & Török, 2008).Sometimes, an eruption including a major flare may occur in the absence of intense PILs, when the emerging flux reconnects with the pre-existing field or has enough magnetic twist to become unstable during emergence.Such a exceptional case, giving rise to a GOES X3 flare, was reported by Gary & Moore (2004).Regardless, the majority of X-class flares -the ones mostly affecting space weather conditions-occur above evolving, intense photospheric PILs.
An example of ARs with and without strong PILs is given in Figure 3, where the relatively flare-quiet (up to mid-C-class flares) NOAA AR 12674 is compared against the intensely flaring NOAA AR 12673.The latter in September 6 and 10 2017 gave the strongest (~X10) flares of solar cycle 24 (e.g., Yan et al., 2018).Both active regions evolved simultaneously in the solar disk, located just a few hundred arcsec away from each other.While the photospheric compactness and conspicuous PILs are evident in NOAA AR 12673, NOAA AR 12674 is much more scattered.Coronal information in Figure 3 (sampled indicatively at 19.3 nm to showcase some structure) indeed shows significant complexity in terms of several bright kernels, complexity and apparent twist; significant non-potentiality, in brief (e.g., Schrijver et al., 2005).Emission generally lacks bright kernels and does not show such non-potentiality in NOAA AR 12674.
Figure 4 further shows a PIL analysis on NOAA AR 12253, achieved in FLARECAST's framework.More information is given in Section 5.1.

Active region properties and flare predictors
Flare forecasting relies almost entirely on statistical correlations between solar magnetic field data and observed flare characteristics.In this case, local (i.e., AR scale) photospheric magnetic fields are parameterized in order to identify and quantify patterns potentially associated with flares.Should a flare occur, flare detection and classification is primarily done using the GOES 0.1-0.8nm soft X-ray band.The peak photon flux in this band is historically used to determine the GOES flare class.There are obviously other sophisticated ways to detect flares in EUV, X-ray, and optical wavelengths (see Martens et al., 2012, for an attempt in the framework of the HEK project), but the GOES X-ray classification is the one used most widely by the space weather and flare forecasting communities.Although the GOES soft X-ray detectors are, in essence, spectral irradiance instruments without spatial resolution on the solar disk, observations by the GOES/SXI telescopes complement flare information with the source ARs, at least for the largest events.However, these spatial identifications are not error-free and verification of all GOES flare locations is a nontrivial task.Significant effort on identified flare locations was also put by Hock (2012) and, more recently, by Angryk et al. (2020).
Quantitative active region photospheric properties, used as flare predictors in FLARECAST's framework, were derived from the HMI SHARP data product (Bobra et al., 2014;Hoeksema et al., 2014).A FLARECAST processing pipeline was developed using the hmi.sharp_cea_720s_nrt data series that contains AR maps of the LOS magnetic field component, B LOS , continuum intensity, I c , and vector magnetic field components in the radial, B r , co-latitude, B h , and azimuthal, B / , directions of the solar spherical coordinate system.These maps are produced at HMI's JSOC6 at a cadence of 12 min, while the NRT HMI stream assures that these data products are available within typically 4 h of acquisition7 (Hoeksema et al., 2014).The identified PILs in the region are further characterized as "moderate" (yellow contours) and "strong" (red contours).The difference between "strong" and "moderate" PILs relies on the absolute value of the gradient of the vertical magnetic field (>40 G/ pixel and >16 G/pixel, respectively) and the strength of the horizontal field (>120 G and >100 G, respectively).Parts of the PIL that are not highlighted in color are below one or both thresholds and are considered as "weak".The algorithm used is the FLARECAST PIL identification code, accessible as described in Appendix A.2. Another reason for using the NRT stream was that any operationally-oriented flare forecasting service has to rely on NRT data for training, testing, and validation of its method(s).
In a further attempt to replicate operational conditions to the greatest extent possible, FLARECAST did not restrict its input data to longitudinal areas close to the central solar meridian, but identically treated all data regardless of solar disk position (for regions very close to the solar limbs though, some selection criteria were imposed, as discussed below).This was decided in spite of a complete understanding that magnetic projection effects, foreshortening and noise near the solar limbs are substantial.
In the FLARECAST processing pipeline, NRT SHARP maps are pre-processed before extracting any property.Preprocessing aims to: (a) check for missing information in the FITS headers and restore it, if possible; (b) examine for bad-quality or missing data (i.e., a NaN or a constant value); (c) capture possible differences between WCS (Thompson, 2006) positions and header positions; (d) trim any part of the FOV containing off-disk pixels for ARs near the limbs.Magnetogram maps are flagged as null and no properties are calculated for HARP timestamps and dates with (i) bad-quality data, (ii) absolute differences between FITS header coordinates and WCS-calculated coordinates that are larger than 5°or (iii) trimmed images that are less than 20 pixels in horizontal size.
Active region properties extracted from the photospheric magnetic field in FLARECAST have all been previously tested for relevance in flare prediction.These properties are presented in Table 1, itemized into 15 different groups.Each property group relates to either a certain property or a set of properties and gives rise to one or more flare predictors, respectively.These predictors, and groups thereof, quantify the state of the AR photosphere and corona and reflect (1) the entire region or SHARP FOV, (2) magnetic PILs, or (3) sunspots (as the areas with the strongest magnetic field).Most property groups characterise only the state of the photospheric magnetic field.Some, however, characterize the state of the AR corona, either by photospheric proxies or by means of current-free PFEs or by differences between the photospheric field and that expected by a PFE.The two last columns of Table 1 indicate relevant published work, distinguishing between studies that were directly implemented in the course of the project from additional or previous studies pertinent to some predictors.
It should be mentioned that, for many of the properties in Table 1, estimating uncertainties is nontrivial.For the cases uncertainties exist, mostly in terms of r-values in fits, these values are provided in the property database.This said, prediction algorithms (Sect.2.3) currently do not utilize predictor uncertainties.Uncertainties in verification metrics are provided, however, as per the sensitivity analysis of Section 4.3.Evaluating the impact of predictor uncertainties in multi-parametric machine learning prediction schemes could be a meaningful future step.
Up to 209 predictors are calculated from each SHARP observation at different cadence, namely 1 h, 3 h, 6 h, 12 h, and 24 h.The full 12-min cadence of HMI SHARPs was exploited only in a very limited number of cases, due to the immense computational time it required.This computational expense dictated the different cadence as a necessity.Clearly, FLARECAST did not restrict to the 25 scalar properties included in the SHARP predictor metadata.It did, however, replicate the calculations of these predictors to validate them and consider additional image statistics (i.e., higher-order moments).Instructions on how to access these and other data and information are detailed in Appendix A.
Figure 5 displays histograms of the monthly number of property groups and SHARP timestamps analyzed over the period September 2012 -January 2019, for the finest (1 h) and the coarsest (24 h) cadences considered.Numbers corresponding to 12-min cadence are not shown due to their lack of statistical significance.The progression of Solar Cycle 24 can be roughly assessed, with significantly higher numbers until early 2016, when the declining phase gave way to the latest solar minimum.Thereafter, the numbers of both calculated property groups and processed SHARP timestamps gradually decline as eligible SHARPs in the NRT stream become fewer and fewer.The shaded interval between 13 April 2016 and 1 September 2017 corresponds to a time of potentially problematic NRT data due M.K. Georgoulis et al.: J. Space Weather Space Clim. 2021, 11, 39 to a transient misalignment between the two HMI cameras providing filtergrams for the generation of the full Stokes vector8 .Only definitive vector magnetogram data were reprocessed -NRTs were notwhile full-disk LOS magnetograms at 45s cadence were not affected.Artifacts in this case intensify with increasing central meridian distance.
The property data set shown in Figure 5 is, to our knowledge and understanding, one of the largest and most diverse ever assembled for flare prediction, offering unique opportunities for statistical and physics-based studies to interested teams worldwide.In Guerra et al. (2018), a sub-group of six predictors were studied to determine the effects of using the LOS field (B LOS ) versus the surface-radial field (B r ) for their calculation and subsequent use in flare forecasting.It was determined that both LOS and radial field components can be advantageous in different circumstances.The project hence decided to use both property versions to facilitate any independent information these two versions could furnish.

Flare association
The NOAA/SWPC flare event list data include the universal times (UT) of their start (i.e., onset), peak, and end, along with the flare magnitude, all obtained by the GOES spacecraft 0.1-0.8nm soft X-ray channel.Where possible, source ARs are identified from the daily NOAA/SWPC flare event lists, with their morphological properties extracted from the daily NOAA/SWPC SRS reports.NOAA ARs, flaring or not, are linked to the HMI SHARPs and the FLARECAST property database includes information on all ARs located within each SHARP FOV.
The process below is followed for each SHARP during its solar disk passage: 1. SHARPs are first checked to determine whether their FOV contains the time-advanced centroid locations of NOAA-numbered ARs.This makes use of the closest SRS report before the SHARP observation, with solar fdifferential rotation taken into account.
2. If any NOAA ARs are assigned to the SHARP, the SWPC flare event list is searched for those NOAA numbers and related flares are associated to all of that SHARP's property database entries.3.For X-ray flares with no reported NOAA number, locations of co-temporal flares observed in ground-based Ha images (also from the NOAA/SWPC flare event list) are used to determine if the multi-wavelength flare event occurred within the SHARP's FOV (again taking solar differential rotation into account).Positive Ha flare association results in the inclusion of the respective X-ray flare.A very small fraction of flares of GOES class C and above, of the order 0.1%, seems to miss both criteria above over the analysis period.
After all SHARP-associated flares are identified, we extract the following properties for each flare: -F M : GOES peak magnitude (e.g., M1.3).
s s : Time difference (in seconds) between the SHARP observation time T 0 and the reported flare start time T s (i.e., s s = T s À T 0 ).s p : Time difference (in seconds) between T 0 and the reported flare peak time T p (i.e., s p = T p À T 0 ).s e : Time difference (in seconds) between T 0 and the reported flare end time T e (i.e., s e = T e À T 0 ).
Figure 6 provides a snapshot of FLARECAST's flare data set and displays the temporal distribution of different flare classes for Solar Cycle 24 (Fig. 6a), along with their rise time (Fig. 6b), decay time (Fig. 6c) and duration (Fig. 6d) for approximately the same time coverage as that of the property database statistics in Figure 5. Rise/decay times and durations have been provided by NOAA.
We consider as eligible flares only those of GOES C-class and above (i.e., !C1.0) to make sure that as many as possible are included, unobscured by an often elevated solar soft X-ray background.We acknowledge, though, that even a C1.0 flare threshold might result in loss of some flares in cases of intensely high solar activity.Others may be lost due to lack of, or an erroneous, location information.

Methods: machine learning
The FLARECAST computational component fully relied on machine learning, that is, on prediction algorithms that utilize an automatic learning step based on either labeled or unlabeled input data.The conceptual core of each machine learning technique lies on the modality of this learning step.Labeled data are characterized by a tag including one or more property-specific labels (e.g., flaring or non-flaring), whereas unlabeled data are free of such tags.
We distinguish between two broad categories of relevant machine learning methods: -Unsupervised methods, that are free to infer the data structure from the data themselves and realize the learning task in a fully data-driven manner.Such methods train on unlabeled data.-Supervised methods, that perform an unknown input-output mapping from known input-output samples.Sparsity enhancing techniques in these methods enable a quantitative ranking of properties (predictors) that contribute most to the achieved prediction.Supervised methods nominally train on labeled data.
The project's machine learning inventory comprises three (3) unsupervised and eleven (11) supervised machine learning methods, listed in Table 2.We chose to implement a diverse array of methods because we identified a need to experimentally test the strengths and weaknesses of each method, along with their key differences and predictive capacity.Identifying the state-of-the-art in machine learning methods flows directly from FLARECAST's main goal (Sect.1).

Unsupervised methods
FLARECAST proposes three unsupervised (i.e., clustering) methods for the automatic classification of sets of unlabeled AR properties.To briefly explain how these three algorithms work, we denote as , where x ik is the i-th property of the k-th sample x k , d is the dimension of the property space R d and Y ¼ fy j jy j 2 R d ; j ¼ 1; . . .; cg is the set of the c centers of the clusters to determine.The algorithms' objective is to minimize a certain functional with respect to the cluster centers.The functionals of the K-means (Anderberg, 2014) and Fuzzy C-means (Bezdek, 1981) algorithms are given by, and, respectively.In both equations above, U = [u jk ] is the n Â c membership matrix whose entries represent the memberships of the k-th sample to the j-th cluster and d jk is the distance of the k-th sample from the j-th cluster center.In the case of the K-means algorithm, the memberships are binary values, while in the case of the Fuzzy C-means algorithm they are real numbers 2[0, 1], representing the membership probability.In the latter case, there is also a "fuzzifier" parameter m.
The third clustering method is Possibilistic C-means (Krishnapuram & Keller, 1996;Massone et al., 2006).This is an elaborate development of Fuzzy C-means, in which each sample can, in principle, belong simultaneously to several clusters with different degrees of membership.In this case the cost function to minimize is given by, where the entries of the membership matrix satisfy the constraint max j u jk > 0 "k and, for each j, the regularization parameter g j depends on the average size and shape of the j-th cluster.

Supervised methods
Most FLARECAST methods are supervised, in line with contemporary applications of machine learning to flare prediction (e.g., Ahmed et al., 2013;Bobra & Couvidat, 2015;Benvenuto et al., 2018;Florios et al., 2018).A detailed description of these methods is beyond the scope of this work.The latest release of the project's platform contains a standard MLP trained by means of Error-Back-Propagation and two RNNs that allow feedback loops in the feed-forward architecture.This modification is realized by means of both an Elman neural network (Elman, 1990), in which any number of context nodes is permitted, and a Jordan neural network (Jordan, 1997) serial, in which the number of context nodes is constrained to coincide with the number of output nodes.
In a more recent development, regularization neural networks (Evgeniou et al., 2000) enable the connection between training and generalization via the minimization of functionals such as the following, represents the training set made of N property-label pairs, V(Á,Á) is the loss function that measures the price paid for the inaccuracy of predicting y i with f(x i ), and k is the regularization parameter, which realizes the tradeoff between fitting over the training set and generalization.
SVM for regression (Scholkopf & Smola, 2001) is one of the standard regularization networks implemented in FLARECAST.In this case, V(Á,Á) is a standard quadratic loss function and F is a Reproducing Kernel Hilbert Space (De Vito et al., 2004), in which four different kernel types can be selected (i.e., linear, polynomial, radial basis function, and sigmoidal).The FLARE-CAST platform also contains a SVM for classification that uses the hinge loss function, namely the one thought most appropriate for classification (Rosasco et al., 2004).Two sparsity enhancing regularization methods were also implemented, in which the number of features that effectively contribute to the generalization is constrained to the smallest possible.This is achieved by minimizing the l1 norm of the feature vector (i.e., the sum of the absolute values of its components), which is further achieved by means of two different approaches: penalized logistic regression (Wu et al., 2009), in which the loss function realizes the Bernoulli distribution for the labels, and LASSO (Yuan & Lin, 2006), in which the loss function is quadratic.FLARECAST also includes a hybrid version of penalized logistic regression and LASSO, in which the regression outcome is partitioned by means of a Fuzzy C-means scheme, without focusing on optimizing a specific skill score (Benvenuto et al., 2018).The platform contains a further generalization, namely an elastic net (Zou & Hastie, 2005) algorithm, in which the minimization functional contains two penalty terms (l1 and l2) with two different regularization parameters optimized by cross validation.
Ensemble learning (Dietterich, 2000) is another supervised approach that uses a combination of different learning models to increase the classification accuracy.In this framework, FLARECAST offers a RF algorithm (Breiman, 2001), which works as a large collection of de-correlated decision trees.Given a training set of samples made of properties and corresponding labels, a decision tree recursively splits the training samples into subsets based on the value of a single property.Each split corresponds to a node in the tree and the task is to separate records in the training set that have different characteristics.We follow the implementation described in Liaw & Wiener (2002), by splitting the tree until every subset is pure (i.e., all samples in the subset belong to the same class).In this way all terminal nodes (i.e., the leaves) are assigned a unique class label.Once M.K. Georgoulis et al.: J. Space Weather Space Clim. 2021, 11, 39 the decision tree has been constructed, classifying a test record is achieved by starting from the root node, applying the test condition to the record and following the appropriate branch based on the outcome of the test.This can lead either to another internal node, for which a new test condition is applied, or to a leaf node.If a leaf node is reached, the label associated with it is assigned to the record.In the RF approach, the training set is randomly divided into a fixed number of subsets and for each subset a decision tree is built.New, incoming unlabeled samples are classified by aggregating the predictions of the decision trees via a majority vote procedure.
Details on how to obtain the FLARECAST machine learning algorithms can be found in Appendix A. We note in passing that the landscape of machine learning methods is continuously and rapidly evolving, with new algorithms constantly introduced.Our approach, as explained in Section 2.4.3 below, is an implementation based on modularity in such a way that incoming machine learning methods can be easily integrated in the FLARECAST platform.

Technology
FLARECAST relies on a broad selection of different technologies, including hardware for storage and computing as well as software for data handling, infrastructure management and computation.The software was designed in a way that is hardware independent and can be installed on numerous platforms.All code developed during FLARECAST has been published under an open source license and is freely available.Code acquisition and license information are detailed in Appendix A.

Computing hardware
A computing server dedicated to FLARECAST has been integrated into the MEDOC computing infrastructure9 and hosts the production version of the FLARECAST pipeline.Queries on this server for SDO/HMI data are made directly into the MEDOC database tables, and SDO/HMI files are accessed locally.This allows efficient runs of the FLARECAST property extraction algorithms.

Data storage hardware
The main FLARECAST data volume is provided by eight SDO/HMI data series at 12 min cadence that have been downloaded using the SDO NetDRMS and archived as part of the MEDOC solar physics data archive.Besides the one used primarily for the forecasting tasks (hmi.sharp_cea_720s_nrt), downloaded series included hmi.m_720s, hmi.m_720s_ nrt, hmi.sharp_720s, hmi.sharp_720s_ nrt, hmi.sharp_cea_720s, hmi.ic_720s and hmi.ic_nolimbdark_720s_nrt.Not all of them were finally used by the Consortium, although definitive SHARP data and continuum images were used for tasks akin to the explorative science component (Sect.5).NRT data series were downloaded no later than 1 hr after they were made available by HMI's JSOC.This overall delay could be further reduced, if necessary, by more frequent download requests.

FLARECAST software architecture
The FLARECAST architecture design was driven by the needs for modularity, portability and ability to perform and accommodate different algorithms written in various programming languages.This philosophy is best described by the toplevel diagram of Figure 2  The underlying management infrastructure controls and monitors these algorithms.The layout of the FLARECAST architecture simplified the project management as the individual components are under the responsibility of dedicated WPs.The relation between components and WPs is shown by the black rectangles in Figure 2. The software infrastructure is discussed in detail in Section 3.

Software components
The software components, represented by the blue rectangles in Figure 2, fulfill specific tasks such as data loading, property extraction, machine learning, or verification.Generally, The management infrastructure is implemented as a collection of containers (see Sect. 2.4.6 below).In automated mode, a set of repeatedly executed computer scripts (i.e., cron jobs) regularly check for new data and start the algorithms, if necessary.Additionally, a small web application allows users with administrator privileges to manually trigger the execution of algorithms.

Data components
The blue database symbols in Figure 2 denote individual parts of the FLARECAST data model, hereafter referred to as data components.Within the workflow, every software component (i.e., every algorithm) reads from several and writes to precisely one data component.Like this, the software components are decoupled from each other and only have to provide a control interface (start, stop) for the management infrastructure.The management infrastructure connects to a dedicated data component for configuration and data logging.
Each data component defines a generic interface to read, create, update, and delete data.The read methods include a query language for simple queries.Figure 7 illustrates this process for an example property extraction algorithm.

Docker containers
All FLARECAST infrastructure components are implemented as Docker containers.Docker10 is an open source software ecosystem (i.e., engine) in which different Docker containers co-exist and function independently.A Docker container can be viewed as a light-weight virtual machine that hosts an arbitrarily configured software environment.This includes custom programming languages and different library versions per container.Each FLARECAST infrastructure component or algorithm is deployed in a dedicated Docker container.These containers can be installed on a high performance cluster as well as on a developer's desktop machine, making it possible to deploy FLARECAST in different environments at the same time.A schematic of the FLARECAST Docker containers is shown in Figure 8.

Language independence
As mentioned already, algorithms may be written in any programming language.This allows to re-use already existing libraries for certain tasks.Currently tested and supported are Python 2.7 and 3.0, the Interactive Data Language (IDL) 8.5 and R. Other languages that can be executed on a Linux system (e.g., Java, C++ and Perl) are not expected to generate major issues, but we must make clear that we have not systematically tested them.

Data model
The design of the data model is balanced between flexible support for different kinds of data structures and strict schema definitions for better interoperability.This can be achieved through a semi-structured data model.Its main structures are defined by a traditional table-based, relational data model.Individual entries of the table support weakly-typed data types (i.e., types supported by weakly-or loosely-typed programming languages, such as C).In this sense, an algorithm can handle data types as simple as numeric values or strings simultaneously with more complex data types, such as arrays, matrices, dictionaries and whole hierarchies of objects.For the weakly-typed data types we rely on the built-in JSON data type of PostgreSQL 9.6.Algorithms encode their output (e.g., extracted properties) in the JSON data format.JSON is supported by most C-like programming languages, including Python, R and IDL.

Communication and dissemination
The FLARECAST communication objective ran throughout the project as it had a dedicated, overarching WP (WP7; see the WP list in Sect.2).The FLARECAST Consortium aimed to raise awareness in three different directions, namely industry and government, space weather end users, and the general public.

Industry/government, scientists and end users
Initial engagement with end users, in terms of both scientists and forecasters, evolved into a User Survey conducted by the Met Office partner in autumn 2016, in advance of the FLARE-CAST First Stakeholders' Workshop held at the Met Office (Exeter, UK) in January 2017.In its findings (described in more detail in Appendix B) about a quarter of respondents were not sure about the accuracy of the forecasts they were using, clearly indicating a need for more education about the forecast methods, as well as about the verification of data, methods, and performance.This may also explain why around 60% of respondents were passive regarding recommending forecast services.Further, it was found that timeliness, accuracy and ease of use are the most important factors in a forecast, while scientific detail is the least important information to include.
The First Stakeholders' Workshop was attended by 30 participants in total, including 20 non-FLARECAST attendees from basic solar and space-weather research and operational forecasting, as well as end users and stakeholders from the aviation, defence, marine, satellite, and communications sectors.There was discussion of the survey results and an extended discussion on the construction of rough research and development roadmaps for the short and long terms.Top-level conclusions of the Workshop included the following: -One forecasting solution is not possible to fit all sectors and users, as different sectors need different forecast windows, latencies and means of verification depending on their generic business models (i.e., false alarms, misses, cost-to-loss ratios, etc. ..). -FLARECAST forecasts will need an intermediate step before reaching end users, to make the forecasts understandable to them.This step should be taken by operational centres, such as the Met Office Space Weather Operations Centre (MOS-WOC).It was deduced that the real end users of FLARE-CAST are, in fact, the operational forecasters themselves.-The users stressed the practical aspects of an operational forecast (i.e., reliability and precision) over scientific details.They also found the Workshop to be an excellent educational experience.-The need for an integrated, Sun-to-Earth, space-weather forecasting system was portrayed as evident and pressing.It will become increasingly pressing in the future.
A Second Stakeholders' Workshop was held in Ostend (Belgium) in November 2017 (alongside that year's European Space Weather Week) and was oriented more toward scientists, end users and operational space-weather forecasters.Among others, it featured participation by the EASA and NASA's CCMC.This second workshop pointed out that FLARECAST is producing and publishing results that are readily available to be used and validated, and the FLARECAST forecasting service will follow shortly.However, some convenors agreed that, in some respects, the "fun" for FLARECAST's potential really starts now, since the greater community should think about scientific/operational projects involving the new FLARECAST "software instrument".Finally, in view of the strong contribution from EASA in the discussion, it is increasingly clear that aviation agencies (and the aviation industry, naturally) are becoming more engaged in space-weather impacts and associated forecast services.Therefore, the heliophysics community needs to proactively follow up with them to ensure that this opportunity for a meaningful, prolific interaction is not missed.
In addition, intense communication took place with the scientific community at conferences and during a dedicated FLARECAST Science Workshop organized in June 2017 in Paris by the CNRS partner.These actions anchored FLARE-CAST within the larger field of current and future spaceweather research and secured collaboration between interested parties beyond the project's duration.

Education and public outreach
A variety of communication formats were used to involve the public in all partner countries: science cafés; citizen science events; children's workshops; interactive exhibits; social media.Public engagement activities raised awareness not only about space weather and the need for reliable solar flare forecasts but also, more generally, about the working of collaborative research in Europe.We witnessed the relentless interest and fascination of the public on the realization that real efforts are made to predict solar activity, because this can have a very tangible impact on human life and well-being.This was not initially clear and we strove to pass the word clearly and compellingly.

Press releases and press appearances
The numerous direct interactions and networking activities between scientists and different stakeholders in the project's framework were aimed at sharing information, advancing mutual understanding, and increasing trust in science (i.e., embedding the project in society at large).These efforts came just in time, as two partner countries (Switzerland and United Kingdom) faced political uncertainties about their involvement in future European research programmes.In one remarkable development11 , the project itself was brought up as a paradigm of solid, meaningful collaboration between British and Continental European researchers.
This visibility was assisted by the Consortium's concerted efforts to exploit opportunities for disseminating the project's achievements and deliberations.There were annual press releases in all official languages of the partner countries, while Consortium members were interviewed in numerous occasions by the electronic and printed media based in the partner countries.Much attention was, therefore, paid to contextualized press work to make sure that the project was broadly communicated in all participating countries and beyond.
Detailed information on all three communication and dissemination elements of FLARECAST can be found on the project's website, a skeleton diagram of which is provided in Figure 9.
3 The FLARECAST infrastructure: handling and monitoring The software developed within FLARECAST is a computational resource enabling one to apply several machine-learning methods to data available in the FLARECAST property database.The software is object-oriented, with a set of objects for reading and analyzing data formats (i.e., model learning or prediction), another set for visualizing the results and, finally, another set for storing them in appropriate prediction databases.The software has been designed and written in Python, with the pertinent software modules described in the following sections.

Data handling
Data handling is managed by a database interface module that can read and collect properties from the FLARECAST property database and write the results of the analysis into other databases.The databases involved are: The flare intensity number reflects the GOES flare class (forecast and actual), while the delay refers to the interval between the time of the specific flare forecast and the actual flare onset.

Data monitoring
A consistent and as complete as possible coverage of the SDO/HMI archive at MEDOC is important for FLARECAST to fully exploit the JSOC data, along with the ability to promptly download the latest SDO/HMI NRT SHARPs for as long as they are available.As a result, both the archive coverage and download delays had to be monitored.Additional monitoring tools have, therefore, been developed to keep track of both the infrastructure workflow and database coverage.

Data coverage monitoring
Monitoring of data coverage is achieved using the scripts in the "coverage" directory of the project's public Git repository.12These scripts are automatically executed daily with their results published at MEDOC. 13 This webpage uses color tables to display the percentage of existing SDO/HMI data, for different data series, that are available at JSOC and are mirrored at MEDOC.This is done as a function of month and day of month using a color table in which even just one missing observation is visible and encodes in white those cases where no data are available at JSOC.Notice that for the SHARP data series a gap may exist either when no eligible SHARPs can be found on the solar disk on a given day or in (relatively rare) cases of issues with the HMI data pipeline.

Download delays monitoring
Monitoring of download delays is achieved using the script in the "delays" directory of the above-mentioned Git repository.The observation date and time at which files are available at MEDOC are directly queried from the database, with year and month supplied to the script as external arguments.
The output is a plot displaying the download delay (compared to the observation [i.e., data acquisition] time) as a function of observing time for the selected month.The delay axis spans from 0 h to 24 h.Missing data (that could not be downloaded from JSOC) are shown as a 0-h delay in red, while data downloaded with a delay of more than 24 h are shown as a 24-h delay in orange.

Infrastructure and database coverage monitoring
To overview and monitor the workflow process and database coverage, the FLARECAST infrastructure provides several viewers, some of which are intended as internal tools.These tools can be found in dedicated Git repositories in the INFRA project area. 14The infra_viewer allows FLARECAST developers to manually start new processes and to monitor their performance.The property_viewer and the prediction_viewer allow to visually inspect the distribution and completeness of the data.They help system admins to better assess the quality of the data.
In addition to viewers, the FLARECAST infrastructure provides a collection of API routes to analyse the data on a database level (Appendix A).These scripts iterate through the entire data set and report unexpected gaps in text form.Manual analysis is then needed to decide if data need to be reprocessed or the gaps can be explained by missing or corrupted source data (i.e., HMI SHARPs).

Performance verification strategy
The entire verification software for the FLARECAST project is contained in the verification_engine.py and verification_ module.pycodes in the project's Stash repository. 15The starting point for the software is reading from a JSON-format configuration file.This initializes the algorithm name for the prediction method, the threshold flare intensity of interest, the base issuing time of the forecasts, the length of the forecast window, the latency of the forecast windows from their issuing times, whether magnetic and/or flare history property sets are used and, finally, the start and end date-times for the period undergoing verification.Given an input forecast window length, the software determines which issuing times will be considered for reading prediction "algorithm_config_name" entries from the prediction database.This is to ensure that temporally edge-to-edge forecasts are considered in the verification, such as a single 24-h forecast window issued at 00:00 UT each day, or two 12-h forecast windows issued at 00:00 and 12:00 UT each day, or four 6-h forecast windows issued at 00:00, 06:00, 12:00, and 18:00 UT each day.In all of these examples, the latency (i.e., the time difference between a forecast issuing time and the start of the prediction window) is zero, meaning that forecasts are effective immediately.Nonzero latency can also be implemented, if desired.
Following this, two separate calls are made to the FLARE-CAST prediction API.The first is a request to the "algoconfig/data" service for the related prediction algorithm configuration settings (or a list of prediction algorithm configuration settings in the case of forecast window durations less than 24 h).The second is a request for the entire set of prediction results through the "prediction/list_v2" service.Having retrieved the full set of forecasts to be verified, minimum and maximum flare class information for that prediction set are extracted from the "algoconfig/data" results.This information is needed to convert the associated flare information contained in each prediction result entry into the relevant binary observational truths (i.e., one or more suitable flare did [1], or did not [0], happen within the forecast window) that are then compared to the decimal forecast probability or binary classification issued by the prediction algorithm under consideration.
The next stage of verification_engine.py is the initial construction of the output metadata structure, containing the full set of metadata keys and a placeholder for the "data" key.At this point, "forecast_type" and "source_predictions" are left blank as the content depends on whether forecasts are probabilities or classifications; distinction is not possible based only on "algo-rithm_config_name" because some prediction methods generate both types.Classification forecast-observation pairs are then passed into the calc_class_stats() function in verification_ module.py,while probability forecast-observation pairs are passed into the equivalent calc_prob_stats() function.Both return structured results saved directly into the "data" key at the metadata level of the output structure.Numerous skill scores are then computed for both the probabilistic and the categorical-classification format of forecasts, outlined in more detail below.

Probabilistic forecasting
The computation of probabilistic scores relies on two metrics associated to the i-th forecast: the binary observed flaring o i {0, 1}, and the forecast probabilty f i 2 [0, 1].Using them, the verification module computes the Brier Score, Table 4. Metrics and skill scores used for categorical forecasts in FLARECAST, along with their applicable ranges.All parameters shown rely on the simple 2 Â 2 contingency table of Table 3.

Name
Notation Formula Range together with its decomposition into the three components of reliability, resolution, and uncertainty (see, e.g., Richardson, 2012).
From the BS, one obtains the Brier Skill Score, where the reference score BS ref is given by, is the average of the binary observed flare occurrences over the N forecasts (i.e., the climatology).
Notice that BSS directly combines probabilistic forecasts with binary observations in a paired format.Further characterization is achieved through binning of these forecast-observation paired data according to their forecast values.This enables construction of the four quantities required to plot reliability diagrams, which are plots comparing the observed frequency of events to the forecast probability (see, e.g., Broecker, 2012, as well as Murray et al., 2017 for an application to flare forecasting).These quantities are (1) the number of entries in each forecast bin, (2) the average forecast value in each bin, (3) the average observation in each bin, and (4) an estimate of uncertainty in the average observation value.The key parameter in the construction of these reliability diagram quantities is the size of the decimal-probability forecast value bins, which is controlled by the "data set"-"probabilities"-"rel_dia_stepsize" parameter in the JSON configuration file input into verification_ engine.py.The reliability diagram data are specifically included in the verification output "data" result structure to prevent the need for the full prediction set to be requested again through the API service at a later time.
In addition to the standard probabilistic metrics, decimal forecast probabilities are converted by a threshold value into categorical-classification forecasts of no-flare (i.e., probabilities below the threshold) and yes-flare (i.e., probabilities at or above the threshold).Varying this threshold yields a series of binary forecast and observation data sets that have the full set of categorical metrics and skill scores calculated for them.These threshold-dependent categorical metric data sets are then included in the output probability verification "data" result structure as a nested list of categorical-classification output result dictionaries, with each list entry supplemented by the threshold probability value used to construct that entry.These categorical metrics and skill scores are provided below.

Categorical forecasting
Given a certain range of flare magnitudes, forecast window duration and latency, verification of the simplest possible categorical forecasting scheme (i.e., binary 0 for no-flare and 1 for yes-flare) relies on the 2 Â 2 contingency table (aka "confusion matrix") of Table 3.This is constructed from the intersection of the binary forecast-observation pairs corresponding to the following four cases: number of forecast windows predicted to flare with flare(s) observed (TP); number of forecast windows predicted to flare with no flare(s) observed (FP); number of forecast windows predicted to not flare with flare(s) observed (FN); number of forecast windows predicted to not flare with no flare(s) observed (TN).The first row (TP + FP) provides the number of flare-predicted forecast windows, while the second row (FN + TN) corresponds to the number of non-flare-predicted forecast windows.The first column (TP + FN) provides the number of observed flaring forecast windows, while the second column (FP + TN) corresponds to the observed non-flaring forecast windows.The total number of forecast windows is N = TP + FP + FN + TN.Table 3 gives rise to a long list of possible forecast metrics and skill scores, with those utilized in FLARECAST outlined in Table 4.
Another important element of forecast verification is the use of the POD and POFD (see Table 4) as ordinate and abscissa, respectively, of the ROC plot (see, e.g., Broecker, 2012, as well as Sharpe & Murray, 2017 for an application specific to flare forecasting).For thresholded probabilistic forecasts (Bloomfield et al., 2012) the ROC plot contains the (POD, POFD) pairs for each probability threshold.In a manner similar to the reliability diagram of Section 4.1, readily creating ROC plots avoids unnecessary reprocessing of large quantities of forecast data just for visualization purposes.ROC plots directly provide the AUC metric that is determined by the integration of the POD values over their spacing in POFD.

Verification metric uncertainty calculation
In standard operation, the verification process calculates a single value of each metric and skill score that represents the performance achieved across all forecasts within the selected verification time range.FLARECAST also includes an optional assessment of uncertainty for these single-valued metrics and skill scores, controlled through the "data set" -"uncertainties" parameters in the JSON verification configuration file.The choice of resampling method for the calculation of uncertainties is controlled by the string-format "sampling_scheme" parameter.In the current implementation, three resampling schemes can be utilized, namely bootstrap and jackknife (Efron, 1982) or random sampling without replacement (i.e., a sub-sampling method).The size of each partial sample is set via "sample_percent", namely the decimal percentage of the total number of forecast-observation pairs in the verification time range, N. Finally, the number of times that resampling is performed is set by the "realizations" parameter.It is worth noting that the jackknife method overrides these to self-define the sample size as N À 1 (one less than the number of forecast-observation pairs) and the number of realizations as N, because this is the number of times one forecast-observation pair can be dropped.In addition, the standard bootstrap method uses N samples (i.e., "sample_percent" = 1), with the freedom to choose any number of realizations.In our infrastructure it is also possible to run the FLARECAST verification engine with the bootstrap resampling with replacement but using "sample_percent" < 1, although this should not be treated (or referred to) as a standard bootstrap implementation, except in the case "sample_percent" = 1.
The assessment of metric uncertainties is optionally invoked after the categorical-classification or probability forecast metrics have been written into the output verification data structure.In the calc_ss_uncert() uncertainty assessment function, indices for the forecast-observation pair arrays are resampled to create 2D index arrays of the form ["realizations", "sample_size"].For a bootstrap, indices in the range [0, (N À 1)] are uniformly randomly drawn for a total of "realizations" Â "sample_size" times to directly populate the 2D index array.For a jackknife, each index is dropped to generate a 2D [N, (N À 1)] index array.Finally, for a subsample, in each realization the monotonic array of indices 0, 1, 2, . . ., (N À 1) are randomly shuffled and the first "sample_size" number of indices are extracted and stored in the 2D ["realizations", "sample_size"] index array.After the 2D index permutation sets are generated, the full suite of verification metrics and skill scores are calculated over the "sample_size" entries for each realization.These are achieved using calc_class_stats() and calc_prob_stats() once again to ensure consistency of the metric and skill score calculations with those reported as primary results in the verification data structure.To simplify the uncertainty assessment portion of the structure, any key names containing nested structures are removed (i.e., the contingency table in the classification data structure and the probability-thresholded reliability diagram in the probability data structure).Following this, averages and standard deviations are calculated over the number of realizations for each separate metric and skill score.The resulting dictionaries of average and standard deviation values are separately saved into the "unc_avg" and "unc_std" keys, respectively, in the output verification data structure.
We note for completeness that the above implementation of metric uncertainties is used because it is deemed as the most time and computationally effective.Alternative implementations can be pursued, but these are reserved for future FLARECAST infrastructure upgrades.

The FLARECAST science results
FLARECAST's scientific outcome relied on the following four pillars: an investigation of new, or a revisit of potentially interesting, flare predictors (Sect.5.1); the application of machine-learning methods to flare forecasting, also enabling feature (i.e., predictor) ranking in terms of predictor importance (Sect.5.2); exploration of, first, eruptivity studies with synthetic, 3D MHD models (Sect.5.3) and, second, the flare-CME connection (Sect.5.4).This last component shows that, importantly, the FLARECAST AR property database can be meaningfully coupled with CME catalogs such as those of the EU HELCATS project16 or NASA's DONKI17 to extend flare forecasting to eruptive flare (i.e., CME onset) forecasting with expectation values of CME properties.The complete list of FLARECASTrelated peer-reviewed articles can be found in Appendix C.

New and revisited flare predictors
In spite of possessing one of the most comprehensive ensembles of flare predictors, FLARECAST opted to identify additional properties that could assist in flare forecasting.In this respect, subsets of the FLARECAST property (including flare association) database were used and linked to the flaring history of ARs.Promising quantities involved photospheric shear flows (Park et al., 2018), non-neutralized electric currents (Kontogiannis et al., 2017), magnetic gradients (Kontogiannis et al., 2018), and the differential emission measure calculated above ARs (Gontikakis et al., 2020).In addition, FLARECAST has taken advantage of the results of Guerra et al. (2018), advocating for the joint use of line-of-sight and solar-surface-vertical versions of predictors in machine learning methods, as well as of the results of McCloskey et al. (2016), revisiting sunspot class associations with flaring rates and confirming the increased likelihood of flaring in case flux emergence enhances the photospheric compactness of ARs.
Strong shear flows are observed along intense magnetic PILs in the AR photosphere.PILs are arguably the single most telltale feature of enhanced flare productivity.Properties derived from a map of plasma shear flows were thus analysed by Park et al. (2018).In particular, an algorithm was developed to quantify photospheric shear flows in ARs in terms of three properties, treated as predictors in this framework: the mean (<S>), maximum (S max ) and integral (S sum ) shear-flow speeds along strong-gradient, strong-field PIL segments.This algorithm was then applied to a large data set of ~2500 co-aligned pairs of AR vector magnetograms with 12-min separation over the period 2012-2016.Park et al. (2018) found that ARs with more widespread and/or stronger shear flows tend to not only be more flare-productive over the next 24 h but also produce major flares sooner.This finding paves the way for future conditional predictions as well as forecasts better integrating the flare history of ARs.The importance of flare history in predictions of future flaring has already been identified by Falconer et al. (2012), as well as by comprehensive statistical studies (Leka et al., 2019a, b;Park et al., 2020).As further discussed in Park et al. (2018) and Welsch et al. (2009) shear-flow properties exhibit a weaker, albeit still positive, correlation with the flare peak flux than Schrijver's R.However, Park et al. (2018) revisited the problem using almost as large a sample of co-aligned magnetograms as Welsch et al. (2009), but at much higher quality (i.e., spatial resolution, cadence) and with a more detailed sample of NOAA-numbered ARs (i.e., in terms of observed times/locations, flaring activities and photospheric magnetic properties).Park et al. (2018) further focused on flaring activity over the next 24 h, as well as on the waiting time between major flares as a function of the shear-flow properties.We believe that the new results obtained with this improved magnetogram dataset and systematic study warrant additional investigation.
The predictive potential of properties related to non-neutralized electric currents in ARs was also investigated.Intense, shear-ridden PILs are exclusive areas of non-neutralized currents that imply the injection of net currents into the corona whose associated non-potential (i.e., free) magnetic energy could be instrumental for flares and eruptions (Török et al., 2014;Dalmasse et al., 2015).Georgoulis et al. (2012) had already proposed a method to calculate non-neutralized currents accurately, along with their applicable uncertainties.This method is slightly different than the one of Leka & Barnes (2003a) who added the absolute values of the total current from each magnetic polarity in that, first, it algebraically adds the total currents from both polarities to obtain the net current and, second, it performs a robust flux partitioning of a given magnetogram to identify and distinguish between magnetic polarities.The integral form of Ampere's law for the inference of net currents was also applied by Falconer et al. (2002); however, no specific discussion on the precise choice of contours was made.Relying on the method of Georgoulis et al. (2012) and Kontogiannis et al. (2017) implemented the computation of the non-neutralized electric currents within the FLARECAST property database.Calculations for a sample of AR time series showed with statistical significance that the systematic increase in the amount of non-neutralized currents concurred with the development of strong PILs and signaled phases of intense flaring.Further application to a representative sample of Solar Cycle 24 ARs showed that the total unsigned non-neutralized current of an AR is promising in distinguishing between flaring and non-flaring AR populations within a 24-h time window (Kontogiannis et al., 2017)."Promise" in that work implied a significantly better (i.e., beyond applicable uncertainties) ability to distinguish between flaring and non-flaring active regions than the (often used as baseline) unsigned magnetic fluxsee Figure 10 and discussions in Barnes & Leka (2008) and Georgoulis (2012Georgoulis ( , 2013)).
Additionally, FLARECAST developed new algorithms to reproduce some recently proposed predictors not yet implemented in forecasting schemes, relying on the complexity of the spatial distribution of the photospheric magnetic flux.These were the sum of the horizontal magnetic gradient (c.f.Korsós et al., 2014Korsós et al., , 2015;;Korsós & Erdélyi, 2016) and the Ising energy (Ahmed et al., 2010).The sum of the horizontal magnetic gradient is calculated by combining magnetograms and photospheric continuum images.The latter are used to locate the umbrae of sunspot groups while the magnetograms are used to define the mean magnetic field strength in the umbrae.These values and the separation distances between opposite-polarity umbrae are used to calculate the sum of the magnetic field gradients between all possible opposite-polarity umbrae pairs.Furthermore, while the original formulation of Ising energy considers pairings between opposite polarities on a pixel-bypixel level, FLARECAST examined that form as well as the Ising energy of opposite-polarity umbrae and topologically inferred magnetic partitions.Kontogiannis et al. (2018) demonstrated that magneticgradient-and Ising-energy-based predictors are also worth including in automated flare forecasting schemes.In terms of Bayesian conditional flaring probabilities (Fig. 10), the sum of the horizontal magnetic gradient appeared to be the most efficient property, followed by the Ising energy of the umbrae pairs.Efficiency here implies again an ability to better distinguish between flaring and non-flaring active region populations, not necessarily in terms of predictive potential.The latter, of course, is to be tested in more strict terms within machine learning prediction methods employing multiple predictors and being able to rank them.Although many of these and other properties relate to PILs, Kontogiannis et al. (2018) showed that each of these properties exhibits specific dependence on the position of ARs on the solar disk and encompasses different information.These differences can justify the different efficiency of these flare predictors and their potential inclusion in multi-parameter machine learning-based forecasting schemes.Gontikakis et al. (2020), in a forthcoming study, introduced a new, "orthogonal" view on flare prediction, analysing EUV image time series of ARs.More specifically, the temporal evolution of the DEM, calculated over ARs, was studied as a possible flare predictor.The DEM shows the distribution of the plasma EM as a function of temperature, T, and is derived from optically thin spectral lines.It is defined as DEMðT Þ ¼ n 2 e dl=dT and involves the derivative of the line-of-sight l over M.K. Georgoulis et al.: J. Space Weather Space Clim. 2021, 11, 39 the plasma temperature, with n e being the electron number density.Syntelis et al. (2016) found that the DEM calculated over an intensely eruptive AR increased substantially a few (i.e., ~5) hours before one of the eruptions, with the other eruption following shortly after.This was interpreted as a possible pre-flare manifestation (see, e.g., Fletcher et al., 2011).To achieve statistical significance in a DEM-based pre-flare activity study, Gontikakis et al. (2020) used DEM inferences by the GAIA-DEM archive. 18In GAIA-DEM, the DEM is assumed to be a Gaussian function of log(T) (Guennou et al., 2012) and the archive provides images of three Gaussian parameters, along with v 2 .These parameters are the EM (i.e., the integral of the DEM), the temperature T max where the DEM is maximized, and the Gaussian width of the DEM.Gontikakis et al. ( 2020) used 6-h long time series of each DEM parameter.The DEM parameters were computed from solar structures extracted from 9000 HMI HARPs.Positive derivatives of the time series, indicating that AR plasma heating is at work, were treated as indicators of imminent flares.Analyzing the DEM time series appeared to give more significant conditional flare probabilities than the reference unsigned magnetic flux (i.e., its value at the end of the time series) for GOES !M1.0 flares, but it was less successful for GOES !C1.0 flares.These may be grounds to advocate that, provided that the significant uncertainties in the DEM calculation can be constrained, DEM time series may complement or even enhance the short-term predictive ability of photospheric properties, at least for major flares.

Flare prediction and feature ranking
The FLARECAST machine-learning prediction component draws from and relies on the plethora of predictors extracted from NRT HMI SHARP magnetograms populating the project's property database.This central objective has been realized by Benvenuto et al. (2018), Florios et al. (2018), Piana et al. (2019) and Campi et al. (2019), while the overall process has been described and discussed by Massone et al. (2018).Benvenuto et al. (2018) tested an array of supervised and hybrid (i.e., comprising both supervised and unsupervised elements; see Sect.2016), but using machine learning for prediction.The main result was that hybrid methods tend to outperform strictly supervised ones and approach the performance of clustering methods.Benvenuto et al. (2018) used a variety of metrics and skill scores present in Table 4 and further determined that a reliable feature ranking by means of their prediction value is possible.Florios et al. (2018) used three different supervised methods (i.e., MLPs, SVMs and RFs; see Table 2) on a subset of the FLARECAST property database over a five-year interval covering 2012-2016.They were able to correlate between the different methods and concluded that RFs (Breiman, 2001) could be the method of choice in a routine, operational flare forecasting scheme.Feature ranking was performed by means of different scores and indices and it was found that peak performance could be achieved by using 10-12 predictors in total.Both purely probabilistic and probability-thresholded categorical performance verification was performed, with optimal probability thresholds determined.The overall performance practically matched, and in some cases exceeded, that of Bobra & Couvidat (2015) who used a SVM and performed the first flare forecasting work using HMI SHARPs.
The study of Campi et al. (2019) performed a data-and property-intensive flare prediction investigation.It utilized a database of 14,931 point-in-time property vectors, each comprising up to 171 predictors within the interval September 2012 -April 2016.The main objective was to determine whether, and to what extent, this abundance of predictors is essential or redundant in flare prediction.The study utilized two machine-learning methods, namely hybrid LASSO and RFs, and opted to construct four training and four testing sets, each corresponding to a 24-h forecast window from a specific issuing time (i.e., 00:00, 06:00, 12:00, and 18:00 UT each day).Particular attention was paid to the complete separation between training and testing sets, that were not only nontime-overlapping (i.e., results from each issuing time were not combined) but they used different ARs (i.e., HARP numbers, in implementation) for training and testing.As such, another objective was to find the impact of different intervals of the same ARs being used in both training and testing.The main findings of Campi et al. (2019) were the following: -Properties with the best predictor ranking have the smallest variance in ranking, which implies that their impact on prediction is consistently high.However, the order of bestranking predictors varies for different prediction methods.-Particularly for prediction of GOES !C1.0 flares, the bestranking predictors are common across all issuing times studied.This is less of a case for the prediction of major flares of GOES !M1.0 flares.-Only a small number of predictors (different for each method, however, most notably for major flares) suffices to make the prediction method achieve its top performance.Hence, the vast majority of predictors are redundant in such approaches and, in fact, using all of them can even be detrimental to performance.The redundancy of predictors was also noted in Barnes et al. (2016) as possibly due to intrinsic, but not necessarily known ab initio, correlations between different predictor values.This effect may also be enhanced for events of increased rarity and hence poorer statistics, i.e., when forecasting increasingly higher flare classes.
-The stricter approach of separating ARs between training and testing sets leads to notably lower skill score values.
Relaxing the robustness of training and testing by allowing "information leaking", not necessarily in terms of temporal overlapping but in terms of different intervals in the same ARs, readily increases these values.Improved performance in this case, however, is due to memorization, rather than learning, on behalf of the machine learning method used.Lower skill scores for unbiased training and testing, as also found by Piana et al. (2019), reflect the intrinsic, uncompromising stochasticity of the flaring phenomenon.
We note at this point that the relatively small number of best predictors found by Florios et al. (2018) and Campi et al. (2019) essentially aligns with earlier results by Ahmed et al. (2013), Bobra & Couvidat (2015) and Barnes et al. (2017).The latter study also found that a set of three or four predictors was sufficient to guarantee best performance for two classifiers, NPDA and RF.That study addressed prediction of both filament eruptions and flares and found significant overlapping of about half of best predictors between the two types of predictions.In the case of Campi et al. (2019), however, such predictor patterns were found only for the prediction of GOES !C1.0 flares within different classifiers, while mostly different sets for different classifiers were found in the prediction of major flares (i.e., GOES !M1.0).
Both Florios et al. (2018) and Campi et al. (2019) used various metrics and skill scores from Table 4 but focused on HSS and TSS.Example runs by both studies and their associated HSS and TSS values are shown in Figure 11.This leads to the conclusion that, for both studies, the metric values increase while adding more predictors, although in both cases the   amplitude of the increase and the values achieved depend on the adopted skill score.Furthermore, in Florios et al. (2018) the increase seems to persist after 12 predictors while in Campi et al. (2019) saturation is reached with just six predictors (although when RF is implemented on three predictors, it leads to a TSS value very close to the one obtained with six predictors).The difference between the two studies (and subsequent panels in Fig. 11) is probably due to the different way the training set is populated, the latter experiment not mixing up active regions between the training and test sets.We notice, moreover, that the TSS value for Campi et al. (2019) is lower than that of Florios et al. (2018) for both applied methods, despite the fact that the predictors used in Florios et al. (2018) are a subset of those used in Campi et al. (2019).This is again a result of the more stringently separated training and testing procedure followed by Campi et al. (2019).In fact, the prediction method itself may often be less important than the robustness of training and testing practices.By roughly reproducing the training and testing of Bobra & Couvidat (2015), Florios et al. (2018), Campi et al. (2019) were able to practically match the performances of these studies for different machine learning methods.As a result, the "uncompromised" stochasticity of flare occurrence, as coined by Campi et al. (2019), could not be curbed even with one of the largest-dimensionality parameter spaces.
Given the above, the FLARECAST performance verification infrastructure (Sect.4) is in place and allows the evaluation of performance of any future effort than can rely on either the existing or enhanced property database (Sect.2.2) and the prediction database (Sect.2.3).Obviously, the quest continues.
At this point it is important to clarify that, although FLAR-ECAST valued highly the use of predictor time series for flare forecasting, this important part of the analysis did not have the time to materialize during the project's nominal period of execution.The only exception is the forthcoming work of Gontikakis et al. (2020) that exploits DEM time series to uncover flare predictors or precursors.Time series, in terms of flare history and/or predictor evolution, have been found useful in flare prediction (Leka et al., 2018(Leka et al., , 2019b;;Park et al., 2020), going beyond point-in-time forecasting.While this avenue is yet to be investigated with the FLARECAST databases, these data clearly enable one to pursue this objective in the future.

Studies using synthetic data
While observational data are the ultimate reference frame to determine the flare-predictive ability of a given quantity, these data have critical limitations toward a complete and pertinent description of solar ARs.For example, while it is agreed that the 3D coronal distribution of the AR magnetic field plays a key role in the triggering and development of instabilities leading to flares and eruptions in general, only the 2D photospheric field vector can be routinely inverted from Stokes images by means of the Zeeman effect.Thanks to tremendous developments in computational solar physics, numerical experiments of flare and eruption triggering are becoming increasingly common and realistic (see, e.g., the review of Green et al., 2018).These experiments have now made it possible to explore new flare predictors.Kusano et al. (2012) have carried out numerous 3D MHD simulations in which the initial set-up was parametrically modified to accommodate eruptive and non-eruptive configurations.Distinguishing between flaring and non-flaring configurations is essential in investigating the efficiency of potential predictors.
This exploratory component of FLARECAST was facilitated by the 3D MHD simulations of Zuccarello et al. (2018), who extended upon Kusano et al. (2012).These simulations started with a dipole magnetic configuration, emulating the two polarities of an AR, and were driven by specific, carefully modeled and controllable photospheric velocity flows.Four types of converging flows were tested (all eventually leading to eruptions) against a control, non-eruptive case.Depending on these driving flows and their experimental control, eruptions took place at different times that, however, could be determined precisely for each simulation.Given the full 3D data sets, different scalar volumetric quantities were calculated, among which were the potential and non-potential (i.e., free) magnetic energy Fig. 13. (a) Correlation between CME speed and the peak of total, unsigned magnetic flux in the source active region up to 24 h prior to the eruption.Points correspond to the actual data values for 124 CMEsource region pairs, while a box-and-whiskers plot is overlaid along with a spline fit to quantify the correlation.Reproduced from Murray et al. (2018).(b) Linear correlation coefficient and respective uncertainties between source region properties and CME speed for a sample of 32 CMEsource region pairs.A distinction between fast (red columns) and slow (green columns) CMEs has been made while results for the entire CME sample are shown in blue columns.Reproduced from Kontogiannis et al. (2019).M.K. Georgoulis et al.: J. Space Weather Space Clim. 2021, 11, 39 budgets and the relative magnetic helicity H, that was further decomposed into non-potential and volume-threading helicity terms (c.f.Berger, 2003;Moraitis et al., 2014;Linan et al., 2018).At the onset of eruptions, all parameter values showed a significant dispersion, indicating the lack of a practical instability threshold.Interestingly, no particular free-energy threshold could be identified for flares and eruptions in general, despite the key role of the magnetic free energy in eruptions.However, the ratio of a specific term H j (coined the "non-potential" helicity) to the total relative helicity behaved differently; eruptions seemed to always take place for the same value of the ratio H j /H.Therefore, this ratio may represent a critical threshold at which a modeled AR becomes unstable.
The importance of the ratio H j /H was further highlighted by Pariat et al. (2017) who analysed parametric 3D MHD simulations of the formation of solar-like ARs.These simulations were originally presented by Leake et al. (2013Leake et al. ( , 2014) ) and modeled the self-consistent formation of ARs via the emergence of a twisted magnetic flux tube from the upper convection zone into the solar atmosphere.Flux emergence led to the formation of a magnetic configuration similar to standard bipolar ARs.The parametrisation of the initial conditions enabled the simulation of both quiescent and eruptive configurations.In Pariat et al. (2017), the full 3D MHD simulated coronal magnetic field for different cases was investigated and analyzed, using magnetic energy and helicity.While magnetic energy, in particular free magnetic energy, exhibited strong differences during the evolution of various configurations, some helicity-based quantities were robustly discriminating between eruptive and non-eruptive simulations.The total relative magnetic helicity H was not one of them, showing its largest budget for the non-eruptive simulation.However, the ratio H j /H showed high values for eruptive simulations only, most importantly before the onset of eruptions.Shortly after eruptions, its values were similar for both eruptive and non-eruptive simulations.As a result, the ratio H j /H was portrayed as a promising eruptivity proxy.
Given the above, the actual predictive value of the ratio H j /H will ultimately be determined by data-driven modeling of observed solar ARs.If future computational advances make the meaningful calculation of the non-potential 3D coronal magnetic field feasible in NRT, testing of this proxy will become possible and may lead to innovative flare predictors in future, advanced versions of FLARECAST and other operationallyoriented systems.Synthetic data obtained from numerical simulations can also be used to study flare proxies following the same methodology with observed data.Guennou et al. (2017) studied flux emergence simulations by Leake et al. (2013Leake et al. ( , 2014) ) in a way that was similar to the treatment of observational data, namely by isolating the 2D magnetic field vector of the modeled AR photosphere.A list of about 100 different scalar properties was then computed in a way similar to the procedures developed in FLARECAST.Guennou et al. (2017) found that, among the properties tested, only those associated with PIL features presented significant preflare signatures.Virtually all other properties were not able to differentiate between eruptive and quiescent simulations.Since the computation of several of these properties depends on the choice of parameters (e.g., thresholds for masking the data, magnetic field thresholds, etc.), Guennou et al. (2017) could study the role of these thresholds on the ability of a property to determine the eruptivity.As exemplified in Figure 12, it was found that some properties were highly sensitive to the choice of user-defined thresholds whereas some others were more robust.The four panels on the left show the distribution of B h at the modeled photospheric boundary of two 3D MHD flux emergence simulations at t = 100, in normalized time units of the non-eruptive (top) and eruptive (bottom) simulations.The white curves show the portion of the PIL where B h > B h,th , for two different choices of B h,th (0 G and 50 G).The length of the PIL where B h > B h,th is one of two criteria used in the computation of L ss , while L sg uses a different criterion based on the potential field, namely the length of the PIL where B h,pot > B h,th .
The six plots on the right show the temporal evolution of L ss and L sg for the eruptive (solid curves) and non-eruptive (dashed curves) simulations under three different values of B h,th , namely 0 G (top), 25 G (middle), and 50 G (bottom).The different colors corresponds to different choices of criteria also entering in the computation of L ss and L sg , with details discussed in Guennou et al. (2017).As B h,th increases, L ss becomes similar for the non-eruptive and the eruptive simulation.This predictor is thus highly sensitive to the choice of B h,th .Conversely, L sg is relatively insensitive to the choice of B h,th .As a result, L sg is capable of discriminating the two simulations and could conceivably be used as an eruptivity proxy.
In addition, Guennou et al. (2017) studied the robustness of certain parameters in the presence of noise, reporting that parameters related to electric currents seem to be the most sensitive.This would imply the need for screening such predictors before inserting them in prediction pipelines.

Flare-CME connection studies
The FLARECAST exploratory component further enabled researchers to expand upon the project results into other cutting-edge research areas, including coronal mass ejection (CME) prediction.Space-weather operations are currently lacking sufficient warning for eruptive flares, with forecasters only predicting CME arrival at Earth once an eruption is observed (see, e.g., Möstl et al., 2014;Verbeke et al., 2019).One may reasonably expect that CMEs and other solar eruptive phenomena can be physically linked by combining data from a multitude of ground-and space-based instruments, as well as models.However, this can be challenging for automated operational systems.Understanding the processes and sources involved in solar eruptions that could enable the use of photospheric properties is imperative for improving this area of forecasting.
The flare-CME connection that could advance CME prediction was investigated via a unique synergy between FLARE-CAST and an EU Framework Programme 7 project funded under the "Exploitation of space science and exploration data" Call.The HELCATS project created numerous data products from heliospheric imaging onboard the two NASA STEREO spacecraft in order to track the evolution of CMEs in the inner heliosphere.Using the project's main catalogue of over 2000 CME events imaged between 2007 and 2017, an automated algorithm was developed to connect the CMEs observed by STEREO to any corresponding solar flares and AR sources on the solar surface, resulting in the HELCATS LOWCAT M.K. Georgoulis et al.: J. Space Weather Space Clim. 2021, 11, 39 catalogue.Murray et al. (2018) supplemented the LOWCAT catalogue's information on CME kinematic properties (such as speed and angular width) with FLARECAST's extensive property database derived from vector magnetograms, enabling a deeper study into the characteristics of eruptive ARs.Initial statistical analysis was undertaken on the new combined data set, with total unsigned flux, vertical current density and current helicity identified as properties of interest for potential eruption warning thresholds.Beyond the scientific gain, this collective effort provided an excellent opportunity to foster communication between different, large-scale research projects which may work in parallel on similar research topics to benefit space-weather forecasting.See Figure 13a for an example of this synergy, while further details can be found in Murray et al. (2018).
A further detailed investigation was performed for a smaller sample of 32 eruptive flares (4 C-class; 16 M-class; 12 X-class) by Kontogiannis et al. (2019), seeking correlations between selected AR non-potentiality properties and CME kinematic characteristics.The selection intentionally focused on PILrelated properties, including some well-established ones (such as Schrijver's R, WL SG and B eff ) along with ones implemented in the course of FLARECAST (such as the Ising energy and the total non-neutralized electric current; see Table 1).CME information was collected from NASA's DONKI database, utilizing CME images by LASCO onboard the SOHO mission.Positive correlations between CME kinematic properties were found for all pre-eruption photospheric properties, in line with the results of Murray et al. (2018).Remarkably, correlations were stronger for faster CMEs implying that selected non-potentiality properties may be indicative of upper-limit eruption scales that ARs can produce over a given timescale.Among the FLARECAST properties tested, the preflare total unsigned non-neutralized current and the length of the main PIL stand out, exhibiting stronger correlations with the CME kinetic energy and speed (Fig. 13b).This finding points to a causal relationship between non-neutralized electric currents (stemming from compact, sheared PILs) and the associated free magnetic energy and helicity budgets in ARs and CMEs.Therefore, one envisions a future forecasting capability of basic CME kinematic properties before CMEs actually occur that, combined with the potential CME launch location as per its source AR, could give educated guesses of possible CME shock formation and geoeffectiveness.

Conclusions
FLARECAST was a multifaceted, interdisciplinary project.As such, it is appropriate to identify the different activity areas of the project and summarize conclusions pertinent to each of them.

Science and technology
One of the project's main aims was to seek the most efficient ways of forecasting solar flares.In this respect, FLARE-CAST is arguably the most intensive and systematic solar flare forecasting effort to date due to (i) the amount, cadence and validation of the input SDO/HMI SHARP magnetogram data, (ii) the sheer number of AR properties treated as flare predictors (209 of them, with 171 used in the comprehensive study of Campi et al., 2019), (iii) the breadth of machine-learning prediction algorithms implemented (14 included in Table 2), and (iv) the comprehensive performance verification approach and infrastructure, along with suggestions on how to implement it in a rigorous training and testing philosophy.
Solar flares are long known to be stochastic events, namely responses of strongly nonlinear (quite likely self-organized critical) dynamical systems, as solar ARs are thought to be.Hence, flare forecasting appears intrinsically probabilistic, notwithstanding that some machine-learning prediction methods provide a binary output.Despite this long-standing knowledge, the FLARECAST Consortium was keen to determine whether a combined Big Data and machine learning effort could lift this stochasticity barrier by achieving a compelling, verified binary separation between flaring and non-flaring AR populations.As Campi et al. (2019) declared right from that paper's title, and was also implied by the results and discussion of Florios et al. (2018), stochasticity in flare occurrence and subsequent forecasting remains uncompromising.This, of course, could be due to limitations or shortcomings of the tested methods and/or data.Regardless, despite significant progress brought by FLARECAST and many other efforts worldwide, a decisive (i.e., binary and with high enough verification metric values to be used for decision making on an operational context) flare prediction capability is still beyond reach.This said, FLARE-CAST has generated codes, data and infrastructure that could potentially facilitate this crucial milestone in the future.
A recent trend in the literature (see, e.g., Nishizuka et al., 2018;Li et al., 2020, and follow-up works) suggests that machine learning may not even be sufficient, so the community should be moving toward deep learning flare prediction efforts.This remains to be seen; even in the above deep learning efforts, however, a significant degree of stochasticity persists, reflected in less-than-optimal performance indicators.So far, solar deeplearning prediction methods have by no means outperformed machine learning ones.It is an open question whether this can be reversed in the future, although the massive data needs of deep learning algorithms (e.g., Goodfellow et al., 2016) may render flare prediction a less-than-optimal domain for deep learning efforts due to the lack of sufficient event instances to train on.
In terms of performance verification, while FLARECAST created a detailed apparatus (Sect.4), the nominal duration of the project did not allow its complete exploitation.Existing project studies (Benvenuto et al., 2018;Florios et al., 2018;Campi et al., 2019) have used meaningful parts of it but not the entire infrastructure provided in Table 4. Future studies, however, will have the option to work and rely on this infrastructure for a comprehensive performance verification assessment.
While FLARECAST was a case study of R2O, it soon became clear that we could learn from its output, by means of both predictor ranking and the explorative research component.Then, part of the project's output contributes to O2R objectives.In particular, a robust ranking of predictors triggers the question of why just a few ([10) predictors, different for each prediction method at least for flares of GOES class !M1.0, gather virtually all the predictive power in Campi et al. (2019), while in another attempt by Florios et al. (2018) that used the same (RF) method the increase in performance seems monotonic as more predictors are added.The above generalize and diversify the M.K. Georgoulis et al.: J. Space Weather Space Clim. 2021, 11, 39 big picture that now calls for (i) a comprehensive review of training and testing practices, regardless of prediction methods, and (ii) an investigation of specific predictors' performance on given prediction method(s).In light of all these, one key takeaway message is that we should not be employing "black box" (i.e., non-human-interpretable) prediction methods, in spite of their performance, in order to keep the knowledge discovery channel open (Rudin, 2019).
In terms of exploration, the project used synthetic data (Guennou et al., 2017;Pariat et al., 2017) to investigate elements not accessible by the available observational data, but also concerned itself with the flare-CME connection to investigate flare predictors exhibiting a statistically significant correlation with CME parameters.This was pursued via both the FLARECAST -HELCATS synergy (Murray et al., 2018) and the FLARECAST -DONKI association of Kontogiannis et al. (2019).These works, along with previous ones in this respect (Falconer et al., 2002;Barnes, 2007;Ugarte-Urra et al., 2007;Georgoulis, 2008), all show promise toward a viable interpretation of the pre-eruption configuration in solar ARs and a deeper, more complete understanding of solar eruptions.
Along with other recent studies (Barnes et al., 2016;Leka et al., 2019a, b;Park et al., 2020), FLARECAST has showcased the need for fixed, pre-defined data sets in solar flare (and eruption) forecasting.Such "benchmark" data sets (see Angryk et al., 2020, for a definition and an example) enable the precise evaluation of different methods, machine/deep learning or other, on preset training and testing samples.Performance comparisons on different data sets and/or phases of the solar cycle can be problematic due to the differences in underlying flare statistics over different time periods, so a benchmark dataset should be partitioned in such a way as to treat variable climatology to the extent possible.The FLARECAST property database could also play the role of a benchmark dataset.It further enables the important action of time series forecasting, that was unfortunately not achieved during the project's nominal period of execution, and the treatment of class imbalance in machine learning performance.Such investigations directly address another trait of major flares, namely their scarcity and rareness, that becomes extreme for flares of historic magnitudes.
FLARECAST developed an infrastructure that could help address the need for an integrated space weather forecasting platform.Coordination of space weather forecasting efforts has been identified as a key priority in proposed roadmaps such as the one developed on behalf of the COSPAR (Schrijver et al., 2015).It is also prominently recommended by the ESWACC (Opgenoorth et al., 2019).In line with these studies, global coordination of space-weather forecasting efforts sprung out as a top-level conclusion during the FLARECAST Stakeholder Workshops.The project's infrastructure could serve as a testbed, or breadboard, for future applications encompassing the other two main legs of "stormy" space weather, namely CMEs and SEPs.Given the versatility of the Docker engine and containers used in FLARECAST, components of different programming languages can be removed, updated, or added in a highly modular fashion.
All in all, FLARECAST has amply shown that solar flare forecasting should not be, and thankfully is not, an internal affair for heliophysicists.Fusion of expertise is paramount to advance and, ultimately, break ground toward an efficient space-weather forecasting but also other complex real-world problems.

Operations
By defining different flare forecasting modes (i.e., based on GOES flare class, forecast window, latency, etc.) and testing numerous prediction algorithms, FLARECAST reaffirmed that one single recipe does not fit every need.Patterns between different prediction methods were seen in the best-performing predictors for GOES !C1.0 flares (i.e., flares of GOES C-class and above), for example, but not for GOES !M1.0 flares (Campi et al., 2019).In addition, AR properties that correlate best with CME properties (Murray et al., 2018;Kontogiannis et al., 2019) may, or may not, be within the best performing predictors for a given method.We view these findings in light of an outcome of the FLARECAST Stakeholders' Workshops that the real end users of the project's results are most likely expert operational forecasters who can disseminate them as needed.Other synergistic community works fall along similar lines by arguing for the benefit of human "forecasters in the loop" (Leka et al., 2019b).It then becomes clear that more time and effort will be needed before agencies and institutions committed to operations are able to shift from today's relatively simple forecast philosophies to sophisticated, multi-dimensional parameter spaces, Big Data and intensive machine/deep learning methodologies.It is a crucial step to be taken, however, balancing on a fine line between simplicity and interpretability of the concepts and the necessary level of complexity dictated by the problem at hand.
In brief, in spite of its limited duration that left a few envisioned elements to be realised (i.e., time series forecasting and comprehensive performance verification of all prediction algorithms), we hope that FLARECAST can be thought of as a future "textbook" R2O case, paving new ways and even narrowing "the valley of death" of less-than-successful approaches between the two domains of research and operations.As eloquently put by Merceret et al. (2013), who drew analogues between space weather and terrestrial meteorology, there may be leads able to transform the "valley of death" into a "valley of opportunity".
It should not go unnoticed that the immediate next step for the FLARECAST infrastructure is to become a flare prediction service of ESA's SSA Space Weather Service Network 19 .This task is undertaken by the FHNW partner that has originally developed the project's technology apparatus.

Communication
A key objective of the project was the communication with government and industry representatives, that cultivated into two FLARECAST Stakeholders' Workshops.A top-level conclusion of them was that communities involved in (solar flare or space-weather) forecasting should be in close contact.There are "language" and terminology barriers that require a continual effort toward keeping the momentum of communication and better defining the constantly evolving landscape of user needs.A prominent example in this direction is the aviation and power industries that have moved beyond colloquial 19 Available at https://swe.ssa.esa.int/ssa-space-weather-activities.
M.K. Georgoulis et al.: J. Space Weather Space Clim. 2021, 11, 39 interest and are raising their awareness toward the actual capabilities, shortcomings and detailed products of operational forecasting.
We found from these Stakeholders' Workshops that accuracy, timeliness, and ease of use are the most important factors in a forecast from the users' perspective.Less important is the scientific detail accompanying each forecast.As mentioned above, it was also found that there should be an intermediate step between the researchers who devise the forecasts and the end users.This step is filled by operational forecasters, who can interpret forecasts and convey their messages in a language that users can understand and cope with.
As to the communication with the wider public, FLARE-CAST reaffirmed that the public is keen to listen and eagerly seeks knowledge on solar flares and space weather.It became clear that open-access, public-domain projects such as this one should return part of their discovered knowledge to the taxpayer citizen, by educating them credibly and responsibly, steering clear from the (unfortunately high) level of misinformation spread largely untested and un-scrutinized over the internet and social media20 .

Lessons learned
In implementing a sizable and diverse project such as FLARECAST, there was no shortage of lessons to learn.The intrinsic complexity of the project was not only due to its diversity, but also because Consortium partners were geographically distributed throughout Europe.This required certain steps to be taken in order to secure efficient coordination and implementation.Even so, however, some desired tasks could not be implemented within the nominal duration due to lack of time.The most important lessons were the following: -An effective project governance is instrumental for planning ahead.Well-defined, distinct roles foreseen in the Consortium Agreement such as those of the Project Coordinator, Project Scientist, Financial and Legal Manager, as well as bodies such as the Project Management Board and the Steering Committee, provide a governance structure that can, in principle, lead the project into fruition via a properly shared workload.There should be space for some adjustments in the initial plan.However, radical changes midway through the project will only lead to confusion and impede implementation.-The Consortium further needed an efficient way of remote communication, beyond e-mails, phone calls and teleconferences.This was achieved via an integrated platform of collaborative software tools that were utilized on a Consortium-wide level.This system combined flexible internal wiki-pages (e.g., enabling tagging individuals in documents with associated email updates, real-time recording of meeting minutes, efficient exporting of periodic report documents and code documentation, etc.) with integrated code repositories that have easily interpretable visualization interfaces for version tracking (hence, allowing efficient code reviews and prompt troubleshooting of change-related issues).-In Consortium deliberations, common understanding and a continual pursuit of consensus in a collegial atmosphere were valued as the most important elements for a successful collaboration.-In a project with strict requirements for hardware installation and software development, time (and backup time) is necessary; efficient time management is paramount for the aversion of implementation delays.Given the high degree of synchronization required between parallel tasks, delays often seem inevitable; when they happen, they may also cause a domino effect affecting different project parts.Advance risk assessment and time planning help mitigate the impact of these delays.
A.4 Licensing EU's OpenAIRE policy requires that all source code and data models must be openly accessible to any interested person or entity, both within the EU and worldwide.Therefore, the software and data licenses chosen by the FLARECAST Consortium aim to be as permissive as possible, at the same time being free of any liability and provided "as is".The majority of the written code is available under a BSD-2 (https://opensource.org/ licenses/BSD-2-Clause) license.Some of the algorithms, however, depend on third-party libraries that use a less permissive GPL licence (https://www.gnu.org/licenses/gpl-3.0.de.html) and thus these algorithms need to be released under GPL, as well.All FLARECAST data products are provided under the fully open ODC PPDL (https://www.opendatacommons.org/licenses/pddl/1-0/index.html) license.

Appendix B
The FLARECAST user survey The FLARECAST user survey was carried out in Autumn 2016, with around 100 users approached and 31 responding.Around 60% indicated that they were using flare forecast or alert services.Most thought these services were useful but around a quarter did not know whether their preferred service was accurate or inaccurate, which points to a need for better informed end users.Asked whether they would recommend such a service, around two thirds gave a "passive" response, indicating that they neither strongly recommend, nor strongly not recommend, flare forecasting services.
Users were asked if they planned to use flare forecasting services differently in the future.Five of them provided further details of how they would do this.While the assumption is that most users are in the aviation or defence industries, it is interesting that out of these five comments, one was from a satellite user and two from GNSS users.Suggested approaches included: -Developing an operational response to space weather events, and a change in practices to take advantage of increased precision and advance warnings.-Comparison with other flare prediction tools, such as ASSA 22 .-Examining correlations between flares and ionospheric TEC changes.-Checking the impact of space weather disturbances on GNSS systems.-Allowing the spacecraft operator to tailor services for spacecraft control aspects.
The users were then asked what factors would be most important to them in a flare forecasting service.Timeliness and accuracy were found to be most important to a user, while scientific details were deemed least important.A summary of all responses appears in Figure B.1.
Most flare forecasts focus on the occurrence of flares defined by the NOAA R scales (e.g., M-class, X-class), but 77% of responders said forecasts of "All Clear" periods are also important.A similar number knew about the NOAA R scales, but 12.5% said that they were not useful for their purposes.The following reasons were given: The R scales classify impacts to radio systems, so are of some interest to us (a Telecoms company), but we are also interested in space weather impacts to other systems such as power distribution.The scale is useful globally, but too coarse at local (i.e., country) level.Currently we are not reacting to NOAA scales.A decision was taken to react to MOSWOC sourced indices.
Finally, the survey came up with suggestions of other useful points for discussion during the First Stakeholders' Workshop.Suggestions included: -Details on verification of results and methods.
-Explanations for non-scientists; international coordination to raise awareness.-Coupling to D-RAP 23 ; specific solar radio burst forecasts/ alerts.-The state of the science and roadblocks for forecast developments.
-Better understanding of impact / consequence of space weather events, and presentation of this for non-experts.-Clear presentation of applicable uncertainties.

Fig. 3 .
Fig. 3. Examples of two solar ARs, in terms of their photospheric continuum intensity images (a, b), LOS magnetograms (c, d) and EUV coronal images at 19.3 nm (e, f).Both active regions were observed on 5 September 2017 at around 12:01 UT.Of them, NOAA AR 12673 (left column) was intensely eruptive, giving the largest eruptive flares since January 2005, while NOAA AR 12674 (right column) did not host any major flares.Images (a-d) have been acquired by HMI, while images (e, f) by AIA, both onboard the SDO mission.

Fig. 4 .
Fig. 4. Automated PIL identification in a photospheric longitudinal magnetogram of NOAA 12253, observed by SDO/HMI on 2 January 2015.The identified PILs in the region are further characterized as "moderate" (yellow contours) and "strong" (red contours).The difference between "strong" and "moderate" PILs relies on the absolute value of the gradient of the vertical magnetic field (>40 G/ pixel and >16 G/pixel, respectively) and the strength of the horizontal field (>120 G and >100 G, respectively).Parts of the PIL that are not highlighted in color are below one or both thresholds and are considered as "weak".The algorithm used is the FLARECAST PIL identification code, accessible as described in Appendix A.2.

Fig. 5 .
Fig. 5. Monthly numbers of (a) calculated property groups and (b) processed NRT SHARP timestamps between 14 September 2012 and 30 January 2019.Property group numbers are shown at highest (1-h; orange bars) and at lowest (24-h; blue bars) cadence.The shaded interval between 13 April 2016 and 8 September 2017 corresponds to a period of questionable quality for the HMI NRT SHARP data (see text for details).Histograms represent a total of 32,098 processed SHARP timestamps, leading to millions of predictor values, since each timestamp could potentially provide up to 209 predictors.

Fig. 6 .
Fig. 6.Attributes of 5557 SHARP-associated flares over September 2012 to May 2019.(a) Flare onset times, in terms of GOES classes X (red), M (orange) and C (green) and respective sub-classes (tick marks; right ordinate), plotted over the 90-day averaged sunspot number over Solar Cycle 24 (left ordinate; Source: Sunspot Index and Long-Term Solar Observations [SILSO], Royal Observatory of Belgium).The numbers of flares for each GOES class are also given.The bottom line shows waterfall diagrams of the distributions of rise time (b), decay time (c) and duration (d).Extrema for each distributions are shown in each plot.The "All flares" diagram corresponds to the sum of flare numbers in each bin.We use fixed, 10-min bin sizes.

Fig. 7 .
Fig. 7.A data component (dashed box) consists of a database (blue cylinder) and a well-defined API (white box).An algorithm (blue rectangle) uses this API to read data (green arrow) and another API to write data (red arrow) into a separate data component.

Fig. 8 .
Fig. 8. Components of the FLARECAST architecture within the main Docker engine.The different grayscale rectangles correspond to different Docker containers, while adjacent containers typically share a common interface.

1.
Property database: It contains the relevant AR information and comprises the following data: (a) SHARP HMI metadata (i.e., properties); (b) data set of SHARP-calculated properties exclusive to FLARECAST; (c) NOAA/SWPC SRS data; (d) flare association as extracted from the NOAA/SWPC event list.2. Machine learning configuration database: It contains information learned by a given machine learning method from a given training set according to the machine learning algorithm category.We have generated the following machine learning algorithm categories and corresponding parameters to be saved and stored: (a) Neural networks ?architecture, synaptic weights.(b) Clustering methods ?clusters' centers.(c) Regression methods ?predictors' weights, regularization parameters.3. Machine learning result database: It contains the prediction results associated to a given AR property set (i.e., the labels predicted by the selected machine learning algorithms).Depending on the machine-learning algorithm utilized, ARs at a given point in time can be labeled using different types of prediction labels.The label types are: (a) Predicted flare occurrence: binary values, 0 (no-flare)/ 1 (yes-flare) values, referring to the flare occurrence at a specific GOES flux level that must be fixed a priori (e.g., flares !M1.0).(b) Predicted flaring probability: decimal percentage values between 0 and 1, similar to above.(c) Flare intensity: a positive real number.(c) Flare intensity and delay: two positive real numbers.

Fig. 10 .
Fig. 10.Example of the development and tests of new morphological flare predictors (adapted from Kontogiannis et al., 2018).(a) Input LOS magnetogram of NOAA 11611.(b) The magnetogram is partitioned into positive-(red contours) and negative-(blue contours) polarity flux patches that are used to derive the Ising energy E Ising,part of the AR.These partitions, along with the vector magnetogram of the AR, are further used to calculate the total unsigned non-neutralized currents I NN,tot .(c) Positive-(red) and negative-(blue) polarity umbrae identified in maps of continuum intensity are used to calculate the Ising energy of the umbrae E Ising,spot and the sum of the horizontal magnetic gradient G s .(d-e) Conditional probabilities for GOES !C1.0 and !M1.0, respectively, calculated for a set of successive thresholds in E Ising , E Ising,part , E Ising,spot , G s and I NN,tot .The respective probabilities inferred for the total unsigned magnetic flux U are also shown for reference.
2.3) machine-learning methods on historical NOAA/SWPC sunspot classifications between August 1996 and December 2010, training the methods on similar data between December 1988 and June 1996.They relied on flaring rates associated to certain sunspot classes, similar to McCloskey et al. (

Fig. 11 .
Fig. 11.Example prediction runs adapted from (a) Florios et al. (2018) and (b) Campi et al. (2019) for the same basic prediction settings, namely a 24-hr forecast window for GOES !M1.0 flares.Both TSS and HSS values are shown in a using RFs only, while b shows only TSS values for two different methods, namely hybrid LASSO (HLA) and RF.The abscissas in both plots correspond to the number of predictors used.

Fig. 12 .
Fig. 12. Example of the impact of thresholding, here of the threshold B h,th for the horizontal magnetic field strength B h , on the efficiency of two potential flare predictors, namely the lengths of the strong-shear and strong-gradient PIL, L ss and L sg , respectively (see text for details).Figure adapted from Guennou et al. (2017).
Fig. 12. Example of the impact of thresholding, here of the threshold B h,th for the horizontal magnetic field strength B h , on the efficiency of two potential flare predictors, namely the lengths of the strong-shear and strong-gradient PIL, L ss and L sg , respectively (see text for details).Figure adapted from Guennou et al. (2017).

Fig
Fig. B.1.Summary of responses to the question, "Which factors are/would be important to you in a flare forecasting service?"Lowest average score corresponds to the most important factor.

Table 1 .
The complete list of FLARECAST properties, separated in property groups and accounting for a total of 209 predictors.The vast majority(197)of them correspond to the AR magnetic field, while the rest include information from the NOAA SWPC Catalogues and from photospheric intensity images.The project uses the flare attributes as properties.At least one flare of desired magnitude in the forecast window signifies a positive instance.Most SHARP parameters correspond to mean values.In the FLARECAST pipeline, other relevant magnetogram-related parameters are also used as predictors; these are maximum and minimum values, median, standard deviation, kurtosis, and skewness.f Used in conjunction with magnetograms to calculate the sum of the magnetic field gradients between all possible opposite-polarity umbrae pairs.
b c Photospheric proxy for coronal information.d Uses PFEs.e

Table 2 .
FLARECAST unsupervised and supervised machine learning methods (final release).