Issue 
J. Space Weather Space Clim.
Volume 13, 2023



Article Number  26  
Number of page(s)  21  
DOI  https://doi.org/10.1051/swsc/2023026  
Published online  20 October 2023 
Research Article
Forecasting solar energetic proton integral fluxes with bidirectional long shortterm memory neural networks
Institute of Astronomy of the Bulgarian Academy of Sciences, 1784 Sofia, Bulgaria
^{*} Corresponding author: mnedal@astro.bas.bg; mnedal@naorozhen.org
Received:
20
April
2023
Accepted:
20
September
2023
Solar energetic particles are mainly protons and originate from the Sun during solar flares or coronal shock waves. Forecasting the Solar Energetic Protons (SEP) flux is critical for several operational sectors, such as communication and navigation systems, space exploration missions, and aviation flights, as the hazardous radiation may endanger astronauts’, aviation crew, and passengers’ health, the delicate electronic components of satellites, space stations, and ground power stations. Therefore, the prediction of the SEP flux is of high importance to our lives and may help mitigate the negative impacts of one of the serious space weather transient phenomena on the nearEarth space environment. Numerous SEP prediction models are being developed with a variety of approaches, such as empirical models, probabilistic models, physicsbased models, and AIbased models. In this work, we use the bidirectional long shortterm memory (BiLSTM) neural network model architecture to train SEP forecasting models for three standard integral GOES channels (>10 MeV, >30 MeV, >60 MeV) with three forecast windows (1day, 2day, and 3day ahead) based on daily data obtained from the OMNIWeb database from 1976 to 2019. As the SEP variability is modulated by the solar cycle, we select input parameters that capture the shortterm, typically within a span of a few hours, and longterm, typically spanning several days, fluctuations in solar activity. We take the F10.7 index, the sunspot number, the time series of the logarithm of the Xray flux, the solar wind speed, and the average strength of the interplanetary magnetic field as input parameters to our model. The results are validated with an outofsample testing set and benchmarked with other types of models.
Key words: Solar energetic particles: flux / Neural networks: LSTM / SEP flux forecasting / Solar activity / Deep learning
Note to the reader: Following the publication of the Erratum, the article has been corrected on 15 December 2023.
© M. Nedal et al., Published by EDP Sciences 2023
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
Solar Energetic Protons (SEP) are highenergy particles that are believed to have originated from the acceleration of particles in the solar corona during coronal mass ejections (CMEs) and solar flares (Aschwanden, 2002; Klein & Dalla, 2017; Lin, 2005, 2011; Kahler et al., 2017). They are typically characterized by their high energy levels – with some particles having energies in the relativistic GeV/nucleon range – and their ability to penetrate through spacecraft shielding, causing radiation damage (Reames, 2013; Desai & Giacalone, 2016). The fluency and energy spectrum of SEP are influenced by several factors, including the strength of the solar flare or CME that produced them, and the conditions of the interplanetary environment (Kahler et al., 1984, 1987; Debrunner et al., 1988; Miteva et al., 2013; Trottet et al., 2015; Dierckxsens et al., 2015; Le & Zhang, 2017; Gopalswamy et al., 2017). SEP exhibits a strong association with the solar cycle, with the frequency and flux of SEP events peaking during the maximum phase of the solar cycle (Reames, 2013). This is thought to be due to the increased activity of the Sun during this phase, which leads to more frequent and powerful flares and CMEs. Previous studies have shown a relationship between the occurrence frequency of SEP and the sunspot number (SN; Nymmik, 2007; Richardson et al., 2016). However, the exact relationship between the solar cycle and SEP is complex and not fully understood. Hence, more work is needed to better understand this connection, as previous studies have reported intense SEP events during relatively weak solar activity (Cohen & Mewaldt, 2018; Ramstad et al., 2018).
SEP have been a subject of interest and research in heliophysics for decades. It is hypothesized that shock waves generated in the corona can lead to an early acceleration of particles. However, SEP have sufficient energy to propagate themselves by surfing the interplanetary magnetic fields (IMF), and therefore, the expanding CME is not necessary for their transport (Reames, 2000; Kóta et al., 2005; Kozarev et al., 2019, 2022). While this theory has gained acceptance, there is an ongoing debate among scientists over the specific mechanisms and conditions responsible for SEP production and acceleration. The creation, acceleration, and transport mechanisms of SEP are complex and involve a combination of magnetic reconnection, shock acceleration, and waveparticle interactions (Li et al., 2003, 2012; Ng et al., 2012). The specific mechanisms responsible for SEP production and acceleration can vary depending on the type and strength of the solar event that triggered them. Further research is imperative to better understand the processes involved in the production and transport of SEP in the heliosphere. This will facilitate the development of more precise models that assist in minimizing the impact of SEP on astronauts and spacebased assets.
Several models are available, or under development, for forecasting SEP, which use diverse approaches and serve different objectives. These models comprise computationally complex physics based models, quick and simple empirical models, Machine Learning (ML)based models, and hybrid models that combine different approaches and produce different types of outputs, including deterministic, probabilistic, categorical, and binary. Deterministic models always generate the same output without any randomness or stochastic components, such as predicting the SEP flux at a specific moment or the arrival time of SEP. On the other hand, probabilistic models provide a probability value that reflects the likelihood of a SEP event occurring. However, replicating SEP fluxes at a specific time is still a significant challenge for current models.
An excellent review of SEP models and predictive efforts was recently published by Whitman et al. (2022), which summarizes the majority of the existing models. For instance, Papaioannou et al. (2022) introduced the Probabilistic Solar Particle Event Forecasting (PROSPER) model, which is incorporated into the Advanced Solar Particle Event Casting System (ASPECS)^{1}. The PROSPER model utilizes a Bayesian approach and datadriven methodology to probabilistically predict SEP events for three integral energy channels >10, >30, and >100 MeV. The model’s validation results indicate that the solar flare and CME modules have hit rates of 90% and 100%, respectively, while the combined flare and CME module have a hit rate of 100%. Bruno & Richardson (2021) developed an empirical model to predict the peak intensity and spectra of SEP at 1 AU between 10 and 130 MeV, using data from multiple spacecraft. The model is tested on 20 SEP events and shows good agreement with observed values. The spatial distribution of SEP intensities was reconstructed successfully, and they found a correlation between SEP intensities and CME speed. Hu et al. (2017) extended the Particle Acceleration and Transport in the Heliosphere (PATH) model to study particle acceleration and transport at CMEdriven shocks. They showed that the model can be used to obtain simultaneous calculations of SEP characteristics such as timeintensity profiles, instantaneous particle spectra, and particle pitch angle distributions at multiple heliospheric locations. Overall, results resemble closely those observed in situ near the Earth but also yield results at other places of interest, such as Mars, making it of particular interest to Mars missions. SPREAdFAST (Kozarev et al., 2017, 2022) is a physicsbased, datadriven framework that utilizes EUV observations and models to simulate SEP fluxes at 1 AU and to estimate energetic particle acceleration and transport to various locations in the inner heliosphere. It generates timedependent histograms and movies distributing them through an online catalog. The accuracy and efficiency of the model were encouraging, but the highest energy fluxes showed disagreement with in situ observations by the SOHO/ERNE instrument. However, the framework has great potential for space weather science and forecasting.
AminalragiaGiamini et al. (2021) used neural networks to provide probabilities for the occurrence of SEP based on soft Xray data from 1988 to 2013. They obtained >85% for correct SEP occurrence predictions and >92% for correct noSEP predictions. Lavasa et al. (2021) described a consistent approach to making a binary prediction of SEP events using ML and conventional statistical techniques. The study evaluated various ML models and concluded that random forests could be the best approach for an optimal sample comprising both flares and CMEs. The most important features for identifying SEP were found to be the CME speed, width, and flare soft Xray fluence. Kasapis et al. (2022) employed ML techniques to anticipate the occurrence of a SEP event in an active region that generates flares. They utilized the SpaceWeather MDI Active Region Patches (SMARP) dataset, which comprises observations of solar magnetograms between June 1996 and August 2010. The SMARP dataset had a success rate of 72% in accurately predicting whether an active region that produces a flare would result in a SEP event. Moreover, it provided a competitive lead time of 55.3 min in forecasting SEP events.
Engell et al. (2017) introduced the Space Radiation Intelligence System (SPRINTS), a technology that uses pre and postevent data to forecast solardriven events such as SEP. It integrates automatic detections and ML to produce forecasts. Results show that SPRINTS can predict SEP with a 56% probability of detection and 34% false alarm rate. Nevertheless, the HESPERIA REleASE tools provide realtime predictions of the proton flux at first Lagrange point (L1) by using nearrelativistic electrons as a warning for the later arrival of protons and have been set to operation (Malandraki & Crosby, 2018). Historical data analysis indicates high prediction accuracy, with a low false alarm rate of approximately 30% and a high probability of detection of 63% (Malandraki & Crosby, 2018).
Forecasting SEP is a critical task that serves operational needs and provides insight into the broader field of space weather science and heliophysics. As emphasized in previous works, a highprecision forecasting model is urgently required to predict SEP flux within a period of time, given the risks associated with these events. This highlights the critical requirement for a dependable forecasting system that can mitigate the risks associated with SEP.
Scientists have been using physicsbased and empirical models for decades to forecast SEP. However, these models have certain limitations. Physicsbased models require accurate input data and underlying physical assumptions. In addition, the complexity of the physics involved and in correct parameters may introduce uncertainties that can lead to inaccurate predictions. On the other hand, empirical models rely on historical data to make predictions. While they can be accurate sometimes, they may be unable to account for changes in physical conditions related to the acceleration and propagation of SEP, which can influence prediction accuracy. ML models, however, provide a different approach to SEP forecasting. These models can analyze vast amounts of data, learn patterns from the data that are used, and connections that may not be obvious to experts. Additionally, ML models can adapt to changes in underlying physical conditions, resulting in more accurate predictions as more data is collected; they also provide relatively rapid forecasts, which allows for incorporation into a realtime forecasting workflow.
In the upcoming sections, we will explore the limitations in accuracy that arise from dealing with an imbalanced dataset and lowresolution data. Specifically, the presence of intrinsic outliers in the time series data pertaining to SEP flux poses a significant challenge in modeling. These outliers correspond to occurrences of SEP events and, consequently, have an impact on the accuracy of predictions. Notably, they often lead to an underestimation of the SEP fluxes, primarily due to the predominance of relatively low values throughout the majority of the time interval.
In this paper, we present advanced deeplearning models to forecast the daily integral flux of SEP over a 3day forecasting window by using bidirectional long shortterm memory (BiLSTM) neural networks, for three energy channels (>10, >30, and >60 MeV). Our models can forecast the timedependent development of SEP events in different energy domains, which can be used to model the space radiation profiles using frameworks such as BRYNTRN Wilson et al. (1988) and GEANT4 (Truscott et al., 2000). We describe the data selection and preprocessing in Section 2. We present an overview of the analysis methods and the models implemented in Section 3. Then we show the forecasting results in Section 4. Finally, the summary and implications are presented in Section 5.
2 Data preparation
In this section, we describe the physical quantities, the types of inputs and their sources, as well as the outputs we are forecasting. Some of the technical terms used in this study are explained further in the Appendices A–E.
In order to capture the variability of solar activity, which modulates the SEP flux, we selected input physical quantities that describe both the interplanetary medium and solar activity. These input features can be categorized into two groups: remote signatures and insitu measurements.
The remote signatures consist of the F10.7 index, as well as the longwavelength (XL) and short wavelength (XS) Xray fluxes. The F10.7 index represents the flux of solar radio emission at a wavelength of 10.7 cm, measured in solar flux units (sfu). To obtain the xray fluxes, we utilized 1 and 5minute averaged data from the Geostationary Operational Environmental Satellite (GOES) database^{2}, specifically at long wavelengths (1–8 Å) and short wavelengths (0.5–4.0 Å).
The insitu measurements encompass the nearEarth solar wind magnetic field and plasma parameters. These include the solar wind speed (in km s^{−1}), average interplanetary magnetic field (IMF) strength (in nT), and the integral SEP fluxes at three energy channels: >10, >30, and >60 MeV, which correspond to the GOES channels (in 1/cm^{2} sec ster). These SEP fluxes were obtained from multiple spacecraft stationed at L1 throughout the study period. In particular, the IMF and plasma data in the OMNI database are obtained from the IMP, Wind, and ACE missions, while the energetic particle fluxes are obtained from the IMP and GOES spacecraft^{3}.
To ensure a comprehensive dataset, we acquired hourlyaveraged data covering a timeframe from December 1976 to July 2019, which spans the past four solar cycles. These data were sourced from the Space Physics Data Facility (SPDF) OMNIWeb database^{4}, hosted by the Goddard Space Flight Center. This database provides a wealth of information, including integral proton fluxes, as well as an extensive range of solar wind plasma and magnetic field parameters. Lastly, the daily data on sunspot numbers were obtained from the Sunspot Index and Longterm Solar Observations (SILSO) archive^{5}, maintained by the World Data Center.
Figure 1 shows a plot for the time series data of all features. The top three panels are the logarithms of the SEP integral flux at the three energy channels (log_PF10, log_PF30, and log_PF60), then the sunspot number, the F10.7 index (F10_idx), the logarithms of the Xray fluxes (log_Xs and log_Xl), the solar wind speed (V_{sw}), and the average magnitude of the IMF (avg_IMF). Throughout this paper, we adopt the convention that “log” refers to the common logarithm with a base of 10. The gray shades refer to the timespan of solar cycles. The blue, orange, and gold colors refer to the training, validation, and test sets, respectively. The data split method will be explained shortly.
Figure 1 Data splitting for all input features, showing the training, validation, and testing sets. Daily data from 19761225 00:00 to 20190730 00:00. The gray shading labels the solar cycles from SC21 to SC24. 
Since the input SEP data have been compiled from various spacecraft, it may have artifacts even after processing. In particular, there are occasional jumps in the background level. There are also several daylong gaps in the OMNI solar wind parameters from the early 1980s to mid1990s where only IMP 8 data are available and this spacecraft spent part of each orbit in the magnetosphere. We are reasonably confident that these issues do not influence the overall analysis significantly.
In deep learning applications, the dataset is split into three sets; namely the training set, the validation set, and the test set. The training set is usually the largest chunk of data that is used to fit the model. The validation set is a smaller chunk of data used to finetune the model and evaluate its accuracy to ensure it is unbiased. The test set is the outofsample data exclusively used to assess the final model when performing on unseen data (Ripley, 1996).
After inspecting the correlation between the solar wind indices and the SEP integral fluxes in the OMNIWeb database, we chose the topcorrelated features with the SEP flux. The correlations were made between the SEP fluxes and the individual parameters. Hence we took only time series of logarithms of the protons’ integral flux at three energy channels (>10, >30, and >60 MeV), the time series of logarithm of the Xray fluxes, the F10.7 index, the sunspot number, the solar wind speed, and the average strength of the IMF as input parameters to our model. The log of the SEP flux was used across the whole study. The correlation matrices for the training, validation, and test sets are shown in Figure 2. The Xray and proton fluxes were converted into the logarithmic form because it was more convenient than the original form of data since the time series data were mostly quiet and had numerous sharp spikes, which correspond to solar events. Based on a previous experience with neural networks (NNs) (Nedal et al., 2019), we found that training separate models for each target (output) feature can lead to better results. This is because a dedicated model for each output feature can more easily learn the interrelationships between input features and make more accurate predictions. Therefore, in our current study, we trained three separate models, each one targeting the logarithm of the proton’s integral flux at a specific energy channel.
Figure 2 Correlation matrices show the correlation between the features in the training, validation, and test sets. 
In order to ensure consistency across all features, all durations of the time series data of the physical quantities were matched to be within the same time range. Subsequently, the dataset was resampled to obtain daily averaged data, resulting in a significant reduction of the dataset size by a factor of 24. This reduction facilitated expeditious training and yielded prompt results. There were missing data values in the original dataset for the B_{avg} (~10.7%), V_{sw} (~10.5%), F10.7index (~0.08%), shortband Xray flux (~8%), longband Xray flux (~9.8%), and proton fluxes (~4.3%). The data gaps were linearly interpolated.
In time series forecasting, it is a common practice to take a continuous set of data points from the main dataset to be the validation set and another smaller chunk of data to be the test set, for instance in Pala & Atici (2019), Benson et al. (2020), Zhang et al. (2022), and Zhu et al. (2022). From our experiments, we got decent results when we applied the same data split method, but the results were a bit biased toward the end of the solar cycle 24 and the testing set was biased towards a quiet period. So, we adapted the 921 strategy, that is taking from each year 9 months to be added to the training set, 2 months to be added to the validation set, and 1 month to be added to the test set. This is applied over the ~43 years of data (Fig. 1), which yields 74.29% of the data for the training set, 16.2% for the validation set, and 9.51% for the testing set. By doing so, we eliminated the need to do crossvalidation and hence, made the training more efficient. It is worth mention that the time series data must not be shuffled as that will break the temporal and logical order of measurements, which must be maintained.
3 Method
In this section, we introduce the data analysis methods used in this work. We start by explaining the model selection phase, followed by a discussion of the BiLSTM neural network architecture. The technical terminologies are described in Appendices A–E.
3.1 The BiLSTM Model
Recurrent neural networks (RNNs) that support processing input sequences both forward and backward are known as BiLSTM neural networks (Schuster & Paliwal, 1997). Regular RNNs (Hochreiter & Schmidhuber, 1997; Kolen & Kremer, 2001) depend on the prior hidden state and the current input to determine the output at a given time. The output of a BiLSTM network, on the other hand, is dependent on the input at a given moment as well as the previous and future hidden states. As a result, the network is able to make predictions using contexts from the past as well as the future. Hence, accuracy is improving. Each BiLSTM layer consists of two LSTM layers: a forward layer that processes the input sequences from the past to future and a backward layer that processes the input sequences from the future to the past, as illustrated in Figure 3, to capture information from both past and future contexts. The output from each layer is concatenated and fed to the next layer, which can be another BiLSTM layer or a fully connected layer for final prediction.
Figure 3 Architecture of a single BiLSTM layer. The blue circles at the bottom labeled by (x_{0}, x_{1}, x_{2}, …, x_{i}) are the input data values at multiple time steps. The purple circles, on the other hand, are the output data values at multiple time steps labeled by (y_{0}, y_{1}, y_{2}, …, y_{i}). The dark green and light green boxes are the activation units of the forward layer and the backward layer, respectively. The orange and yellow circles are the hidden states at the forward layer and the backward layer, respectively. Both the forward and backward layers compose a single hidden BiLSTM layer. The figure is adapted from Olah (2015). 
BiLSTM networks are more advantageous than traditional LSTM networks in a variety of aspects (Graves & Schmidhuber, 2005; Ihianle et al., 2020; Alharbi & Csala, 2021). First, as we demonstrate in this study, they are excellent for tasks like time series forecasting, as well as speech recognition and language translation (Wöllmer et al., 2013; Graves & Jaitly, 2014; Sundermeyer et al., 2014; Huang et al., 2018; Nammous et al., 2022) because they can capture longterm dependencies in the input sequence in both forward and backward directions. Second, unlike feedforward networks, BiLSTM networks do not demand fixedlength input sequences, thus being able to handle variablelength sequences better. Furthermore, by taking into account both past and future contexts, BiLSTM networks can handle noisy data. However, BiLSTM networks are computationally more expensive than regular LSTM networks due to the need for processing the input sequence in both directions. They also have a higher number of parameters and require more training data to achieve good performance.
The final dataset has 7 features, including the target feature, from December 25th 1976 to July 30th 2019, with a total of 15,558 samples (number of days). The training set has 11,558 samples, the validation set has 2520 samples, and the test set has 1480 samples.
The input horizon of 270 steps (30 days × 9 months) was used. A data batch size of 30 was used, which is the number of samples processed that result in one update to the model’s weights (Appendix B). The model consists of 4 BiLSTM layers with 64 neurons each, and an output dense layer with 3 neurons, representing the output forecasting horizon. The total number of trainable parameters is 333,699. The number of training epochs was set to 50 because from experiments, the model stopped improving remarkably after almost 50 epochs. Thus, there was no need to waste time and computational resources to train the model for more than 50 epochs.
The ModelCheckpoint callback function was used to register the model version with the minimal validation loss. The EarlyStopping callback function was used to halt the model run when detecting overfitting, with a patience parameter of 7. ReduceLROnPlateau callback function was used to reduce the learning rate when the validation loss stops improving, with a patience parameter of 5, a reduction factor of 0.1 and minimal learning rate of 1e^{−6}.
3.2 Model selection
To determine the most suitable model for our objective and provide justifiable reasons, we conducted the following analysis. First, we examined the naive (persistence) model, which is very simplistic and assumes that the time series values will remain constant in the future. In other words, it assumes that the future value will be the same as the most recent historical value. That was the baseline. Next, we examined the movingaverage model, which calculates the future values based on the average value of historical data within a specific time window. This gives a little bit lower error.
After that, we went towards the MLbased models. For all the ML models, we chose the Adaptive moment estimation (Adam) optimizer (Kingma & Ba, 2015) as the optimization algorithm due to its minimal memory requirements and high computational efficiency as it is wellsuited for applications that involve large number of parameters or large datasets. As a rule of thumb, we set the optimizer’s learning rate to be 0.001 as it is usually recommended (Zhang et al., 2022).
In order to prepare the data in a readable format for the ML models, we created a windowed dataset with an input horizon of 365 steps representing 1 year of data and an output horizon of 3 steps representing the forecast window of 3 days. We call this windowing method as MultiInput Multiple Output (MIMO) strategy, in which the entire output sequence is predicted in one shot. The MIMO strategy adopts the sliding window method that was mentioned in Benson et al. (2020) in which each sequence is shifted by one step with respect to the previous sequence until reaching the end of the available data (Fig. 4). This approach minimized the imbalance of active days, with high SEP fluxes, and quiet days.
Figure 4 Illustration of the sliding window technique for a sample of 10 timesteps, where each number denotes a distinct time step. As an example here, the input horizon (blue color) length is 4 timesteps and the output horizon length is three timesteps. The input window slides one time step at a time across the entire data sequence to generate four distinct input and forecast horizon pairs. The purple, orange, and green colors of the output horizon represent 1day, 2day, and 3day ahead forecasting, respectively. The timesteps of 1day ahead forecasting across the data sequences are then concatenated into a single timeseries list that is called 1day ahead prediction. The same for 2day and 3day ahead. 
After experiments with different loss functions and evaluating their performance on our dataset, we chose the Huber function (C1) as the loss function, and the Mean Absolute Error (MAE) was used as the metric function to monitor the model performance. We used the Huber function because it is robust and combines the advantages of both Mean Squared Error (MSE) and MAE loss functions. It is less sensitive to outliers than MSE, while still being differentiable and providing gradients, unlike MAE. Since our data is noisy and contains outliers that may negatively impact the model’s performance, the Huber loss function is a good choice.
We examined various neural network models to determine the optimal architecture for our task. Initially, we started with a simple linear model comprising of a single layer with a single neuron. However, this model did not yield satisfactory results. We then explored a dense ML model consisting of 2 hidden layers, each with 32 neurons and a RelU activation function. Next, we experimented with a simple RNN model with the same number of hidden layers and neurons. To find the optimal learning rate, we utilized the LearningRateScheduler callback function and discovered that a rate of 1.58e^{−4} under the basic settings minimized the loss. We proceeded to examine stateful versions of RNN, LSTM, and BiLSTM models with three hidden layers, each with 32 neurons and a learning rate of 1.58e^{−4}. In addition, we explored a hybrid model that consisted of a 1dimensional convolutional layer with 32 filters, a kernel size of 5, and a RelU activation function. We combined this with a twohidden layer LSTM network with 32 neurons each and a learning rate of 1.58e^{−4}. We experimented with Dropout layers but did not observe any significant improvement in the results. Finally, we evaluated a BiLSTM model with five hidden layers, 64 neurons each, and a learning rate of 0.001. Based on the evaluation of all the models on both the validation and test sets (Fig. 5 and Table D1 in Appendix D), we selected the BiLSTM model for further refinement. More details on the final model architecture and hyperparameters are explained in Appendix D. Figure 5 presents a comparative analysis of the Huber loss within the validation and testing sets across the ten aforementioned models. We used several evaluation measures to assess our models since each metric provides valuable insights into the accuracy and performance of the forecasts (Appendix C), helping to identify areas for improvement and adjust the forecasting models accordingly.
Figure 5 Benchmarking of 10 models, shows the Huber loss for the validation and test sets. 
4. Results and discussion
The benchmarking in Figure 5 showed that, in general, the MLbased methods were not much different. On the other hand, the persistence model and moving average model resulted in the highest errors compared with the MLbased models, and their results were close to some extent. As we see, the BiLSTM model performed the best over both the validation and test sets compared with the other models.
We developed and trained three BiLSTM models to forecast the integral flux of SEP, one model per energy channel. After the training was completed, we evaluated the performance of the models from the loss curve (Fig. 6) using the Huber loss (the left panel) and the metric MAE (the middle panel). During the training, the learning rate was reduced multiple times via the LearningRateScheduler callback function (the right panel). The left panel quantifies the discrepancy between the model’s predictions and the true values over time. It shows how the Huber loss function changes during the training iterations (Epochs) for the training and validation sets for the three energy channels so that each channel has one color. The middle panel shows how the model’s metric MAE changes with training epochs. It is used to evaluate the performance of the trained model by measuring the average absolute difference between the model’s predictions and the true values, providing a single numerical value that indicates the model’s error at a given epoch. The right panel shows how the learning rate of the model’s optimizer changes with epochs via the LearningRateScheduler callback function, which changes the learning rate based on a predefined schedule to improve training efficiency and convergence. The learning rate refers to the rate at which the model’s parameters are updated during the training process. We noticed that at the epochs where the learning rate has changed, there were bumps in the loss curves across all the energy channels, which is expected. This highlights the boundaries within which the learning rate yields better performance.
Figure 6 Left Panel – The Huber loss vs. the number of training epochs for the BiLSTM model for the validation and test sets, for the 3 energy channels. Middle Panel – The mean absolute error (MAE); the model’s metric vs. the number of training epochs. Right Panel – Shows how the learning rate of the Adam optimizer changes over the number of epochs. 
From experimentation, we found that the batch size and the optimizer learning rate are the most important hyperparameters that have a strong influence on the overall model’s performance (Greff et al., 2016). In addition, adding dropout layers as well as varying the number of hidden layers and hidden neurons resulted in only marginal improvements to the final model performance, while substantially increasing training time and requiring greater computational resources.
The term batch size refers to the number of data sequences processed in one iteration during the training of ML model (Goodfellow et al., 2016). Initially, a batch size of 64 was selected, however, we observed that the model produced better results when a batch size of 30 was used instead. This could be related to the Carrington rotation, which lasts for ~27 days. There were ~570 Carrington rotations between December 25th 1976 and July 30th 2019. Therefore, updating the model’s weights after every Carrington rotation could be a reasonable choice for improving its performance. Figure 7 shows how good the model predictions are (on the yaxis) compared with the observations of the validation set (on the xaxis). The blue, orange, and gold colors refer to 1day, 2day, and 3day ahead predictions, respectively. The top panel is for the >10 MeV channel, the middle panel is for the >30 MeV channel, and the bottom panel is for the >60 MeV channel. The left column is for the entire validation set, while the right column is for the observations points ≥10 proton flux units (pfu). That is the threshold value of proton flux as measured by the National Oceanic and Atmospheric Administration (NOAA) GOES spacecraft to indicate the severity of space weather events caused by SEP.
Figure 7 Correlation between the model predictions and observations for 1day, 2day, and 3day ahead for >10 MeV (top panel), >30 MeV (middle panel), and >60 MeV (bottom panel). The panels in the left column represent all the points of the validation set and those in the right column represent all the observation points with daily mean flux ≥10 pfu. 
We found that, overall, the models performed very well. The R correlation was >0.9 for all points of the validation set across the forecasting windows for the three energy channels. The R correlation was >0.7 for the observations points ≥10 pfu as well. However, the correlation between the modeled data and the observations exhibited a decline as the forecast horizon increased, in accordance with the anticipated result.
To confirm the validity of the models, we performed the same correlation analysis between the modeled data and the observations of the outofsample test set (Fig. 8), which was not given to the model. Again, we found a high correlation across the forecasting windows for the three energy channels. The points were more dispersed between 1 and 1.5 on the xaxis, which reflected in a bit lower correlation. This might be a limitation in the current version of the model between that range of SEP fluxes since the models underestimated the flux values within that range across all energy channels, possibly due to the relatively smaller training samples with fluxes above 10 pfu compared with the majority of the data.
Figure 8 Same as Figure 7 but for the test set. 
In order to see the temporal variation of the correlation between the modeled data and the observations, we applied a rolling window of 90 steps (3 months × 30 days/month = 1 season) that shows the seasonal variation of the correlation, as shown in Figure 9. Here, we show only the 1day ahead predictions for the test set, for the three energy channels. We observe drops in the correlation factor synchronized with the transition between solar cycles (e.g., particularly between ~1995 and 2000, which represents the declining phase of the solar cycle 22 and the rising phase of the solar cycle 23). This could be related to the fact that the low SEP fluxes during quiet times are more random and thus more difficult to forecast (Feynman et al., 1990; Gabriel et al., 1990; Rodriguez et al., 2010; Xapsos et al., 2012).
Figure 9 Comparison between the model outputs and observations of the test set for the 3 energy channels. In addition to the rollingmean window correlation for 1day ahead predictions. 
During periods of low solar activity, the forecasting of low SEP fluxes becomes more challenging due to their increased randomness. This difficulty arises from the reduced occurrence of conventional SEP drivers, such as solar flares and CMEs. Studies have suggested that the most significant solar eruptions tend to happen shortly before or after the solar cycle reaches its maximum (Švestka, 1995). Additionally, sporadic increases in solar activity have been observed (Kane, 2011), which might contribute to the diminished correlations observed in our research. There is clearly some factor that is influencing the correlation during certain periods where there are no or only weak SEP events. However, it is not obvious which physical phenomena are the cause rather than, for instance, some artifact of the data. Understanding the interplay between these factors and their influence on SEP fluxes during periods of reduced solar activity remains a critical area of research. It would be interesting to find what is reducing the correlations, thus more investigation is needed.
Overall, the modeled data were correlated the most with observations at >60 MeV, then the second rank was for the >10 MeV channel, and the third rank was for the >30 MeV channel. That could be related to the relatively larger extent of drops in correlation at the >30 MeV channel. The decline in correlation at the >30 MeV channel is consistent with the findings of Le & Zhang (2017). A summary of the performance results of the models for both the validation set and test set is presented in Table 1.
Summary of the performance results of the models for the validation and test sets.
From the visual inspection of the test set examples (Figs. 10–12), we found that the predicted onset time, the peak time, and end times of SEP events were highly correlated with those of the observations, which implies that the model captured the temporal variations, as well as the trends in SEP flux.
Figure 10 The model’s forecasts for the outofsample testing set for the >10 MeV channel are shown at forecast horizons of 1 day, 2 days, and 3 days ahead, using samples of data from December in selected years mentioned in the topleft side of the plots. 
Figure 11 The model’s forecasts for the outofsample testing set for the >30 MeV channel are shown at forecast horizons of 1 day, 2 days, and 3 days ahead, using samples of data from December in selected years mentioned in the topleft side of the plots. 
Figure 12 The model’s forecasts for the outofsample testing set for the >60 MeV channel are shown at forecast horizons of 1 day, 2 days, and 3 days ahead, using samples of data from December in selected years mentioned in the topleft side of the plots. 
To get further insight into the model’s performance, we conducted an assessment of various skill scores, including True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Additionally, skill score ratios such as Probability of Detection (POD), Probability of False Detection (POFD), False Alarm Rate (FAR), Critical Success Index (CSI), True Skill Statistic (TSS), and Heidke Skill Score (HSS). Detailed descriptions of these skill scores can be found in Appendix E. To extract individual SEP events from the test dataset, we implemented a thresholdbased clustering algorithm. This algorithm uses the NOAA/SWPC warning threshold value of 10 pfu for the E ≥ 10 MeV channel. Upon analysis, we identified the number of detected SEP events for each output forecasting window and calculated the skill scores (Table 2). In the true test set, we identified 12 SEP events.
Confusion matrix for the energy channel ≥10 MeV predictions in the test set.
The evaluation of the model revealed notable trends as the length of the output forecasting window increased. The POD and CSI exhibited a declining pattern, indicating a reduced ability of the model to accurately detect and capture positive events (SEP occurrences) as the forecasting horizon extended further into the future. This suggests that the model’s performance in identifying and capturing true positive instances diminishes with longer forecasting windows. Moreover, the POFD demonstrated an increasing trend, indicating an elevated rate of false positive predictions as the forecasting horizon lengthened. The model’s propensity to generate false alarms rose with the lengthening forecasting window, leading to incorrect identification of nonevents as positive events. Consequently, the TSS and HSS exhibited decreasing values, signifying a deterioration in the model’s overall skill in accurately capturing and distinguishing between positive and negative instances. Overall, our skill scores are comparable with those reported by previous studies (Table 3). Although the UMASEP model does better than ours (i.e., has a higher POD), our FAR is much lower, thus, making fewer false alarms than the UMASEP model.
Comparing the skill scores with previous models. The dashed entries mean the data is unavailable (see Whitman et al., 2022, for more details).
5 Conclusions
Forecasting the SEP flux is a crucial task in heliophysics since it affects satellite operations, astronaut safety, and groundbased communication systems. It is a challenging task due to its nonlinear, nonstationary, and complex nature. Machine learning techniques, particularly neural networks, have shown promising results in predicting SEP flux. In this study, we developed and trained BiLSTM neural network models to predict the dailyaveraged integral flux of SEP at 1day, 2day, and 3day ahead, for the energy channels >10 MeV, >30 MeV, and >60 MeV. We used a combination of solar and interplanetary magnetic field indices from the OMNIWeb database for the past 4 solar cycles as input to the model. We compared the models with baseline models and evaluated them using the Huber loss and the error metrics in Appendix C.
The data windowing method we used, based on the MIMO strategy, eliminates the need to feed the output forecast as input back into the model and that allows forecasting relatively far into the future while maintaining decent results (e.g., the MSE ranging between 0.007 and 0.015 for 1day forecasting in the test set, compared to an MSE of 0.236 for a persistence model. See Table 1). The results show that the model can make reasonably accurate predictions given the difficulty and complexity of the problem. The MSE was ranged between 0.009 and 0.055 for the validation set, and between 0.007 and 0.05 for the test set. The correlations between the observations and predictions were >0.9 for the validation and test sets (Figs. 7 and 8). Nevertheless, the mean temporal correlation was ~0.8 for the test set (Fig. 9). Although our models performed well, we observed a relatively large discrepancy between the predictions and the observations in the >30 MeV energy band.
The findings of this study underscore the challenges encountered by the forecasting model in accurately predicting SEP data over longer time periods. As the length of the output forecasting window increased, the model’s ability to detect true positives and its overall skill in differentiating positive and negative instances diminished. Additionally, the model displayed an elevated rate of false negative predictions, indicating an increased tendency to generate misses as the forecasting horizon extended. These results highlight the importance of carefully considering the appropriate forecasting window length for SEP data to ensure the model’s optimal performance. Our skill scores generally align with those from previous works (Table 3). There are variations in the metrics’ values across different studies, highlighting the complexities and nuances associated with each study. Nevertheless, it is important to acknowledge that the statistical significance of the results in this study is limited due to data averaging. Future studies should consider incorporating hourly data, as this is likely to result in a greater number of identified events. The model can provide shortterm predictions, which can be used to anticipate the behavior of the nearEarth space environment. These predictions have important implications for space weather forecasting, which is essential for protecting satellites, spacecraft, and astronauts from the adverse effects of solar storms.
Multiple techniques exist for identifying the optimal combination of hidden layers and neurons for a given task such as empirical methods, parametric methods, and the grid search crossvalidation method, which we will explore in future work. The observed reduction in correlation necessitates further investigation to determine its origin, whether stemming from tangible causal factors or potential aberrations within the model or data. We plan to expand upon this work by performing shortterm forecasting using hourlyaveraged data. This extension will involve integrating additional relevant features such as the location and area of active regions and coronal holes on the Sun.
BiLSTM networks are particularly useful for tasks involving sequential data such as time series forecasting. Given their capacity to handle input sequences in both directions in time and capture longterm dependencies, they are valuable in a broad range of applications. Nonetheless, one should carefully consider their data requirements and computational complexity before adopting them. Our results emphasize that the use of deep learning models in forecasting tasks in heliophysics are promising and encouraging, as pointed out by Zhang et al. (2022).
This work is a stepping stone towards realtime forecasting of SEP flux based on the publicavailable datasets. As an extension, we are currently working on developing a set of models that deliver nearreal time prediction of SEP fluxes at multiple energy bands, multiple forecasting windows, with hourlyaveraged data resolution, with a more sophisticated model architecture, as well as more features that address the state of solar activity more comprehensively. We plan to extend the analysis to include more recent data from solar cycle 25, in order to improve the accuracy of the models. In conclusion, our study highlights the potential of using BiLSTM neural networks for forecasting SEP integral fluxes. Our models provide a promising approach for predicting the nearEarth space environment too, which is crucial for space weather forecasting and ensuring the safety of our space assets. Our findings contribute to the growing body of literature on the applications of deep learning techniques in heliophysics and space weather forecasting.
Appendix A
Terminologies
In this section, we introduce the main concepts related to machine learning which are presented in this paper.
Crossvalidation: A technique used to evaluate the performance of a machine learning model by dividing the data into subsets and assessing the model on different combinations of these subsets.
Input horizon: The number of previous time steps considered as input to a model for time series forecasting. It represents the length of the historical sequence used for predictions.
Batch size: The number of samples processed together in a single iteration of the training algorithm. It affects training speed and memory requirements.
Updating the model’s weights: The process of adjusting the parameters of a neural network based on training data to minimize the difference between predicted and true outputs. The model’s weights represent the parameters that are learned during the training process.
Loss: A function that quantifies the difference between predicted and actual outputs. It guides the optimization process during training.
Minimum validation loss: The lowest value achieved by the loss function on a validation dataset during training. It indicates the most accurate predictions on unseen data.
Overfitting: When a model performs well on training data but fails to generalize to unseen data due to memorizing training examples instead of learning underlying patterns.
Learning rate: A hyperparameter that determines the step size at each iteration of the optimization algorithm during training. It affects learning speed and convergence. A high learning rate can cause the training process to converge quickly, but it may also result in overshooting the optimal solution or getting stuck in a suboptimal solution. On the other hand, a very low learning rate can make the training process slow and may struggle to find the optimal solution.
Reducing the learning rate when the validation loss stops improving: This concept involves adjusting the learning rate dynamically during the training process. When the validation loss reaches a plateau or stops improving, it indicates a suboptimal point. By reducing the learning rate, the model can take smaller steps in weight space, potentially finding a better solution. This technique, known as learning rate scheduling or learning rate decay, is commonly used to finetune the model’s performance.
Patience: A parameter used in training to determine the number of epochs to wait for an improvement in validation loss before stopping the training process.
Patience parameter of 7: In the context of early stopping, training will be stopped if the validation loss does not improve for 7 consecutive epochs.
Adam optimizer: A popular optimization algorithm in deep learning that combines Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation (RMSprop) to achieve efficient optimization.
Optimal architecture: The best configuration of a neural network, including the number of layers, neurons, and other choices, for optimal performance on a specific task.
Hyperparameters: Parameters set before training a model that control the learning algorithm’s behavior, such as learning rate, batch size, and activation functions.
Layer: A building block of a neural network that performs specific operations on input data. Includes input, hidden, output, fully connected, convolutional, recurrent, activation, and dropout layers. Here is a description for each layer:
Input layer: The first layer of a neural network that receives raw input data. It passes the input to subsequent layers for further processing. The number of nodes in the input layer is determined by the dimensionality of the input data.
Hidden layers: Intermediate layers between the input and output layers. They perform computations on the input data and capture higherlevel representations or abstractions. Hidden layers are not directly exposed to the input or output.
Output layer: The final layer of a neural network that produces model predictions or outputs based on computations from preceding layers. The number of neurons in the output layer depends on the problem being solved, such as regression or classification.
Fully connected layer (dense layer): Each neuron in this layer is connected to every neuron in the previous layer. It allows information flow between all neurons, enabling complex relationships to be learned.
Convolutional layer: Commonly used in Convolutional Neural Networks (CNNs) for analyzing gridlike data, such as images. It applies convolution operations using filters or kernels to learn spatial patterns or features.
Recurrent layer: Used in Recurrent Neural Networks (RNNs) to process sequential data. These layers have feedback connections that allow information to be passed from one step to the next, capturing temporal dependencies and maintaining memory of past inputs.
Activation layer: Applies a nonlinear function to the output of a layer, introducing nonlinearity into the neural network. Activation functions like Sigmoid, Hyperbolic Tangent (tanh), or Rectified Linear Unit (ReLU) determine neuron outputs based on weighted inputs.
Dropout layer: A regularization technique commonly used in deep learning models. It randomly sets a fraction of outputs from the previous layer to zero during training, preventing overfitting and improving generalization.
Layers play a crucial role in the information processing and learning capabilities of neural networks. The arrangement and combination of different layers determine the network’s architecture and ultimately its ability to solve specific tasks.
Stateful: A property of Recurrent Neural Networks (RNNs) where the hidden state is preserved between consecutive inputs, allowing the network to have memory.
Neuron: A computational unit in a neural network that receives input, applies weights, and passes the result through an activation function to produce an output.
Hidden neuron: A neuron in a hidden layer of a neural network that performs intermediate computations.
Callback function: A function used during model training to perform specific actions at certain points or conditions, such as saving the best model, adjusting learning rates, or early stopping.
LearningRateScheduler callback function: A function used in training to dynamically adjust the learning rate at specific points based on a predefined schedule or function. It improves training efficiency and convergence by allowing the model to make finer adjustments as it approaches the optimal solution.
Appendix B
Mathematical Representation of the LSTM NN Model
The computations inside one LSTM cell can be described by the following formulas (Ihianle et al., 2020):
where x_{t} is input data at time t. The input gate i_{t} determines which values from the updated cell states (candidate values) should be added to the cell state. It also takes into account the current input x_{t} and the previous output h_{t−1}, and is passed through a sigmoid activation function. represents the candidate values that are added to the cell state at time t. The forget gate activation vector f_{t} at time step t, which determines how much of the previous cell state should be retained. The cell state C_{t} at time t is updated based on the forget gate, input gate, and candidate values. The output gate o_{t} at time t determines how much of the cell state should be output. The output vector h_{t} at time t is calculated based on the cell state and the output gate values. h_{t−1} is the output vector at the previous time step t − 1. W_{f}, W_{i}, W_{c}, W_{o} are the weight matrices for the input vector x_{t}. U_{f}, U_{i}, U_{c}, U_{o} are the weight matrices for the output vector h_{t−1}. b_{f}, b_{t}, b_{c}, b_{o} are the bias vectors. The symbol ⊙ denotes a pointwise multiplication. The sigmoid function σ is used as the activation function for the gate vectors, and the hyperbolic tangent function tanh is used for the candidate values and the output vector.
Appendix C
Evaluation metrics
To evaluate the model performance, we used the following equations:
where y is the true value, is the predicted value, and δ is a threshold in the Huber loss function that controls the tradeoff between the mean squared error (MSE) and the mean absolute error (MAE). In this paper, it was set to 0.1, which was selected based on several experiments.
MSE is the mean squared error, which measures the difference between predicted and actual values by calculating the average of squared differences. It provides a measure of the average squared magnitude of the errors in your forecasts, which can be useful in penalizing larger errors more heavily than smaller errors.
MAPE is the mean absolute percentage error, which measures the difference between predicted and actual values by calculating the average of absolute differences. It provides a measure of the average magnitude of the errors, allowing to evaluate the overall accuracy of your forecasts.
RMSE is the root mean squared error, which measures the difference between predicted and actual values by taking the square root of the average of squared differences. It provides a measure of the accuracy of the forecasts in the same units as the original data, allowing to evaluate the magnitude of errors in the same scale as the data.
MAPE is the mean absolute percentage error, which measures the accuracy of a forecast by calculating the average of absolute percentage errors. It provides a measure of the accuracy of the forecasts in percentage terms, allowing to evaluate the magnitude of errors relative to the actual values. MSE, MAE, RMSE, and MAPE are often used in regression analysis to assess the accuracy of the model’s predictions.
Finally, R is the Pearson correlation coefficient, which measures the strength and direction of the relationship between two continuous variables, and can provide an indication of the extent to which changes in one variable may be related to changes in the other.
Appendix D
Model Configuration
The configurations for the ML models shown in Figure 5 and their performance on the validation set and the test set for the SEP integral flux >10 MeV are presented in Table D1. The batch size was set to be 64 and the number of training epochs was set to be 100. The EarlyStopping callback function, with a patience of 10, is used to help prevent overfitting during the training process by stopping training when the monitored metric has stopped improving for a certain number of epochs. The patience parameter controls how many epochs the training will continue without improvement before it is stopped. This is useful because if the validation loss stops getting better, the model has probably overfitted the training data and is not generalizing effectively to new data. By stopping the training early, we can avoid wasting time and resources on further training that is unlikely to improve the model’s performance.
Configuration of the ML model. (1) refers to the error value for 1day forecasting. Same for (2) refers to 2day forecasting, and (3) for 3day forecasting. *In the 1DCNN layer, 32 filters, a kernel size of 5, and strides of 1 were used.
We used the ModelCheckpoint callback function to save the best weights of the model during training so that they can be reused later. The LearningRateScheduler callback function allows to dynamically adjust the learning rate of the model during training using a function passed to it that will be called at the beginning of each epoch, and it should return the desired learning rate for that epoch. It can be useful when training deep neural networks, as it allows for a higher learning rate in the early stages of training when the model is still far from convergence, and a lower learning rate as the model approaches convergence, which can help it to converge more accurately. The downside might be the longer training time.
All the calculations and model runs were implemented under the framework of TensorFlow 2.3.0 (Singh & Manure, 2020) in Python 3.6.13. The models were executed on Ubuntu 20.04.1 LTS OS with 4 × GPUs (NVIDIA GeForce RTX 2080 Ti, 11019 MiB, 300 MHz). According to the Keras API guide (Ketkar, 2017), the requirements to use the cuDNN implementation are the activation function must be set to tanh and the recurrent activation must be set to sigmoid. We also set the seed number to 7 across all the model runs to maintain reproducibility.
Stateful RNNs can be difficult to work with when using callbacks in Keras because their hidden state must be manually managed across minibatch updates. When training a stateful RNN in Keras, the hidden state is carried over from the previous epoch and can cause problems with certain callbacks, such as EarlyStopping or ModelCheckpoint. To work around this issue, one can use stateless RNNs or manually reset the hidden state at the end of each epoch, but this can be complex and prone to errors.
Appendix E
Skill Scores
Skill scores and ratios are commonly used in evaluating the performance of classification models, particularly in binary classification tasks. They provide insights into the model’s ability to correctly predict positive and negative instances. Here is a brief description of each skill score and ratio, along with their formulas:
True positive (TP): The number of data points or intervals correctly identified as positive by the model. It represents instances where both the model and the ground truth indicate the presence of an event.
True negative (TN): The number of intervals correctly identified as negative by the model. It represents instances where both the model and the ground truth indicate the absence of an event.
False positive (FP): The number of intervals incorrectly identified as positive by the model. It occurs when the model predicts an event, but the ground truth indicates its absence.
False negative (FN): The number of intervals incorrectly identified as negative by the model. It occurs when the model fails to detect an event that the ground truth indicates its presence.
Accuracy: Represents the proportion of correct predictions out of total predictions.
Precision: Represents the proportion of positive predictions that are actually positive.
Probability of detection (POD) or recall: Represents the model’s ability to correctly identify positive instances.
Probability of false detection (POFD): Measures the model’s tendency to falsely predict positive instances when the ground truth indicates their absence.
False alarm rate (FAR): Indicates the ratio of false positive predictions to the total number of positive instances.
Critical success index (CSI): Measures the model’s ability to correctly predict both positive and negative instances.
True skill statistic (TSS): Takes into account both the model’s ability to detect positive instances and its ability to avoid false alarms.
Heidke skill score (HSS): Evaluates the model’s performance by comparing it with random chance. It takes into account the agreement between the model’s predictions and the observed data, considering both true positive and true negative predictions.
where
Acknowledgments
We thank the referees for their feedback. We acknowledge using data from the GSFC/SPDF OMNIWeb interface. This work is supported by the Modeling and ObServational Integrated Investigations of Coronal Solar Eruptions (MOSAIICS) project, funded under contract KP06DV8/18.12.2019 to the Institute of Astronomy and National Astronomical Observatory (NAO), the Bulgarian Academy of Sciences (BAS), under the National Scientific Program “VIHREN” in Bulgaria. The editor thanks Ian Richardson and two other anonymous reviewers for their assistance in evaluating this paper.
References
 Alharbi FR, Csala D. 2021. Shortterm solar irradiance forecasting model based on bidirectional long shortterm memory deep learning. In: 2021 International Conference on Electrical, Communication, and Computer Engineering (ICECCE), IEEE, pp. 1–6. https://doi.org/10.1109/ICECCE52056.2021.9514233. [Google Scholar]
 AminalragiaGiamini S, Raptis S, Anastasiadis A, Tsigkanos A, Sandberg I, Papaioannou A, Papadimitriou C, Jiggens P, Aran A, Daglis IA. 2021. Solar energetic particle event occurrence prediction using solar flare soft Xray measurements and machine learning. J Space Weather Space Clim 11: 59. https://doi.org/10.1051/swsc/2021043. [CrossRef] [EDP Sciences] [Google Scholar]
 Aschwanden MJ. 2002. Particle acceleration and kinematics in solar flares: a synthesis of recent observations and theoretical concepts (invited review). Space Sci Rev 101(1): 1–227. https://doi.org/10.1023/A:1019712124366. [CrossRef] [Google Scholar]
 Benson B, Pan WD, Prasad A, Gary GA, Hu Q. 2020. Forecasting solar cycle 25 using deep neural networks. Sol Phys 295(5): 65. https://doi.org/10.1007/s1120702001634y. [CrossRef] [Google Scholar]
 Bruno A, Richardson IG. 2021. Empirical model of 10–130 MeV solar energetic particle spectra at 1 AU based on coronal mass ejection speed and direction. Sol Phys 296(2): 36. https://doi.org/10.1007/s11207021017794. [CrossRef] [Google Scholar]
 Cohen C, Mewaldt R. 2018. The groundlevel enhancement event of September 2017 and other large solar energetic particle events of cycle 24. Space Weather 16(10): 1616–1623. https://doi.org/10.1029/2018SW002006. [NASA ADS] [CrossRef] [Google Scholar]
 Dalla S, Swalwell B, Battarbee M, Marsh MS, Laitinen T, Proctor SJ. 2017. Application of test particle simulations to solar energetic particle forecasting. In: Proceedings of the International Astronomical Union, Volume 13, Symposium S335: Space Weather of the Heliosphere: Processes and Forecasts, Foullon C, Malandraki OE, (Eds.) Cambridge University Press, pp. 268–271. https://doi.org/10.1017/S1743921317011012. [Google Scholar]
 Debrunner H, Flückiger E, Gradel H, Lockwood J, McGuire R. 1988. Observations related to the acceleration, injection, and interplanetary propagation of energetic protons during the solar cosmic ray event on February 16, 1984. J Geophys Res Space Phys 93(A7): 7206–7216. https://doi.org/10.1029/JA093iA07p07206. [CrossRef] [Google Scholar]
 Desai M, Giacalone J. 2016. Large gradual solar energetic particle events. Living Rev Sol Phys 13(1): 3. https://doi.org/10.1007/s4111601600025. [CrossRef] [Google Scholar]
 Dierckxsens M, Tziotziou K, Dalla S, Patsou I, Marsh M, Crosby N, Malandraki O, Tsiropoula G. 2015. Relationship between solar energetic particles and properties of flares and CMEs: statistical analysis of solar cycle 23 events. Sol Phys 290(3): 841–874. https://doi.org/10.1007/S1120701406414. [CrossRef] [Google Scholar]
 Engell A, Falconer D, Schuh M, Loomis J, Bissett D. 2017. SPRINTS: a framework for solardriven event forecasting and research. Space Weather 15(10): 1321–1346. https://doi.org/10.1002/2017SW001660. [CrossRef] [Google Scholar]
 Feynman J, Armstrong T, DaoGibner L, Silverman S. 1990. Solar proton events during solar cycles 19, 20, and 21. Sol Phys 126: 385–401. https://doi.org/10.1007/BF00153058. [CrossRef] [Google Scholar]
 Gabriel S, Evans R, Feynman J. 1990. Periodicities in the occurrence rate of solar proton events. Sol Phys 128: 415–422. https://doi.org/10.1007/BF00838476. [CrossRef] [Google Scholar]
 Goodfellow I, Bengio Y, Courville A. 2016. Deep learning. MIT Press. ISBN 0262035618. [Google Scholar]
 Gopalswamy N, Mäkelä P, Yashiro S, Thakur N, Akiyama S, Xie H. 2017. A hierarchical relationship between the fluence spectra and CME kinematics in large solar energetic particle events: A radio perspective. J Phys Conf Ser 900: 012009. https://doi.org/10.1088/17426596/900/1/012009. [CrossRef] [Google Scholar]
 Graves A, Jaitly N. 2014. Towards endtoend speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning, vol. 32 of Proceedings of Machine Learning Research, Xing EP, Jebara T (Eds.) PMLR, Bejing, China. pp. 1764–1772. https://proceedings.mlr.press/v32/graves14.html. [Google Scholar]
 Graves A, Schmidhuber J. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18(5–6): 602–610. https://doi.org/10.1016/j.neunet.2005.06.042. [CrossRef] [Google Scholar]
 Greff K, Srivastava RK, Koutnik J, Steunebrink BR, Schmidhuber J. 2016. LSTM: A search space odyssey. IEEE Trans Neural Netw Learn Syst 28(10): 2222–2232. https://doi.org/10.1109/TNNLS.2016.2582924. [Google Scholar]
 Hochreiter S, Schmidhuber J. 1997. Long shortterm memory. Neural Comput 9(8): 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735. [Google Scholar]
 Hu J, Li G, Ao X, Zank GP, Verkhoglyadova O. 2017. Modeling particle acceleration and transport at a 2D CMEdriven shock. J Geophys Res Space Phys 122(11): 10–938. https://doi.org/10.1002/2017JA024077. [Google Scholar]
 Huang X, Tan H, Lin G, Tian Y. 2018. A LSTMbased bidirectional translation model for optimizing rare words and terminologies. In: 2018 International Conference on Artificial Intelligence and Big Data (ICAIBD), Chengdu, China, 26–28 May. IEEE, pp. 185–189. https://doi.org/10.1109/ICAIBD.2018.8396191. [Google Scholar]
 Ihianle IK, Nwajana AO, Ebenuwa SH, Otuka RI, Owa K, Orisatoki MO. 2020. A deep learning approach for human activities recognition from multimodal sensing devices. IEEE Access 8: 179028–179038. https://doi.org/10.1109/ACCESS.2020.3027979. [CrossRef] [Google Scholar]
 Kahler S, Cliver E, Cane H, McGuire R, Reames D, Sheeley N Jr, Howard R. 1987. Solar energetic proton events and coronal mass ejections near solar minimum. Proceedings of the 20th International Cosmic Ray Conference Moscow 3: 121–123. [Google Scholar]
 Kahler S, Kazachenko M, Lynch B, Welsch B. 2017. Flare magnetic reconnection fluxes as possible signatures of flare contributions to gradual SEP events. J Phys Conf Ser 900: 012011. https://doi.org/10.1088/17426596/900/1/012011. [CrossRef] [Google Scholar]
 Kahler S, Sheeley N Jr, Howard R, Koomen M, Michels D, McGuire RE, von Rosenvinge T, Reames DV. 1984. Associations between coronal mass ejections and solar energetic proton events. J Geophys Res Space Phys 89(A11): 9683–9693. https://doi.org/10.1029/JA089iA11p09683. [CrossRef] [Google Scholar]
 Kane R. 2011. Solar activity during sunspot minimum. Indian J Radio Space Phys 40: 7–10. [Google Scholar]
 Kasapis S, Zhao L, Chen Y, Wang X, Bobra M, Gombosi T. 2022. Interpretable machine learning to forecast SEP events for solar cycle 23. Space Weather 20(2): e2021SW002842. https://doi.org/10.1029/2021SW002842. [CrossRef] [Google Scholar]
 Ketkar N. 2017. Introduction to keras. In: Deep learning with python: a handson introduction. Apress, Berkeley, CA, pp. 97–110. https://doi.org/10.1007/9781484227664_7. [CrossRef] [Google Scholar]
 Kingma DP, Ba J. 2015. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), San Diego. https://doi.org/10.48550/arXiv.1412.6980. [Google Scholar]
 Klein KL, Dalla S. 2017. Acceleration and propagation of solar energetic particles. Space Sci Rev 212: 1107–1136. https://doi.org/10.1007/s1121401703824. [CrossRef] [Google Scholar]
 Kolen JF, Kremer SC. 2001. Gradient flow in recurrent nets: the difficulty of learning LongTerm Dependencies. In: In: A field guide to dynamical recurrent networks, IEEE, pp. 237–243. [Google Scholar]
 Kóta J, Manchester W, Jokipii J, De Zeeuw D, Gombosi T. 2005. Simulation of SEP acceleration and transport at CMEdriven shocks. AIP Conf Proc 781: 201–206. https://doi.org/10.1063/1.2032697. [CrossRef] [Google Scholar]
 Kozarev K, Nedal M, Miteva R, Dechev M, Zucca P. 2022. A multievent study of earlystage sep acceleration by CMEdriven shocks – Sun to 1 AU. Front Astron Space Sci 9: 801429. https://doi.org/10.3389/fspas.2022.801429. [CrossRef] [Google Scholar]
 Kozarev KA, Davey A, Kendrick A, Hammer M, Keith C. 2017. The Coronal Analysis of SHocks and Waves (CASHeW) framework. J Space Weather Space Clim 7: A32. https://doi.org/10.1051/swsc/2017028. [CrossRef] [EDP Sciences] [Google Scholar]
 Kozarev KA, Dayeh MA, Farahat A. 2019. Earlystage solar energetic particle acceleration by coronal mass ejectiondriven shocks with realistic seed spectra. I. Low corona. Astrophys J 871(1): 65. https://doi.org/10.3847/15384357/aaf1ce. [CrossRef] [Google Scholar]
 Lavasa E, Giannopoulos G, Papaioannou A, Anastasiadis A, Daglis I, Aran A, Pacheco D, Sanahuja B. 2021. Assessing the predictability of solar energetic particles with the use of machine learning techniques. Sol Phys 296(7): 107. https://doi.org/10.1007/s1120702101837x. [CrossRef] [Google Scholar]
 Le GM, Zhang XF. 2017. Dependence of large SEP events with different energies on the associated flares and CMEs. Res Astron Astrophys 17(12): 123. https://doi.org/10.1088/16744527/17/12/123. [CrossRef] [Google Scholar]
 Li G, Shalchi A, Ao X, Zank G, Verkhoglyadova O. 2012. Particle acceleration and transport at an oblique CMEdriven shock. Adv Space Res 49(6): 1067–1075. https://doi.org/10.1016/j.asr.2011.12.027. [CrossRef] [Google Scholar]
 Li G, Zank G, Rice W. 2003. Energetic particle acceleration and transport at coronal mass ejection driven shocks. J Geophys Res Space Phys 108(A2): 1369. https://doi.org/10.1029/2002JA009666. [CrossRef] [Google Scholar]
 Lin R. 2005. Relationship of solar flare accelerated particles to solar energetic particles (SEPs) observed in the interplanetary medium. Adv Space Res 35(10): 1857–1863. https://doi.org/10.1016/j.asr.2005.02.087. [CrossRef] [Google Scholar]
 Lin R. 2011. Energy release and particle acceleration in flares: summary and future prospects. Space Sci Rev 159: 421–445. https://doi.org/10.1007/s1121401198010. [CrossRef] [Google Scholar]
 Malandraki OE, Crosby NB. 2018. Solar particle radiation storms forecasting and analysis: The HESPERIA HORIZON 2020 project and beyond. Springer Nature. ISBN 9783319600512. [Google Scholar]
 Miteva R, Klein KL, Malandraki O, Dorrian G. 2013. Solar energetic particle events in the 23rd solar cycle: interplanetary magnetic field configuration and statistical relationship with flares and CMEs. Sol Phys 282(2): 579–613. https://doi.org/10.1007/S1120701201952. [CrossRef] [Google Scholar]
 Nammous MK, Saeed K, Kobojek P. 2022. Using a small amount of textindependent speech data for a BiLSTM largescale speaker identification approach. J King Saud Univ Comput Inf Scie 34(3): 764–770. https://doi.org/10.1016/J.JKSUCI.2020.03.011. [Google Scholar]
 Nedal M, Mahrous A, Youssef M. 2019. Predicting the arrival time of CME associated with typeII radio burst using neural networks technique. Astrophys Space Sci 364(9): 161. https://doi.org/10.1007/s1050901936518. [CrossRef] [Google Scholar]
 Ng CK, Reames DV, Tylka AJ. 2012. Solar energetic particles: shock acceleration and transport through selfamplified waves. AIP Conf Proc 1436: 212–218. https://doi.org/10.1063/1.4723610. [Google Scholar]
 Nymmik R. 2007. To the problem on the regularities of solar energetic particle events occurrence. Adv Space Res 40(3): 321–325. https://doi.org/10.1016/J.ASR.2007.02.013. [CrossRef] [Google Scholar]
 Núñez M. 2011. Predicting solar energetic proton events (E >10 MeV). Space Weather 9(7): S07003. https://doi.org/10.1029/2010SW000640. [Google Scholar]
 Olah C. 2015. Neural networks, types, and functional programming [Blog post]. Available at http://colah.github.io/posts/201509NNTypesFP/. [Google Scholar]
 Pala Z, Atici R. 2019. Forecasting sunspot time series using deep learning methods. Sol Phys 294(5): 50. https://doi.org/10.1007/s1120701914346. [CrossRef] [Google Scholar]
 Papaioannou A, Anastasiadis A, Kouloumvakos A, Paassilta M, Vainio R, Valtonen E, Belov A, Eroshenko M Abunina, Abunin A. 2018. Nowcasting solar energetic particle events using principal component analysis. Sol Phys 293(7): 1–23. https://doi.org/10.1007/S1120701813207. [CrossRef] [Google Scholar]
 Papaioannou A, Vainio R, Raukunen O, Jiggens P, Aran A, Dierckxsens M, Mallios SA, Paassilta M, Anastasiadis A. 2022. The probabilistic solar particle event forecasting (PROSPER) model. J Space Weather Space Clim 12: 24. https://doi.org/10.1051/swsc/2022019. [CrossRef] [EDP Sciences] [Google Scholar]
 Ramstad R, Holmström M, Futaana Y, Lee CO, Rahmati A, Dunn P, Lillis RJ, Larson D. 2018. The September 2017 SEP event in context with the current solar cycle: Mars Express ASPERA3/IMA and MAVEN/SEP observations. Geophys Res Lett 45(15): 7306–7311. https://doi.org/10.1029/2018GL077842. [CrossRef] [Google Scholar]
 Reames DV. 2000. Particle acceleration by CMEdriven shock waves. AIP Conf Proc 516: 289–300. https://doi.org/10.1063/1.1291483. [CrossRef] [Google Scholar]
 Reames DV. 2013. The two sources of solar energetic particles. Space Sci Rev 175: 53–92. https://doi.org/10.1007/s1121401399589. [CrossRef] [Google Scholar]
 Richardson I, von Rosenvinge T, Cane H. 2016. North/south hemispheric periodicities in the >25 MeV solar proton event rate during the rising and peak phases of solar cycle 24. Sol Phys 291(7): 2117–2134. https://doi.org/10.1007/S1120701609484. [CrossRef] [Google Scholar]
 Ripley BD. 1996. Pattern recognition and neural networks. Cambridge University Press. ISBN 9780511812651. [CrossRef] [Google Scholar]
 Rodriguez J, Onsager T, Mazur J. 2010. The eastwest effect in solar proton flux measurements in geostationary orbit: a new GOES capability. Geophys Res Lett 37(7): L07109. https://doi.org/10.1029/2010GL042531. [Google Scholar]
 Schuster M, Paliwal KK. 1997. Bidirectional recurrent neural networks. IEEE Trans Signal Process 45(11): 2673–2681. https://doi.org/10.1109/78.650093. [CrossRef] [Google Scholar]
 Singh P, Manure A. 2020. Introduction to tensorflow 2.0. In: Learn TensorFlow 2.0: Implement Machine Learning and Deep Learning Models with Python. Apress, Berkeley, CA, pp. 1–24. https://doi.org/10.1007/9781484255582_1. [Google Scholar]
 Sundermeyer M, Alkhouli T, Wuebker J, Ney H. 2014. Translation modeling with bidirectional recurrent neural networks. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Moschitti A, Pang B, Daelemans W (Eds.), Association for Computational Linguistics, Doha, Qatar. pp. 14–25. https://doi.org/10.3115/V1/D141003. [CrossRef] [Google Scholar]
 Švestka Z. 1995. A comparison of solar activity during the decline of several solar cycles. Adv Space Res 16(9): 27–36. https://doi.org/10.1016/02731177(95)003112. [CrossRef] [Google Scholar]
 Trottet G, Samwel S, Klein KL, Dudok de Wit T, Miteva R. 2015. Statistical evidence for contributions of flares and coronal mass ejections to major solar energetic particle events. Sol Phys 290(3): 819–839. https://doi.org/10.1007/S1120701406281. [CrossRef] [Google Scholar]
 Truscott P, Lei F, Dyer C, Ferguson C, Gurriaran R, et al. 2000. Geant4 – a new Monte Carlo toolkit for simulating space radiation shielding and effects. In: 2000 IEEE Radiation Effects Data Workshop. Workshop Record. Held in conjunction with IEEE Nuclear and Space Radiation Effects Conference (Cat. No. 00TH8527), IEEE, pp. 147–152. https://doi.org/10.1109/REDW.2000.896281. [CrossRef] [Google Scholar]
 Whitman K, Egeland R, Richardson IG, Allison C, Quinn P, et al. 2022. Review of solar energetic particle models. Adv Space Res. https://doi.org/10.1016/j.asr.2022.08.006. [Google Scholar]
 Wilson JW, Townsend LW, Chun SY, Buck WW, Khan F, Cucinotta F. 1988. BRYNTRN: a baryon transport computer code, computation procedures and data base. Technical Report. Available at https://ntrs.nasa.gov/api/citations/19880014330/downloads/19880014330.pdf. [Google Scholar]
 Wöllmer M, Zhang Z, Weninger F, Schuller B, Rigoll G. 2013. Feature enhancement by bidirectional LSTM networks for conversational speech recognition in highly nonstationary noise. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, pp. 6822–6826. https://doi.org/10.1109/ICASSP.2013.6638983. [CrossRef] [Google Scholar]
 Xapsos MA, Stauffer CA, Jordan TM, Adams JH, Dietrich WF. 2012. Periods of high intensity solar proton flux. IEEE Trans Nucl Sci 59(4): 1054–1059. https://doi.org/10.1109/TNS.2012.2196447. [CrossRef] [Google Scholar]
 Zhang W, Zhao X, Feng X, Liu C, Xiang N, Li Z, Lu W. 2022. Predicting the daily 10.7cm solar radio flux using the long shortterm memory method. Universe 8(1): 30–111. https://doi.org/10.3390/universe8010030. [CrossRef] [Google Scholar]
 Zhu H, Zhu W, He M. 2022. Solar cycle 25 prediction using an optimized long shortterm memory mode with F10.7. Sol Phys 297(12): 157. https://doi.org/10.1007/s11207022020915. [CrossRef] [Google Scholar]
Cite this article as: Nedal M, Kozarev K, Arsenov N & Zhang P 2023. Forecasting solar energetic proton integral fluxes with bidirectional long shortterm memory neural networks. J. Space Weather Space Clim. 13, 26. https://doi.org/10.1051/swsc/2023026.
All Tables
Summary of the performance results of the models for the validation and test sets.
Comparing the skill scores with previous models. The dashed entries mean the data is unavailable (see Whitman et al., 2022, for more details).
Configuration of the ML model. (1) refers to the error value for 1day forecasting. Same for (2) refers to 2day forecasting, and (3) for 3day forecasting. *In the 1DCNN layer, 32 filters, a kernel size of 5, and strides of 1 were used.
All Figures
Figure 1 Data splitting for all input features, showing the training, validation, and testing sets. Daily data from 19761225 00:00 to 20190730 00:00. The gray shading labels the solar cycles from SC21 to SC24. 

In the text 
Figure 2 Correlation matrices show the correlation between the features in the training, validation, and test sets. 

In the text 
Figure 3 Architecture of a single BiLSTM layer. The blue circles at the bottom labeled by (x_{0}, x_{1}, x_{2}, …, x_{i}) are the input data values at multiple time steps. The purple circles, on the other hand, are the output data values at multiple time steps labeled by (y_{0}, y_{1}, y_{2}, …, y_{i}). The dark green and light green boxes are the activation units of the forward layer and the backward layer, respectively. The orange and yellow circles are the hidden states at the forward layer and the backward layer, respectively. Both the forward and backward layers compose a single hidden BiLSTM layer. The figure is adapted from Olah (2015). 

In the text 
Figure 4 Illustration of the sliding window technique for a sample of 10 timesteps, where each number denotes a distinct time step. As an example here, the input horizon (blue color) length is 4 timesteps and the output horizon length is three timesteps. The input window slides one time step at a time across the entire data sequence to generate four distinct input and forecast horizon pairs. The purple, orange, and green colors of the output horizon represent 1day, 2day, and 3day ahead forecasting, respectively. The timesteps of 1day ahead forecasting across the data sequences are then concatenated into a single timeseries list that is called 1day ahead prediction. The same for 2day and 3day ahead. 

In the text 
Figure 5 Benchmarking of 10 models, shows the Huber loss for the validation and test sets. 

In the text 
Figure 6 Left Panel – The Huber loss vs. the number of training epochs for the BiLSTM model for the validation and test sets, for the 3 energy channels. Middle Panel – The mean absolute error (MAE); the model’s metric vs. the number of training epochs. Right Panel – Shows how the learning rate of the Adam optimizer changes over the number of epochs. 

In the text 
Figure 7 Correlation between the model predictions and observations for 1day, 2day, and 3day ahead for >10 MeV (top panel), >30 MeV (middle panel), and >60 MeV (bottom panel). The panels in the left column represent all the points of the validation set and those in the right column represent all the observation points with daily mean flux ≥10 pfu. 

In the text 
Figure 8 Same as Figure 7 but for the test set. 

In the text 
Figure 9 Comparison between the model outputs and observations of the test set for the 3 energy channels. In addition to the rollingmean window correlation for 1day ahead predictions. 

In the text 
Figure 10 The model’s forecasts for the outofsample testing set for the >10 MeV channel are shown at forecast horizons of 1 day, 2 days, and 3 days ahead, using samples of data from December in selected years mentioned in the topleft side of the plots. 

In the text 
Figure 11 The model’s forecasts for the outofsample testing set for the >30 MeV channel are shown at forecast horizons of 1 day, 2 days, and 3 days ahead, using samples of data from December in selected years mentioned in the topleft side of the plots. 

In the text 
Figure 12 The model’s forecasts for the outofsample testing set for the >60 MeV channel are shown at forecast horizons of 1 day, 2 days, and 3 days ahead, using samples of data from December in selected years mentioned in the topleft side of the plots. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.