Forecasting Solar Energetic Proton Integral Fluxes with Bi-Directional Long Short-Term Memory Neural Networks

Solar energetic particles are mainly protons and originate from the Sun during solar flares or coronal shock waves. Forecasting the Solar Energetic Protons (SEP) flux is critical for several operational sectors, such as communication and navigation systems, space exploration missions, and aviation flights, as the hazardous radiation may endanger astronauts', aviation crew and passengers' health, the delicate electronic components of satellites, space stations, and ground power stations. Therefore, the prediction of the SEP flux is of high importance to our lives and may help mitigate the negative impacts of one of the serious space weather transient phenomena on the near-Earth space environment. Numerous SEP prediction models are being developed with a variety of approaches, such as empirical models, probabilistic models, physics-based models, and AI-based models. In this work, we use the bi-directional long short-term memory (BiLSTM) neural network model architecture to train SEP forecasting models for 3 standard integral GOES channels (>10 MeV,>30 MeV, and>60 MeV) with 3 forecast windows (1-day, 2-day, and 3-day ahead) based on daily data obtained from the OMNIWeb database from 1976 to 2019. As the SEP variability is modulated by the solar cycle, we select input parameters that capture the short-term, typically within a span of a few hours, and long-term, typically spanning several days, fluctuations in solar activity. We take the F10.7 index, the sunspot number, the time series of logarithm of the x-ray flux, the solar wind speed, and the average strength of the interplanetary magnetic field as input parameters to our model. The results are validated with an out-of-sample testing set and benchmarked with other types of models.


Introduction
Solar Energetic Protons (SEP) are high-energy particles that are believed to be originated from the acceleration of particles in the solar corona during coronal mass ejections (CMEs) and solar flares (Aschwanden, 2002;Klein and Dalla, 2017;Lin, 2005Lin, , 2011;;Kahler et al., 2017).They are typically characterized by their high energy levels -with some particles having energies in the relativistic GeV/nucleon range -and their ability to penetrate through spacecraft shielding, causing radiation damage (Reames, 2013;Desai and Giacalone, 2016).The fluence and energy spectrum of SEP are influenced by several factors, including the strength of the solar flare or CME that produced them, and the conditions of the interplanetary environment (Kahler et al., 1984(Kahler et al., , 1987;;Debrunner et al., 1988;Miteva et al., 2013;Trottet et al., 2015;Dierckxsens et al., 2015;Le and Zhang, 2017;Gopalswamy et al., 2017).SEP exhibit a strong association with the solar cycle, with the frequency and flux of SEP events peaking during the maximum phase of the solar cycle (Reames, 2013).This is thought to be due to the increased activity of the Sun during this phase, which leads to more frequent and powerful flares and CMEs.Previous studies have shown a relationship between the occurrence frequency of SEP and the sunspot number (SN; Nymmik, 2007;Richardson et al., 2016).However, the exact relationship between the solar cycle and SEP is complex and not fully understood.Hence, more work is needed to better understand this connection, as previous studies have reported intense SEP events during relatively weak solar activity (Cohen and Mewaldt, 2018;Ramstad et al., 2018).
SEP have been a subject of interest and research in heliophysics for decades.It is hypothesized that shock waves generated in the corona can lead to an early acceleration of particles.However, SEP have sufficient energy to propagate themselves by surfing the interplanetary magnetic fields (IMF), and therefore, the expanding CME is not necessary for their transport (Reames, 2000;Kóta et al., 2005;Kozarev et al., 2019Kozarev et al., , 2022)).While this theory has gained acceptance, there is an ongoing debate among scientists over the specific mechanisms and conditions responsible for SEP production and acceleration.The creation, acceleration, and transport mechanisms of SEP are complex and involve a combination of magnetic reconnection, shock acceleration, and wave-particle interactions (Li et al., 2003(Li et al., , 2012;;Ng et al., 2012).The specific mechanisms responsible for SEP production and acceleration can vary depending on the type and strength of the solar event that triggered them.Further research is imperative to better understand the processes involved in the production and transport of SEP in the heliosphere.This will facilitate the development of more precise models that assist in minimizing the impact of SEP on astronauts and space-based assets.
Several models are available, or under development, for forecasting SEP, which use diverse approaches and serve different objectives.These models comprise computationally complex physicsbased models, quick and simple empirical models, Machine Learning (ML)-based models, and hybrid models that combine different approaches and produce different types of outputs, including deterministic, probabilistic, categorical, and binary.Deterministic models always generate the same output without any randomness or stochastic components, such as predicting the SEP flux at a specific moment or the arrival time of SEP.On the other hand, probabilistic models provide a probability value that reflects the likelihood of an SEP event occurring.However, replicating SEP fluxes at a specific time is still a significant challenge for current models.
An excellent review on SEP models and predictive efforts was recently published by Whitman et al. (2022), which summarizes the majority of the existing models.For instance, Papaioannou et al. (2022) introduced the Probabilistic Solar Particle Event Forecasting (PROSPER) model, which is incorporated into the Advanced Solar Particle Event Casting System (ASPECS) 1 .The PROSPER model utilizes a Bayesian approach and data-driven methodology to probabilistically predict SEP events for 3 integral energy channels >10, >30, and >100 MeV.The model's validation results indicate that the solar flare and CME modules have hit rates of 90% and 100%, respectively, while the combined flare and CME module has a hit rate of 100%.Bruno and Richardson (2021) developed an empirical model to predict the peak intensity and spectra of SEP at 1 AU between 10 and 130 MeV, using data from multiple spacecraft.The model is tested on 20 SEP events and shows good agreement with observed values.The spatial distribution of SEP intensities was reconstructed successfully, and they found a correlation between SEP intensities and CME speed.Hu et al. (2017) extended the Particle Acceleration and Transport in the Heliosphere (PATH) model to study particle acceleration and transport at CME-driven shocks.They showed that the model can be used to obtain simultaneous calculations of SEP characteristics such as time-intensity profiles, instantaneous particle spectra, and particle pitch angle distributions at multiple heliospheric locations.
Overall, results resemble closely those observed in situ near the Earth but also yield results at other places of interest, such as Mars, making it of particular interest to Mars missions.SPREAdFAST (Kozarev et al., 2017(Kozarev et al., , 2022) is a physics-based, data-driven framework that utilizes EUV observations and models to simulate SEP fluxes at 1 AU and to estimate energetic particle acceleration and transport to various locations in the inner heliosphere.It generates time-dependent histograms and movies distributing them through an online catalog.The accuracy and efficiency of the model were encouraging, but the highest energy fluxes showed disagreement with in situ observations by the SOHO/ERNE instrument.However, the framework has great potential for space weather science and forecasting.
In Aminalragia-Giamini et al. (2021), they used neural networks to provide probabilities for the occurrence of SEP based on soft X-rays data from 1988 to 2013.They obtained >85% for correct SEP occurrence predictions and >92% for correct no-SEP predictions.Lavasa et al. (2021) described a consistent approach to making a binary prediction of SEP events using ML and conventional statistical techniques.The study evaluated various ML models and concluded that random forests could be the best approach for an optimal sample comprising both flares and CMEs.The most important features for identifying SEP were found to be the CME speed, width, and flare soft X-ray fluence.Kasapis et al. (2022) employed ML techniques to anticipate the occurrence of a SEP event in an active region that generates flares.They utilized the Space-Weather MDI Active Region Patches (SMARP) dataset, which comprises observations of solar magnetograms between June 1996 and August 2010.The SMARP dataset had a success rate of 72% in accurately predicting whether an active region that produces a flare would result in a SEP event.Moreover, it provided a competitive lead time of 55.3 min in forecasting SEP events.Engell et al. (2017) introduced the Space Radiation Intelligence System (SPRINTS), a technology that uses pre-and post-event data to forecast solar-driven events such as SEP.It integrates automatic detections and ML to produce forecasts.Results show that SPRINTS can predict SEP with an 56% probability of detection and 34% false alarm rate.Nevertheless, the HESPERIA REleASE tools provide real-time predictions of the proton flux at L1 by using near-relativistic electrons as a warning for the later arrival of protons and have been set to operation (Malandraki and Crosby, 2018).Historical data analysis indicates high prediction accuracy, with a low false alarm rate of approximately 30% and a high probability of detection of 63% (Malandraki and Crosby, 2018).
Forecasting SEP is a critical task that serves operational needs and provides insight into the broader field of space weather science and heliophysics.As emphasized in previous works, a high precision forecasting model is urgently required to predict SEP flux within a period of time, given the risks associated with these events.This highlights the critical requirement for a dependable forecasting system that can mitigate the risks associated with SEP.
Scientists have been using physics-based and empirical models for decades to forecast SEP.
However, these models have certain limitations.Physics-based models require accurate input data and underlying physical assumptions.In addition, the complexity of the physics involved and incorrect parameters may introduce uncertainties that can lead to inaccurate predictions.On the other hand, empirical models rely on historical data to make predictions.While they can be accurate sometimes, they may be unable to account for changes in physical conditions related to the acceleration and propagation of SEP, which can influence prediction accuracy.ML models, however, provide a different approach to SEP forecasting.These models can analyze vast amounts of data, learning patterns from the data that are used, and connections that may not be obvious to experts.
Additionally, ML models can adapt to changes in underlying physical conditions, resulting in more accurate predictions as more data is collected; they also provide relatively rapid forecasts, which allows for incorporation into a real-time forecasting workflow.
In the upcoming sections, we will explore the limitations in accuracy that arise from dealing with an imbalanced dataset and low-resolution data.Specifically, the presence of intrinsic outliers in the time series data pertaining to SEP flux poses a significant challenge in modeling.These outliers correspond to occurrences of SEP events and, consequently, have an impact on the accuracy of predictions.Notably, they often lead to an underestimation of the SEP fluxes, primarily due to the predominance of relatively low values throughout the majority of the time interval.
In this paper, we present advanced deep learning models to forecast the daily integral flux of SEP over a 3-day forecasting window by using bi-directional long short-term memory (BiLSTM) neural networks, for 3 energy channels (>10, >30, and >60 MeV).Our models can forecast the timedependent development of SEP events in different energy domains, which can be used to model the space radiation profiles using frameworks such as BRYNTRN Wilson et al. (1988) and GEANT4 (Truscott et al., 2000).We describe the data selection and pre-processing in Section 2. We present an overview on the analysis methods and the models implemented in Section 3. Then we show the forecasting results in Section 4. Finally the summary and implications are presented in Section 5.

Data Preparation
In this section, we describe the physical quantities, the types of inputs and their sources, as well as the outputs we are forecasting.Some of the technical terms used in this study are explained further in the appendices.
In order to capture the variability of solar activity, which modulates the SEP flux, we selected input physical quantities that describe both the interplanetary medium and solar activity.These input features can be categorized into two groups: remote signatures and in-situ measurements.
The remote signatures consist of the F10.7 index, as well as the long-wavelength (X L ) and shortwavelength (X S ) x-ray fluxes.The F10.7 index represents the flux of solar radio emission at a wavelength of 10.7 cm, measured in solar flux units (sfu).To obtain the x-ray fluxes, we utilized 1and 5-minute averaged data from the Geostationary Operational Environmental Satellite (GOES) database2 , specifically at long wavelengths (1 -8 Å) and short wavelengths (0.5 -4.0 Å).
The in-situ measurements encompass the near-Earth solar wind magnetic field and plasma parameters.These include the solar wind speed (in km s −1 ), average interplanetary magnetic field (IMF) strength (in nT), and the integral SEP fluxes at three energy channels: >10, >30, and >60 MeV, which correspond to the GOES channels (in 1/cm 2 sec ster).These SEP fluxes were obtained from multiple spacecraft stationed at the first Lagrange point (L1) throughout the study period.In particular, the IMF and plasma data in the OMNI database are obtained from the IMP, Wind, and ACE missions, while the energetic particle fluxes are obtained from the IMP and GOES spacecraft3 .
To ensure a comprehensive dataset, we acquired hourly-averaged data covering a timeframe from December 1976 to July 2019, which spans the past four solar cycles.These data were sourced from the Space Physics Data Facility (SPDF) OMNIWeb database 4 , hosted by the Goddard Space Flight Center.This database provides a wealth of information, including integral proton fluxes, as well as an extensive range of solar wind plasma and magnetic field parameters.Lastly, the daily data on sunspot numbers were obtained from the Sunspot Index and Long-term Solar Observations (SILSO) archive5 , maintained by the World Data Center.
Figure 1 shows a plot for the timeseries data of all features.The top 3 panels are the logarithms of the SEP integral flux at the 3 energy channels (log PF10, log PF30, and log PF60), then the sunspot number, the F10.7 index (F10 idx), the logarithms of the x-ray fluxes (log Xs and log Xl), the solar wind speed (Vsw), and the average magnitude of the IMF (avg IMF).Throughout this paper, we adopt the convention that "log" refers to the common logarithm with a base of 10.The gray shades refer to the timespan of solar cycles.The blue, orange, and gold colors refer to the training, validation, and test sets, respectively.The data split method will be explained shortly.
Since the input SEP data have been compiled from various spacecraft, it may have artifacts even after processing.In particular, there are occasional jumps in the background level.There are also several day-long gaps in the OMNI solar wind parameters from the early 1980s to mid-1990s where only IMP 8 data are available and this spacecraft spent part of each orbit in the magnetosphere.We are reasonably confident that these issues do not influence the overall analysis significantly.
In deep learning applications, the dataset is split into 3 sets; namely the training set, the validation set, and the test set.The training set is usually the largest chunk of data that is used to fit the model.
The validation set is a smaller chunk of data used to fine-tune the model and evaluate its accuracy to ensure it is unbiased.The test set is the out-of-sample data exclusively used to assess the final model when performing on unseen data (Ripley, 1996).
After inspecting the correlation between the solar wind indices and the SEP integral fluxes in the OMNIWeb database, we chose the top-correlated features with the SEP flux.The correlations were made between the SEP fluxes and the individual parameters.Hence we took only timeseries of logarithms of the protons' integral flux at 3 energy channels (>10, >30, and >60 MeV), the timeseries of logarithm of the X-ray fluxes, the F10.7 index, the sunspot number, the solar wind speed, and the average strength of the IMF as input parameters to our model.The log of the SEP flux was used across the whole study.The correlation matrices for the training, validation, and test sets are shown in Figure 2. The X-ray and proton fluxes were converted into the logarithmic form because it was more convenient than the original form of data since the time series data were mostly quiet and had numerous sharp spikes, which correspond to solar events.Based on a previous experience with NNs (Nedal et al., 2019), we found that training separate models for each target (output) feature can lead to better results.This is because a dedicated model for each output feature can more easily learn the interrelationships between input features and make more accurate predictions.Therefore, in our current study, we trained 3 separate models, each one targeting the logarithm of the protons integral flux at a specific energy channel.
In order to ensure consistency across all features, all durations of the time series data of the physical quantities were matched to be within the same time range.Subsequently, the dataset was resampled to obtain daily averaged data, resulting in a significant reduction of the dataset size by a factor of 24.This reduction facilitated expeditious training and yielded prompt results.
In timeseries forecasting, it is a common practise to take a continuous set of data points from the main dataset to be the validation set and another smaller chunk of data to be the test set, for instance in Pala and Atici (2019); Benson et al. (2020); Zhang et al. (2022); Zhu et al. (2022).From our experiments, we got descent results when we applied the same data split method, but the results were a bit biased toward the end of the solar cycle 24 and the testing set was biased towards a quiet period.So, we adopted the 9-2-1 strategy, that is taking from each year 9 months to be added in the training set, 2 months to be added in the validation set, and 1 month to be added in the test set.This is applied over the ∼43 years of data (Fig. 1), which yields 74.29% of the data for the training set, 16.2% for the validation set, and 9.51% for the testing set.By doing so, we eliminated the need to do cross-validation and hence, made the training more efficient.It is worth to mention that the timeseries data must not be shuffled as that will break temporal and logical order of measurements, which must be maintained.

Methods
In this section, we introduce the data analysis methods used in this work.We start with explaining the model selection phase, followed by a discussion of the bidirectional long-short-term memory (BiLSTM) neural network architecture.The technical terminologies are described in the appendices.

The Bi-LSTM Model
Recurrent neural networks (RNNs) that support processing input sequences both forward and backward are known as Bidirectional Long Short-Term Memory (BiLSTM) neural networks (Schuster and Paliwal, 1997).Regular RNNs (Hochreiter and Schmidhuber, 1997;Kolen and Kremer, 2001) depend on the prior hidden state and the current input to determine the output at a given time.The  using contexts from the past as well as the future.Hence, accuracy is improving.Each BiLSTM layer consists of two LSTM layers; a forward layer that processes the input sequences from the past to future, and a backward layer that processes the input sequences from the future to the past, as illustrated in Figure 3, to capture information from both past and future contexts.The output from each layer is concatenated and fed to the next layer, which can be another BiLSTM layer or a fully connected layer for final prediction.
BiLSTM networks are advantageous than traditional LSTM networks in a variety of aspects (Graves and Schmidhuber, 2005;Ihianle et al., 2020;Alharbi and Csala, 2021).First, as we demonstrate in this study, they are excellent for tasks like timeseries forecasting, as well as speech recognition and language translation (Wöllmer et al., 2013;Graves and Jaitly, 2014;Sundermeyer et al., 2014;Huang et al., 2018;Nammous et al., 2022) 3. Architecture of a single BiLSTM layer.The blue circles at the bottom labeled by (x 0 , x 1 , x 2 , ..., x i ) are the input data values at multiple time steps.The purple circles, on the other hand, are the output data values at multiple time steps labeled by (y 0 , y 1 , y 2 , ..., y i ).The dark green and light green boxes are the activation units of the forward layer and the backward layer, respectively.The orange and yellow circles are the hidden states at the forward layer and the backward layer, respectively.Both the forward and backward layers composes a single hidden BiLSTM layer.The figure is adopted from Olah (2015) overfitting, with a patience parameter of 7. ReduceLROnPlateau callback function was used to reduce the learning rate when the validation loss stops improving, with a patience parameter of 5, a reduction factor of 0.1 and minimal learning rate of 1e −6 .

Model Selection
To determine the most suitable model for our objective and provide justifiable reasons, we conducted the following analysis.First we examined the naive (persistence) model, which is very simplistic and assumes that the timeseries values will remain constant in the future.In other words, it assumes that the future value will be the same as the most recent historical value.That was the baseline.Next we examined the moving-average model, which calculates the future values based on the average value of historical data within a specific time widow.This gives a little bit lower error.
After that, we went towards the machine learning (ML)-based models.For all the ML models, we chose the Adaptive moment estimation (Adam) optimizer (Kingma and Ba, 2015) as the optimization algorithm due to its minimal memory requirements and high computational efficiency as it is well-suited for applications that involve large number of parameters or large datasets.As a rule of thumb, we set the optimizer's learning rate to be 0.001 as it is usually recommended (Zhang et al., 2022).
In order to prepare the data in a readable format to the ML models, we created a windowed dataset with an input horizon of 365 steps representing 1 year of data and an output horizon of 3 steps representing the forecast window of three days.We call this windowing method as Multi-Input Multiple Output (MIMO) strategy, in which the entire output sequence is predicted in one shot.The MIMO strategy adopts the sliding window method that was mentioned in Benson et al. (2020) in which each sequence is shifted by one step with respect to the previous sequence until reaching the Fig. 4. Illustration of the sliding window technique for a sample of 10 timesteps, where each number denotes a distinct time step.As an example here, the input horizon (blue color) length is 4 timesteps and the output horizon length is 3 timesteps.The input window slides 1 time step at a time across the entire data sequence to generate 4 distinct input and forecast horizon pairs.The purple, orange, and green colors of the output horizon represent 1-day, 2-day, and 3-day ahead forecasting, respectively.The timesteps of 1-day ahead forecasting across the data sequences are then concatenated into a single timeseries list that is called 1-day ahead prediction.The same for 2-day and 3-day ahead.
end of the available data (Fig. 4).This approach minimized the imbalance of active days, with high SEP fluxes, and quiet days.After experiments with different loss functions and evaluate their performance on our dataset, we chose the Huber function C.1a as the loss function and the Mean Absolute Error (MAE) is used as the metric function to monitor the model performance.We used the Huber function because it is robust and combines the advantages of both Mean Squared Error (MSE) and MAE loss functions.
It is less sensitive to outliers than MSE, while still being differentiable and providing gradients, unlike MAE.Since our data is noisy and contains outliers that may negatively impact the model's performance, the Huber loss function is a good choice.
We examined various neural network models to determine the optimal architecture for our task.
Initially, we started with a simple linear model comprising of a single layer with a single neuron.However, this model did not yield satisfactory results.We then explored a dense ML model consisting of two hidden layers, each with 32 neurons and a RelU activation function.Next, we experimented with a simple RNN model with the same number of hidden layers and neurons.To find the optimal learning rate, we utilized the LearningRateScheduler callback function and discovered that a rate of 1.58e −4 under the basic settings minimized the loss.We proceeded to examine stateful versions of RNN, LSTM, and BiLSTM models with three hidden layers, each with 32 neurons and a learning rate of 1.58e −4 .In addition, we explored a hybrid model that consisted of a 1-dimensional convolutional layer with 32 filters, a kernel size of 5, and a RelU activation function.We combined this with a two-hidden layer LSTM network with 32 neurons each and a learning rate of 1.58e −4 .We experimented with Dropout layers but did not observe any significant improvement in the results.
Finally, we evaluated a BiLSTM model with five hidden layers, 64 neurons each, and a learning rate of 0.001.Based on the evaluation of all the models on both the validation and test sets (Fig. 5 and Table D.1), we selected the BiLSTM model for further refinement.More details on the final model architecture and hyperparameters are explained in the Appendix D. Figure 5 presents a comparative analysis of the Huber loss within the validation and testing sets across the ten aforementioned models.We used several evaluation measures to assess our models since each metric provides valuable insights into the accuracy and performance of the forecasts (Appendix C), helping to identify areas for improvement and adjust the forecasting models accordingly.

Results and Discussion
The benchmarking in Figure 5 showed that, in general, the ML-based methods were not much different.On the other hand, the persistence model and moving average model resulted in the highest errors compared with the ML-based models, and their results were close to some extent.As we see, the BiLSTM model performed the best over both the validation and test sets compared with the other models.
We developed and trained 3 BiLSTM models to forecast the integral flux of SEP, one model per energy channels.After the training was completed, we evaluated the performance of the models from the loss curve (Fig. 6) using the Huber loss (the left panel) and the metric MAE From experimentation, we found that the batch size and the optimizer learning rate are the most important hyperparameters that have a strong influence on the overall model's performance (Greff et al., 2016).In addition, adding dropout layers as well as varying the number of hidden layers and hidden neurons resulted in only marginal improvements to the final model performance, while substantially increasing training time and requiring greater computational resources.
The term batch size refers to the number of data sequences processed in one iteration during the training of a ML model (Goodfellow et al., 2016).Initially, a batch size of 64 was selected, however, we observed that the model produced better results when a batch size of 30 was used instead.This could be related to the Carrington rotation, which lasts for ∼27 days.There were ∼570 Carrington rotations between December 25 th 1976 and July 30 th 2019.Therefore, updating the model's weights after every Carrington rotation could be a reasonable choice for improving its performance.Figure 7 shows how good the model predictions are (on the y-axis) compared with the observations of the validation set (on the x-axis).The blue, orange, and gold colors refer to 1-day, 2-day, and 3-day ahead predictions, respectively.The top panel is for the >10 MeV channel, the middle panel is for the >30 MeV channel, and the bottom panel is for the >60 MeV channel.
The left column is for the entire validation set, while the right column is for the observations points ≥10 proton flux units (pfu).That is the threshold value of proton flux as measured by the National Oceanic and Atmospheric Administration (NOAA) GOES spacecraft to indicate severity of space weather events caused by SEP.
We found that, overall, the models performed very well.The R correlation was >0.9 for all points of the validation set across the forecasting windows for the 3 energy channels.The R correlation was >0.7 for the observations points ≥10 pfu as well.However, the correlation between the modeled data and the observations exhibited a decline as the forecast horizon increased, in accordance with the anticipated result.To confirm the validity of the models, we performed the same correlation analysis between the modeled data and the observations of the out-of-sample test set (Fig. 8), which was not given to the model.Again, we found a high correlation across the forecasting windows for the 3 energy channels.The points were more dispersed between 1 and 1.5 on the x-axis, which reflected the seasonal variation of the correlation, as shown in Figure 9. Here, we show only the 1-day ahead predictions for the test set, for the 3 energy channels.We observe drops in the correlation factor synchronized with the transition between solar cycles (e.g., particularly between ∼1995 -2000, which represents the declining phase of the solar cycle 22 and the rising phase of the solar cycle 23).This could be related to the fact that the low SEP fluxes during quiet times are more random and thus more difficult to forecast (Feynman et al., 1990;Gabriel et al., 1990;Rodriguez et al., 2010;Xapsos et al., 2012).
During periods of low solar activity, the forecasting of low SEP fluxes becomes more challenging due to their increased randomness.This difficulty arises from the reduced occurrence of conventional SEP drivers, such as solar flares and CMEs.Studies have suggested that the most significant solar eruptions tend to happen shortly before or after the solar cycle reaches its maximum ( Švestka, 1995).Additionally, sporadic increases in solar activity have been observed (Kane, 2011), which might contribute to the diminished correlations observed in our research.There is clearly some factor that is influencing the correlation during certain periods where there are no or only weak SEP events.However, it is not obvious which physical phenomena are the cause rather than, for instance, some artifact of the data.Understanding the interplay between these factors and their influence on SEP fluxes during periods of reduced solar activity remains a critical area of research.It would be interesting to find what is reducing the correlations, thus more investigation is needed.
Overall, the modeled data was correlated the most with observations at >60 MeV, then the second rank was for the >10 MeV channel, and the third rank was for the >30 MeV channel.That could be related to the relatively larger extent of drops in correlation at the >30 MeV channel.The decline in correlation at the >30 MeV channel is consistent with the findings of Le and Zhang (2017).A summary of the performance results of the models for both the validation set and test set is presented in Table 1.
From the visual inspection of the test set examples (Fig. 10, 11, and 12), we found that the predicted onset time, the peak time, and end times of SEP events were highly correlated with those of the observations, which implies that the model captured the temporal variations, as well as the trends in SEP flux.(TSS), and Heidke Skill Score (HSS).Detailed descriptions of these skill scores can be found in Appendix E. To extract individual SEP events from the test dataset, we implemented a thresholdbased clustering algorithm.This algorithm uses the NOAA/SWPC warning threshold value of 10 pfu for the E ≥10 MeV channel.Upon analysis, we identified the number of detected SEP events for each output forecasting window and calculated the skill scores (Table 2).In the true test set, we identified 12 SEP events.
The evaluation of the model revealed notable trends as the length of the output forecasting window increased.The POD and CSI exhibited a declining pattern, indicating a reduced ability of the model to accurately detect and capture positive events (SEP occurrences) as the forecasting horizon extended further into the future.This suggests that the model's performance in identifying and capturing true positive instances diminishes with longer forecasting windows.Moreover, the POFD demonstrated an increasing trend, indicating an elevated rate of false positive predictions as the forecasting horizon lengthened.The model's propensity to generate false alarms rose with the lengthening forecasting window, leading to incorrect identification of non-events as positive events.Consequently, the TSS and HSS exhibited decreasing values, signifying a deterioration in the model's overall skill in accurately capturing and distinguishing between positive and neg-  ative instances.Overall, our skill scores are comparable with those reported by previous studies (Table 3).Although the UMASEP model does better than ours (i.e., has a higher POD), our FAR is much lower, thus, making fewer false alarms than the UMASEP model.BiLSTM neural network models to predict the daily-averaged integral flux of SEP at 1-day, 2-day, and 3-day ahead, for the energy channels >10 MeV, >30 MeV, and >60 MeV.We used a combination of solar and interplanetary magnetic field indices from the OMNIWeb database for the past 4 solar cycles as input to the model.We compared the models with baseline models and evaluated them using the Huber loss and the error metrics in Appendix C.
The data windowing method we used, based on the MIMO strategy, eliminates the need to feed the output forecast as input back into the model and that allows to do forecasting relatively far into the future while maintaining decent results (e.g., the MSE is ranging between 0.007 and 0.015 for 1-day forecasting in the test set, compared to an MSE of 0.236 for a persistence model.See Table 1).
The results show that the model can make reasonably accurate predictions given the difficulty and complexity of the problem.The MSE was ranged between 0.009 and 0.055 for the validation set, and between 0.007 and 0.05 for the test set.The correlations between the observations and predictions were >0.9 for the validation and test sets (Fig. 7 and Fig. 8).Nevertheless, the mean temporal correlation was ∼0.8 for the test set (Fig. 9).Although our models performed well, we observed a relatively large discrepancy between the predictions and the observations in the >30 MeV energy band.
The findings of this study underscore the challenges encountered by the forecasting model in accurately predicting SEP data over longer time periods.As the length of the output forecasting window increased, the model's ability to detect true positives and its overall skill in differentiating positive and negative instances diminished.Additionally, the model displayed an elevated rate of false negative predictions, indicating an increased tendency to generate misses as the forecasting horizon extended.These results highlight the importance of carefully considering the appropriate forecasting window length for SEP data to ensure the model's optimal performance.Our skill scores generally align with those from previous works (Table 3).There are variations in the metrics' values across different studies, highlighting the complexities and nuances associated with each study.
Nevertheless, it is important to acknowledge that the statistical significance of the results in this study is limited due to data averaging.Future studies should consider incorporating hourly data, as this is likely to result in a greater number of identified events.The model can provide shortterm predictions, which can be used to anticipate the behavior of the near-Earth space environment.
These predictions have important implications for space weather forecasting, which is essential for protecting satellites, spacecraft, and astronauts from the adverse effects of solar storms.
Multiple techniques exist for identifying the optimal combination of hidden layers and neurons for a given task such as empirical methods, parametric methods, and the grid search cross-validation method, which we will explore in future work.The observed reduction in correlation necessitates further investigation to determine its origin, whether stemming from tangible causal factors or potential aberrations within the model or data.We plan to expand upon this work by performing short-term forecasting using hourly-averaged data.This extension will involve integrating additional relevant features such as the location and area of active regions and coronal holes on the Sun.
BiLSTM networks are particularly useful for tasks involving sequential data such as timeseries forecasting.Given their capacity to handle input sequences in both directions in time and capture long-term dependencies, they are valuable in a broad range of applications.Nonetheless, one should carefully consider their data requirements and computational complexity before adopting them.
Our results emphasize that the use of deep learning models in forecasting tasks in heliophysics are promising and encouraging, as pointed out by Zhang et al. (2022).
This work is a stepping stone towards real-time forecasting of SEP flux based on the publicavailable datasets.As an extension, we are currently working on developing a set of models that deliver near-real time prediction of SEP fluxes at multiple energy bands, multiple forecasting windows, with hourly-averaged data resolution, with a more sophisticated model architecture, as well as more features that address the state of solar activity more comprehensively.We plan to extend the analysis to include more recent data from solar cycle 25, in order to improve the accuracy of the models.In conclusion, our study highlights the potential of using BiLSTM neural networks for forecasting SEP integral fluxes.Our models provide a promising approach for predicting the near-Earth space environment too, which is crucial for space weather forecasting and ensuring the safety of our space assets.Our findings contribute to the growing body of literature on the applications of deep learning techniques in heliophysics and space weather forecasting.
• Learning Rate: A hyperparameter that determines the step size at each iteration of the optimization algorithm during training.It affects learning speed and convergence.A high learning rate can cause the training process to converge quickly, but it may also result in overshooting the optimal solution or getting stuck in a suboptimal solution.On the other hand, a very low learning rate can make the training process slow, and may struggle to find the optimal solution.
• Reducing the learning rate when the validation loss stops improving: This concept involves adjusting the learning rate dynamically during the training process.When the validation loss reaches a plateau or stops improving, it indicates a suboptimal point.By reducing the learning rate, the model can take smaller steps in weight space, potentially finding a better solution.This technique, known as learning rate scheduling or learning rate decay, is commonly used to finetune the model's performance.
• Patience: A parameter used in training to determine the number of epochs to wait for an improvement in validation loss before stopping the training process.
• Patience Parameter of 7: In the context of early stopping, training will be stopped if the validation loss does not improve for 7 consecutive epochs.
• Adam Optimizer: A popular optimization algorithm in deep learning that combines Adaptive Gradient Algorithm (AdaGrad) and Root Mean Square Propagation (RMSprop) to achieve efficient optimization.
• Optimal Architecture: The best configuration of a neural network, including the number of layers, neurons, and other choices, for optimal performance on a specific task.
• Hyperparameters: Parameters set before training a model that control the learning algorithm's behavior, such as learning rate, batch size, and activation functions.
• Layer: A building block of a neural network that performs specific operations on input data.
Includes input, hidden, output, fully connected, convolutional, recurrent, activation, and dropout layers.Here is a description for each layer: -Input Layer: The first layer of a neural network that receives raw input data.It passes the input to subsequent layers for further processing.The number of nodes in the input layer is determined by the dimensionality of the input data.
-Hidden Layers: Intermediate layers between the input and output layers.They perform computations on the input data and capture higher-level representations or abstractions.Hidden layers are not directly exposed to the input or output.
-Output Layer: The final layer of a neural network that produces model predictions or outputs based on computations from preceding layers.The number of neurons in the output layer depends on the problem being solved, such as regression or classification.
-Fully Connected Layer (Dense Layer): Each neuron in this layer is connected to every neuron in the previous layer.It allows information flow between all neurons, enabling complex relationships to be learned.RNNs or manually reset the hidden state at the end of each epoch, but this can be complex and prone to errors.

Fig. 5 .
Fig. 5. Benchmarking of 10 models, shows the Huber loss for the validation and test sets.

(Fig. 6 .
Fig. 6.Left Panel -The Huber loss vs. the number of training epochs for the BiLSTM model for the validation and test sets, for the 3 energy channels.Middle Panel -The mean absolute error (MAE); the model's metric vs. the number of training epochs.Right Panel -Shows how the learning rate of the Adam optimizer changes over the number of epochs.

Fig. 7 .Fig. 8 .
Fig. 7. Correlation between the model predictions and observations for 1-day, 2-day, and 3-day ahead for >10 MeV (top panel), >30 MeV (middle panel), and >60 MeV (bottom panel).The panels in the left column represent all the points of the validation set, those in the right column represent all the observations points with daily mean flux ≥10 pfu.

Fig. 9 .
Fig. 9. Comparison between the model outputs and observations of the test set for the 3 energy channels.In addition to the rolling-mean window correlation for 1-day ahead predictions.

Fig. 10 .
Fig.10.The model's forecasts for the out-of-sample testing set for the >10 MeV channel are shown at forecast horizons of 1 day, 2 days, and 3 days ahead, using samples of data from December in selected years mentioned in the top-left side of the plots.

Fig. 11 .
Fig. 11.The model's forecasts for the out-of-sample testing set for the >30 MeV channel are shown at forecast horizons of 1 day, 2 days, and 3 days ahead, using samples of data from December in selected years mentioned in the top-left side of the plots.

Fig. 12 .
Fig.12.The model's forecasts for the out-of-sample testing set for the >60 MeV channel are shown at forecast horizons of 1 day, 2 days, and 3 days ahead, using samples of data from December in selected years mentioned in the top-left side of the plots.
are the weight matrices for the output vector h t−1 .b f , b i , b c , b o are the bias vectors.The symbol ⊙ denotes a pointwise multiplication.The sigmoid function σ is used as the activation function for the gate vectors, and the hyperbolic tangent function tanh is used for the candidate values and the output vector.

--
Probability of Detection (POD) or Recall: Represents the model's ability to correctly identify positive instances.Probability of False Detection (POFD): Measures the model's tendency to falsely predict positive instances when the ground truth indicates their absence.Rate (FAR): Indicates the ratio of false positive predictions to the total number of positive instances.FAR = FP FP + T P (E.5) -Critical Success Index (CSI): Measures the model's ability to correctly predict both positive and negative instances.Statistic (TSS): Takes into account both the model's ability to detect positive instances and its ability to avoid false alarms.T S S = POD − FAR (E.7) -Heidke Skill Score (HSS): Evaluates the model's performance by comparing it with random chance.It takes into account the agreement between the model's predictions and the observed data, considering both true positive and true negative predictions.HS S = T P + T N − C T − C (E.8) where T = T P + T N + FP + FN C = (T P + FP)(T P + FN) + (T N + FP)(T N + FN) T The final dataset has 7 features, including the target feature, from December 25 th 1976 to July 30 th 2019, with a total of 15,558 samples (number of days).The training set has 11,558 samples, the validation set has 2,520 samples, and the test set has 1,480 samples.
because they can capture long-term dependencies in the input sequence in both forward and backward directions.Second, unlike feedforward networks, BiLSTM networks do not demand fixed-length input sequences, thus being able to handle variable-length sequences better.Furthermore, by taking into account both past and future contexts, BiLSTM networks can handle noisy data.However, BiLSTM networks are computationally more expensive than regular LSTM networks due to the need for processing the input sequence in both directions.They also have a higher number of parameters and require more training data to achieve good performance.modelstopped improving remarkably after almost 50 epochs.Thus, there was no need to waste time and computational resources to train the model for more than 50 epochs.The ModelCheckpoint callback function was used to register the model version with the minimal validation loss.The EarlyStopping callback function was used to halt the model run when detecting

Table 1 .
Summary of the performance results of the models for the validation and test sets.
To get further insight into the model's performance, we conducted an assessment of various skill scores, including True Positive (TP), True Negative (TN), False Positive (FP), and False Negative

Table 2 .
Confusion matrix for the energy channel ≥10 MeV predictions in the test set.

Table 3 .
(Whitman et al. (2022)ores with previous models.The dashed entries mean the data is unavailable(Whitman et al. (2022)for more details).
where x t is input data at time t.The input gate i t determines which values from the updated cell states (candidate values) Ct should be added to the cell state.It also takes into account the current input x t and the previous output h t−1 , and is passed through a sigmoid activation function.Ct represent the candidate values that are added to the cell state at time t.The forget gate activation vector f t at time step t, which determines how much of the previous cell state should be retained.The cell state C t at time t is updated based on the forget gate, input gate, and candidate values.The output gate o t at time t determines how much of the cell state should be output.The output vector h t at time t is calculated based on the cell state and the output gate values.h t−1 is the output vector at the previous time step t − 1. W f , W i , W c , W o are the weight matrices for the input vector x t .U f , U i , U c , U o Table D.1.Configuration of the ML model.(1)refers to the error value for 1-day forecasting.Same for (2) refers to 2-day forecasting, and (3) for 3-day forecasting.* In the 1D-CNN layer, 32 filters, a kernel size of 5, and strides of 1 were used.