Issue 
J. Space Weather Space Clim.
Volume 14, 2024



Article Number  28  
Number of page(s)  11  
DOI  https://doi.org/10.1051/swsc/2024027  
Published online  23 October 2024 
Technical Article
Thermosphere model assessment for geomagnetic storms from 2001 to 2023
GETCNES, Space Geodesy Office, 14 Av. Edouard Belin, 31400 Toulouse, France
^{*} Corresponding author: sean.bruinsma@cnes.fr
Received:
29
April
2024
Accepted:
14
August
2024
We present an updated study for thermosphere model assessment under geomagnetic storm conditions, defined when the geomagnetic index ap = 80 or larger. Comparisons between five empirical models, NRLMSISE00, JB2008, and three versions of DTM2020, and CHAMP, GRACE, GOCE, SwarmA, and GRACEFO neutral density data sets for 152 storms are presented. The storms are categorized according to ap, as single peak or multiple peaks. After applying a model debiasing procedure using the density data just before the onset of a storm, the models are on average only slightly biased, often only a few percent. This is an unexpected and reassuring result for these relatively simple models, which were fitted to different observations. The standard deviations of these averages are however up to 12% (1sigma), which places the small biases into perspective. The smallest biases are achieved at the lowest altitude when comparing with GOCE data, and the highest for GRACE. The best results, i.e. smallest bias and standard deviation on average over all singlepeak storms, over the entire 4Phase storm period are obtained with DTM2020_Intermediate and DTM2020_Research models, while the oldest model, NRLMSISE00, is the least precise. However, NRLMSISE00 is the least biased when compared to multiplepeak storms. As could be expected, multiplepeak storms are reproduced with less precision than singlepeak storms. The assessment reveals that model precision decreases with altitude, but that bias is independent of altitude, at least in the range covered by the data, 250–550 km.
Key words: Thermosphere / Model assessment / Geomagnetic storm / Satellite drag
© S. Bruinsma & S. Laurens, Published by EDP Sciences 2024
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1 Introduction
Semiempirical (SE) thermosphere models are used operationally in the determination and prediction of the ephemerides of active satellites and orbital debris. Conjunction analysis is becoming a major issue with the fastgrowing number of objects in space. Notably for geomagnetic storms, good knowledge of thermosphere model performance, or the uncertainty, is therefore necessary. The accuracy of the determination and prediction of ephemerides of objects in Low Earth Orbit (LEO; altitudes lower than 1000 km) depends largely on the quality of the atmospheric drag modeling. The drag force varies mainly due to the fluctuations in total neutral density, and to a much lesser degree also due to satellite characteristics (Doornbos, 2011; Mehta et al., 2017) and atmospheric composition and temperature. The variability of the thermosphere is driven by the changing solar Extreme UltraViolet (EUV) emissions and the changing Joule heating and particle precipitation due to the interaction of the magnetosphere with the solar wind (commonly referred to as “solar activity” and “geomagnetic activity”, respectively), whereas a small part is due to upward propagating perturbations that originate in Earth’s lower atmosphere. The accuracy of the forecast of the drivers is crucial to the accuracy of the density forecast. We do not address errors in the driver forecasts in this paper, but the model performance with observed drivers (or perfect forecasts of the proxies), i.e. we quantify the precision of the combination of the model and the proxies it employs.
In order to assess and track the progress over time of current and future thermosphere models, appropriate metrics and highquality neutral density data sets, preferably with highspatial resolution covering long intervals of time (i.e. years, and ideally a complete solar cycle), are required. In a previous paper (Bruinsma et al., 2021), specific stormtime metrics were introduced and described and then applied to a small number (13) of isolated storms. The neutral density data used in that study to verify model precision in space and time has been augmented with data from the missions SwarmA (van den Ijssel et al., 2020) and GRACEFO (Siemes et al., 2023), whereas a more precise GRACE dataset (Siemes et al., 2023) replaces the older one. In this study, a total of 152 storms from 2001 to 2023, defined as events with a Kp of 6, i.e., ap = 80 (units: 2 nT), and larger, is used to comprehensively assess the CIRA (COSPAR International Reference Atmospheres) models NRLMSISE00 (hereafter MSIS00; Picone et al., 2002), JB2008 (Bowman et al., 2008), and DTM2020 (Bruinsma and Boniface, 2021). In this study, a modification of the stormtime metrics was necessary to accommodate also double or multiplepeaked storms, e.g., the 2003 Halloween storm.
The density data and the adopted metrics allow objective benchmarking of the thermosphere models, quantification, and detailed descriptions of the improved (or degraded) performance. A major goal of this study is to provide CCMC (Community Coordinated Modeling Center: https://ccmc.gsfc.nasa.gov) with the list of storms, and the density data, which will then be used to assess all thermosphere models that run there. Thermosphere model scorecards can then be published, applying consistent and always identical metrics, which can help users select the best model for their objective.
The following section briefly presents the models, the density data, and stormtime metrics with a small modification for doublepeaked storms. The assessment results of the three models are presented in Section 3. After a summary, the conclusions are given in Section 4.
2 Model descriptions
2.1 Semiempirical thermosphere models
Semiempirical models are used in orbit computation and mission design and planning because they are easy to use and computationally fast. They provide density and temperature estimates for a single location and time (e.g., satellite positions along an orbit). These climatological (or “specification”) models have a low spatial resolution of the order of thousands of kilometers and a temporal resolution of 1–3 h that is imposed by the cadence of the geomagnetic indices. Some kind of storm preconditioning is taken into account through a combination of delayed and averaged geomagnetic indices, notably in MSIS00, and the resulting density estimate is based on the statistical fits to those drivers. SE models are constructed by optimally estimating the model coefficients to their respective underlying databases (some combination of total density, temperature, and composition measurements). As a consequence, the SE thermosphere models adopt the scales of the satellite models used in the respective databases (e.g., DTM2020 based on TU Delft satellite models, JB2008 based on HASDM satellite models) – and these are rarely consistent between modelers as was shown in Bruinsma et al. (2022).
The DTM2020Operational model (DTM2020_Ops) was developed recently, using F10.7 and Kp as drivers. DTM2020_Int is an unpublished intermediate model using F30 but still Kp as drivers that were created in the development of the model DTM2020Research (DTM2020_Res), which is too complicated to run operationally. The backbone of the DTM2020 models is the complete CHAllenging Mini satellite Payload (CHAMP; Doornbos, 2011), Gravity Recovery and Climate Experiment (GRACE; Bruinsma, 2015) and Gravity field and steadystate Ocean Circulation Explorer (GOCE; Bruinsma et al., 2014) highresolution accelerometerinferred density datasets, as well as Swarm A densities (van den Ijssel et al., 2020). The JB2008 model was developed by fitting to 10 years of US Air Force dailymean densities, 5 years of output of the U.S. Space Force High Accuracy Satellite Drag Model (HASDM; Storz et al., 2005), 5 years of CHAMP, and 4 years of GRACE data. The files with the solar drivers, which are computed and sometimes modified by the modelers, are regularly updated. The driver files (SOLFSMY and DTCFILE) that were online in January 2024 were used for the assessment given in this paper. MSIS00 is used in this assessment because the difference with MSIS 2.0 (Emmert et al., 2020) is very small for geomagnetic storms and in the form of a bias (the geomagnetic model algorithm was not updated) and because it is widely used. MSIS00 was constructed with a mass spectrometer, incoherent scatter radar, historical accelerometer total densities, UV occultation, and various rocket measurements, along with global dailymean orbitderived thermospheric density and tabular lower atmospheric temperature. The time resolution and drivers of the models are listed in Table 1.
Thermosphere and upper atmosphere models used in the assessment. The temporal resolution is listed in the last column.
2.2 Selected density data
The density variability in the thermosphere is very large, typically hundreds of percent, on long time scales (the solar cycle) but also on short time scales of hours to days in the event of geomagnetic storms. The variations in density depend on position, season, and the level of solar and geomagnetic activity. The highresolution accelerometerinferred densities and the GPSinferred densities are in situ, along the satellite orbits, but a single satellite still provides a rather poor spatial distribution: the latitudinal extent depends on the orbital inclination, the local time coverage is essentially limited to the time of its ascending and descending pass and the precession of the orbital plane (negligible throughout a geomagnetic storm). While the measurement cadence typically is 10 s, a satellite passes the same latitude and local time (but at a different longitude) once per orbit of roughly 100 min.
Table 2 lists the essential information of the five selected datasets, CHAMP (Doornbos, 2011), GRACE and GRACEFO (GFO; Siemes et al., 2023), GOCE (Bruinsma et al., 2014), and SwarmA (van den Ijssel et al., 2020). These precise density datasets are selected because they are appropriate for stormtime evaluation, notably precise measurements from pole to pole. The SwarmA densities are derived using precision orbit determination and have an alongtrack resolution of about 1000 km at best when solar and geomagnetic activity is high. The accelerometerinferred densities have a constant resolution of about 600 km after lowpass filtering without data rejection.
Datasets selected for the model assessment under storm conditions. (i is inclination, and LST is the local solar time coverage and the approximate period, in days, to cover 24 h).
The model assessment is performed by means of comparing to densities of 152 storms with a Kp of 6, i.e., ap = 80, and higher. Table A1 lists the dates, minimum Dst, and maximum ap of the storms, and which satellites monitored the storms.
2.3 Updated stormtime metrics for modeldata comparison
Storms in this analysis are defined as Kp = 6 or larger (ap = 80 or larger). In Bruinsma et al. (2021), the storms were divided into four phases of fixed length, two before and two after the minimum Dst associated with the storm defined through Kp: prestorm (1), onset (2), recovery (3), and poststorm (4). An update of the above phases is necessary in this study in order to accommodate storms with more than one peak in geomagnetic activity of Kp = 6 or larger in phases 3–4 (at least 9 h apart), of which the Halloween 2003 storm is a muchstudied case. The intervals of such multiplepeaked storms can be unambiguously defined using ap/Kp, which was preferred over a more complicated definition based on both Dst and ap. For consistency, we modified the definition of t0 of the storm phases for all storms based on ap reaching 80, and Dst is no longer used. Figure 1 (top) illustrates the four phases of a singlepeak (SP) storm. Phase 1, the prestorm interval, is used to debias the models with respect to the observations via a scale factor, which is then applied to the model densities in all four phases. This debiasing procedure is used to minimize the effect of nonstormrelated model differences and errors on the assessment. Density data for the 108 singlepeak (SP) storms are selected 30h before to 48h after the time ap reaches 80, which now defines t0 and the end of Phase 2 (storm onset). Phase 3 is the main and recovery phase, and Phase 4 is the poststorm phase. Figure 1 (bottom) illustrates the phases for double or multiplepeaked (MP) storms, which in this example is for 10–16 July 2004. The t0 for the 44 MP storms is defined for SP storms as the time when ap reaches 80. Phase 3 for MP storms, in Figure 1 (bottom) due to the second time ap = 80 is reached at t = 1.4, is extended by a variable duration based on when ap falls below 80 again (at t almost 3.0 in this example) plus 36h. Phase 4 then extends 12 h after the end of Phase 3. The phases and their durations for SP and MP (italic) storms are computed as listed below with respect to t0.
Figure 1 The four phases of the assessment interval for SP (top) and MP (bottom) storms, with t0 centered on the time of the first peak in ap with a minimum of 80. 
Phase 1: t0 minus 30h  t0 minus 30h to t0 minus 18h 
to t0 minus 18h  
t0 minus 18h to t0  t0 minus 30h to t0 minus 18h 
Phase 3: t0 plus 36h  t0 plus variable duration plus 36h 
Phase 4: t0 plus 36h  end of Phase 3 plus 12h 
to t0 plus 48h 
Because of the very large range in density values due to differences in altitude and solar activity (i.e. phase of the solar cycle), it is difficult to analyze and interpret a model’s performance using absolute values. Therefore, relative precision is preferred in the form of ratios that are easy to comprehend. The models and data are compared by computing the observedtocomputed density ratios (O/C). The mean μ and standard deviation σ of the observedtocomputed density ratios, due to their distribution, are computed in log space following Sutton (2018):
$$\mathrm{mean}\frac{O}{C}=\mu =\mathrm{exp}\left(\frac{1}{N}\sum _{n=1}^{N}\mathrm{ln}\frac{{O}_{n}}{{C}_{n}}\right)$$
$$\mathrm{standard}\mathrm{deviation}\frac{O}{C}=\sigma =\sqrt{\frac{1}{N}\sum _{n=1}^{N}{\left(\mathrm{ln}\frac{{O}_{n}}{{C}_{n}}\mathrm{ln}\mathrm{mean}\left(\frac{O}{C}\right)\right)}^{2}}$$(1)
where N is the number of observations. A ratio of one indicates that on average the observations are reproduced perfectly, i.e. an unbiased model. Deviation from unity signifies under (larger than one) or overestimation (smaller than one) of the model. The standard deviation (StD) of the observedtocomputed density ratios σ, computed as percentage by multiplying by 100, represents the combination of the ability of the model to reproduce observed density variations, and the geophysical noise (e.g. waves, the short duration effect of large flares) and instrumental noise in the observations. The Pearson correlation coefficients R of observed and modeled densities are also computed, which are independent of model bias; R^{2} represents the fraction of observed variance reproduced by the model.
The number of SP and MP storms identified in this study taking the available satellite density data into account is 108 and 44, respectively. Storms for which data gaps are present in any of the four phases were rejected. Storms are often observed by more than one spacecraft, and the total number of SP and MP storm comparisons is 193 (CHAMP 48, GRACE 72, GOCE 16, SwarmA 42, and GFO 15) and 79 (CHAMP 31, GRACE 35, GOCE 2, SwarmA 9, and GFO 2), respectively.
The comparisons of each model with satellite density data are computed in three steps:
Calculate the model debiasing scale factor with density data in Phase 1 per storm and per satellite (e.g., 48 scale factors for CHAMP and SP storms);
Apply debiasing and calculate μ_{i,s}, σ_{i,s}, and R_{i,s}, where i = 1 to the number of storms per satellite s (e.g., i is 48 for CHAMP and SP storms) for the four phases separately and all four together. Observations that are ten times smaller or bigger than model densities are rejected;
Calculate the average of the mean O/C and standard deviation of the μ_{i,s} per satellite again with Eq. (1) (so calculated with the 48 μ_{i,s} in case of CHAMP and SP storms), referred to hereafter as <O/C> and StD <O/C>, respectively. Calculate the arithmetic average of the StD of O/C (i.e., of the σ_{i,s}), and the arithmetic average of R_{i,s} as computed in step 2.
3 Stormtime assessment results
Scale offsets between models are mainly due to the ingested density data. Therefore, assessment of the performance of models over geomagnetic storms requires debiasing. Table 3 presents the average of the mean O/C (<O/C>) and the StD of all mean O/C (StD <O/C> of Phase 1, i.e. the mean of the scale factors applied to the models and their StD. The scale factors for the DTM models are close to unity because they were fitted to most of these datasets. MSIS00 and JB2008 were not fitted to the TU Delft densities, and consequently, their scale factors are considerable, which demonstrates the essential nature of debiasing to compare fairly.
The <O/C> and StD <O/C> for Phase 1, for debiasing, per model for the five density datasets and all 152 storms (single and multiplepeak storms).
For users of thermosphere models, the main result of the assessment is the model performance over entire storms, i.e. from phase 1 to phase 4. The average of the 193 mean O/C (<O/C>) and the StD of all mean O/C (StD <O/C>) is displayed in Figure 2 for all SP storms and per satellite for the five models. On average, the models perform rather well during storms, and the <O/C> per satellite density dataset are within 5% of unity for all models excepting MSIS00. Overall, the models are only slightly underestimating during storms, by 1–5%. So on first sight model performance is fairly equal. However, the StD <O/C> indicated by the bars in Figure 2 shows that the spread around the biases is not equal. The smallest StD <O/C> are obtained with DTM2020_Int and DTM2020_Res. Figure 3 shows the arithmetic average StD of O/C for all SP storms and per satellite for the five models, and the bars represent the StD of the StD of O/C. The StD of O/C increases from GOCE to CHAMP to GRACE/SwarmA/GFO, which clearly shows the dependency of model performance on altitude. It also clearly shows that the DTM models reproduce the storm variations with significantly smaller errors than MSIS00 and JB2008. However, a better StD of O/C does not necessarily improve the model performance when calculating satellite orbits, which is provided by <O/C>.
Figure 2 <O/C>) and StD <O/C> of phases 1–4 of the 108 SP storms, per satellite for the five models. 
Figure 3 The arithmetic average StD of O/C of the 108 SP storms and per satellite for the five models. 
The average of the 79 mean O/C (<O/C>) and the StD of all mean O/C (StD <O/C>) is displayed in Figure 4 for all MP storms and per satellite for the five models. On average, the models perform well again during MP storms, and the <O/C> per satellite is within 6% of unity for all models. Contrary to what was observed for SP storms, the models are overestimating MP storms by a few percent (NB: only CHAMP and GRACE results are based on a large number of storms). The model performance is significantly reduced for MP storms, which is reflected by the larger StD <O/C> compared with SP storms (Fig. 2): for GRACE for example, it increases from 7.5% to 9.9% (10.1% to 12.1%) for DTM2020_Res (JB2008). The arithmetic average StD of O/C for all MP storms shown in Figure 5 is also significantly larger than those obtained for SP storms shown in Figure 3, which confirms that the MP storms are reproduced with lower fidelity than SP storms.
Figure 4 <O/C>) and StD <O/C> of phases 1–4 of the 44 MP storms, per satellite for the five models. 
Figure 5 The arithmetic average StD of O/C of the 44 MP storms and per satellite for the five models. 
The complete statistics per phase are given in tables, which enable identifying the strong and weak points of the models more specifically, which is probably not very pertinent for operational issues, but interesting for modelers. Table 4 lists the <O/C> and StD <O/C> of the five models for the five density datasets and all singlepeak storms. The results, per model, are listed for all four phases (bold), Phase 2, Phase 3, and Phase 4 (from top to bottom). Results for Phase 1 are not pertinent due to the debiasing and are not listed. The boldfaced numbers are displayed in Figure 2. The performance of DTM2020_Ops stands out for GFO, for which it is significantly more precise than DTM2020_Int and DTM2020_Res. The <O/C> per phase of DTM2020_Ops and DTM2020_Int do not present a specific tendency for Phases 2–4. MSIS00 on the other hand underestimates Phase 3 (recovery) and then overestimates in Phase 4 (poststorm). Except for GOCE, JB2008 underestimates Phase 2, the onset. DTM2020_Res presents an overestimation in Phase 4. The StD <O/C> is predominantly smallest during Phase 2 (onset) for all models. It is mostly largest in Phase 3 for MSIS00 and Phase 4 for JB2008 and the DTM models.
The <O/C> and StD <O/C> per model for the five density datasets and the 108 singlepeak storms. Per model are given the results of all four phases (bold), Phase 2, Phase 3, and Phase 4.
Table 5 lists the average StD of O/C and the average correlation coefficients of the five models for the five density datasets and all singlepeak storms. Results are listed again for all four phases (bold), Phase 2, Phase 3 and Phase 4 (from top to bottom). The boldfaced average StD is displayed in Figure 3. DTM2020_Res now clearly has the smallest StD and highest correlations, also for GFO (not used in DTM), even if DTM2020_Ops results are very close. The smallest StD of O/C is most often seen in Phase 4 for all models, i.e. in the poststorm phase, and the highest mostly in Phase 2 for the DTM models and JB2008, while it is seen in Phase 3 for MSIS00. For all models, the highest correlation is always seen for Phase 4, and lowest for Phase 2. The StD and correlation also reveal a small dependency on altitude, i.e. in the order GOCECHAMPGRACE. While slightly higher, SwarmA StD and correlations results are better again than for GRACE, but this is due to the different nature of the densities, i.e. lowresolution GPSinferred densities versus highresolution accelerometerinferred densities. GFO StD and correlation results are also better than for GRACE, and this is most likely due to the much smaller number of storms for GFO.
The average StD of the O/C and correlation coefficient per model for the five density datasets of the 108 singlepeak storms. Per model are given the results of all four phases (bold), Phase 2, Phase 3 and Phase 4. Green indicates best performance.
Table 6 lists the <O/C> and StD <O/C> of the five models for the five density datasets and all multiplepeak storms. The Phase 3 results listed here for the multiplepeak storms are based on longer intervals (see Fig. 1). The boldfaced numbers are displayed in Figure 4. Note that GOCE and GFO results are only based on 2 multiplepeak storms, and therefore not statistically significant. CHAMP and GRACE have the smallest biases now with MSIS00 (NB: where they are largest for the SP storms), and SwarmA with JB2008 and the DTM models. All models reveal a clear offset between CHAMP and GRACE compared with SwarmA. The StD <O/C> for MP storms is noticeably higher than for SP storms.
The <O/C> and StD <O/C> per model for the five density datasets and the 44 multiplepeak storms. Per model are given the results of all four phases (bold), Phase 2, Phase 3, and Phase 4.
Table 7 lists the average StD of O/C and the average correlation coefficients of the five models for the five density datasets and all multiplepeak storms. The boldfaced average StD ise displayed in Figure 5. DTM2020_Res again clearly has smallest StD and highest correlations. The smallest StD of O/C is as for SP storms most often found for Phase 2, while a clear tendency cannot be gleaned from Phases 3 and 4. For all models, the highest correlation is as for SP storms always seen in Phase 4, it is lowest in Phase 2 for CHAMP and GRACE, and mostly lowest in Phase 3 for SwarmA.
The average StD of the O/C and correlation coefficient per model for the five density datasets and the 44 multiplepeak storms. Per model are given the results of all four phases (bold), Phase 2, Phase 3, and Phase 4.
Table 8 lists the <O/C> and StD <O/C> of the five models for the five density datasets and all 152 storms (single and multiplepeak). In orbit prediction, the type of storm is not known beforehand, and these results are therefore most comprehensive. Since phases 2 and 3 do not represent exactly the same storm phases in case of SP and MP storms, only the assessments over the complete storms, i.e., for all four phases, is listed. The results are of course more or less (because there are more SP than MP) in between the SP and MP results. DTM2020_Res and DTM2020_Int have the smallest biases <O/C> of 2% maximum, and smallest variations StD <O/C>. JB2008 has only slightly higher biases, a maximum of 4%, but their StD are up to twice as large. MSIS00 has the highest biases, but these are still only 2–8%.
The <O/C> and StD <O/C> per model for the five density datasets and all 152 storms (single and multiplepeak storms). Per model are given the results of all four phases.
Table 9 lists the average StD of O/C and the average correlation coefficients of the five models for the five density datasets and all storms (single and multiplepeak). DTM2020_Res always has the smallest StD of O/C and highest correlations, i.e., the storms are reproduced with highest fidelity of the five models assessed in this study.
The average StD of the O/C and correlation coefficient per model for the five density datasets and all 152 storms (single and multiplepeak storms). Per model, the results of all four phases.
4 Conclusions and outlook
The density data and indices for the 152 selected storms as well as updated metrics for thermosphere model assessment have been described in this paper. The mean and standard deviation of the density ratios and the correlations were computed using CHAMP, GRACE GOCE, SwarmA and GFO density data and the CIRA models MSIS00, JB2008, and DTM2020_Ops/Int/Res.
Overall, taking all SP and MP storms into account, the DTM and JB2008 models are only slightly underestimating during storms, 2–4% maximum, which is an unexpectedly good performance for semiempirical models. MSIS00 biases are somewhat higher and range from 2 to 8%. It is remarkable, and reassuring, that the MSIS, JB2008 and DTM models predict storm perturbations with comparably small biases, while they were fitted to different observations (very different for MSIS). That performance is reached on condition of debiasing the models before the storm onset, which option is not available to all thermosphere model users, nor is it always possible. However, excepting GOCE results, the StD <O/C> of all models ranges from 6 to 12%, i.e. the scatter is considerable and improvements are undeniably necessary.
The best results, i.e. smallest bias and standard deviation on average over all SP storms, over the entire 4Phase storm period are obtained with DTM2020_Int and DTM2020_Res models, while the oldest model, MSIS00 (which unlike DTM and JB2008 has not benefitted from an overhaul of its geomagnetic model algorithm), is the least precise. However, MSIS00 is the least biased model for MP storms. The assessment clearly revealed that model precision (StD of O/C) decreases with altitude, but that bias <O/C> is independent of altitude, at least in the range of 250–550 km.
The updated metric and the full list of SP and MP storms (Table A1) will eventually be implemented in CAMEL (Comprehensive Assessment of Models and Events based on Library tools) at NASA’s Community Coordinated Modeling Center (CCMC). Then, assessment of not only the semiempirical models tested in this study but also of many first principles models such as TIEGCM, WACCMX, GITM and WAM will be possible. This study showed that the average model bias can hardly be improved upon, but maybe the physical models succeed in driving down the standard deviations (StD <O/C>). Presently, an incomplete test version is already available with fewer storms and not all model output (https://webserver1.ccmc.nasa.gov/camel/NeutralDensity). The final CAMEL version will also generate score cards per model, which will be based on selected reference storms and density data, so that model users will have access to objective and standardized model assessments.
Acknowledgments
We thank the reviewers for their time and help to improve the paper. We thank TU Delft (http://thermosphere.tudelft.nl) for making their total density data publicly available. We thank WDC Kyoto for the Dst index (http://wdc.kugi.kyotou.ac.jp/), and GFZ Potsdam for the Kp index (https://www.gfzpotsdam.de/en/kpindex/) and Hpo index ((https://www.gfzpotsdam.de/en/hpoindex/). We received support from CNES/SHM. The editor thanks Christian Siemes and an anonymous reviewer for their assistance in evaluating this paper.
Appendix
The 152 selected storm intervals from 2001 to 2023, storm class (cat: Single Peak or Multiple Peak), minimum Dst, maximum ap, and the satellite dataset (CH, GR, GO, SW, GFO = CHAMP, GRACE, GOCE, SwarmA and GRACEFO).
References
 Bowman B, Tobiska WK, Marcos F, Huang C, Lin C, Burke W. 2008. A new empirical thermospheric density model JB2008 using new solar and geomagnetic indices. In: AIAA/AAS astrodynamics specialist conference, AIAA, Honolulu, HI. https://doi.org/10.2514/6.20086438. [Google Scholar]
 Bruinsma S, Siemes C, Emmert JT, Mlynczak MG. 2022. Description and comparison of 21st century thermosphere data. Adv Space Res 72(12): 5476–5489. https://doi.org/10.1016/j.asr.2022.09.038. [Google Scholar]
 Bruinsma S, Boniface C, Sutton E, Fedrizzi M. 2021. Thermosphere modeling capabilities assessment: geomagnetic storms. J Space Weather Space Clim 11: 12. https:/doi.org/10.1051/swsc/2021002. [CrossRef] [EDP Sciences] [Google Scholar]
 Bruinsma S, Boniface C. 2021. The DTM2020 thermosphere models. J Space Weather Space Clim 11: 47. https://doi.org/10.1051/swsc/2021032. [CrossRef] [EDP Sciences] [Google Scholar]
 Bruinsma SL. 2015. The DTM2013 thermosphere model. J Space Weather Space Clim 5: A1. https://doi.org/10.1051/swsc/2012005. [CrossRef] [EDP Sciences] [Google Scholar]
 Bruinsma SL, Doornbos E, Bowman BR. 2014. Validation of GOCE densities and thermosphere model evaluation. Adv Space Res 54: 576–585. https://doi.org/10.1016/j.asr.2014.04.008. [Google Scholar]
 Doornbos E. 2011. Thermospheric density and wind determination from satellite dynamics, Ph.D. Dissertation, University of Delft, 188 p. Available at http://repository.tudelft.nl/. [Google Scholar]
 Emmert JT, Drob DP, Picone JM, Siskind DE, Jones Jr M et al. 2020. NRLMSIS 2.0: a wholeatmosphere empirical model of temperature and neutral species densities. Earth Space Sci 8(3): e2020EA001321. https://doi.org/10.1029/2020EA001321. [Google Scholar]
 Mehta PM, Walker AC, Sutton EK, Godinez HC. 2017. New density estimates derived using accelerometers on board the CHAMP and GRACE satellites. Space Weather 15: 558–576. https://doi.org/10.1002/2016SW001562. [CrossRef] [Google Scholar]
 Picone JM, Hedin AE, Drob DP, Aikin AC. 2002. NRLMSISE00 empirical model of the atmosphere: statistical comparisons and scientific issues. J Geophys Res 107(A12): 1468. https://doi.org/10.1029/2002JA009430. [CrossRef] [Google Scholar]
 Siemes C, Borries C, Bruinsma S, FernandezGomez I, Hładczuk N, van den Ijssel J, Kodikara T, Vielberg K, Visser P. 2023. New thermosphere neutral density and crosswind datasets from CHAMP, GRACE and GRACEFO. J Space Weather Space Clim. https:/doi.org/10.1051/swsc/2023014. [Google Scholar]
 Storz MF, Bowman BR, Branson MJI, Casali SJ, Tobiska WK. 2005. High accuracy satellite drag model (HASDM). Adv Space Res 36: 2497–2505. https://doi.org/10.1016/j.asr.2004.02.020. [Google Scholar]
 Sutton EK. 2018. A new method of physicsbased data assimilation for the quiet and disturbed thermosphere. Space Weather 16: 736–753. https://doi.org/10.1002/2017SW00178. [CrossRef] [Google Scholar]
 Van den Ijssel J, Doornbos E, Iorfida E, March M, Siemes C, Montenbrück O. 2020. Thermosphere densities derived from Swarm GPS observations. Adv Space Res 65: 7. https://doi.org/10.1016/j.asr.2020.01.004. [Google Scholar]
Cite this article as: Bruinsma S & Laurens S. 2024. Thermosphere model assessment for geomagnetic storms from 2001 to 2023. J. Space Weather Space Clim. 14, 28. https://doi.org/10.1051/swsc/2024027.
All Tables
Thermosphere and upper atmosphere models used in the assessment. The temporal resolution is listed in the last column.
Datasets selected for the model assessment under storm conditions. (i is inclination, and LST is the local solar time coverage and the approximate period, in days, to cover 24 h).
The <O/C> and StD <O/C> for Phase 1, for debiasing, per model for the five density datasets and all 152 storms (single and multiplepeak storms).
The <O/C> and StD <O/C> per model for the five density datasets and the 108 singlepeak storms. Per model are given the results of all four phases (bold), Phase 2, Phase 3, and Phase 4.
The average StD of the O/C and correlation coefficient per model for the five density datasets of the 108 singlepeak storms. Per model are given the results of all four phases (bold), Phase 2, Phase 3 and Phase 4. Green indicates best performance.
The <O/C> and StD <O/C> per model for the five density datasets and the 44 multiplepeak storms. Per model are given the results of all four phases (bold), Phase 2, Phase 3, and Phase 4.
The average StD of the O/C and correlation coefficient per model for the five density datasets and the 44 multiplepeak storms. Per model are given the results of all four phases (bold), Phase 2, Phase 3, and Phase 4.
The <O/C> and StD <O/C> per model for the five density datasets and all 152 storms (single and multiplepeak storms). Per model are given the results of all four phases.
The average StD of the O/C and correlation coefficient per model for the five density datasets and all 152 storms (single and multiplepeak storms). Per model, the results of all four phases.
The 152 selected storm intervals from 2001 to 2023, storm class (cat: Single Peak or Multiple Peak), minimum Dst, maximum ap, and the satellite dataset (CH, GR, GO, SW, GFO = CHAMP, GRACE, GOCE, SwarmA and GRACEFO).
All Figures
Figure 1 The four phases of the assessment interval for SP (top) and MP (bottom) storms, with t0 centered on the time of the first peak in ap with a minimum of 80. 

In the text 
Figure 2 <O/C>) and StD <O/C> of phases 1–4 of the 108 SP storms, per satellite for the five models. 

In the text 
Figure 3 The arithmetic average StD of O/C of the 108 SP storms and per satellite for the five models. 

In the text 
Figure 4 <O/C>) and StD <O/C> of phases 1–4 of the 44 MP storms, per satellite for the five models. 

In the text 
Figure 5 The arithmetic average StD of O/C of the 44 MP storms and per satellite for the five models. 

In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.