The operational and research DTM-2020 thermosphere models

Aims: The semi-empirical Drag Temperature Models (DTM) predict the Earth’s thermosphere’s temperature, density, and composition, especially for orbit computation purposes. Two new models were developed in the framework of the H2020 Space Weather Atmosphere Models and Indices (SWAMI) project. The operational model is driven by the trusted and established F10.7 and Kp indices for solar and geomagnetic activity. The so-called research model is more accurate, but it uses the indices F30 and the hourly Hpo, which are not yet accredited operationally.Methods: The DTM2020 models’ backbone comprises GOCE, CHAMP, and Swarm A densities, processed by TU Delft, and Stella processed in-house. They constitute the standards for absolute densities, and they are 20–30% smaller than the datasets used in the fit of DTM2013. Also, the global daily mean TLE densities at 250 km, spanning four solar cycles, were now used to improve solar cycle variations. The operational model employs the same algorithm as DTM2013, which was obtained through fitting all data in our database from 1967 to 2019. Because of the Hpo index, which is not available before 1995, the coefficients linked to the geomagnetic activity of the research model are fitted to data from 2000 to 2019. The algorithm was updated to take advantage of the higher cadence of Hpo. Both models are assessed with independent data and compared with the COSPAR International Reference Atmosphere models NRLMSISE-00, JB2008, and DTM2013. The bias and precision of the models are assessed through comparison with observations according to published metrics on several time scales. Secondly, binning of the density ratios are used to detect specific model errors. Results: The DTM2020 densities are on average 20–30% smaller than those of DTM2013, NRLMSISE-00, and JB2008. The assessment shows that the research DTM2020 is the least biased and most precise model compared to assimilated data. It is a significant improvement over DTM2013 under all conditions and at all altitudes. This is confirmed by the comparison with independent SET HASDM density data. The operational DTM2020 is always less accurate than the research model except at 800 km altitude. It has comparable or slightly higher precision than DTM2013, despite using F10.7 instead of F30 as solar activity driver. DTM, and semi-empirical models in general, can still be significantly improved on the condition of setting up a more complete and consistent total density, composition, and temperature database than available at this time by means of a well-conceived observing system.


Introduction
A major application of semi-empirical thermosphere specification models is in the computation of the atmospheric drag force in the orbit determination and prediction of spacecraft as well as debris. The models provide low spatial and temporal resolution average (climatological) predictions of the main constituents' temperature, total and partial densities as a function of location (altitude, latitude, longitude, local solar time), solar and geomagnetic activity, and season. Their accomplishment depends largely on the quality and distribution of the available atmospheric observations. They reproduce with variable accuracy through a simplified analytical algorithm based on parameterized equations and notably low degree and order spherical harmonic expansions. The spatial resolution that can be reached with the degree and order 3 or 4 expansions is typically 45-60 arc degrees. Consequently, smaller density variations (e.g., gravity waves, certain tides) cannot be reproduced, and they are sources of geophysical noise when comparing to observations.
The thermosphere operational and research Drag Temperature Model (DTM2020) models were developed in the framework of the Space Weather Atmosphere Models and Indices (SWAMI) project (http://swami-h2020.eu), which was a European Union Horizon 2020 Framework project. The main objectives of the SWAMI project were updating the DTM2013 thermosphere model (Bruinsma, 2015), extending the upper bound of the altitude range covered by the Met Office Unified Model (MO-UM), creation of a whole atmosphere model by blending of MO-UM output with DTM2020, and taking advantage of the new high-cadence driver for geomagnetic activity Hpo (Jackson et al., 2020).
The DTM2020 and COSPAR International Reference Atmosphere (CIRA) models described and assessed in this paper are constructed by optimally fitting model coefficients to different combinations of density, temperature, and composition measurements. The backbone of the DTM2020 models is the complete Challenging Minisatellite Payload (CHAMP; Doornbos, 2011), Gravity Recovery and Climate Experiment (GRACE; Bruinsma, 2015) and Gravity field and steady-state Ocean Circulation Explorer (GOCE; Doornbos et al., 2014) high-resolution accelerometer-inferred density datasets, as well as Swarm A densities (Van den Ijssel et al., 2020). A major difference with DTM2013 is that the density datasets were not adjusted to the US Space Force High Accuracy Satellite Drag Model (HASDM; Storz et al., 2005) by scaling factors described for GOCE data in . Instead, CHAMP, GOCE, and Swarm A densities were considered of higher absolute accuracy and used unmodified. Consequently, DTM2020 densities are on average 20-30% smaller than DTM2013 (and JB2008); these smaller absolute densities are consistent with NRLMSIS 2.0 results, which shows the same tendency when compared to NRLMSISE-00.
Another difference with DTM2013 is that two versions of DTM2020 were developed. The proxy for solar Extreme Ultra Violet (EUV) emissions, the solar radio flux at 30 cm F30 (Dudok de Wit et al., 2014), is not yet universally accepted in an operational environment as F10.7 is. The high-cadence Hpo geomagnetic indices (https://doi.org/10.5880/Hpo.0001) have been developed very recently and for that reason maybe also not yet be accredited for operational use. Therefore, it was decided to develop an operational model with F10.7 and Kp as drivers (hereafter referred to as DTM2020_Oper), and a more accurate but research model driven by F30 and Hp60 (DTM2020_Res). DTM2020, as well as the CIRA models for atmospheric drag NRLMSISE-00 (Picone et al., 2002), JB2008 (Bowman et al., 2008), DTM2013, and NRLMSIS 2.0 (Emmert et al., 2020), are assessed by comparison to total density data to demonstrate and quantify the improvements made.
The following section briefly reviews the density data and drivers used in this study. The model algorithm is described in Section 3. The assessment is presented in Section 4, and a summary and conclusions are given in Section 5.

Density data and drivers
The neutral density data used in the construction and validation of DTM2020 are listed in Table 1. The distributions with time (solar activity) and altitude of the main total mass density datasets are displayed in Figure 1, and they are described in more detail in the following subsections, while the last two subsections succinctly describe the new proxies used in DTM2020_Res (F10.7 and Kp are not presented).

GOCE and CHAMP
GOCE (Drinkwater et al., 2003) was launched in March 2009 in a 96.5°inclination, dawn-dusk orbit, and re-entered the atmosphere in November 2013. Thanks to drag compensation, the orbit was maintained at 255 km mean altitude for the largest part of the mission, and then it was lowered in four stages, ultimately to 224 km in May 2013. GOCE neutral density is an ESA product (https://earth.esa.int/eogateway/catalog/gocethermosphere-data) with a resolution of 80 km along the orbit and a precision of a few percent or better . For DTM2020, GOCE data were used as provided by ESA. The processing of GOCE data (by TU Delft under ESA contract) was done using high-fidelity satellite shape and numerical aerodynamic coefficient models, which lead to the most accurate absolute density. In the past, for DTM2009 and DTM2013, density data were adjusted to the High Accuracy Satellite Drag Model (HASDM; Storz et al., 2005), via multiplicative scale factors (e.g., 1.25 for GOCE; Bruinsma et al., 2014) for consistency, compatibility, and comparability. HASDM is an operational model that is corrected in near realtime through data assimilation.
The CHAMP (Reigber et al., 1996) data were also processed using high-fidelity satellite shape and aerodynamic coefficient models by TU Delft (Doornbos, 2011). The scale factor to best fit the HASDM densities along the CHAMP orbit is 1.23, which was applied in the construction of DTM2009 and DTM2013. The absolute densities predicted with a model are linked to the aerodynamic coefficients used in the derivation of atmospheric density. Unfortunately, there is no standard for aerodynamic coefficient modeling, but DTM2020 was constructed with the most accurate data available. Since the GOCE and CHAMP data are not scaled in the DTM2020 development, it is clear that DTM2020 densities will be 20-25% smaller than DTM2013.

Swarm A
The ESA Swarm mission (Friis-Christensen et al., 2008), launched in November 2013, consists of three identical satellites (A and C side-by-side, and B in the higher orbit) in near-polar orbits at about 460 and 510 km initially. They are equipped with accelerometers, but in this study, the GPS-derived densities (Van den Ijssel et al., 2020) are used because of the serious problems and data gaps in the accelerometer data (Siemes et al., 2016). The spatial resolution and precision depend on the level of atmospheric drag experienced in orbit, principally on altitude and solar activity. Only the density data of Swarm A (http://swarm-diss.eo.esa.int), one of the lower flying satellites, for high to medium solar activity (a cut-off of F10.7 less than 100 sfu was applied) was used in DTM2020; the spatial resolution of the selected data is estimated to be approximately 4000-8000 km along the orbit, i.e., comparable to the model resolution.

GRACE
Neutral densities inferred from accelerometer measurements on GRACE (Tapley et al., 2004) were computed using the methodology described in Bruinsma et al. (2004). Compared with the TU Delft processing, the satellite shape was a simple S. Bruinsma and C. Boniface: J. Space Weather Space Clim. 2021, 11, 47 macro model, and the analytical Sentman aerodynamic coefficient model (Sentman, 1961) was employed. The GRACE data cover the period 08/2002 through 12/2016 but starting in 2011, 2-3 gaps of 4 weeks each per year occurred, and data quality is slightly worse due to battery issues on the spacecraft. To make the GRACE data more consistent with the more accurate TU Delft densities, scaling factors were determined through comparison, after normalization to the mean altitude, with CHAMP and Swarm A densities when the orbital planes were nearly co-planar (less than 1-hour local time difference). The GRACE density scaling factors thus determined a decrease over the mission: 0.76 (2002-2005), 0.73 (2006), 0.70 (2007-2016).

TLE daily-mean global densities
The daily-mean global average mass densities at 250 and 325 km altitude derived from Two-Line orbital Element (TLE) sets covering the years 1967-2013 (Emmert, 2015) were used or partly used (from 2000 to 2013) in the case of the research model. They are at the low end of the ten altitude levels (250-575 km) on which the TLE densities were computed; the higher altitude data were not used because precision decreases and noise grows rapidly. Although data is provided with daily cadence, the temporal resolution is less because of the 3-day cubic spline smoothing involved in the density retrieval procedure (Emmert, 2009). Even if the noise is much larger than that in GOCE data at 250 km, the TLE dataset is essential because it covers four solar cycles. Because the scale of the TLE data is not consistent with the accelerometer-inferred and Swarm A densities, this dataset is used in the adjustment procedure to estimate relative density variations and not the absolute value. Consequently, a model bias with respect to this dataset is to be expected.

Stella and Starlette
The spherical satellites Stella and Starlette are in a 96°inclination and near-circular orbit at approximately 813 km altitude, and 49°inclination slightly eccentric orbit (e = 0.02) with a perigee altitude of 800 km, respectively. Because of their shape (no attitude-related errors), knowledge of the satellite characteristics (mass, surface, reflectivity), and the very accurate laser tracking by the International Laser Ranging Service (ILRS; Pearlman et al., 2002), they are suitable for density derivation despite their relatively high altitude. The densities were inferred from the analysis of orbit perturbations (Jacchia & Slowey, 1963), which essentially ties the observed decay to a mean density over the orbit in the case of Stella and mean density close to perigee for Starlette and span the period 2000-2019. Starlette was not used in the construction of DTM2020 and was reserved for validation. The scale of these two satellites is also considered accurate because the analytical closed form equations used are actually as accurate as the numerical models for spheres.

SET HASDM density database
Space Environment Technologies (SET) has made 20 years of densities available through the SET HASDM Database (Tobiska et al., 2021), which is derived from the US Space Force High Accuracy Satellite Drag Model (Storz et al., 2005). These densities are not observations but the output of a model corrected in near real-time through data assimilation. Densities are given every 3 h from 2000 to 2019 on [10°latitude Â 1 h local time] grids, and from 175 to 825 km altitude in steps of 25 km. The SET HASDM data on the altitudes 250, 400, 475, and 800 km are only used for model validation for the entire 20-year time span.
The scale of the SET HASDM data (and the data in the next section) is not consistent with the accelerometer-inferred and Swarm A densities. A model bias concerning this dataset is, therefore, to be expected.

Energy Dissipation Rate (EDR) densities
Daily-mean densities inferred from satellite drag data from the Cosmos satellites, Venuslander, Ov-3, and Hitchhiker-1, all in elliptical orbits with perigee heights in the 200-500 km range, are also reserved for assessment only. The densities, covering the period 2000-2010, were computed from the estimated energy dissipation rates derived from the orbits fitted directly to radar and optical tracking data. For a complete description of the method and the data, we refer to (Bowman et al., 2004).

The F30 radio flux
DTM2013 and DTM2020_Res are driven by F30 because the observed densities can be reconstructed with higher fidelity than with F10.7, most likely because of the much higher contribution of thermal Bremsstrahlung (Dudok de Wit et al., 2014). Similar to F10.7, it is a ground-based radio measurement, and therefore more robust and more accurately calibrated than satellite measurements of solar emissions (Vourlidas & Bruinsma, 2018). Still, the absolute calibration also of ground-based radio instruments remains a challenging issue. And although F30 is a better proxy for solar EUV than F10.7, it is still a proxy.
For convenience and to facilitate mission design and other simulations, F30 was used in DTM2013 after scaling to F10.7 via the linear regression shown in eq. (1): During the development of DTM2020, problems with fitting the density data over the complete 1967-2019 interval led to a renewed inspection of F30. The difference with F10.7 revealed a positive trend, as can be seen in Figure 2. Correcting F30 for the drift of about 10 sfu over the 53 years using the linear regression formula printed in Figure 2 resolved the density data fitting issue. The drift is most likely due to imperfect calibration of F30. In the following, and different from DTM2013, the drift-corrected F30 given below is used: in which year is the decimal calendar year (e.g., 2010.4928).

The Hp60/ap60 hourly planetary geomagnetic index
The research DTM2020 model uses a new driver for the geomagnetic activity of the Hpo index family (http://swami-h2020. eu/wp-content/uploads/2021/04/AGU-iPoster-Matzka.pdf) with higher temporal resolution than Kp, the hourly Hp60 index (https://doi.org/10.5880/Hpo.0001). Up to values of Kp = 9-, it employs the same (operational) algorithm used to derive the Kp index and is conceived to have the same frequency distribution, i.e., the same number of quiet days and geomagnetic storms. Compared with the Kp algorithm, another essential improvement besides the higher resolution is that it is no longer capped at 9 (all geomagnetic events above a certain threshold are Kp = 9). The Hp60 index is open-ended, and very high levels of geomagnetic activity can now be more accurately classified. For instance, Kp reached a value of 9 during the Halloween storm in 2009, while the Hp60 reaches a value of 12. However, it is not available before 1995 because it requires digital high-cadence magnetograms.
3 The development of DTM2020

The algorithms
In DTM, total density in the altitude range from 120 to approximately 1500 km is reconstructed by summing the partial densities of the main thermosphere constituents (N2, O2, O, He, H), under the hypothesis of independent static diffuse equilibrium. The total mass density q at altitude z is calculated as follows: The partial densities of the ith constituent, q i , are specified at 120 km altitude and are propagated to higher altitudes employing their specific height function f i (z), which is based on temperature. Variations in the temperature at 120 km, the exospheric temperature, and the partial densities are modeled using the S. Bruinsma and C. Boniface: J. Space Weather Space Clim. 2021, 11, 47 spherical harmonic function G(L), which takes the effect of environmental parameters L (latitude, longitude, local solar time, season, solar flux, and geomagnetic activity) into account.
The function G(L) used in DTM2013 and DTM2020_Oper is identical, and the model improvement is due to fitting to new data only. For the research model, it was updated to take advantage of the higher cadence of the Hp60 index. The modifications made were both logistic, to handle a larger number of indices, and algorithmic. Presently ten values of Hpo are used: Hpo delayed by 0-4 h (5 indices), the mean of the previous 24 h, and four mean values of three indices delayed by 6 (mean of Hpo delayed by 5, 6, and 7 h), 10, 15, and 20 h. The delay applied to the five Hp60 indices is a function of latitude, and it varies from 1-h at the pole to 4-h at the equator. The above delays and delayed mean values were determined by trial and error on a selection of storms. A further improvement is achieved by coupling the amplitude of the density enhancement due to geomagnetic activity with the mean solar activity.

Sequential modeling procedure
The DTM2020 models were constructed in a sequential procedure because Hp60 was unavailable before 1995, i.e., not for the entire density database (Table 1). In a preliminary step of the procedure, using DTM2013 as a background model, the temperature and partial densities were re-estimated via leastsquares adjustment to better fit the approximately 20-25% smaller observed densities than those used in DTM2013. That unpublished model, DTM2019, was the prior model used in the sequential development depicted in Figure 3. The first model developed was DTM2020_Oper, which is driven by F10.7 and Kp, and therefore all density data from the sixties to the present-day can be assimilated. Then in a second step, using DTM2020_Oper as background, DTM2020_Intermediate was constructed with F30 and Kp. Since F30 covers the entire period for which density data are available, all solar activity-related model coefficients are accurately estimated. The DTM2020_ Res model is developed in the third step, using DTM2020_ Intermediate as a-priori, and only re-estimating those model coefficients linked to Hp60 using the recent high-resolution density data (2000)(2001)(2002)(2003)(2004)(2005)(2006)(2007)(2008)(2009)(2010)(2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019), thereby preserving the accuracy of the coefficients linked to solar activity.

The density uncertainty estimate
An optional new feature of DTM2020 is the 1-r uncertainty estimate for the density given as a percentage. The uncertainty model was constructed by binning and analyzing all accelerometer-inferred densities and Swarm A densities using manifold approximation, described in the companion paper by Boniface and Bruinsma (2021)

Results
All models are evaluated by computing the mean and standard deviation (StD) of the density ratios, defined as "observed-over-calculated" or "O/C", in natural log space (ln O/C)). The metrics are described in . The mean of the density ratios represents the bias of a model with respect to the dataset used in the assessment. The model bias thus defined is not an entirely objective quality indicator, and one must consider whether the dataset was ingested into the thermosphere model, and if that was done in an absolute or relative manner. Also, the processing standards applied in the production of the dataset, in particular, the satellite shape and aerodynamic coefficient, because the absolute values of the densities depend on them. The standard deviation of the density ratios, computed as a percentage of the observation, represents a combination of the ability of the model to reproduce observed density variations and (geophysical and instrumental) noise in the observations. The signal-to-noise ratio of the density data is inversely proportional to the density level, i.e., it increases with altitude.
In the following sections, the relative precisions of the models are evaluated by comparing to the complete datasets in terms of the mean and StD of the density ratios computed with all observations of a dataset, on a yearly and daily basis. The yearly statistics are computed by comparing to all data per calendar year, i.e., computing one density ratio per year. The StD is then computed with the yearly density ratios in log space. The daily statistics are obtained similarly. More specific information on errors can be obtained by binning the density ratios in latitude, local time, 81-day mean F10.7/F30, and Kp. Analysis of the mean and StD of the ratios may give clues about specific modeling errors or data inconsistencies. To eliminate obvious outliers, density ratios are rejected in all assessments if they do not lie in the interval from 0.4 to 2.5, and 0.333 to 3.0 for TLE data.
The density datasets may be inconsistent due to errors and approximations in the satellite geometry models, the aerodynamic force (drag and lift) coefficients, and the surface-to-mass ratios used in their derivation. This will cause the model to be biased concerning those datasets. This inconsistency will diminish when a more physically accurate calculation for the drag force is developed and applied consistently, and all datasets reprocessed using such a standard. However, it is a vast topic for which no consensus is reached, and it cannot be addressed in this paper.

Model precision: overall
In this first overall assessment, the relative precisions of the models are calculated by comparing the full datasets in terms of mean and StD of the density ratios computed with all observations of the datasets in Table 1. The results, listed in Table 2, are also given for the intermediate DTM2020 model constructed with F30 and Kp to quantify the better performance of DTM2020_Res compared to DTM2013 before 2000. Most of the DTM2020 intermediate and research model coefficients are identical, particularly those related to solar activity, and, therefore, the performance under pre-year 2000 conditions is nearly identical. The table is divided into two parts: the top part shows results of comparisons with ingested datasets, while the bottom part presents comparisons to datasets that were not used to construct the DTM models.
As explained in Section 2, in DTM2020, the CHAMP, GOCE, and Swarm A density datasets are presently ingested without scaling factors. The results in Table 2 are obtained using smaller densities, and the 0.2-0.3 bias in the MSIS00, JB2008, and DTM2013 density ratios attests to this. DTM2020 fits the data well, with mean density ratios very close to unity. The StD of the density ratios is the smallest (in bold) for the DTM2020 models. The biggest improvements are seen in comparisons with GOCE and Swarm A data, and the former densities at an approximate altitude of 250 km are now reproduced with less than 10% StD.
The TLE densities, on the contrary, fit NRLMSISE-00, JB2008, and DTM2013 well on average, whereas the DTM2020 models, as expected, underestimate them by 20-35%. The JB2008 and DTM2020_Res models cannot be compared to the complete TLE datasets because the drivers are not available that far back in the past, and therefore, the comparison was made from 2000 to 2013. DTM2020 reproduces the TLE densities with significantly smaller StD than DTM2013.
The rightmost column of Table 2 shows NRLMSIS 2.0, which except for Swarm A, always has a slightly higher StD than NRLMSISE-00. This is not surprising since essentially new stratosphere and mesosphere data were ingested, but no new thermosphere data. However, one of the consequences of the assimilation of lower atmosphere data in NRLMSIS 2.0 is the lower density in the thermosphere. While still higher than DTM2020, it is important to note that the models converge independently to lower densities.
The low-altitude Satellite Electrostatic Triaxial Accelerometer (SETA) and Atmosphere Explorer Miniature ElectroStatic Accelerometer (AE MESA) densities (below 250 km, and down to 130 km) are presently more precisely reconstructed with DTM2020 than with DTM2013, and the smallest StD is even achieved with the intermediate DTM2020. The best results are obtained with NRLMSISE-00, which was fitted to those datasets. Note that the biases of NRLMSIS 2.0, from 10-26% (for AE-C MESA), are in rather good agreement with Table 2. Mean and StD, the latter expressed as percentage (smallest values within 0.5% in bold), of the density ratios of the models comparing to full datasets. "DTM2020 Intermed" is the intermediate model displayed in Figure 3. DTM2020. Starlette densities at 800 km altitude are wellmodeled with DTM2020 and best with NRLMSISE-00; the StD of JB2008 is high, 39.3%, and this is due to the absence of predicting a winter Helium bulge (Keating & Prior, 1968). The StD with Stella data, also at about 800 km, understandably is high for the same reason. The HASDM data from 2000 to 2019, selected at four altitudes (250, 400, 475, and 800 km) typically covering GOCE, CHAMP, GRACE, and Stella, is used in the assessment. The data was partly used (2001)(2002)(2003)(2004)(2005) in the construction of JB2008, which has small biases. Below 500 km, DTM2020 densities are 17-28% smaller on average, whereas DTM2013 is 5-10% overestimating. The NRLMSISE-00 and NRLMSIS 2.0 mean density ratios decrease considerably from 250 to 475 km. The StD is smallest with DTM2020, in line with the results for GOCE, CHAMP, and GRACE. The unexpectedly high StD with JB2008 and the high data rejection result from very bad predictions shown in the next section for 2018 and 2019. At 800 km, the results need some clarification. The latitude-local time structure of the HASDM densities at 800 km does not agree with the MSIS and DTM models. Under solar minimum conditions, the winter Helium bulge (Keating & Prior, 1968) is absent in HASDM, which explains its absence in JB2008. The effect of Helium maximizing in the winter hemisphere, causing the total density maximum to be located in the winter hemisphere under (very) low solar activity conditions, can be directly observed even in Swarm and GRACE densities. Therefore, when comparing to the MSIS and DTM models, which both predict maximum density in the winter hemisphere, the StD reflects this altered structure which explains why it is so large. However, the mean density (on time scales of years) should be correctly determined in HASDM. The NRLMSIS models have negligible biases, and the DTM2013 bias of 12% is much reduced now with DTM2020.  S. Bruinsma and C. Boniface: J. Space Weather Space Clim. 2021, 11, 47 Finally, the EDR densities are reproduced with biases ranging from 15 to 29% with DTM2020, but the StD is smallest and meaningful progress compared to DTM2013 has been made.

Model precision: per year
The comparison in the previous section gives a good general impression of the model performance. Computing mean and StD of the density ratios per year makes more detailed analysis possible, particularly the performance over the solar cycle and as a function of altitude. The yearly mean density ratios are shown in Figure 4 for GOCE, CHAMP, GRACE, Swarm A, TLE 250 km, and HASDM at 250 and 475 km. The assessment is not done for NRLMSIS 2.0 because the results in Section 4.1 were shown only to place the lower DTM2020 densities in a present-day perspective. Observations before 2000 were not used because the JB2008 and DTM2020_Res drivers are inexistent.
The offset between DTM2020 and NRLMSISE-00, JB2008 and DTM2013 is evident in all frames of Figure 4 and reflects the 20-25% lower density estimates of the new models. Another salient feature common to all models is a dip, more or less deep, from 2007 to 2009 during the minimum transit from solar cycle 23 to 24. Due to the low solar EUV emissions, which were overestimated with all proxies, and most with F10.7 (Solomon et al., 2010). Consequently, all models overestimate density and NRLMSISE-00 most because it uses F10.7 and is not fitted to data in that period. F30, shown in the bottom left frame of Figure 4, also overestimated the  EUV flux during the last minimum, but it is smaller than F10.7. The dip is less obvious in JB2008 because one of its drivers, S10 and its 81-day centered mean S81c, is modified (decreased) for the solar minimum years to better-fit orbit data. A similar scheme seems to have been employed for the minimum years 2017-2019. But the extremely low values of S81c (less than 55 sfu) cause serious underestimation at 250 km so that applying our always identical thresholding in the comparison procedure led to more than 60% rejection of the 2018-2019 HASDM data at 475 km (also at 400 km; not shown). The JB2008 assessment presented in the DTM2013 paper (Bruinsma 2015), using the specific indices files made available, also revealed similar density under-estimation issues for 2010 and 2011. In general, DTM2020 behaves well under all solar activity conditions without having recourse to adjusting the single solar activity driver F30. Below 400 km during solar minima, density is approximately 10% overestimated. This is a consequence of using a proxy for EUV emissions. The DTM2020_Oper model struggles with GRACE and Swarm A and is not as stable over the years as DTM2020_Res or DTM2013 because it uses the F10.7 proxy. The comparison with independent HASDM data confirms the good performance of DTM2020 and DTM2013 (which uses the F30 proxy as well). The StD of the yearly mean density ratios displayed in Figure 4 (i.e., of the 10 CHAMP density ratios, 4 GOCE density ratios, etc.) is listed in Table 3. The overall lowest StD, indicative of the smallest scatter of the density ratios from year to year, is achieved with DTM2020_ Res, which is slightly lower than DTM2013. Figure 5 displays the StD per year for GOCE, Swarm A, and GRACE at the top, and HASDM at 250 and 475 km at the bottom. The StD in the top frames, which shows a comparison to ingested data, is nearly always smallest with DTM2020_ Res, and a significant improvement compared to DTM2013 is achieved. It also reveals a high StD of JB2008 for GRACE during solar minimum years 2007-2009, during which NRLMSISE-00 has the smallest StD, probably due to its more accurate modeling of Helium (consistent with Starlette results in Table 2). The HASDM results shown in the bottom frames of Figure 5 validate the low StD of GOCE with DTM2020. They also clearly show that StD and solar activity are inversely proportional. The external validation confirms that DTM2020 has the lowest StD for low to medium solar activitywhich are the conditions for most ingested datawhereas JB2008 has the smallest StD under high solar activity (also for CHAMP in 2001; not shown), especially at 250 km. However, for the decaying phase of solar cycle 24 to the minimum from 2017-2019, the StD of JB2008 goes off the chart.
The assessment at 800 km altitude is done with Stella (ingested) and the external Starlette and HASDM at 800 km data. Figure 6 presents the yearly mean density ratios for Stella, Starlette, and HASDM at 800 km and the StD per year for Starlette only. In this comparison, DTM2020 agrees with Stella and Starlette within ±10% except for 2001, which seems to be due to a problem in F30 (it is also visible in DTM2013). Comparison with HASDM shows a strong signal correlated with solar activity in JB2008 and the least in NRLMSISE-00. The StD of the yearly mean density ratios displayed in Figure 6 are also listed in Table 3; DTM2020 is considerably more accurate at 800 km than DTM2013. The yearly mean StD with Starlette confirms the highest precision of NRLMSISE-00 at 800 km, but the DTM models are close, whereas JB2008, which does not predict the winter Helium bulge, has a very large StD.

Model precision: per day
The StD of the daily mean density ratios is computed per day and listed in Table 4 for GOCE, CHAMP, TLE 250 km (from 2000 to 2013), and HASDM 250 km. DTM2020_Res has the lowest values, and a significant improvement of around 10% concerning DTM2013 has been made in reproducing daily variability. As expected, values lie between the overall (highest, using all high-cadence data) and yearly mean StD values. The yearly and overall statistics enable us to quantify how StD evolves on 3 h to yearly time scales and shorter to multi-year times scales through extrapolation (but this is a risky operation). An example is shown in Figure 7 using HASDM 250 km data. The JB2008 StD is the highest, varying from 17.4% at 3-h time scales to 12.4% at yearly scales. However, it is lowest when the years 2018 and 2019 are rejected, which have erroneous indices, varying from 2.9% to 11.1%. The figure clearly shows the climatological nature of the models, reflected by the decreasing StD when increasing the time scales.

Model precision: geomagnetic storms
The assessment under geomagnetic storm conditions is done according to the procedure and on the 13 storms described in , but with four additional storms at the solar cycle maximum (31/03/2001; 06/11/2001; 23/05/2002; 07/09/2002). In the procedure, storms are divided into four phases relative to the time of minimum Dst: 12 h preceding onset = phase 1; onset = phase 2; recovery = phase 3; poststorm = phase 4). The models are first de-biased concerning the observations via a scaling factor, then applied to the model densities. The scaling factor is the ratio of the sum of all observations to the sum of all model densities in phase 1. The de-biasing procedure aims to minimize the effect of scaling issues between datasets and non-storm related model errors on the assessment The mean and StD of the density ratios for each phase and the complete interval, as well as the amplitude and timing of the S. Bruinsma and C. Boniface: J. Space Weather Space Clim. 2021, 11, 47 maximum density peak, are computed. Orbit-averaged CHAMP, GRACE, and GOCE density data for 17 storms are used in this assessment. The results are given in Table 5, which lists the mean and StD of the density ratios for the complete events (phases 1-4), the storm phases 2 + 3, amplitude, the timing of the peak, and correlation coefficient. DTM2020_Res reproduces the storms significantly more accurately than DTM2013, with smaller biases and standard deviations, more accurate amplitude, and higher correlation. The density peak amplitude error at high solar activity is halved. The timing error is relatively important, unexpected since the model uses geomagnetic indices with hourly instead of 3-hourly cadence. DTM2020_Oper underestimates density and has nearly identical standard deviations and correlation as DTM2013, but more accurate amplitudes.

Specific comparisons (binning)
Binning the density ratios in certain model input parameters facilitates testing the model quality concerning specific variations. In the following sections, the most relevant and revealing comparisons are shown. Note that binning provides basic but incomplete information on modeling errors because sorting data in specific bins only partly averages out model errors of different origins.

Solar local time bins
In Figure 8, the density ratios of CHAMP and HASDM at 400 and 250 km are binned per 6-h local solar time. There are no systematic errors visible for any of the models. The StD of all models is always highest in the midnight to early morning sector. DTM2020_Res has the smallest StD in all sectors. Also, it has a clear minimum StD in the afternoon sector, which is in line with expectations about errors due to density variability at spatial scales smaller than the models' resolution (Bruinsma & Forbes, 2008).

81-day mean solar flux bins
The longest dataset, TLE 250 km, is used to bin the density ratios in 81-day mean solar radio flux. The DTM2020_Res model cannot be directly tested due to the absence of Hp60 over the entire interval, but the solar activity-related model coefficients are the same as in the intermediate DTM2020 model. Five mean solar flux bins are defined: 50-100, 100-130, 130-160, 160-190, and 190-290 sfu, and the results are shown   Clim. 2021, 11, 47 in Figure 9 (top left frame). DTM2013 reveals a small trend of about 5%, while the biases of the DTM2020 models hardly vary with solar activity. NRLMSISE-00 density ratios exhibit a clear dependence on solar activity level, and the trend from medium (100-130 sfu) to high (190-290 sfu) solar activity is about 10% (from 1.04 to 0.94). The StD of the density ratios is highest under low activities for all models. The DTM2020_Oper model has the lowest StDs except under low activity, for which the best result is obtained with DTM2020_Intermed. It is unclear presently why F30 does not achieve lower StD at high activity. HASDM densities at 250, 400, and 475 km covering almost two solar cycles have also been used in this assessment. The HASDM 250 km and TLE 250 km results do not fully agree, the small trends from low to high activity in particular, but differences are less than ±5%. The StD, while much larger with TLE 250 km, is largest for low solar activity and reveals a minimum at the medium or medium-high activity for the DTM models, while the StD of NRLMSISE-00 and JB2008 decreases from low to high activity monotonously. Except under low solar activity, JB2008 has the smallest StD. At 400 km, the DTM2020 density ratios do not present a trend, but a v-shaped signature of ±5% around the mean is visible; JB2008 displays a variation of a few percent. The StD of the models presents rather comparable behavior to the 250 km results, except that DTM2020_Res now clearly has the smallest StD for low and low-medium solar activity. The results at 475 km are comparable (not shown).
Comparison with Stella at about 800 km leads to similar v-shaped signatures in the DTM density ratios as with HASDM at 400 and 475 km, namely a minimum for the 130-160 sfu bin and higher values for low and very high activity. The StD, excepting JB2008, is now the smallest for low solar activity. This non-intuitive result may be due to the data weighting scheme (Stella weights optimized for Helium, i.e., low solar activity) and slight incompatibilities with GRACE and Swarm A for DTM, but it cannot explain the same tendency in NRLMSISE-00. The StD of JB2008 does exhibit the highest values for low activity since Helium variability as a function of solar activity is modeled incorrectly (i.e., Helium maximizing in the winter hemisphere).

Geomagnetic activity bins
The HASDM density ratios shown in Figure 10 at 250, 400, and 475 km are binned in Kp representing low (Kp < 2), medium (2 < Kp < 5), and high (Kp > 5) geomagnetic activity. At 250 km, the DTM2020 models do not have a large difference S. Bruinsma and C. Boniface: J. Space Weather Space Clim. 2021, 11, 47 in bias between low-medium and high activity, and the StD at high geomagnetic activity is 2% lower than for DTM2013. DTM2020_Res displays the same behavior at 400 and 475 km, and it has the lowest StD, whereas DTM2020_Oper underestimates density at the highest activity. The comparison with the independent HASDM data confirms the good DTM2020 results of the geomagnetic storm assessment presented in Section 4.4.

Summary and conclusions
Both DTM2020 models have been constructed using the same accelerometer-inferred densities and daily-mean density data after 2000, but DTM2020_Oper is also directly fitted to historical data already assimilated in DTM-2000 (mass spectrometer data in particular; Bruinsma et al., 2003), and the global mean densities at 250 km inferred from TLEs spanning four solar cycles. The CHAMP, GOCE, and Swarm A densities are the most accurate total density datasets thanks to detailed satellite shape models and high fidelity numerical models for the aerodynamic coefficients. They constitute the most accurate absolute density scale, 20-30% smaller than DTM2013 and JB2008 (HASDM) densities. Smaller absolute densities are also predicted with NRLMSIS 2.0 but through assimilation of data in the mesosphere instead of the accelerometer-inferred densities.
Two combinations of solar and geomagnetic indices were used to develop DTM2020: the classic combination of F10.7 and Kp is used in DTM2020_Oper, whereas the drivers for DTM2020_Res are the 30-cm radio flux F30 and the new planetary geomagnetic index Hp60. The latter model is more precise, but the F30 measurement and production are not considered sufficiently permanent and unfailing for use in an operational environment.
DTM2020_Res is the most precise when comparing to assimilated data. For CHAMP, GOCE, Swarm A, and TLE 250 km, it performs appreciably better than DTM2013, with StD of the density ratios 8%, 22%, 13%, and 9% smaller, respectively (Table 1). For Stella data at higher altitude, DTM2020_Res (10%) and DTM2020_Oper (59%) perform better than DTM2013, too, when considering the yearly means (Table 3). DTM2020 also reproduces the external density datasets more precisely than DTM2013. The EDR density datasets are precisely reproduced with DTM2020_Res, while DTM2020_Oper is significantly less precise but still better than DTM2013. Comparison to HASDM data at 250, 400, and 475 km altitudes confirms that DTM2020_Res is the most precise DTM model. Comparison to HASDM 800 km densities showed that DTM2020_Oper performs better than DTM2020_Res at higher altitude, that both are more accurate than DTM2013 (71 and 34 % more accurate, respectively), and that MSIS is the most stable over the 20 years (Table 3). We speculate that this is brought about by more realistic modeling of Helium.
Geomagnetic storms are also reproduced much more accurately with DTM2020_Res thanks to the new Hp60 indices than with DTM2020_Oper and DTM2013, which have equal performance (both use Kp). The mean and StD of the density ratios and the amplitude of the storm peak all support this conclusion (Table 5).
Taking all of the above into account, DTM2020_Res reproduces total mass neutral density most accurately overall, and it is a significant improvement over DTM2013 under all conditions. Applying corrections to F30 during solar cycle minimum conditions (i.e., decreasing F30 by a few percent), which was not done in this project, could improve model accuracy further by reducing or even eliminating biases. Because of the sparseness of precise and high-resolution density data, (semi-empirical) model improvement is still possible and necessary, particularly from 100 to 250 km and around 800 km (where most Earth observation satellites operate). However, high-resolution density data will remain a scarce commodity in the immediate future, and its provision will remain dependent on missions with other objectives, i.e., data of opportunity. Such high-resolution data is not available at 800 km at all. Secondly, it is essential that all density datasets available presently are processed according to a common standard to make them consistent. That will facilitate the adjustment and improve the accuracy of the resulting model because the model coefficients can be fitted more precisely.