Issue
J. Space Weather Space Clim.
Volume 6, 2016
Statistical Challenges in Solar Information Processing
Article Number A14
Number of page(s) 21
DOI https://doi.org/10.1051/swsc/2016007
Published online 04 March 2016

© M. Schöll et al., Published by EDP Sciences 2016

Licence Creative Commons
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

1. Introduction

Changes in the spectral solar irradiance (SSI) strongly affect the upper atmosphere, but also impact climate. In particular, a consistent dataset of SSI changes serves as an input for climate models (Lean et al. 1995; Ermolli et al. 2013; Thuillier et al. 2014) as well as a central parameter for space weather predictions. These models are in need of datasets spanning several decades of continuous and radiometrically accurate measurements. However, satellites measuring SSI usually have a life span of a few years up to a decade, before instrument degradation, equipment failure, or end of mission financing stops the data flow. Not a single instrument has measured SSI for two or more solar cycles, and even for those instruments that have measured an entire solar cycle, e.g. SUSIM,1 instrument stability is not sufficient to properly assess the existence of long-term trends.

To merge SSI observations and reconstruct missing observations, we need models. These models are often driven by solar proxies. Frequent choices include the sunspot number, which has been measured since the beginning of the 17th century, solar radio flux measurements, with a popular example being the 10.7 cm radio flux available since 1947, and spectral ratios such as the MgII core-to-wing ratio (Viereck et al. 2004). There are two kinds of models: empirical ones, where one or more solar proxies are directly fitted to data using some criteria of best-fit, and semi-empirical ones, which involve physical modeling of some solar phenomena such as surface flux transport of the solar photospheric magnetic field. In this latter case, the proxy data are coupled to some underlying physical quantity. For example, sunspot area is used to infer magnetic field strength, which in turn is used to estimate the SSI (Domingo et al. 2009; Ermolli et al. 2013).

Both empirical (e.g. Tobiska 2004) and semi-empirical (e.g. Yeo et al. 2014) designs have been used for various SSI models. Proxy-based models can be very accurate for the calibration time period, but they are usually calibrated and compared to a specific instrument. For example, the semi-empirical NRLSSI (Lean 2000) model is calibrated against UARS/SOLSTICE. Relying on a single instrument makes it difficult to distinguish between the solar signal and instrumental artifacts; it also limits the long-term temporal calibration to two solar minima.

A homogeneous uncertainty estimation is essential for the delivery of a single composite. Such a single composite is one of the explicit goals of the SOLID project. In Bayesian statistics, the influence of the data is proportional to the uncertainty of the data. As such, it is important that the same methods are used for each instrument to determine the uncertainty. However, even though data providers usually present the accompanying uncertainties, their uncertainty budget often accounts for different sources of uncertainty. These include the standard deviation over short (sub-daily) time spans, modeled instrument response functions, aspects of manufacturing, and calibration constants. All these sources determine the final uncertainty budget. Ideally, proxy-based models would be simultaneously calibrated to different instruments covering several time spans. Yet, there are many obstacles that have prevented scientists from doing so, including instrumental artifacts, uncertainty of the instrument stability, and the offset to the true zero. Addressing these challenges requires a common estimate of uncertainties, and a common data format that would allow for a combination or comparison of different instrumental data would certainly be helpful.

DeLand & Cebula (2008) already provide an SSI composite of the ultraviolet (UV) based on six instruments, specifically the SBUV instrument on Nimbus-7, SBUV/2 on NOAA-9 and NOAA-11, SME on board SOHO, and the SUSIM and SOLSTICE instruments on UARS, together with synthetic proxy-based models in the case of missing instrumental data. Out of these instruments they selected one specific instrument for different temporal (November 1978 to August 2005) and spectral (120–400 nm) regions. The uncertainties were directly derived from the original data. However, the instruments chosen limited the temporal and spectral ranges, and, due to the methods of combining the data, instrumental artifacts were not accounted for in the composite.

To obtain homogeneous SSI datasets, our approach is thus as follows. We develop a coherent methodology to estimate the precision and stability for all datasets, and we evaluate the uncertainties induced by each method. We rescale all data into a common time grid with daily resolution and remove non-physical values. We take advantage of spectral coherency to estimate missing data in one wavelength band from a combination of other bands that display a similar temporal variability. We detect instrumental artifacts by taking advantage of an autoregressive model and combine the model with proxies as a qualitative indicator. Finally, we use proxy interpolation to handle data gaps.

Our final product consists of three datasets per instrument: first, the original, unified dataset converted to common units including independent precision estimates; second, a homogenized and stratified “close-to-original” dataset that does not change any data except for re-gridding in time, accompanied by independent uncertainty estimates and quality flags; and third, a “best-estimate” dataset where missing data and identified outliers have been replaced by values obtained from proxy-based models. This systematic approach to attain homogeneous SSI datasets in a unified format is part of the First European Comprehensive SOLar Irradiance Data Exploitation project (SOLID, http://projects.pmodwrc.ch/solid), whose objective is to attain a long-term SSI composite of daily resolution covering the satellite space-age era by utilizing all available data. This can be used by modelers of the Earth atmosphere and climate, researchers in stellar physics, planetary science or astrobiology, and as a cross-comparison for other models and instruments.

Subsequently, Section 2 provides, after a short historic overview of SSI measurements, a description of all data incorporated in SOLID. Section 3 presents our uncertainty estimations. Section 4 details our pipeline approach, consisting of the pre-treatment of the data (Sect. 4.1), the interpolation of missing data (Sect. 4.2), the re-gridding of the data (Sect. 4.3), the outlier detection (Sect. 4.4), and the proxy-based interpolation (Sect. 4.5). Section 5 specifies the content, availability, and the future of the SOLID database, while Section 6 concludes our paper. For better readability, we include most of our tables in Appendix A, followed by a formal description of the data format in Appendix B and our algorithms in pseudo-code in Appendix C.

2. Data

Scientists have long struggled with making long-term SSI observations. As Schmidtke (2014) describes in his historical overview of SSI measurements, the first attempts at obtaining SSI were not suitable for long-term recording of the variable solar spectrum. While E. Regener and V. H. Regener, in 1932, used balloons flying up to an altitude of 33 km to obtain a spectrum of the visible to the far ultraviolet (UVC, 200 nm–280 nm) region, Baum et al. (1946, cited by Schmidtke) used V2 rockets to obtain the first SSI measurements in the extreme ultraviolet (EUV) band. Though innovative, their methods, based on photographic film, were restricted by the necessity of instrumental retrieval. Therefore, they represented only a snapshot of solar activity. Del Zanna & Andretta (2011) collected an exhaustive list of rocket flights, measuring the EUV. It was the development of photoelectric diodes in the 1960s that allowed for satellite-based long-term measurements. Using these diodes, the EUVS experiment aboard the AE-E satellite showed, for the first time in history, the variability of the EUV from 20 to 185 nm over the course of 3 years from a solar minimum in 1976 to a solar maximum in 1979 (Hinteregger 1981). This was the beginning of long-term SSI measurements. The first instrument observing a full solar cycle was UARS/SUSIM (Upper atmosphere research satellite/Solar Ultraviolet Spectral Irradiance Monitor), 1991–2005 (Rottman et al. 1993).

Starting with data from the AEE/EUVS experiment in 1967, we have collected and processed data from 26 instruments over a wavelength range from 4 nm to 2.4 μm. Our space-based observations cover the time range of 1977–2015. We also provide nine reference spectra, with our first spectrum by Arvesen et al. (1969) and our latest one by Woods et al. (2009), as well as 14 proxies of solar activity. We utilize these proxies to enhance instrumental data by detection of possible outliers (Sect. 4.4) and by interpolation of missing data (Sect. 4.5). Figure 1 presents an overview of all instruments, reference spectra, and proxies available in the SOLID database. Not all existing data have been included, either due to bad quality or missing availability. Notably, the Airglow Solar Spectrometer Instrument (ASSI) data from the San Marco mission and the EUVS from the Orbiting Solar Observatory (OSO) III mission are not included. Other data, such as the broadband filter radiometer LYRA on PROBA2, will be used to verify the composite.

thumbnail Fig. 1.

A graphical summary of Table A.1 (Instruments), A.3 (References), A.4 (Proxies), and A.5 (Composite datasets) representing almost all the datasets incorporated in SOLID versus time and wavelength. The upper panel presents all available proxies, together with two spectral composites. The lower blue line, labeled Radio, represents the 3.2, 8, 10.7, 15, and 30 cm radio fluxes from the Nobeyama and Penticton radio-observatories. The time range shown corresponds to the available times in the SOLID database, which truncates all proxies but the radio fluxes at 1969. The lower panel presents the temporal and spectral ranges for all instruments in the database. Vertical lines represent the time point and spectral range of the reference spectra. Instruments with fewer than 20 channels are shown as light boxes with a horizontal line at the wavelength of the central frequency of each channel. While the 21st century is well covered both in temporal and spectral dimensions, previous data are only available for selected wavelengths. Only the UV between 170 nm and 400 nm has been covered continuously since 8 November 1978.

To establish a common data format, we follow the Net Common Data Format (NetCDF) Climate Forecast standard, version 1.6,2 which provides standardized variable names, well-defined field names for global meta-information such as author, data source, and creation date, meta-information for each variable such as units, flag description for masks, and cross-references to other variables.

In Appendix A, we present an overview of all the data together with references, sources, and key information such as time spectral ranges below in tabular form. Table A.1 lists all instruments included in the SOLID database, Table A.3 contains all nine reference spectra, Table A.4 displays all proxies, and Table A.5 presents the three composite time series in the database. Two of the composite time series, the Woods et al. (2000) Lyman α composite and the Fröhlich (2006) TSI composite, are also used as proxies for SSI. The third composite, the aforementioned DeLand SSI composite has been included in the database, but will not be used for the construction of our composite. Instead we will use the original instrumental data. Altogether, we currently have 53 datasets in the database. Any new dataset incorporated into SOLID will also be made available at the SOLID homepage.

3. Uncertainty estimation

As discussed above, the definition of uncertainties and the inclusion of different uncertainty sources differ for each instrument. Hence, it is not surprising that the final uncertainty estimates vary considerably between instruments. A particularly pronounced example are the measurements of total solar irradiance, as shown in Figure 2, where uncertainties vary over three orders of magnitude and the highest uncertainties are given for the first fully-calibrated instrument, TIM, due to the inclusion of accuracy in its uncertainty estimates. In conclusion, any meaningful interinstrument comparison of uncertainties must take into account their sources and definitions.

thumbnail Fig. 2.

Instrument uncertainties for different TSI instruments. They differ by up to three orders of magnitude with the highest uncertainties for a modern instrument, TIM. This is due to different definitions used for what an instrumental uncertainty is. For that reason, these values cannot be meaningfully compared.

A few words on terminology are necessary first. While precision corresponds to what is commonly known as the random error or noise, stability and accuracy both make up the systematic error. Stability, here, describes a time-dependent estimate of the instrumental drift, in other words, a trend; accuracy stands for the offset from the “true zero” at the beginning of the mission, i.e. the bias of the absolute value. As the same concept often comes with different names, we provide our own naming convention in Table 1.

Table 1.

Different sources and properties of uncertainties, together with synonyms and their inverses. The unit of the uncertainties depends on the unit of the data, which is abbreviated as ud. Inverse is used in the sense that the larger the value of the uncertainty, the lower the value of the inverse. We also indicate whether we present our own estimation of the uncertainty and whether the uncertainty is time dependent. Time-dependent uncertainties are provided with the same temporal resolution as the data.

Since uncertainty estimates given by data providers vary enormously in definition and quality, it is necessary to provide an independent and homogeneous uncertainty estimate, which can be meaningfully compared between instruments. We provide two such estimates for precision and stability. There is no way to independently assess accuracy without resorting to external data, except as to refer to the absolute ground and on-board calibration. Therefore, only instrument teams can provide these estimates and we do not attempt to estimate them independently. If they are provided, we include them in the SOLID database.

As discussed above, uncertainty estimates provided by the PI (principal investigator) cannot be directly compared with each other. However, they do contain valuable information, such as satellite off-pointing or different integration time, and as such, they are also provided. Each PI presents these estimates differently: They can be provided as absolute or relative values for each data-point, as time- or wavelength-independent fractional values or percentages or as a global estimate. We considered two possibilities: either the uncertainty estimates are provided for each data-point of the time series, in which case we convert them (if needed) into absolute values, or the provided uncertainties are time- or wavelengths-independent, in which case they are expressed in fraction of the data. In the latter case, we did not multiply the uncertainty by each data-point (which would have provided an uncertainty estimate in the same format as in the first case) to reduce the file size. This results in the somewhat cumbersome situation that the uncertainties may differ in units and dimensions from one dataset to the other: in some cases, they have the same units as the timeseries, in other cases, they are unit-less fractions. The correct unit is always given in the corresponding field attribute.

As for precision, we provide both the original precision estimates given by data providers and our own estimate, which we describe here. Most data display either Poisson or Gaussian noise statistics or a mix of the two. Noise is generally estimated from the high-frequency component of the data, where its impact is often strongest. As a consequence, estimators usually involve some time differencing, followed by a measurement of the dispersion, for example by taking the standard deviation. Today, the wavelet methods (e.g. Mallat 2009) are among the most powerful means for estimating the noise level. The general idea, which was formalized by Donoho & Johnstone (1995), is to decompose the time series into different time scales by means of a discrete wavelet transform, then to consider wavelet coefficients wi that correspond to the smallest time scale only, and finally to use median (|wi(x)|)/0.6748 as an estimate of the noise level. The use of the median allows one to exclude the few unusually large wavelet coefficients that may be associated with sudden transients, such as sunspot darkenings, and to focus only on the bulk of the fluctuations. The 0.6748 correction factor allows this quantity to be equated with the standard deviation in the case of additive white noise. For details regarding the wavelet decomposition we refer to Donoho & Johnstone (1995). The choice of the mother wavelet, i.e. the unscaled wavelet, is not critical here, as long as its regularity is sufficient. In our case, the 4th order Daubechies wavelet gives satisfactory performance.

While this estimator works well for quantifying random errors stemming from counting processes (i.e. with a Poisson distribution) and, more generally, for white noise, it does not provide a good estimate of the colored noise level, whose power spectral density is not flat. Noise color is defined by the power β of the spectral density per unit frequency. White noise is defined by a flat spectral density (β = 0), while integrated white noise, also known as Brownian noise, corresponds to β = −2 (Pink noise corresponds to β = –1). On the other side of the spectrum we have differentiated white noise, namely violet noise, that is defined by β = 2 and describes noise that varies more strongly in the short term. Blue noise is defined as β = 1.

Using this definition of noise, we modify the Donoho noise estimator to take into account all the wavelet coefficients (named Donoho-FULL). This underestimates blue and violet noise levels. Our solution is to adapt the Donoho noise estimator by combining the Donoho and the Donoho-FULL estimator using a weighted average, yielding a good estimate of white, blue, and violet noise, as shown in Figure 3. The weights are determined empirically such that they provide a best fit over . This does not only estimate the white noise level correctly, but also blue and violet noise levels which may be present in the case of a positive feedback loop.

thumbnail Fig. 3.

Comparison of the three presented noise estimators for different colors of noise. We show the original Donoho noise estimator (solid black), a wavelet estimator that takes into account all wavelet scales (Donoho-FULL, short-dashed red), and our adapted Donoho estimator (long-dashed blue). The β value, representing the power of the spectral density, is varied from −2 (Brownian noise) to 2, where 2 is violet noise (1 would be blue noise). Each point represents the estimated noise level of a one thousand point randomly generated time series of the specified power. The spread indicates the standard deviation of the estimator from one hundred trial runs. While the classical Donoho estimator overestimates all noise for β > 0, but using all wavelet components underestimates the noise of the same data, the adapted Donoho estimator provides a better estimate.

However, we are not aware of a reliable method to determine Brownian noise level without resorting to external data. As such, we cannot take into account possible Brownian contribution, despite the fact that most observations are likely to have a negative β. Therefore, for consistency, and although we know that this is not optimal, we assume our data to be free from Brownian noise and use our adapted Donoho estimator as a systematic means for comparison. Our stability estimator does use external data in the form of proxies. Hence, the red noise can be estimated as part of our stability estimate.

Another important limitation of our precision estimator is that its upper limit equals the variability of the data. In other words, our signal-to-noise ratio never gets below 1, as shown in the lower panels of Figure 4 in the visible wavelength range.

thumbnail Fig. 4.

(a) Measured SSI (black) of UARS SOLSTICE at 248.5 nm (left) and SME at 214.5 nm, along with the precision provided by PIs (long-dashed red) and our own estimate of precision (short-dashed blue). We also mark the times where we have detected outliers (red circles). Two strong outliers are visible in June, while two possible, but unlikely outliers are around 20 February 1996. The interpolated data from end of January have also been correctly classified as outliers. (b) The 1 − σ standard deviation of the spectral variability (solid black) for each wavelength and the provided (long-dashed red) and estimated (short-dashed blue) averaged uncertainties per wavelength of the two instruments from (a), but for the full time-range (3 Oct 1991 to 29 Sept 2001 for UARS/SOLSTICE and 8 Oct 1981 to 12 Apr 1989 for SME/UV). As discussed in the text, our own estimates are, by definition, bounded by the variability of the data.

Thus, our precision estimation consists of this adapted Donoho estimator, calculated with a running window of a maximum length of 100 days, and the uncertainties induced by interpolation (Sects. 4.2 and 4.3) and the outlier detection routine (Sect. 4.4). The maximum window length of 100 days has been chosen to allow for temporal changes in the precision, while keeping the statistical properties stable. The full algorithm is given in Section C.2.4. We include estimates of these uncertainties for each data-point in the final uncertainty budget. The combined precision is presented in the NetCDF attribute solid_precision (Sect. B.3.6).

As for stability, that is the uncertainty on the drift, the problem is a bit different. Instruments that have on-board monitoring systems (using calibration lamps or star observations) can provide an estimate of the drift with its uncertainty. Not all datasets, however, contain such stability estimates. Furthermore, the stability estimates for some instruments rely on assumptions based on the correlation between irradiance and solar proxies (e.g. the SBUV datasets). Although we cannot provide a completely independent estimate of the stability, there is a need for a common and homogeneous stability estimate for each dataset. The estimation of the stability (i.e. long-term uncertainties) is a very challenging task and can only be formally done through instrument intercomparison. Meanwhile, we can estimate medium-term uncertainties (on a yearly time scale) and use this as a first guess for stability. We do this by fitting each time series with a combination of proxies (Magnesium II, daily sunspot area, and the radio fluxes at 3.2 cm, 10.7 cm, 15 cm, and 30 cm) and by distinguishing the short (<108 days) and long (>108 days) time scales, the 108-day limit corresponding to four solar rotations. The fitting is designed in a flexible way (e.g. by including proxies that capture various solar features) in order not to force the irradiance to follow specific solar proxies and to keep the independent nature of the measurements. The stability at each time step is defined as the difference in the yearly slope of the observed and fitted time series at each wavelength. This method and its results are described in a forthcoming paper (Kretzschmar et al., in preparation).

The different components of uncertainty need to be combined into a single value. Here we present two different methods to calculate the absolute and relative uncertainty. Equation (1) determines the absolute uncertainty for a point in time, that is, taken the given value, it determines how accurate the data are in absolute terms, given accuracy, precision, and stability, while Eq. (2) calculates the uncertainties between two points in time to determine the accuracy of a trend.

(1) (2)with p representing precision, a accuracy as estimated by the PI, s stability, and t time, where t = 0 corresponds to the time of calibration.

As the discussion above shows, the accompanying uncertainties are an integral part of the data. To sum up, we provide the uncertainties as given by data providers, converted to standardized units, as well as our own independent estimates of precision in the form of the adapted Donoho estimator and stability of the data as estimated by the proxy-based model.

4. Methods

Here we present the pipeline for the data preparation for each individual dataset, providing methods that can be applied to all datasets and uncertainty estimations that allow different datasets to be compared in a meaningful way. Our pipeline approach entails the following steps: pre-processing, interpolation of missing data, re-gridding in time, outlier detection, and, finally, a proxy-based interpolation of erroneous or missing data. We provide uncertainty estimates for each step in the processing pipeline.

Figure 5 presents a graphical overview of all steps to construct the final composite, marking all steps discussed in this paper.

thumbnail Fig. 5.

A graphical overview of all steps to construct the final composite. The input data (red) are provided by instrument teams (References can be found in Appendix A). This work discusses the construction of the homogeneous individual datasets (green, numbers link to the corresponding sections), while the stability estimation (magenta) is discussed in a separate paper. The combination of all these datasets will be provided in the future (blue).

Appendix B includes a complete and formal description of the data format, while Appendix C provides the formal description of all methods in meta-code. The following sections describe the methods and the underlying rationale for their selection.

4.1. Pre-treatment

Our first step in the processing queue consists of the pre-treatment of the data: setting up a unified format, re-gridding the data on a two-dimensional time-wavelength grid, and removing the most extreme outliers. First, we convert the data format into one common format using the NetCDF Climate Forecast standard, version 1.6.

Next, we arrange the data on a two-dimensional time-wavelength grid with the axes conforming to wavelength in nanometers and time in days since 1980 without further data adjustments. As we will describe below (Sect. 4.3), we regularize the temporal grid at a later point in our pipeline. This approach is justified since most statistical methods either require or are easier to use when applied to regularly gridded data without any missing data.

Finally, it is necessary to remove obvious errors, that is values that are well outside the physical realm, as some methods, notably least-square fits, can be heavily influenced by those outliers. Figure 6 displays an example of a time series with such outliers. We flag and remove all data that are farther than 16 σ away from the mean. Even though the choice of 16 σ is somewhat arbitrary, it eliminates the most extreme non-physical outliers, while guaranteeing that we do not eliminate large amplitude transients, which are more frequent in the UV bands. The goal here is not to detect all outliers, but to eliminate the possibility of a single value dominating the statistics of a dataset.

thumbnail Fig. 6.

UARS/SUSIM at 239.5 nm as provided by the instrument team and the 16 σ thresholds (dashed). All points outside of the thresholds are marked as outliers, replaced by linear interpolation. Further possible outliers are selected and handled in Section 4.4. We define their uncertainties as the difference from the original data to the interpolated points. In the case of SUSIM, these outliers occur directly after instrumental data gaps, giving further credence to the possibility of instrumental, non-physical, outliers.

All these changes are tracked in an accompanying quality map of the same dimension as the data field (formal description in Sect. B.3.5) with its entries representing the sum of binary flags. The numeric value of each flag is a unique power of two – 1, 2, 4, etc. – and the entry on the quality map is the sum of all the values of the present flags. For example, a value of 12 represents two flags, corresponding to the values 4 and 8. Any processing applied to the data is indicated in the quality map.

4.2. Interpolation of missing data

Most instruments incorporated in SOLID feature missing data with notable exceptions being SME/UV and SNOE/SXP. Missing data result from various problems, including faulty transmissions, eclipses, temporary device failures, and outages when the satellite is put in safe mode. Whatever its origin, we need to systematically deal with it for each instrument.

To reconstruct missing data, we rely on the high coherency of spectral solar variability. This coherency implies a strong redundancy of the observed variability, in the sense that temporal variations in a given spectral band can be adequately reconstructed by linear combination of typically 1–5 time series. This property has been exploited by Dudok de Wit (2011) to build an interpolation scheme that relies on Singular Value Decomposition (SVD, Kalman 1996). What follows is a description of this method. First, we temporarily flag and replace all missing data by a two-dimensional linear interpolation. Theoretically, any value can be used. Other possible choices are the average of the data or random sampling from the underlying distribution of the data. The improvement gained by a good estimate is a faster convergence of our method. Once we have a preliminary estimate, we recursively replace the missing data with a first rank SVD approximation until convergence is achieved. When converged, we increase the rank of the SVD by one and replace all missing data with the second rank SVD approximation until convergence. This is done repetitively up to the 10th rank. Missing data are also marked as such in the quality map, using the flag Interpolated. Figure 7 presents an example of the resulting interpolation of SORCE/SOLSTICE at 144 nm. While most data gaps can easily be interpolated due to neighboring channels providing data, gaps due to instrumental outage, which affect the whole instrument, are also flagged as missing and temporarily interpolated linearly. For the final data product, we use proxies to provide the temporal information. However at this step we refrain from inducing any external data into the dataset. The error induced by this method is estimated via bootstrapping.

thumbnail Fig. 7.

Coherency interpolation at different points in time for the 144 nm channel of SORCE SOLSTICE A, which contains 65 channels in total (115 nm–179 nm). Each color represents a reconstruction of the 144 nm channel using a sequence of datasets of which some contain gaps for the days marked by a black horizontal line that were filled using the SVD interpolation. We show the original data (solid thick black), interpolated data where only this channel has been removed and re-interpolated (long-short dashed, blue), data where not only the 144 nm channel has been removed, but also its lower and upper 10 channels, i.e. everything from 134 nm to 154 nm (short-dashed green), and data where everything but the outermost channels (the two channels at 115 nm and 179 nm) have been removed for the selected days (long-dashed red). We also show our estimated precision (dotted black). The lower plot displays the error introduced by the SVD interpolation, smoothed over four days for better visibility. Due to the high spectral coherency, the error in the reconstruction is not significantly affected when removing not only the neighboring two, but 21 channels. Removing all but two channels does increase the error up to 100%, yet most data are still below our own estimation of the precision.

Table A.2 presents an overview of all data adjustments, including the interpolation of data. These adjustments result in a dataset without any missing value on a two-dimensional time-wavelength grid with accompanying uncertainties.

4.3. Re-gridding in time

This section describes our regridding of the datasets onto a regular time grid. This procedure is necessary for several reasons. First, the autoregressive model utilized in our outlier detection, described in Section 4.4, requires a regular time grid. Also, a regular time grid is helpful to directly compare data with each other and to apply multiscale decomposition methods, as planned for future work.

Luckily, most datasets are already available with daily resolution, centered at noon. Only three datasets, namely AEE/EUV, ENVISAT/SCIAMACHY, and ISS/SolACES, need to be adjusted in time. Another dataset, NIMBUS7/SBUV, has multiple data values for the same time. We take the average of these and adjust the uncertainties by calculating the standard deviation. Table 2 provides the statistics of these datasets.3

Table 2.

An overview of all datasets that are interpolated in time. We present minimum, median, maximum, and standard deviation of the absolute time shifts. Time is given in hours (hh:mm).

We re-grid these datasets to a constant time grid of daily resolution centered at 12:00 UTC via a linear interpolation scheme and estimate the interpolation-induced uncertainty by a linear error regression of the variability. For this, we assume an independent error propagation, and, since the wavelet noise estimator is an accurate short-term noise estimate, we use the wavelet estimator of the original nearest neighbor as the non-weighted starting point. This estimate is weighted by the distance of the re-gridding. A proxy-based interpolation scheme is used for the final interpolation (Sect. 4.5); linear interpolation is used in intermediate steps for its cheap computational costs and simple statistical properties.

This procedure presents a classic chicken-and-egg problem. While the autoregressive model, utilized in our outlier detection, requires a regular time grid, the regularization of this time grid, based on interpolation, dilutes precisely these outliers. Though this problem may be solved through an iterative approach, we prefer to first interpolate the data including the outliers at the risk of diluting the latter and to later remove them. Our procedure yields similar results, but is more efficient than the iterative approach. Furthermore, since our final interpolation, described in Section 4.5, does not distinguish between missing data and outliers, this problem does not affect the final data values, but only the accompanying estimates of the precision.

4.4. Outlier detection

The following section describes our outlier detection routine based on an autoregressive model (AR, Chatfield 2003). Figure 8 provides a quick overview of this procedure, consisting of the aforementioned AR model together with a proxy-based quality estimator to differentiate physical outliers from instrumental artifacts.

thumbnail Fig. 8.

Flowchart of the outlier detection with the possibility of physical outliers. First, we construct an autoregressive model of the time series. Then we flag all values whose difference to the model is greater than a given threshold n as possible outlier candidates. To account for the possibility of physical outliers, we relax this condition if proxies flag the same data as outliers.

thumbnail Fig. 9.

A sample of the three provided levels of UARS/SUSIM, the original data, named level 0 (a), re-gridded and missing data interpolated as level 1 (b), and our best-estimate, level 2, where outliers have been removed and re-interpolated (c). One solar proxy, Mg II, is shown for comparison (black dots). The error bars represent our estimate of the precision, including the induced uncertainty of each method. Also shown are the values of all set flags. For example, a value of 24 means that bits 3 (23) and 4 (24) are set, marking this data-point as both being detected as an outlier (8) and re-interpolated in the final interpolation (16).

Before addressing the procedure above in greater detail, it is necessary to discuss the nature of outliers. Having removed the most extreme outliers during pre-treatment (Sect. 4.1), we now aim at distinguishing instrumental artifacts from peculiar data corresponding to actual physical phenomena such as solar flares and huge sunspots. As they may have the same magnitude, they cannot be distinguished by a simple threshold. We also want to include sunspots but flag solar flares due to their short time spans, and, as such, their inclusion into each instrumental dataset strongly depends on internal characteristics like integration time and data selection, i.e. whether, for example, averaged daily data or a single daily measurement has been provided.

We resort to an autoregressive model to detect both instrumental artifacts and physical outliers. The approach is conceptually similar to that developed by Mann & Lees (1996), and it assumes that each record can be modeled by a linear time-invariant model. The discrepancy between this model and the observations is then used for detecting outliers. An AR model is a linear model of p + 1 parameters, which assumes that each point is linearly dependent on the p previous points, , with yt corresponding to the data at time t, b the coefficients of the AR model, and εt the error at time t. These models are specified by their order p, and the parameter b can be estimated by a least-squares fit to y (Priestley 1981). The higher the model order, the better the fit, but also the higher its complexity. To determine the optimal order of the model, we apply the corrected Akaike information criterion (AICC, Burnham & Anderson 2002, see below Eq. (3) for a version applied to AR), which weights the goodness-of-fit versus the complexity of the model. The lower C in AICC, representing “corrected”, accounts for the possibility of overfitting a data series of small size by multiplying the number of model parameters, p + 1 with a correction factor.(3)with n the sample size, p the order of the tested AR model, and σp the standard deviation of the AR model of order p.

For most models, the estimated AICC reaches its maximum for 4th order models. Hence, here, we construct a stationary 4th order AR model. However, it should be noted that, especially for the visible light spectrum, this parameter can increase significantly and that an adaptive parameter may improve the predictability and, thereby, the quality of the model.

Our model assumes that the Sun is a steady system perturbed by magnetic active regions that remain stable over the course of a few days. However, the Sun may also display fast transients that cannot be predicted by an autoregressive model, e.g. rapidly emerging sunspots and flares. Here, we account for the possibility of such physical outliers using an external reference in the form of solar proxies to relax the outlier detection criteria. Yet solar flares, as another type of physical outliers, are removed from our datasets, as our solar proxies do not contain flare information and the flare detection strongly depends on the instrument.

We flag all instrumental outliers and increase their uncertainty in three steps, proportional to our confidence that they are actual outliers. First, we use the aforementioned 4th order AR model to flag all outlier candidates, both instrumental and physical, and calculate the standard deviation of the differences, preliminarily flagging all data where the difference of the model to the data is greater than , with n = 4. Second, to account for the possibility of outliers being of physical origin, the parameter n is increased by 2 if at least 70% of our solar proxies are also flagged as outliers. Third, all such flagged data, as well as data that have been modified in a previous method, are temporarily replaced by linear interpolation. This procedure is repeated until convergence is reached. The linear interpolation is replaced by a more advanced scheme in a later step. Typically, this process converges within three steps, but the number of steps necessary for convergence increases when long data gaps have been interpolated in a previous step. We have empirically determined the required parameters, that is n, the increase by 2, and the percentage threshold of the proxies.

Finally, we convert the replaced data into uncertainties while keeping the original data in place, with the final step, coherency interpolation, only applied after all processing is finished. This is described below in Section 4.5.

The estimated uncertainty induced by outliers is defined as the absolute difference between the original and predicted data.

Applying the outlier detection routine as discussed above to our datasets yields results which we found to be physically consistent. We present an application of our routine to SOHO-SEM in Figure 10. As this figure shows, after the removal of the most extreme outliers, our method still detects not only several obvious instrumental artifacts, but also data-points that are below any meaningful direct threshold. In some cases, as in the case of SUSIM (Fig. 6), they appear before a data gap, giving further credence that they are due to instrumental artifacts. Our routine also correctly flags all linearly interpolated data as non-physical. Some values are possibly flagged as outliers erroneously. However, the proposed replacement barely differs from the original value, and, hence, it only slightly affects the induced uncertainty.

thumbnail Fig. 10.

Time series of SOHO-SEM data after application of the AR outlier removal routine (blue [25 nm], green [30 nm]). We show the original data (black) without any processing done and the input data to the AR routine (red). Red dots mark points that are detected as outliers. This method picks obvious errors (e.g. around February 2002 there are some negative values), linearly interpolated data (mid-end of 1999, January 1999), and solar flares. However, it also detects some points that are not obviously (and necessarily) outliers. Their uncertainty is increased by the difference to the interpolated data. The inlet displays the whole time series, January 1996 to June 2014.

4.5. Final interpolation of missing data

Finally, we interpolate all data that have been flagged using the coherency interpolation scheme as described above in Section 4.2, this time including proxies as additional channels. While this interweaves instrumental data with proxy data for all data-points that are missing from the original dataset, it attains temporal variability during times of instrumental outages when no channel has data.

5. Results

Based upon these methods, we provide three kinds of datasets: first, the original data converted into the common data format; second, the original data accompanied by uncertainty estimates and interpolated data; and, third, a “best-estimate” data product with possible outliers replaced by proxy interpolation. Figure 9 displays these three datasets, together with the accompanying flags and precision estimation.

As our database now contains three different types of datasets, the question “Which one is best for a given purpose?” naturally arises. This depends largely on the chosen method, as the subsequent examples illustrate. As a simple least-square fit can be dominated by outliers, it is of vital importance that instrumental artifacts be removed. Thus, the third type of dataset should be used for least-square fitting.4 However, if a weighted leastsquare fit or, more generally, any method that takes into account the uncertainties of data is used, the second type of dataset is more appropriate. The first type of dataset serves primarily as a reference and starting point, as it does not include independent uncertainty estimates.

All datasets are available at ftp://www.pmodwrc.ch/pub/projects/solid/database. The database will be updated regularly as new data become available, including modeled data, which will complete temporal and spectral coverage. Using standardized methods to homogenize the data makes the inclusion of new data, or new versions of already available data, easily achievable.

6. Conclusion

As part of SOLID, we provide homogeneous datasets of available space-based SSI measurements with uncertainty estimates and accompanying meta-data in a unified format, following standard conventions.

Several challenges, however, remain. Even though we have put much effort into developing an independent and homogeneous uncertainty estimate, we do recognize that we have not yet solved the problem of combining these with the PI provided uncertainty estimates. As for now, we provide both sets of estimates, and it is up to the user to combine them. Furthermore, our methods are designed to be generic and applicable to all instruments incorporated in SOLID, but consequently, they do not reflect all properties specific to each instrument. Some datasets, for instance, are accompanied by flags describing the origin, the processing method, and the quality of the data or the number of measurements per data-point. Though provided in our database, these flags are not part of our processing.

Notwithstanding these challenges, our datasets are a starting point to merge all available data into one single homogeneous dataset. A follow-up paper will present a new approach to create such an SSI composite, which will also eventually be available at the above address. In any case, the present database will help to provide an SSI reconstruction of the satellite era, and, thereby, it will foster diverse research in solar science, space weather, and climate studies.

Acknowledgments

The authors acknowledge that the research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7 2012) under Grant Agreement No. 313188 (SOLID, http://projects.pmodwrc.ch/solid). The authors would like to thank all data providers who provided their data on-line, David Bolsée for the SOLSPEC data, Robert Schäfer for SolACES data, Gérard Thuillier for the ATLAS reference spectra, and Marty Snow for the helpful discussions during the SOLID meetings in Bremen and London. This research has made use of NASA’s Astrophysics Data System. The editor thanks two anonymous referees for their assistance in evaluating this paper.


1

The expansion of acronyms used in this work is given in Appendix A.

3

This time shift affects all data-points and is the only change applied to the original data not marked as such in the quality map solid_flags.

4

Alternatively, another fitting method such as the median fit alleviates the problem of outliers and can be used with any type of dataset.

References

Appendix A. Tables

We have moved most of our tables to this separate section for better readability purpose. Here we present all instruments (Table A.1), an overview of all flags and their percentage applied to each dataset (Table A.2), and the list of all reference spectra (Table A.3) and proxies (Table A.4).

Table A1.

The list of all instruments in the SOLID database, together with the data version if provided and the wavelengths and time ranges. Below each dataset name we expand the acronym of the mission and instrument name and provide a reference publication and the source URL if available. The dates correspond to the start and end dates of the database. These can differ from the mission dates. This is always the case for ongoing missions.

Table A2.

An overview of the amount of data adjusted by each method. Each flag corresponds to a specific method: interpolation of missing data (1, Sect. 4.2), averaging multiple data-points (2, Sect. 4.3), outlier detection (3, Sect. 4.4), and proxy interpolation (4, Sect. 4.5). The flags are further described in Section B.7. Generally, all interpolated data are detected as outliers and interpolated by our coherency interpolation scheme. The CDS instrument does not provide daily observations, and, as such, has most data points interpolated. The last column corresponds to the number of channels in the instrument.

Table A3.

A list of all reference spectra currently available in the SOLID database.

Table A4.

Total of 14 available proxies. For the SOLID database we truncate all proxies at 1969.

Table A5.

Composite time series, their wavelength ranges, and time intervals. Below each entry we provided a reference and the source URL.

Appendix B. Data format

All data are provided in NetCDF CF-1.6 (see Footnote 2) format. In the following we give the detailed information provided in NetCDF, including global attributes, the dimensions of the data, the notation of variables used, instrument precision, instrument stability, instrument accuracy, and the use of flags.

B.1. Global Attributes

Each NetCDF file has one set of global attributes, i.e. attributes that are not bound to a specific variable. For our file format they are as follows: 

creation_date Creation date of the dataset, given in “YYYY-MM-DD HH:MM:SS UTC+00”

title Name of the dataset

institution Name of the institution where the dataset was created, currently always LPC2E/CNRS.

reference Reference to the relevant publication that describes the original dataset.

reference_ads The reference key corresponding to the astrophysics data system key.

source The source of the data, this is usually a URL from where to retrieve the data.

instrument_version The data version as provided by the instruments team.

Conventions Convention used to name the variables and their properties. Currently this is the NetCDF Climate and Forecast (CF) Metadata Conventions, version 1.6 (see Footnote 2), given as “CF-1.6”.

history A history of the dataset, i.e. when was it converted to which level.

SOLID_data_type The data type can be one of Instrument, Proxy, Reference, Composite, or Model.

SOLID_creation_date Date of the file creation of the SOLID data file.

SOLID_version SOLID data version. This specifies version of the used processing program.

SOLID_system Name of the platform where the program was run, e.g. ‘Matlab 8.1.0.604 (R2013a)’

SOLID_level Processing level of the data, an integer between 0 and 5, corresponding to the processing steps described in this paper.

B.2. Dimensions

All our datasets have two dimensions, time and wavelength. For ease of use and generic application, the same scheme has also been applied to proxy data. In these cases, the wavelength dimension is the number of proxies in the data file and the wavelength entries are negative indices, that is −1 for the first proxy, −2 for the second, and so on. The names of the proxies are stored in the field proxy_names.

time The number of distinct time points.

wavelength The number of distinct wavelengths in the data file.

B.3. Variables

First, a general description of a variable with the main common attributes, followed by specific variables.

B.3.1. Variable Name(dimension1, dimension2, ...)

The name of the variable containing the following attributes. Here we give a short description of the attribute names generic to all variables. In the subsequent sections, we provide the actual values of these attributes.

standard_name The standard name as mandated by CF-1.6. This is a unique and precise version of the variable name.

long_name The long name as mandated by CF-1.6. This is the human-readable format.

units The unit of the variable. This may depend on the dataset.

valid_min, valid_max Minimum and maximum values allowed for this variable.

missing_value A value that defines a missing value in the dataset.

B.3.2. time(time)

A one-dimensional array containing the time given in days since January first, 1980.

standard_name ‘time’.

long_name ‘Time’.

units ‘days since 1980-01-01 00:00:00 UTC+00’.

calendar The calendar used, this is always ‘standard’.

B.3.3. wavelength(wavelength)

This variable contains the second dimension of the data variable. For irradiance datasets this contains the wavelength in nanometers, otherwise it contains negative indices. While the variable name wavelength does not make sense in the latter case, we keep it for consistency, but use the correct values for the attributes.

standard_name ‘radiation_wavelength’ or ‘data_index’.

long_name ‘Wavelength’ or ‘Index’.

units Wavelength are given in nanometers, ‘nm’, unless the dataset does not contain irradiances. This is the case for e.g. proxy data, in which case the unit is “negative index” and the variable wavelength contains -1, -2, ..., -n.

valid_min If the variable contains irradiance data, it is ‘0’, otherwise it is not set.

B.3.4. data(wavelength, time)

The actual data stored as a matrix of dimension time × wavelength. The values of most properties depend on the actual dataset. However, it does differ in the case of e.g. a proxy dataset.

standard_name For an irradiance dataset: ‘downwelling_spectral_irradiance_in_vacuum’, otherwise data dependent.

long_name ‘Spectral Irradiance’ for spectral irradiance, otherwise data dependent.

units For irradiance datasets we use ‘W/(m2 nm)’, otherwise data dependent.

valid_min For most datasets this is ‘0’.

ancillary_variables ‘solid_flag solid_precision solid_stability instrument_precisioninstrument_stability instrument_accuracy’ or a subset of these variables. This depends on the actual data.

SOLID_name The proxy tables have named indices that are given in this attribute, one for each wavelength index.

B.3.5. solid_flag(wavelength, time)

This is an ancillary variable of data, containing the quality flag mask of the dataset. Each processing step corresponds to an entry in the data mask. A description of all the values is presented in Section B.7. To determine the presence of a flag, one may either use the bitwise ‘and’ operator on integers and test for non-zero.

standard_name ‘downwelling_spectral_irradiance_in_vacuum status_flag’.

long_name ‘Spectral Irradiance Quality’.

units ‘level’.

valid_range ‘0, 15’.

flag_masks ‘1, 2, 4, 8’ as unsigned integers.

flag_meanings ‘Interpolated Multiple_Time_Values AR_Fitted FinalInterpolation’.

B.3.6. solid_precision(wavelength, time)

This is an ancillary variable of data, containing our estimated precision of the dataset.

missing_value ‘-1.e+99’.

_FillValue ‘-1.e+99’. We use the same values for _FillValue and missing_value to indicate that the data has no default value.

standard_name ‘downwelling_spectral_irradiance_in_vacuum standard_error’

long_name ‘Spectral Irradiance Precision’.

units same as the units attribute of the data variable.

description ‘Estimated Precision’.

B.3.7. solid_stability(wavelength, time)

This is an ancillary variable of data, containing our estimated stability of the dataset. Each entry corresponds to an entry in the data mask.

_FillValue ‘-1.e+99’.

missing_value ‘-1.e+99’.

standard_name ‘downwelling_spectral_irradiance_in_vacuum standard_error’

long_name ‘Spectral Irradiance Stability’.

units same as the units attribute of the data variable per year.

description ‘Estimated Stability’.

B.4. instrument_precision(wavelength, time)

The precision as provided. Either time or wavelength dimensions can be missing. In these cases, the missing dimensions have to be broadcasted.

_FillValue ‘-1.e+99’.

missing_value ‘-1.e+99’.

standard_name ‘downwelling_spectral_irradiance_in_vacuum standard_error’.

long_name ‘Spectral Irradiance Precision’.

units Either the same as units attribute of the data variable, or, if a dimension has been collapsed, it is given without units, in which case it is relative to the data.

description ‘Provided Precision’.

B.5. instrument_stability(wavelength, time)

The stability as provided. Either time or wavelength dimensions can be missing. In these cases, the missing dimensions have to be broadcasted.

_FillValue ‘-1.e+99’.

missing_value ‘-1.e+99’.

standard_name ‘downwelling_spectral_irradiance_in_vacuum standard_error’.

long_name ‘Spectral Irradiance Stability’.

units Either same as the units attribute of the data variable divided by year, or, if a dimension has been collapsed, it is ‘1/year’, in which case it is relative to the data per year.

description ‘Provided Stability’.

B.6. instrument_accuracy(wavelength, time)

The accuracy as provided. Either time or wavelength dimensions can be missing. In these cases, the missing dimensions have to be broadcasted.

_FillValue ‘-1.e+99’.

missing_value ‘-1.e+99’.

standard_name ‘downwelling_spectral_irradiance_in_vacuum standard_error’.

long_name ‘Spectral Irradiance Accuracy’.

units Either same as the units attribute of the data variable, or, if a dimension has been collapsed, it is given without units, in which case it is relative to the data.

description ‘Provided Accuracy’.

B.7. Flags

A list of all flags used in the variable solid_flags described in Section B.3.5. The flags are represented as binary values and stored as sums, i.e. if the quality flag for a specific time and wavelength is 6, this data point is composed of multiple time values and has been re-fitted by the autoregression model. To obtain the flag of value f, compute

1: Missing data or obvious outlier This flag is set if the data has been interpolated due to missing data or data that is well outside the physical range.

2: Multiple time values Some data provide multiple values for the same time. If so, the average of those is used.

4: Regrid in Time This value indicates a re-gridding of the data in time

8: Outlier The autoregression fit model set this flag if this data has been flagged as an outlier and replaced by linear interpolation.

16: Proxy Interpolate This data-point has been replaced by a proxy-based model.

Appendix C. Methods

C.1. General Processing Routines

Here we define a short description of helper routines that are used for the main routines below.

#x Number of elements in array x

Shape(x) The size of the matrix in the form [n, m, ...]

Onset(x) A matrix of ones with the same size as another matrix x, similar functions are Zeros, True, Nan, False.

C.2. Basic Processing Routines

C.2.1. Noise Estimators

Here we present several estimators for the precision that can be used to give an internal noise estimate. This estimator’s purpose is to estimate the noise introduced due to time shifting. In this work we used the wavelet noise estimator (Sect. C.2.4), which is a composite of the Donoho noise estimator (Sect. C.2.2) and the all component wavelet noise estimator (Sect. C.2.3). We did not use the proxy autoregression wavelet estimator (Sect. C.2.5) because despite the additional computational cost and, more importantly, its reliance on an external data source, it did not return significant different results.

C.2.2. Donoho Noise Estimator

The original wavelet noise estimator by Donoho. It calculates the median of the wavelet coefficients and normalizes it with regard to white noise. It works well for white noise, however, as discussed in Section 3 it underestimates low varying colored noise, e.g., brown and pink noise while overestimating fast varying noise, e.g., blue and violet noise.

1: def DonohoNoiseEstimator(d, wn ← (‘daubechies’, 8))

  Require: From WaveLab v8.05: MakeONFilter() and FWTPO()

2:   q ← MakeONFilter(wn)

3:   

4:   

5:   w ← FWTPO(ds, 1, q)

6:   return

C.2.3. All Component Wavelet Noise Estimator

An adapted Donoho wavelet noise estimator. It calculates the median of all wavelet components coefficients and normalizes it with regard to white noise. As in the case of the Donoho wavelet noise estimator, it works well for white noise, however it underestimates all colored noises, e.g., brown, pink, blue, and violet noise.

1: def FullWaveletNoiseEstimator(d, wn ← (‘daubechies’, 8))

Require:  WaveLab v8.05

2:   q ← MakeONFilter(wn)

3:   

4:   

5:   w ← FWTPO(ds, 1, q)

6:   return median(|w|)/0.6748

C.2.4. Wavelet Noise Estimator

By combining the Donoho’s wavelet noise estimator with the high-frequency noise estimator, it is possible to get a good estimate for all high-frequency components. This is the estimator used in this work.

1: def WaveletNoiseEstimator(d, wn ← (‘daubechies’, 8))

  Require: DonohoNoiseEstimator, FullWaveletNoiseEstimator

2:   nd ← DonohoNoiseEstimator(d, wn)

3:   nw ← FullWaveletNoiseEstimator(d, wn)

4:   return (nd + 1:2nw)/2.2

C.2.5. Proxy autoregressive Wavelet Noise Estimator

The simple wavelet noise estimator may overestimate the noise due to physical short-term variability and due to physical outliers. This can be compensated by removing physical signals using a multiple input single output autoregressor before estimating the noise via the wavelet noise estimator.

We use radio flux data as the additional multiple input component. While it would be possible to also use the neighboring wavelength as additional input, this approach has two drawbacks, first it makes the noise estimate dependent on the number of measured wavelengths by the instrument (and the spectral resolution) and it will remove noise that influences several wavelengths at once.

1: def WaveletAutoregressiveNoiseEstimator(d, wn ← (‘daubechies’, 8), p ← radio))

  Require: WaveletNoiseEstimator, Autoregressive MISO

2:   p′p – mean(p)

3:   d′← d – mean(d)

4:   dfit ← Autoregressive MISO(d, p′, q = 3, mode = ‘forward’)

5:   return WaveletNoise(d′ – dfit)

C.3. SVD-Interpolation

1: def SVD-Interpolation(ts, ϵ ← 10–8)

2:  d ← ts.Data

3:  m ← IsNanOrNeighboor(d)

4:  dm ← Interp(d¬m)

5:  for i ← 1..10:

6:   converged ← False

7:   while not ← converged:

8:    (u, s, v) ← svdi(d)

9:    dm ← (usv)m

10:    converged ← (||dmd|| < ϵ ||d||)

11:  return ts

1: def IsNanOrNeighboor(d)

2:  m′ ← IsNan(d)

3:  mmbitor bitor

   return m

C.4. Autoregression Outlier Detector

Here we describe a method to flag and remove outliers by an autoregressive model. For each wavelength we construct a 4th order AR model, calculate the error and flag, and replace all values temporarily by linear interpolation whose error lies outside a given σ. The σ can be adaptive by providing a proxy to detect real outliers, that is outliers caused by solar phenomena. The same procedure is applied to the proxy first. All outliers that are detected in 70% of the proxies increase σ by 2 for the detected outliers in the proxy data.

1: def AutoregressiveOutlierDetector(ts, n ← 3, proxy ← radio proxies)

  Require: n > 0

  Require: shape(n) ϵ {[1], shape(y)}

2:   if shape(n) = [1]:

3:    nnones(y)           ᐅ Explode n to the same size as y

4:   if proxy ≠ empty :

5:     pm ← AutoregressiveOutlierDetector(ts ← proxy, proxy ← [])           ᐅ First argument is the timeseries ‘proxy’

6:     +2.           ᐅ Relax condition for outliers detection

7:   qm ← ts.Quality = Interpolated)           ᐅ Interpolated = 1

8:   for λ in ts.Wavelength:

9:    (ts.Dataλ, qλ) ← RmOutliers(ts.Dataλ, n, )

10:   (ts.Quality ← ts.Quality bitor q

11:   return ts

12: def RmOutliers(y, n ← 3, mask ← ∅)

  Require: n > 0

  Require: shape(mask) ϵ {∅, shape(y)}

  Require: shape(n) ϵ {[1], shape(y)}

13:   f* ← inf

14:   q ← False(y)

15:   repeat

16:    ff*

17:    (ϵ, f) ← |y – arfit(y)|

18:    m ← (ϵ > y) or mask

19:    y(m) ← interplinear(ym))

20:    q(m) ← AR-Fitted           ᐅ AR-Fitted = 4

21:  until or maximum iteration limit is reached

22:  return (y, q)

Cite this article as: Schöll M, Dudok de Wit T, Kretzschmar M & Haberreiter M. Making of a solar spectral irradiance dataset I: observations, uncertainties, and methods. J. Space Weather Space Clim., 6, A14, 2016, DOI: 10.1051/swsc/2016007.

All Tables

Table 1.

Different sources and properties of uncertainties, together with synonyms and their inverses. The unit of the uncertainties depends on the unit of the data, which is abbreviated as ud. Inverse is used in the sense that the larger the value of the uncertainty, the lower the value of the inverse. We also indicate whether we present our own estimation of the uncertainty and whether the uncertainty is time dependent. Time-dependent uncertainties are provided with the same temporal resolution as the data.

Table 2.

An overview of all datasets that are interpolated in time. We present minimum, median, maximum, and standard deviation of the absolute time shifts. Time is given in hours (hh:mm).

Table A1.

The list of all instruments in the SOLID database, together with the data version if provided and the wavelengths and time ranges. Below each dataset name we expand the acronym of the mission and instrument name and provide a reference publication and the source URL if available. The dates correspond to the start and end dates of the database. These can differ from the mission dates. This is always the case for ongoing missions.

Table A2.

An overview of the amount of data adjusted by each method. Each flag corresponds to a specific method: interpolation of missing data (1, Sect. 4.2), averaging multiple data-points (2, Sect. 4.3), outlier detection (3, Sect. 4.4), and proxy interpolation (4, Sect. 4.5). The flags are further described in Section B.7. Generally, all interpolated data are detected as outliers and interpolated by our coherency interpolation scheme. The CDS instrument does not provide daily observations, and, as such, has most data points interpolated. The last column corresponds to the number of channels in the instrument.

Table A3.

A list of all reference spectra currently available in the SOLID database.

Table A4.

Total of 14 available proxies. For the SOLID database we truncate all proxies at 1969.

Table A5.

Composite time series, their wavelength ranges, and time intervals. Below each entry we provided a reference and the source URL.

All Figures

thumbnail Fig. 1.

A graphical summary of Table A.1 (Instruments), A.3 (References), A.4 (Proxies), and A.5 (Composite datasets) representing almost all the datasets incorporated in SOLID versus time and wavelength. The upper panel presents all available proxies, together with two spectral composites. The lower blue line, labeled Radio, represents the 3.2, 8, 10.7, 15, and 30 cm radio fluxes from the Nobeyama and Penticton radio-observatories. The time range shown corresponds to the available times in the SOLID database, which truncates all proxies but the radio fluxes at 1969. The lower panel presents the temporal and spectral ranges for all instruments in the database. Vertical lines represent the time point and spectral range of the reference spectra. Instruments with fewer than 20 channels are shown as light boxes with a horizontal line at the wavelength of the central frequency of each channel. While the 21st century is well covered both in temporal and spectral dimensions, previous data are only available for selected wavelengths. Only the UV between 170 nm and 400 nm has been covered continuously since 8 November 1978.

In the text
thumbnail Fig. 2.

Instrument uncertainties for different TSI instruments. They differ by up to three orders of magnitude with the highest uncertainties for a modern instrument, TIM. This is due to different definitions used for what an instrumental uncertainty is. For that reason, these values cannot be meaningfully compared.

In the text
thumbnail Fig. 3.

Comparison of the three presented noise estimators for different colors of noise. We show the original Donoho noise estimator (solid black), a wavelet estimator that takes into account all wavelet scales (Donoho-FULL, short-dashed red), and our adapted Donoho estimator (long-dashed blue). The β value, representing the power of the spectral density, is varied from −2 (Brownian noise) to 2, where 2 is violet noise (1 would be blue noise). Each point represents the estimated noise level of a one thousand point randomly generated time series of the specified power. The spread indicates the standard deviation of the estimator from one hundred trial runs. While the classical Donoho estimator overestimates all noise for β > 0, but using all wavelet components underestimates the noise of the same data, the adapted Donoho estimator provides a better estimate.

In the text
thumbnail Fig. 4.

(a) Measured SSI (black) of UARS SOLSTICE at 248.5 nm (left) and SME at 214.5 nm, along with the precision provided by PIs (long-dashed red) and our own estimate of precision (short-dashed blue). We also mark the times where we have detected outliers (red circles). Two strong outliers are visible in June, while two possible, but unlikely outliers are around 20 February 1996. The interpolated data from end of January have also been correctly classified as outliers. (b) The 1 − σ standard deviation of the spectral variability (solid black) for each wavelength and the provided (long-dashed red) and estimated (short-dashed blue) averaged uncertainties per wavelength of the two instruments from (a), but for the full time-range (3 Oct 1991 to 29 Sept 2001 for UARS/SOLSTICE and 8 Oct 1981 to 12 Apr 1989 for SME/UV). As discussed in the text, our own estimates are, by definition, bounded by the variability of the data.

In the text
thumbnail Fig. 5.

A graphical overview of all steps to construct the final composite. The input data (red) are provided by instrument teams (References can be found in Appendix A). This work discusses the construction of the homogeneous individual datasets (green, numbers link to the corresponding sections), while the stability estimation (magenta) is discussed in a separate paper. The combination of all these datasets will be provided in the future (blue).

In the text
thumbnail Fig. 6.

UARS/SUSIM at 239.5 nm as provided by the instrument team and the 16 σ thresholds (dashed). All points outside of the thresholds are marked as outliers, replaced by linear interpolation. Further possible outliers are selected and handled in Section 4.4. We define their uncertainties as the difference from the original data to the interpolated points. In the case of SUSIM, these outliers occur directly after instrumental data gaps, giving further credence to the possibility of instrumental, non-physical, outliers.

In the text
thumbnail Fig. 7.

Coherency interpolation at different points in time for the 144 nm channel of SORCE SOLSTICE A, which contains 65 channels in total (115 nm–179 nm). Each color represents a reconstruction of the 144 nm channel using a sequence of datasets of which some contain gaps for the days marked by a black horizontal line that were filled using the SVD interpolation. We show the original data (solid thick black), interpolated data where only this channel has been removed and re-interpolated (long-short dashed, blue), data where not only the 144 nm channel has been removed, but also its lower and upper 10 channels, i.e. everything from 134 nm to 154 nm (short-dashed green), and data where everything but the outermost channels (the two channels at 115 nm and 179 nm) have been removed for the selected days (long-dashed red). We also show our estimated precision (dotted black). The lower plot displays the error introduced by the SVD interpolation, smoothed over four days for better visibility. Due to the high spectral coherency, the error in the reconstruction is not significantly affected when removing not only the neighboring two, but 21 channels. Removing all but two channels does increase the error up to 100%, yet most data are still below our own estimation of the precision.

In the text
thumbnail Fig. 8.

Flowchart of the outlier detection with the possibility of physical outliers. First, we construct an autoregressive model of the time series. Then we flag all values whose difference to the model is greater than a given threshold n as possible outlier candidates. To account for the possibility of physical outliers, we relax this condition if proxies flag the same data as outliers.

In the text
thumbnail Fig. 9.

A sample of the three provided levels of UARS/SUSIM, the original data, named level 0 (a), re-gridded and missing data interpolated as level 1 (b), and our best-estimate, level 2, where outliers have been removed and re-interpolated (c). One solar proxy, Mg II, is shown for comparison (black dots). The error bars represent our estimate of the precision, including the induced uncertainty of each method. Also shown are the values of all set flags. For example, a value of 24 means that bits 3 (23) and 4 (24) are set, marking this data-point as both being detected as an outlier (8) and re-interpolated in the final interpolation (16).

In the text
thumbnail Fig. 10.

Time series of SOHO-SEM data after application of the AR outlier removal routine (blue [25 nm], green [30 nm]). We show the original data (black) without any processing done and the input data to the AR routine (red). Red dots mark points that are detected as outliers. This method picks obvious errors (e.g. around February 2002 there are some negative values), linearly interpolated data (mid-end of 1999, January 1999), and solar flares. However, it also detects some points that are not obviously (and necessarily) outliers. Their uncertainty is increased by the difference to the interpolated data. The inlet displays the whole time series, January 1996 to June 2014.

In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.