Semi-supervised classiﬁcation of lower-ionospheric perturbations using GNSS radio occultation observations from Spire Global’s Cubesat Constellation

– This study presents a new methodology to automatically classify perturbations in the lower ionosphere using GNSS radio occultation (RO) observations collected using Spire ’ s constellation of Cube-Sats. This methodology combines signal processing techniques with semi-supervised machine learning by applying spectral clustering in a metric space of wavelet spectra. A “ bottom-up ” algorithm was applied to extract E layer information directly from Spire ’ s high-rate (50 Hz) GNSS-RO pro ﬁ les by subtracting the effect of the F layers. This processing algorithm has been implemented in our ground segment to operationally produce high rate sTEC pro ﬁ les with a vertical resolution of better than 100 m. The key idea behind the semi-supervised classi ﬁ cation is to produce a database of labeled clusters that can be used to classify new unlabeled data by determining which cluster it belongs to. A dataset of more than 12,000 GNSS-RO pro ﬁ les collected in 2019 containing sTEC perturbations is used to ﬁ nd the initial clusters. This dataset is used to represent the climatology of ionospheric perturbations, such as MSTIDs and sporadic Es. The wavelet power spectrum (WPS) is computed for these pro ﬁ les, and a metric space is de ﬁ ned using the Earth mover ’ s distance (EMD) between the WPS. A self-tuning spectral clustering algorithm is used to cluster the pro ﬁ les in this metric space. These clusters are used as a reference database of perturbations to classify new sTEC pro ﬁ les by ﬁ nding the cluster of the closest pro ﬁ le of the clustered dataset in the EMD metric space. This new methodology is used to construct an automated system to monitor iono-spheric perturbations on a global scale


Introduction
The ionosphere remains an important region of the atmosphere for study due to its potential impact on radio systems (Angling et al., 2012) and its coupling to the wider geospace environment (Schunk & Nagy, 2009).The ionosphere is strongly coupled to the magnetosphere and solar wind as well as to the lower atmosphere.In the case of the latter, the ionosphere can be affected by events such as extreme terrestrial weather (Chou et al., 2017), earthquakes and tsunamis (Galvan et al., 2011;Savastano et al., 2017;Astafyeva, 2019), explosions (Jacobson et al., 1988;Huang et al., 2019) and rocket launches (Booker, 1961;Mendillo et al., 1975;Savastano et al., 2019).
This paper focuses on detecting and classifying perturbations in the low altitude ionosphere, i.e., between 80 and 150 km.We can define such perturbations as deviations from the median ionosphere that have been produced by a geophysical or anthropogenic driver.These perturbations can be generated by multiple mechanisms such as plasma instabilities (Fejer & Kelley, 1980), sporadic E (Es) (Haldoupis, 2011), and traveling ionospheric disturbances (TIDs) (Yeh & Liu, 1974).
Sporadic E (Es) layers are thin (0.5-5 km) layers of enhanced electron density occurring at E region heights (90-120 km).Its occurrence is strongly latitude and seasondependent (Whitehead, 1970(Whitehead, , 1989); Es appears mainly in the daytime at midlatitudes in the summer hemisphere, while Es rates are generally low in winter.The horizontal extent of Es can range from 10 to 1000 km (Wu, 2005), and layers may last from minutes to hours.At midlatitudes, sporadic E formation is a complex process of interaction between wind shears in the lower thermosphere and long-lived metallic ions originating from meteors (Arras, 2010;Haldoupis, 2011).
Topical Issue -Ionospheric plasma irregularities and their impact on radio systems TIDs are a type of ionospheric perturbations consisting of wave-like fluctuations of the electron density in the ionosphere.These disturbances are the ionospheric manifestation of atmospheric gravity waves that naturally occur at many different scales (Georges, 1967;Georges & Hooke, 1970).Based on their phase velocity, wave period, and horizontal wavelength, TIDs are often classified into medium-scale TID (MSTID) and large scale TID (LSTID) (Ogawa et al., 1987;Crowley & Rodrigues, 2012).
Es and TID climatologies have been well characterized through studies using ionosondes and other ground instruments over many years.The first global maps of sporadic E occurrence were produced during the International Geophysical Year in 1957-58 (Leighton et al., 1962).However, such measurement campaigns are limited in their geographic coverage and may also be limited in their vertical resolution.The high-resolution vertical structure of Es has been studied using sounding rockets (Aubry et al., 1966;MacKenzie & Sayers, 1966;Wakabayashi et al., 2005).However, such measurements are even more geographically sparse.
Beyond scientific studies, real-time monitoring of TIDs and Es layers is of great interest to many terrestrial applications, such as radio communications, global navigation satellite system (GNSS) users, space weather data assimilation models, and natural hazards warning systems (Savastano et al., 2017).However, this remains a challenging goal because, on a global scale, there are large parts of the ionosphere that are not surveyed by any ground-based instrumentation.One way to overcome this is to use space-based radio occultation (RO) measurements to sense the ionosphere.Such measurements are made by estimating the total electron content (TEC) along the line of sight between the GNSS transmitter in medium Earth orbit (MEO) and a receiver in low Earth orbit (LEO) (Hajj & Romans, 1998;Hajj et al., 2002).GNSS-RO studies have been used to study the climatology of Es (Wu et al., 2005;Arras, 2010;Tsai et al., 2018).Using high rate data, RO profiles can capture very fine structures in the E region of the ionosphere (Wu, 2018).
The use of GNSS-RO for routine perturbation monitoring presents its own difficulties.For example, the Spire RO CubeSat constellation currently collects in excess of 15k GNSS-RO profiles per day.It is not possible for each of these to be manually examined and classified with the type of perturbation (or none) that it may contain.Therefore, this paper aims to describe how machine learning methods may be used to group similar profiles into clusters and how these may then be used to classify new profiles in real-time.

Data collection system
Spire operates a rapidly growing constellation of more than 100, 3U CubeSats known as LEMURs (low-Earth multi-use receivers).Each LEMUR carries a software-defined GNSS-RO receiver (STRATOS) that is capable of making open-loop, dual-frequency, multi-constellation (GPS, GLONASS, QZSS, and Galileo) RO measurements.The LEMUR CubeSats fly in a range of LEO orbits and at altitudes between 400 km and 600 km.
Presently, Spire processes over 15k GNSS-RO profiles each day, with the expectation of further growth due to new satellite launches and an increased duty cycle of the existing fleet.Figure 1 shows an example of the coverage of Spire GNSS-RO profiles collected on 12 May 2020.A description of the measurement system can be found (Angling et al., 2021).
Space-based GNSS-RO sTEC measurements are collected in an atmospheric limb sounding geometry where the GNSS receiver (Rx) is on an LEO satellite (Angling et al., 2021).As the LEO satellite moves in its orbit, the GNSS satellite (Tx) is seen to rise above or set below the horizon.The ray path from the Tx to Rx is quasi horizontal and can be characterized by the height of its tangent point.Thus, at each time epoch, the sTEC measurement made along each ray path can be associated with the tangent height in order to construct an sTEC profile.

Dataset
Spire's GNSS-RO level-1b atmPhs files follow the COS-MIC (Constellation Observing System for Meteorology, Ionosphere, and Climate) standard file format defined by UCAR (University Corporation for Atmospheric Research).These files contain profiles of signal to noise (SNR), L1 and L2 excess phase measurements sampled at a rate of 50 Hz.In determining the excess phase, the error terms that are common to both frequencies (i.e., errors in the precise orbit determination (POD) and both transmitter and receiver clock noise) have been removed by the standard GNSS-RO processing system.Thus, a single frequency excess phase measurement can be used to estimate relative TEC without forming the L1-L2 linear combination.This is desirable since forming the combination introduces an error term due to the fact the L1 and L2 signals experience different bending and thus travel over slightly different ray paths.Although this is often neglected for space to ground paths, for RO, the path separation can be 100s of meters (Svehla, 2018); this is greater than the vertical resolution that is achieved for the E region measurements and cannot be ignored.Furthermore, the L1 signal has a higher signal-to-noise ratio (SNR) than the L2 signal and therefore provides the best phase measurement among the available GNSS-RO data.For these reasons, we used the bottom-up approach described in (Wu, 2018) in order to estimate high-rate (50 Hz sampling) relative G. Savastano et al.: J. Space Weather Space Clim. 2022, 12, 14 sTEC profiles in the E region from L1-only excess phase observations.The main assumption of the algorithm is that the contribution of the F region to the excess phase observations varies linearly with height when the tangent point is below 80 km altitude.The basis for the linear assumption is described in (Wu, 2018) Section 3, and the simulation validation results are given in (Wu, 2018) Appendix C. Figure 2 shows the bottom-up approach applied to high-rate 50 Hz excess phase data in order to estimate relative sTEC profiles.There are many advantages in analyzing the sTEC measurements instead of inverted quantities (such as total electron content).The main one is that traditional retrieval processes (e.g., the Abel inversion) often assume spherical symmetry (Hajj & Romans, 1998), which may not be valid for ionospheric perturbations, and make the retrieved quantities more difficult to interpret.
Limitations of the linear assumption in the presence of strong TEC gradients or ionospheric anisotropy in the F region may lead to unphysical relative TEC profiles (see top right panel of Fig. 3).However, since the proposed semi-supervised classification algorithm is based on cropped wavelet power spectra that do not include very long wavelengths (see Sect. 2.3.1), the classification results are not affected.The main reason for implementing the bottom-up algorithm is to create E-region perturbation profiles more easily physically interpretable.This is an important step for any semi-supervised machine learning algorithm which relies on labels created by scientists during the training phase (see Sect. 3).
GNSS-RO sTEC profiles are nearly instantaneous snapshots of the E region of the ionosphere with a typical measurement time of 1-2 min.This is much shorter than the typical timescales of atmospheric and ionospheric processes that may cause perturbations (Alexander et al., 2008).Therefore, we can treat the ionosphere as fixed in time during the measurement and perform an analysis of vertical structures while ignoring the temporal variability.
The 50 Hz sampling of the relative sTEC profile corresponds to a vertical resolution of around 100 m and thereby allows the detection of fine vertical structures in the E region ionosphere.An added benefit is that only single-frequency observations are needed for this technique.
In order to represent the climatology of ionospheric perturbations, such as MSTIDs (Duly et al., 2013) and sporadic E (Wu et al., 2005), we randomly selected 100 profiles per day for the whole year 2019.We implemented two selection criteria to reduce the number of corrupted profiles in our dataset.The first criterium is that the minimum of the signal to noise ratio (SNR) along the profile must be greater than 15 dB to avoid profiles affected by possible changes in the satellite attitude or by radio-frequency interference (RFI).The second criterium deals with the breakdown of the linear assumption of the bottom-up approach, which is avoided by computing the standard deviation of the sTEC profile below 70 km altitude and disregarding profiles having a standard deviation greater than 1 TECU.In our study, the number of GNSS-RO profiles where the linear assumption broke down was less than 1% of the total.
In order to select only profiles containing ionospheric perturbations, we analyzed the wavelet power spectrum (WPS) of each profile (see details in Sect.2.3), and we selected only profiles meeting the criteria: where the maximum and the sum of the WPS allows the selection of profiles with either localized or extended power intensity in the wavelet spectra.Figure 3 displays the final dataset of perturbed sTEC profiles used as input for the classification pipeline.The dataset contains more than 12,000 sTEC profiles, with a homogeneous geographic distribution in latitude and longitude.However, it is

ML-based classification pipeline
In this section, we present the pipeline used to classify perturbations in the lower ionosphere.This pipeline is based on a new methodology that combines signal processing techniques, such as wavelet analysis, with unsupervised machine learning clustering algorithms, such as spectral clustering.

Wavelet transformation
The first step of this pipeline is to resample the sTEC profiles into a regular vertical grid that extends from 50 to 145 km.This step allows the computation of physically meaningful wavelet spectra of the profiles.We use the scipy.interpolate.interp1dfunction, which implements several interpolation algorithms, such as nearest neighbor search (NNS) and spline functions, effectively enabling the downsampling of profiles to a 200 m vertical resolution.
After resampling the profiles, we compute the wavelet spectrum for each profile using the PyWavelets python package (Lee et al., 2019).We store each spectrum into a (n Â n x Â n y ) matrix, where n is the number of profiles and n x and n y are the number of columns and rows of the 2-D spectrum.The wavelet spectrum is cropped in wavelength from 1 to 10 km and in altitude from 70 to 125 km in order to avoid edge effects.
The wavelet spectra are computed using a complex Morlet wavelet, which provides a good compromise between scales and wavelength resolution, and it is defined by: where f b and f c represent the bandwidth parameter and the wavelet center frequency, respectively.In our pipeline, the wavelet center frequency was set to 1.0 Hz and the bandwidth parameter to 1.5 (Fig. 4).In order to cluster together perturbations occurring at different altitudes, we center the altitude of the spectrum around the maximum of the spectrum amplitude with an altitude window of ±25 km.In this way, we are unaffected by the absolute altitude values.For sTEC profiles where the maximum spectrum amplitude occurs at a distance <25 km from the edge, we zero-pad to fill the missing values in the wavelet spectrum space.
The next step is downsampling the wavelet power spectra.This step aims to speed up the clustering computations by reducing the number of pixels per spectra without losing important geophysical information.This is achieved by resizing the spectra using the INTER_AREA algorithm of the OpenCV python library.
Figure 5 shows the centered and downsampled wavelet power spectra for three sTEC profiles.The first column shows the perturbed sTEC profiles, the second column shows the original wavelet power spectra, and the third and fourth columns display the downsampled spectra to 20 Â 50 and 10 Â 25 pixels, respectively.In our pipeline, we have used the spectra downsampled to 20 Â 50.

Computing similarity between profiles
The earth mover's distance (EMD) is used to define a distance metric in the space of RO wavelet spectra.EMD is a method to compute the distance between two multi-dimensional probability distributions in some feature space D. The concept of using the EMD to measure perceptual similarity between gray-scale images was first explored by (Peleg et al., 1989).More recently, EMD has been utilized for coloror texturebased similarity (Greenspan et al., 2000;Rubner et al., 2000).EMD-based similarity analysis (EMDSA) is now an effective tool used in many pattern recognition (Grauman & Darrell, 2004) applications and has recently been applied to unsupervised learning problems in the physical sciences (Komiske et al., 2019).
The key concept is that the "distance" between two distributions is defined as the minimal amount of work that must be performed to transform one distribution into the other.If we consider the distributions as two different ways of piling up a certain amount of earth (hence "earth mover's distance") over the region D, the EMD is the minimum cost of turning one pile into the other; where the cost is assumed to be amount of earth moved times the distance by which it is moved (Fig. 6).
The EMD is notoriously expensive to compute for distributions in dimensions higher than one.As we will ultimately want to operationally calculate the distance to all profiles in the reference dataset for each new profile collected by the Spire constellation, it is not feasible to rely on a full EMD calculation for the two-dimensional wavelet spectra.By converting the wavelet spectra into an integer-weight point cloud representation, a proxy for the full EMD can be efficiently computed.In this representation, an EMD calculation can be cast as a transportation problem and solved with a minimum cost flow algorithm (Villani, 2003;Santambrogio, 2009;Korte & Vygen, 2012;Li et al., 2018).The point cloud representation is defined by considering each row along the altitude dimension as a point in an N-dimensional space (where N is given by the wavelet spectrum resolution along the wavelength dimension).Our EMD proxy is then computed using the Euclidean distance between the points to define a point cloud EMD calculation that can be efficiently solved with scipy.optimize.linear_sum_assignment.
Since we have already centered and cropped around the largest perturbation in the wavelet spectrum, the loss of altitude information does not impact the results significantly: on the reference dataset, both the Pearson and Spearman r correlation coefficients between the full EMD calculation and our point cloud EMD proxy is >0.99.This indicates that it is a good proxy and can be calculated more than 3 orders of magnitude faster than when using pyemd.emdon a modern laptop.The proxy calculation is inspired by more general approaches to efficiently approximate the EMD in higher dimensions, such as the sliced Wasserstein (Bonneel et al., 2015;Kolouri et al., 2019); however, we rely on our pre-processing pipeline rather than Monte Carlo sampling to reduce the complexity of the calculation without losing necessary information.The final distance matrix has the shape of (n Â n), where n is the number of profiles.

Clustering profiles
Spectral clustering is a technique with roots in spectral graph theory (Chung, 1997).The core idea is to perform a spectral analysis of a graph's Laplacian matrix to find clusters in the graph.
We can use a graph G = (V, E) as an abstraction of our dataset, where the nodes V represent the wavelet spectra of the sTEC profiles and the width of the edges E connecting the nodes represents the weight of each edge w(i; j), which in turn is a function of the similarity between nodes i and j. Figure 7 shows an example of a fully connected graph.Clustering a graph means partitioning the set of nodes V into disjoint sets V 1 , V 2 , . . ., V n , where the similarity among nodes in a set V i is high and across different sets V i , V j is low.
For example, a graph G = (V, E) can be partitioned into two disjoint sets A and B, A [ B = V, A \ B = ;, by removing those Fig. 8.A case where minimum cut gives a bad partition.Figure from (Shi & Malik, 1997).
G. Savastano et al.: J. Space Weather Space Clim. 2022, 12, 14 edges that connect the two clusters.The degree of dissimilarity between clusters can be computed as the total weight of the edges that have been removed.In graph theoretic language, this is called the cut and is defined as: wðu; vÞ: The optimal bi-partitioning of a graph is the one that minimizes this cut value.Although the number of potential partitions increases exponentially with the number of nodes, finding the minimum cut of a graph is a well-studied problem, and there exist efficient algorithms for solving it (Wu & Leahy, 1993).
The minimum cut criterium favors cutting small sets of isolated nodes in the graph.This is not surprising since the cut value increases with the number of edges crossing the cut.Figure 8 illustrates this case; assuming the edge weights are inversely proportional to the distance between any two nodes, we see that the cut that partitions out node n 1 or n 2 will have a very small value.In fact, any cut that partitions out individual nodes on the right half will have a smaller cut value than the cut that partitions the nodes into the left and right halves.
To avoid this problem of clustering small sets of isolated nodes in the graph, a disassociation measure known as normalized cut (Ncut) was defined in (Shi & Malik, 1997) as: where assoc(A, V) is the total connection from nodes in A to all nodes in the graph, and assoc(B, V) is similarly defined.
With this definition of the disassociation between the groups, the cut that partitions out small, isolated points will no longer have a small Ncut value since it will almost certainly be a large percentage of the total connection from that small set to all other nodes.For example, in the case illustrated in Figure 8, we see that the cut 1 value across node n 1 will be 100% of the total connection from that node.
The graph's Laplacian matrix L can be written as: where I is the incidence matrix of the graph, D is the degree matrix, and W is the adjacency matrix.Many methods, such as the normalized cuts algorithm (Shi & Malik, 1997), can be used to find k clusters in the graph using the first k eigenvectors of the graph's Laplacian matrix (Von Luxburg, 2007).When working with data that is not directly available in the form of a graph, the first step is to construct an affinity matrix A 2 R n Â n , where n is the number of nodes (i.e., profiles in our case).An affinity matrix is just like an adjacency matrix, except the entries express how similar nodes are to each other.We can use the EMD distance matrix as an affinity matrix, in which a value of zero means identical elements, and high values mean very dissimilar elements.This EMD distance matrix can be transformed into an affinity matrix that is well suited for the Ncut algorithm by applying a Gaussian kernel (Ng et al., 2002): where the scaling parameter r is a free parameter representing the width of the Gaussian kernel and controls how rapidly the  similarity matrix A falls off with the distance between the points p i and p j .The term d(p i , p j ) represents the distance function.As previously described, we compute the distance matrix by calculating a proxy for the pairwise earth mover's distance (EMD) between all the elements of the matrix of downsampled wavelet spectra obtained from the perturbed sTEC profiles.Instead of selecting a single arbitrary scaling parameter r, we compute a local scaling parameter r i for each point p i (Zelnik-manor & Perona, 2005).The Gaussian kernel can now be expressed as: Using a local scaling parameter for each point allows selftuning of the point-to-point distances according to the local statistics of the neighbourhood surrounding point i.A simple choice for r i is: where p b is the b'th neighbor of point p i .In our pipeline, we used the standard value of b = 7 (Zelnik-manor & Perona, 2005).
Once we have computed the adjacency matrix, we can apply the k-ways normalized cuts method to find the best k clusters (Shi & Malik, 1997) using the implementation in sklearn.cluster.SpectralCluster.The optimum number of clusters for a specific dataset can be estimated with the eigengap heuristic algorithm, which finds the number k such that all eigenvalues k 1 , . . ., k k are very small, but k k+1 is relatively large (Von Luxburg, 2007).

Results
In this section, we present the results from our ML-based classification pipeline, which found 145 clusters in our dataset.The next step of our semi-supervised machine learning pipeline  is to label and merge these clusters into a smaller number of classes of interest.For our application, we identify four main classes (Table 1).Some of the clusters that belong to the database of ionospheric perturbations are presented in the following figures.Each entry of the database also contains metadata that describes the type of ionospheric perturbation in that cluster.Figure 9 shows an example cluster that contains 258 reference profiles with small perturbations.The map on the left shows the geographic location of the sTEC profiles.The right-hand panel shows 8 profiles randomly selected from the cluster.All the profiles from this cluster contain similar smallscale sTEC ionospheric irregularities, therefore the cluster was assigned to class 3c.
Figure 10 shows another example cluster that contains 66 profiles.All the profiles from this cluster exhibit mediumscale wave-like ionospheric perturbations, therefore the cluster was assigned to class 2b.The final example in Figure 11 shows sporadic E structures.There are 17 profiles in this cluster that are assigned to class 1b.
Once the reference database of ionospheric perturbations has been constructed, we can use it to classify new profiles.First, we compute the wavelet spectrum of the new profile, then we compute the EMD distance between the new spectra and all the spectra in the reference database, and lastly, we assign the new profile to the same class as that of the closest profile (i.e., where the EMD to the reference data set is minimized).
Figure 12 shows the classification of two new profiles (in red).The black profiles represent the closest profile in the EMD space.Ten profiles of the same cluster are represented in blue.The EMD minimum value and the total number of profiles of the clusters are reported in the yellow box.
After deploying the automated classification system, it is possible to monitor the rate at which Es and TID events are detected.Figures 13 and 14 show the final output of Spire's ionosphere perturbations monitoring system (IPMS) based on the high-rate GNSS-RO profiles.The rates of Es and TID events show a clear hemispheric asymmetry, which agrees well with the climatology of these perturbations.In particular, we can clearly see that the highest rate of TIDs occurred near the  G. Savastano et al.: J. Space Weather Space Clim. 2022, 12, 14 Japanese region.These results agree well with the literature on summertime nighttime TIDs in Japan (Saito et al., 2001(Saito et al., , 2002;;Otsuka et al., 2007).

Conclusions
In this study, we extracted valuable information on the lower ionosphere from Spire's constellation of CubeSats.In particular, Spire Global's Lower Earth Orbit CubeSats produce high-rate (50-Hz) sTEC profiles derived using the GNSS-RO technique, which can be used to detect ionospheric features such as sporadic E and TIDs.We designed and implemented an innovative semi-supervised ML-based classification algorithm that combines wavelet signal processing and EMD-based similarity clustering.In this approach, the nonlinearity and nonstationarity of GNSS-RO data can be dealt with more rigorously than in methods using the traditional paradigm of constant frequency and amplitude (e.g., Fourier analysis).
Our results show the possibility of classifying different ionospheric perturbations, which then provides the opportunity for an automated system to monitor ionospheric perturbations on a global scale.This system is currently working operationally and may be of interest to many terrestrial applications, such as radio communications, global navigation satellite system (GNSS) users, space weather data assimilation models, and natural hazard warning systems.
In the future, we may explore the implementation of the ML-based algorithm onboard Spire CubeSats in order to autonomously detect anomalies and prioritize the downlink of RO profiles.G. Savastano et al.: J. Space Weather Space Clim. 2022, 12, 14

Fig. 2 .
Fig.2.Bottom-up processing applied to Spire high-rate (50 Hz) excess phase data in order to estimate relative sTEC profiles.The left panel shows the L1 excess phase profile (black) for Spire satellite FM100 from Galileo E13 and the linear contribution of the F region estimated from the profile below 80 km.The right panel shows the relative sTEC profile that is obtained after removal of the linear trend.

Fig. 3 .
Fig. 3. Dataset of 12,172 perturbed sTEC profiles.Upper left: location of all the sTEC profiles in latitude vs. longitude.Lower left: location of all the sTEC profiles in latitude vs. local time.Upper right: 30 randomly selected profiles.Lower right: local time histograms of sTEC profiles.

Fig. 4 .
Fig. 4. Complex Morlet wavelet function used in this pipeline and its frequency support.

Fig. 5 .
Fig. 5. Centered and downsampled wavelet spectra.(From the left) first column shows three sTEC profiles, second column shows the original wavelet power spectra, third and fourth columns show the downsampled spectra to 20 Â 50 and 10 Â 25 pixels, respectively.

Fig. 6 .
Fig. 6.Concept of the EMD between two probability distributions.

Fig. 7 .
Fig. 7.A partial representation (i.e., 50 nodes) of the fully connected graph obtained from our complete dataset (i.e., 12,172 nodes).Each node (red dot) represents an sTEC profile, and the edges of different widths represent the similarity between nodes, defined by the EMD distance.

Fig. 9 .
Fig. 9. Ionospheric perturbations cluster example.The left map shows the location of the 258 sTEC profiles within the cluster.On the right, 8 randomly selected profiles from the cluster.All the profiles show similar small-scale sTEC ionospheric perturbations.

Fig. 10 .
Fig. 10.Ionospheric perturbations example cluster.The left map shows the location of the 66 sTEC profiles within the cluster.On the right, 8 randomly selected profiles from the cluster.All the profiles show similar medium-scale TIDs structures.

Fig. 11 .
Fig. 11.Ionospheric perturbations example cluster.The left map shows the location of the 17 sTEC profiles within the cluster.On the right, 8 randomly selected profiles from the cluster.All the profiles show similar Es structures.

Fig. 12 .
Fig. 12. Classification of a new sTEC profile (in red) as Es (left panel) and TID (right panel).The black profiles represent the closest profile with minimum EMD distance.Ten profiles from the same cluster are displayed in blue.The minimum EMD value and the total number of profiles in that cluster are displayed in the yellow box.

Fig. 13 .
Fig. 13.Relative density of Es from Spire profiles to total number of profiles gathered globally from 2020/07/23 to 2020/07/30.The Es perturbations are picked up at fairly high rates at mid-latitude in the northern hemisphere (summertime), consistent with sporadic E climatological models.

Fig. 14 .
Fig.14.Relative density of TIDs from Spire profiles to total number of profiles gathered globally from 2020/07/23 to 2020/07/30.The TIDs are picked up at high rates (i.e., more than 50%) at mid-latitude in the northern hemisphere (summertime), especially in the Chinese and Japanese sectors, consistent with TID climatological studies.