Spherinator and HiPSter: Representation Learning for Unbiased Knowledge Discovery from Simulations

Kai L. Polsterer,Bernd Doser,Andreas Fehlner,Sebastian Trujillo-Gomez
2024-06-06
Abstract:Simulations are the best approximation to experimental laboratories in astrophysics and cosmology. However, the complexity, richness, and large size of their outputs severely limit the interpretability of their predictions. We describe a new, unbiased, and machine learning based approach to obtaining useful scientific insights from a broad range of simulations. The method can be used on today's largest simulations and will be essential to solve the extreme data exploration and analysis challenges posed by the Exascale era. Furthermore, this concept is so flexible, that it will also enable explorative access to observed data. Our concept is based on applying nonlinear dimensionality reduction to learn compact representations of the data in a low-dimensional space. The simulation data is projected onto this space for interactive inspection, visual interpretation, sample selection, and local analysis. We present a prototype using a rotational invariant hyperspherical variational convolutional autoencoder, utilizing a power distribution in the latent space, and trained on galaxies from IllustrisTNG simulation. Thereby, we obtain a natural Hubble tuning fork like similarity space that can be visualized interactively on the surface of a sphere by exploiting the power of HiPS tilings in Aladin Lite.
Instrumentation and Methods for Astrophysics,Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is: **How to extract useful scientific insights from large - scale astrophysical and cosmological simulations while overcoming the interpretation and analysis challenges brought by the high complexity and large scale of these simulation data?** Specifically, the author proposes a new, unbiased, machine - learning - based method to obtain useful scientific insights from a wide range of simulations. This method can be applied to the largest simulations today and is crucial for solving the extreme data - analysis challenges brought by the Exascale era (the era of exa - flop computing). In addition, this method can also be used to explore observational data. ### Main problems and challenges 1. **Data complexity and scale** - Modern cosmological simulations (such as IllustrisTNG) use more than \(10^{11}\) particles for modeling, and the amount of data generated reaches the petabyte level. - Such a data scale is far beyond human exploration, synthesis, and interpretation capabilities, and traditional analysis techniques are obsolete. 2. **Data representation and compression** - Simulation data is usually stored in directories in a very compressed format, often representing rich multi - dimensional data with a single scalar. - An automatic method for learning more effective compression and embedding of the original data is required to ensure that similar objects are also similar in the compressed representation. 3. **Visualization and interactive exploration** - There are many tools in astronomy that can be used to visualize and process data on the sphere. - A method is required to project the projection obtained from machine learning onto the sphere to facilitate iterative refinement of the view and provide exploratory access to complex structured data. ### Solutions The author proposes two main tools: - **Spherinator**: A variational auto - encoder (VAE) implemented using a convolutional neural network (CNN), with a hyperspherical latent space that is rotation - invariant. Through non - linear dimensionality reduction techniques, the simulation data is projected into a low - dimensional space, facilitating interactive inspection, visual interpretation, sample selection, and local analysis. - **HiPSter**: Utilizes the HiPS (Hierarchical Progressive Survey) standard to create an explorable data representation. The hierarchical structure of HiPS allows for a step - by - step refinement of the view, thus providing exploratory access to complex data. Through these methods, the author hopes to effectively process and analyze large - scale simulation data in the Exascale era and discover new scientific insights from it.