The GalaxiesML dataset is designed for machine learning applications in astrophysics. It includes 286,401 galaxy images from the Hyper-Suprime-Cam (HSC) Survey PDR2 across five filters: g, r, i, z, y. Spectroscopically confirmed redshifts serve as ground truth, making the dataset ideal for tasks such as redshift estimation and galaxy morphology classification. This dataset supports upcoming large-scale surveys like LSST and Euclid.
Examples of 64x64 pixel galaxy images from GalaxiesML in the five g, r, i, z, y filters. Images at 127x127 pixels are also available.
The dataset is split into 127x127 pixel and 64x64 pixel image resolutions. The following files are available for each resolution:
5x127x127_training_with_morphology.hdf5
- Training set with images and morphological parameters.5x127x127_training_with_morphology.csv
- Metadata and morphology for the training set.5x127x127_validation_with_morphology.hdf5
- Validation set with images and morphological parameters.5x127x127_validation_with_morphology.csv
- Metadata and morphology for the validation set.5x127x127_testing_with_morphology.hdf5
- Testing set with images and morphological parameters.5x127x127_testing_with_morphology.csv
- Metadata and morphology for the testing set.5x64x64_training_with_morphology.hdf5
- Training set with images and morphological parameters.5x64x64_validation_with_morphology.hdf5
- Validation set with images and morphological parameters.5x64x64_testing_with_morphology.hdf5
- Testing set with images and morphological parameters.The dataset includes several galaxy morphology parameters, such as:
Additional descriptions of the galaxy tabular data are provided below: Tabular Data Description.
The GalaxiesML dataset is a valuable resource for both astrophysicists and data scientists. It provides a vast collection of galaxy images along with detailed photometric data and precise redshift measurements. Whether you're working on galaxy classification, redshift estimation, or other machine learning applications in astrophysics, this dataset offers the comprehensive data you need to drive your research forward. The dataset is available on Zenodo and can be accessed using the link below.
Access DatasetIf you use this dataset in your work, please cite the following references:
Explore our Convolutional Neural Network (CNN) example on GitHub to see how to leverage the GalaxiesML dataset for galaxy classification and redshift estimation tasks. This example provides a comprehensive codebase to help better understand machine learning applications in astrophysics.
View CNN Example on GitHubOur latest research paper leverages the GalaxyML dataset to introduce a novel method using Denoising Diffusion Probabilistic Models (DDPM) conditioned on redshift data to generate realistic galaxy images. We demonstrate that DDPM effectively captures the physical characteristics and evolutionary changes of galaxies, enhancing our understanding of cosmic phenomena through machine learning.
Download Research PaperWe provide three Jupyter notebooks to help you explore our generative models for galaxy images. The Generation notebook is available as an interactive Colab notebook, while the Training and Evaluation notebooks are provided as downloadable references.
Explore how we trained our Denoising Diffusion Probabilistic Model (DDPM) on the GalaxiesML dataset. This notebook details the training process and model architecture.
Download Training NotebookGenerate your own galaxy images using our pre-trained model! This interactive Colab notebook allows you to create new galaxy images by specifying redshift values and other parameters.
Open in Colab ↗Review our model evaluation process through various metrics and visualizations. Understand how we assessed the quality of generated images against real galaxy distributions.
Download Evaluation NotebookDownload the required Python scripts and utilities needed to run the notebooks. Includes modules for data management, model architecture, and evaluation tools.
Download ToolsColumn Name | Units | Description |
---|---|---|
object_id |
Object ID from the HSC survey. Unique ID in 64-bit integer | |
coord |
(deg, deg, deg) | Coordinate used in coneSearch(coord, RA, DEC, RADIUS) |
ra |
deg | RA (J2000.0) of the image center |
dec |
deg | DEC (J2000.0) of the image center |
{band}_cmodel_mag |
mag | Magnitude of the central galaxy in filter {band} |
{band}_cmodel_magsigma |
mag | Uncertainty in the magnitude in filter {band} |
skymap_id |
Location of the galaxy in internal survey position definition (tract, patch) | |
specz_name |
Name(s) of the galaxy in the spectroscopic survey(s) | |
specz_flag_homogeneous |
Homogenized spec-z flag. (TRUE=secure, FALSE=insecure) | |
specz_mag_i |
mag | i-band magnitude of the galaxy in the spectroscopic survey |
specz_ra |
deg | RA (J2000.0) of galaxy in spectroscopic survey |
specz_dec |
deg | DEC (J2000.0) of galaxy in spectroscopic survey |
specz_redshift |
Spectroscopic redshift | |
specz_redshift_err |
Spectroscopic redshift uncertainty | |
{band}_central_image_pol_15px_rad |
Photometry within a 15-pixel radius in filter {band} |
|
{band}_central_image_pop_10px_rad |
Photometry within a 10-pixel radius in filter {band} |
|
{band}_central_image_pop_5px_rad |
Photometry within a 5-pixel radius in filter {band} |
|
{band}_ellipticity |
Ellipticity of the object in filter {band} |
|
{band}_half_light_radius |
pixels | Radius containing 50% of the total flux in filter {band} |
{band}_isophotal_area |
pixels² | Isophotal area of the object in filter {band} |
{band}_major_axis |
pixels | Major axis of the detected object in filter {band} |
{band}_minor_axis |
pixels | Minor axis of the detected object in filter {band} |
{band}_peak_surface_brightness |
mag/sq. arcsec | Peak surface brightness in filter {band} |
{band}_petro_rad |
pixels | Petrosian radius in filter {band} |
{band}_pos_angle |
deg | Position angle of the object in filter {band} |
{band}_sersic_index |
Sérsic index in filter {band} |
|
{band}_total_galaxies |
Total number of galaxies detected in filter {band} |
This dataset is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).