Table of Contents

Overview

The GalaxiesML dataset is designed for machine learning applications in astrophysics. It includes 286,401 galaxy images from the Hyper-Suprime-Cam (HSC) Survey PDR2 across five filters: g, r, i, z, y. Spectroscopically confirmed redshifts serve as ground truth, making the dataset ideal for tasks such as redshift estimation and galaxy morphology classification. This dataset supports upcoming large-scale surveys like LSST and Euclid.

HSC Survey Field Map
Map from the Hyper Suprime-Cam Team: https://hsc.mtk.nao.ac.jp/ssp/survey/#survey_fields

Features

  • 286,401 galaxy images in five photometric bands (g, r, i, z, y).
  • Spectroscopic redshifts for each galaxy, with redshift values ranging from 0.01 to 4.
  • Morphological parameters such as Sérsic index, half-light radius, and ellipticity.
  • Data provided in HDF5 format, along with CSV metadata.
  • Training, validation, and test splits available for machine learning applications.
Galaxy Images

Examples of 64x64 pixel galaxy images from GalaxiesML in the five g, r, i, z, y filters. Images at 127x127 pixels are also available.

Data Files

The dataset is split into 127x127 pixel and 64x64 pixel image resolutions. The following files are available for each resolution:

127x127 Pixel Image Files

64x64 Pixel Image Files

Galaxy Morphology Parameters

The dataset includes several galaxy morphology parameters, such as:



Additional descriptions of the galaxy tabular data are provided below: Tabular Data Description.

Download the Dataset

The GalaxiesML dataset is a valuable resource for both astrophysicists and data scientists. It provides a vast collection of galaxy images along with detailed photometric data and precise redshift measurements. Whether you're working on galaxy classification, redshift estimation, or other machine learning applications in astrophysics, this dataset offers the comprehensive data you need to drive your research forward. The dataset is available on Zenodo and can be accessed using the link below.

Access Dataset

Citations

If you use this dataset in your work, please cite the following references:

CNN Example

Explore our Convolutional Neural Network (CNN) example on GitHub to see how to leverage the GalaxiesML dataset for galaxy classification and redshift estimation tasks. This example provides a comprehensive codebase to help better understand machine learning applications in astrophysics.

View CNN Example on GitHub

Generative Model of Galaxies

Our latest research paper leverages the GalaxyML dataset to introduce a novel method using Denoising Diffusion Probabilistic Models (DDPM) conditioned on redshift data to generate realistic galaxy images. We demonstrate that DDPM effectively captures the physical characteristics and evolutionary changes of galaxies, enhancing our understanding of cosmic phenomena through machine learning.

Download Research Paper

We provide three Jupyter notebooks to help you explore our generative models for galaxy images. The Generation notebook is available as an interactive Colab notebook, while the Training and Evaluation notebooks are provided as downloadable references.

Training Notebook

Explore how we trained our Denoising Diffusion Probabilistic Model (DDPM) on the GalaxiesML dataset. This notebook details the training process and model architecture.

Download Training Notebook

Generate Notebook

Generate your own galaxy images using our pre-trained model! This interactive Colab notebook allows you to create new galaxy images by specifying redshift values and other parameters.

Open in Colab ↗

Evaluate Notebook

Review our model evaluation process through various metrics and visualizations. Understand how we assessed the quality of generated images against real galaxy distributions.

Download Evaluation Notebook

Support Tools

Download the required Python scripts and utilities needed to run the notebooks. Includes modules for data management, model architecture, and evaluation tools.

Download Tools
Interactive Generation: The Generate notebook is set up in Google Colab with our pre-trained model, allowing you to create new galaxy images instantly. The Training and Evaluation notebooks and tools are provided as downloads for reference and documentation purposes.

NOTE: You must be signed in to a Google account, and the Colab extension must be installed for that account in order to access Colab.

Tip: Click the "Open With" drop down menu, search for "Colab" and click "Install". Then refresh the page and select "Open with Google Colaboratory"

Papers using GalaxiesML

Tabular Data

Column Name Units Description
object_id Object ID from the HSC survey. Unique ID in 64-bit integer
coord (deg, deg, deg) Coordinate used in coneSearch(coord, RA, DEC, RADIUS)
ra deg RA (J2000.0) of the image center
dec deg DEC (J2000.0) of the image center
{band}_cmodel_mag mag Magnitude of the central galaxy in filter {band}
{band}_cmodel_magsigma mag Uncertainty in the magnitude in filter {band}
skymap_id Location of the galaxy in internal survey position definition (tract, patch)
specz_name Name(s) of the galaxy in the spectroscopic survey(s)
specz_flag_homogeneous Homogenized spec-z flag. (TRUE=secure, FALSE=insecure)
specz_mag_i mag i-band magnitude of the galaxy in the spectroscopic survey
specz_ra deg RA (J2000.0) of galaxy in spectroscopic survey
specz_dec deg DEC (J2000.0) of galaxy in spectroscopic survey
specz_redshift Spectroscopic redshift
specz_redshift_err Spectroscopic redshift uncertainty
{band}_central_image_pol_15px_rad Photometry within a 15-pixel radius in filter {band}
{band}_central_image_pop_10px_rad Photometry within a 10-pixel radius in filter {band}
{band}_central_image_pop_5px_rad Photometry within a 5-pixel radius in filter {band}
{band}_ellipticity Ellipticity of the object in filter {band}
{band}_half_light_radius pixels Radius containing 50% of the total flux in filter {band}
{band}_isophotal_area pixels² Isophotal area of the object in filter {band}
{band}_major_axis pixels Major axis of the detected object in filter {band}
{band}_minor_axis pixels Minor axis of the detected object in filter {band}
{band}_peak_surface_brightness mag/sq. arcsec Peak surface brightness in filter {band}
{band}_petro_rad pixels Petrosian radius in filter {band}
{band}_pos_angle deg Position angle of the object in filter {band}
{band}_sersic_index Sérsic index in filter {band}
{band}_total_galaxies Total number of galaxies detected in filter {band}

License

This dataset is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).