Entagma Tutorial: Importing Satellite Data into Houdini

I recorded this tutorial for Entagma. I hope you like it! There is a text version below where I go a bit more in-depth about the maths.

Importing Satellite Data into Houdini

Most research about space and astronomy is run by governments which, usually, make their research results and data available to the public. The two big space agencies, NASA and ESA, provide a huge amount of data through various internet portals.

In this blog post, I want to show you how you can bring such data into Houdini and make sense of it. There’s a little bit of maths involved, some VEX and a good amount of Python in- and outside of Houdini. This is going to be a long one, so go get a drink before you start.

You can download the Houdini scene and .bgeo file at the bottom of this page.

(@$2560 \times 1440 \space px$@, click to enlarge)

The Data Source

Introduction

The European Space Agency (ESA) launched the GAIA Satellite in 2013. By their own words, Gaia is

[..] an ambitious mission to chart a three-dimensional map of our Galaxy, […] will provide unprecedented positional and radial velocity measurements […]

Sounds like fun! And as it turns out, getting the data is incredibly easy. The ESA released the first dataset recorded by GAIA in September 2016: The GAIA DR1. The dataset can be downloaded from the GAIA Archive. There are two sources available:

  • gaia_source/ – The full GAIA data, with a whopping 510GB1, containing over 1.142 billion objects.
  • tgas_source/ – A subset of the GAIA data that is cross-matched and verified against the Tycho-2 Catalogue, with a volume of “merely” 1.6GB1 and about 2.5 million stars.

1 Data volume measurement of the uncompressed CSV-formatted distribution. The GZip compression rate on the datasets is around 60%, thus the actual download size only ~40% of the uncompressed data size.

I chose to go with the TGAS dataset in this blogpost. Even with a fast internet connection and computer, downloading 200GB (compressed size of the GAIA dataset) and processing 500GB is not that convenient when you’re just beginning to experiment! The TGAS dataset is thus much easier to get started with.

Downloading the TGAS dataset

Also check out by “NiklasRosenstein/dtools” repository which already contains a script to download this data and more (such as data from the NASA Exoplanet Archive).

The TGAS dataset consists of only 16 parts, so you could easily download them manually through your browser. However, if you would want to get the GAIA dataset, you’d be sitting there for quite a while to download all 5231 parts.

Writing a Python script to download all the files is incredibly easy. The TGAS dataset can be downloaded in CSV format from the following URL, where {PART} needs to be replaced by a value between 000 and 015.

http://cdn.gea.esac.esa.int/Gaia/tgas_source/csv/TgasSource_000-000-{PART}.csv.gz

Below is a script that retrieves all these files and saves it to the current directory. Make sure you have the requests library installed before you run the script.

get-tgas.py

from urllib.parse import urlparse  # Python 2: from urlparse import urlparse
import os
import posixpath
import requests

url_template = 'http://cdn.gea.esac.esa.int/Gaia/tgas_source/csv/TgasSource_000-000-{:03d}.csv.gz'

for i in range(16):
  url = url_template.format(i)
  filename = posixpath.basename(urlparse(url).path)
  if os.path.isfile(filename):
    # We already downloaded the file.
    print('Skipped "{}"'.format(filename))
    continue

  # Request the contents of the file.
  response = requests.get(url, stream=True)
  response.raise_for_status()

  print('Downloading "{}"...'.format(filename))
  try:
    with open(filename, 'wb') as fp:
      for chunk in response.iter_content(1024):
        fp.write(chunk)
  except:
    # If anything bad happened (or the program was interrupted), we don't
    # want an incomplete file lying around.
    try: os.remove(filename)
    except OSError as exc: print(exc)
    raise

$ pip3 install --user requests
$ mkdir ~/Desktop/TGAS && cd ~/Desktop/TGAS
$ python3 ./get-tgas.py
Downloading "TgasSource_000-000-000.csv.gz"...
Downloading "TgasSource_000-000-001.csv.gz"...
[...]

Note that the data is GZip compressed. We could decompress the data directly after the download, but that would leave is with a lot of more space used on our harddrive. It might not matter for the TGAS dataset, but it will likely matter for the full GAIA dataset. Additionally, the time required to decompress the data on-the-fly is almost negated by the reduced IO load (60% less data read from the harddrive).

A quick tour into astronomy

Before we start extracting the data and pouring it into Houdini, we should take a moment to think about what data we actually need. Looking at the TGAS source documentation, we can see that there’s a lot of data available for every object (59 columns, to be precise). After taking some time reading the page and searching for the meaning of certain terms, I found out that the following columns are specifically interesting for us. I’ve included links to pages that explain the meaning of this data.

You should at least give these pages a quick glimpse, they are very interesting and are incredibly helpful to understand what we’re dealing with!

TL;DR We can use the PARALLAX to compute the distance of the star. The photometric magnitude PHOT_G_MEAN_MAG gives us the brightness of the star. And the latitude B and longitude L can be used to calculate the coordinates on the unit sphere in a cartesian coordinate system!

Getting the data into Houdini

Introduction

When it comes to processing data, a very important step is to prepare the data and turn it into a format that makes it easier to work with. While reading the CSV data sure is easy, it is very slow (as you will notice yourself in just a bit).

Houdini’s binary geometry format (.bgeo) comes to the rescue! Instead of reading the CSV data and working on the result directly, we can instead export the Houdini geometry with all its point attributes, leaving us a with a much smaller file that contains only the data that we’re interested in and is much faster to read (by a factor of thousands!).

There is already a “Table Import” node in Houdini, however it can not read GZip compressed CSV files, let alone read multiple files and merge them.

Writing a Python SOP node

Create a Python SOP node and add the following two parameters to it:

  • maxrows – Integer, the maximum number of rows to read.
  • directory – File, the path to the directory that contains all our .csv.gz files.

The “maxrows” is very useful during the development and debugging of the node since you won’t have to wait for all data but only a portion of it to verify that your code is working.

This is the default code in every Python SOP node.

node = hou.pwd()
geo = node.geometry()

# Add code to modify contents of geo.
# Use drop down menu to select examples.

We’ll need to do some filesystem operations, read CSV data and decompress with GZip, so let’s import all that stuff.

import csv
import gzip
import os

Let’s now declare which columns we are interested in. We’ll use that later to make it easier to extract the data from every row. Also, we should add attributes to the Houdini geometry so we can actually save the data. We’re going to save every row in the CSV data as a point on the geometry. Keep in mind that every point on a geometry already has at least three floating point attributes: P.x, P.y, P.z. If we ignore them, they would all be set to 0 and we’d waste a lot of memory. Instead, we will assign the values of the first three columns as the point position.

COLUMNS = 'l,b,parallax,phot_g_mean_mag'.split(',')
# Ignore the first three columns, will be fed into the point position.
for col in COLUMNS[3:]:
  geo.addAttrib(hou.attribType.Point, col, 0.0)

Moving on to the bigger picture, this is what our “main” procedure will look like:

  1. Read the node’s parameter values
  2. List up all the files in the specified directory that are eligble to parse
  3. Parse the files until we’ve reached the maximum number of rows or there are no more files to be parsed

Since this will be a long-running operation, we should use the hou.InterruptableOperation class which not only allows us to display a progress bar during the node’s cooking, but also pause and interrupt the cooking!

def main():
  maxrows = node.parm('maxrows').eval()
  directory = node.parm('directory').eval()

  files = []
  for name in os.listdir(directory):
    if name.endswith('.csv.gz'):
      files.append(os.path.join(directory, name))
  files.sort()

  count = 0
  with hou.InterruptableOperation('Reading CSV', 'Reading Files', True) as op:
    for fi, filename in enumerate(files):
      op.updateLongProgress(fi / float(len(files)-1), os.path.basename(filename))
      if count >= maxrows:
        break
      with gzip.open(filename) as fp:
        count += read_csv(fp, maxrows - count)

# Put this at the very end of your node's code.
main()

Moving on to our final step: The read_csv() function. It takes as input an open file and the number of rows it should read at max. We’ll quickly wrap the file in a csv.reader object, read from it the CSV header to figure out which the column indices and then fill the geometry.

def read_csv(fp, maxrows):
  reader = csv.reader(fp)
  header = {name: index for index, name in enumerate(next(reader))}
  col_indices = [header[name] for name in COLUMNS]

  count = 0
  for row in reader:
    if count >= maxrows:
      break
    if len(row) != len(header):
      # There's an empty line after the header in the data. We also don't
      # want to be screwed over by lines that have less columns than the
      # header specifies.
      continue
    
    count += 1
    point = geo.createPoint()
    data = [float(row[index]) for index in col_indices]

    # Use the first three values as point position.
    point.setPosition(hou.Vector3(*data[:3]))

    # Transfer the rest to geometry attributes.
    for name, value in zip(COLUMNS[3:], data[3:]):
      point.setAttribValue(name, value)
  
  return count

Complete Python SOP code

After you’ve finished the code in the Python SOP, up the “maxrows” parameter and select the “directory” where the downloaded TGAS files reside.

First results

Since we fill the point positions, there’s already something we can look at in the viewport! As it so happened to be the case, we mapped the longitude, latitude and parallax to the X, Y and Z coordinates respectively. Due to the large point size in the viewport, it may look more like a random point cloud rather than our nearest 2.5 million stars. Give it a quick rendering and you’ll see it much better.

2,057,060 points, cooked in 1m 5s.

Export the geometry

As I described earlier, exporting the geometry to a .bgeo file has the advantage that the time to load the data from disk is improved dramatically. Unless we need to read additional properties from the table, we will not need to use the Python SOP again.

A note about the GAIA dataset

You will need to update the node to catch possible ValueError exceptions being raised when attempting to convert the row data to numbers with the float() function. For example, the PARALLAX column will be empty for most rows in the GAIA dataset.

Converting the data

Here I will demonstrate how we can calculate data that we can visualize meaningfully from the data that we get from the TGAS dataset. Depending on what and how we want to visualize the data, only some of these steps are necessary. Consider we use physically correct light sources that use inverse-square falloff and appear less bright the further away from the observer.

  • When using the absolute magnitude, we must actually place the light sources at the correct distance from the observer.
  • When using the apparent magnitude, we must place the light sources on the unit sphere (representing 10 parsecs) to observe the correct brightness.

Below are sections that you can click to expand to get an explanation of how to interpret and convert the data.

1. Convert spherical to cartesian coordinates

This is pretty straight forward. Add a new “Attribute Wrangle” to the output of the “File” node and insert this VEX code. There’s not much more to say about transforming spherical coordinates, other than “that’s how it works”.

// We saved the longitude and latitude in the geometry's point positions.
float lon = radians(@P.x);
float lat = radians(@P.y);

// Transform spherical coordinates to coordinates on the unit-sphere.
float x = cos(lat) * cos(lon);
float y = cos(lat) * sin(lon);
float z = sin(lat);

@P = set(x, y, z);

2. Convert parallax to distance

Remember that we stored the parallax in the P.z attribute, so we should determine the distance in the previously created “Attribute Wrangle” because we will loose the parallax information once we write the new point position.

The TGAS source documentation states that the parallax is stored in “mas”, which stands for “milliarcseconds”, a unit of angle. A quick gaze at the Wikipedia Stellar Parallax tells us that there is a very simple relation between distance (in parsec) and arcseconds:

$$ d_{pc} = 1/p_{as} $$

However, it is important to know the units that we’re dealing with! To be accurate, we have to remember that we’re dealing with milliarcseconds, so our relation is actually

$$ d_{pc} = 1000/p_{mas} $$

This will lead to very large numbers that will be inconvenient to work with in Houdini. I’ve found that dividing all distances by @$1000$@ gives a reasonably sized pointcloud. Now, this may look like the very same formula that we started with, but it is not. The distance we get is no longer in parsec!

$$ \begin{align} d_{Hou} &= \frac{10^3/p_{mas}}{10^3} \\[8pt]
&= 1/p_{mas} \end{align} $$

Modify the code from 1. Convert spherical to cartesian coordinates:

// We saved the longitude and latitude in the geometry's point positions.
float lon = radians(@P.x);
float lat = radians(@P.y);

// Transform spherical coordinates to coordinates on the unit-sphere.
float x = cos(lat) * cos(lon);
float y = cos(lat) * sin(lon);
float z = sin(lat);

float d = 1.0 / @P.z;
@P = set(x, y, z) * d;

Image: Cartesian Coordinates with Distance (Houdini Viewport)

3. Apparent or absolute magnitude?

Wikipedia has a good article about Magnitude in astronomy.

[…] The Sun has an apparent magnitude of −27, the full moon −13, the brightest planet Venus measures −5, and Sirius, the brightest visible star in the night sky, is at −1.5. […]

However, there is a difference between absolute and apparent magnitude, and the TGAS source documentation doesn’t state which of the two the PHOT_G_MEAN_MAG column is. We can figure that out easily, though. Just think about it: In the 2.5 million nearest stars, there must be a number of stars that are just as bright or even brighter than the sun.

Do we have such low magnitude values in our dataset? Check the Geometry Spreadsheet.

We don’t! So this must be apparent magnitude.

4. Convert apparent to absolute magnitude

The page Apparent and Absolute Magnitudes describes the relationship of absolute and apparent magnitudes as well as the relationship of distance and magnitude.

Absolute magnitude @$M_v$@ is the apparent magnitude the star would have if it were placed at a distance of 10 parsecs from the Earth.

If we know the distance @$d$@ (in parsec) and apparent magnitude @$m$@, we can calculate the absolute magnitude @$M_v$@ with the following formula:

$$ M_v = m - 2.5 \cdot log_{10}\left[\left(\frac{d}{10}\right)^2\right] $$

Make sure that you pass in the distance in parsec! You can alter the code of Section 1. Convert spherical to cartesian coordinates to save a new attribute f@d that is the distance in parsecs.

  @d = 1000.0 / @P.z;
  @P = set(x, y, z) * (@d / 1000.0);

Now you can compute the magnitude as

  float m = @phot_g_mean_mag;
  float M = m - 2.5 * log10(pow(@d / 10.0, 2));

Continue with section 5. Convert magnitude to brightness.

5. Convert magnitude to brightness

Wikipedia shows the following formula about the relationship of two magnitudes and two intensities:

$$ m_1 - m_{ref} = -2.5 \cdot log_{10}\left(\frac{I_1}{I_{ref}}\right) $$

The parameters in this formula are

  • @$m_1$@ – Our current star’s magnitude (= what we get from the TGAS dataset)
  • @$I_1$@ – Our current star’s intensity (= what we’re looking for)
  • @$m_{ref}$@ – A reference star’s magnitude
  • @$I_{ref}$@ – A reference star’s intensity

We can solve the equation for @$I_1$@ by hand or using software like WolframAlpha, Maple, Mathway or SymPy. I’ve prepared two ways to the solution below.

Note SymPy and WolframAlpha give a solution that is far less elegant than what we can do with solving for @$I_1$@ by hand.

Using WolframAlpha

Link

And the answer is (same as SympPy):

$$ I_1 = I_{ref} \cdot e ^ {0.921034 \cdot (m_{ref} - m_1)} $$

Using SymPy

  $ pip install --user sympy
  $ python
  Python 3.5.2 (v3.5.2:4def2a2901a5, Jun 25 2016, 22:01:18) [MSC v.1900 32 bit (Intel)] on win32
  Type "help", "copyright", "credits" or "license" for more information.
  >>> from sympy import *
  >>> m1, i1, mref, iref = symbols('m1 i1 mref iref')
  >>> eq = -2.5 * log(i1 / iref, 10) - (m1 - mref)
  >>> solve(eq, i1)
  [iref*exp(-0.921034037197618*m1 + 0.921034037197618*mref)]

And the answer is (same as WolframAlpha):

$$ I_1 = I_{ref} \cdot e ^ {0.921034037197618 \cdot (m_{ref} - m_1)} $$

Step-by-step Solution

Here’s a step-by-step solution of solving for @$I_1$@.

$$ \begin{align} m_1 - m_{ref} &= -2.5 \cdot log_{10}\left(\frac{I_1}{I_{ref}}\right) \\[8pt] \frac{m_1 - m_{ref}}{-2.5} &= log_{10}\left(\frac{I_1}{I_{ref}}\right) \\[8pt]
-0.4m_1 + 0.4m_{ref} &= log_{10}\left(\frac{I_1}{I_{ref}}\right) \\[8pt]
10^{-0.4m_1 + 0.4m_{ref}} &= \frac{I_1}{I_{ref}} \end{align} $$

And the final equation is:

$$ I_1 = I_{ref} \cdot 10^{-0.4m_1 + 0.4m_{ref}} $$

Now keep in mind that there is no such thing as “absolute” or “100%” brightness, thus we can not actually define an exact @$m_{ref}$@ and @$I_1$@ that yields the only possible solution. Brightness is relative, and so every result we get is correct and we should choose one that fits our needs.

float Iref = 1.0;
float mref = chf('refmag');
v@Cd = Iref * pow(10, -0.4 * @phot_g_mean_mag + 0.4 * mref);

Sources

This is a list of pages that I used to get an understanding of the data.

Download

comments powered by Disqus