Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 23 additions & 23 deletions episodes/03-transform.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,15 @@ exercises: 5
- Select rows and columns from an Astropy `Table`.
- Use Matplotlib to make a scatter plot.
- Use Gala to transform coordinates.
- Make a Pandas `DataFrame` and use a Boolean `Series` to select rows.
- Make a pandas `DataFrame` and use a Boolean `Series` to select rows.
- Save a `DataFrame` in an HDF5 file.

::::::::::::::::::::::::::::::::::::::::::::::::::

:::::::::::::::::::::::::::::::::::::::: questions

- How do we make scatter plots in Matplotlib?
- How do we store data in a Pandas `DataFrame`?
- How do we store data in a pandas `DataFrame`?

::::::::::::::::::::::::::::::::::::::::::::::::::

Expand All @@ -40,7 +40,7 @@ analysis, identifying stars with the proper motion we expect for GD-1.
2. Then we will transform the coordinates and proper motion data from
ICRS back to the coordinate frame of GD-1.

3. We will put those results into a Pandas `DataFrame`.
3. We will put those results into a pandas `DataFrame`.


::::::::::::::::::::::::::::::::::::::::::::::::::
Expand Down Expand Up @@ -471,7 +471,7 @@ We started with a rectangle in the GD-1 frame. When
transformed to the ICRS frame, it is a non-rectangular region. Now,
transformed back to the GD-1 frame, it is a rectangle again.

## Pandas DataFrame
## pandas DataFrame

At this point we have two objects containing different sets of the
data relating to identifying stars in GD-1. `polygon_results` is the Astropy `Table` we downloaded from Gaia.
Expand Down Expand Up @@ -563,33 +563,33 @@ We could have: `proper_motion` contains the same data as

::::::::::::::::::::::::::::::::::::::::: callout

## Pandas `DataFrame`s versus Astropy `Table`s
## pandas `DataFrame`s versus Astropy `Table`s

Two common choices are the Pandas `DataFrame` and Astropy `Table`.
Pandas `DataFrame`s and Astropy `Table`s share many of the same characteristics
Two common choices are the pandas `DataFrame` and Astropy `Table`.
pandas `DataFrame`s and Astropy `Table`s share many of the same characteristics
and most of the manipulations that we do can be done with either. As you become
more familiar with each, you will develop a sense of which one you prefer for
different tasks. For instance you may choose to use Astropy `Table`s to read
in data, especially astronomy specific data formats, but Pandas `DataFrame`s to
in data, especially astronomy specific data formats, but pandas `DataFrame`s to
inspect the data. Fortunately, Astropy makes it easy to convert between the
two data types. We will choose to use Pandas `DataFrame`, for two reasons:
two data types. We will choose to use pandas `DataFrame`, for two reasons:

1. It provides capabilities that are (almost) a superset of the other data
structures, so it is the all-in-one solution.

2. Pandas is a general-purpose tool that is useful in many domains,
2. pandas is a general-purpose tool that is useful in many domains,
especially data science. If you are going to develop expertise in one
tool, Pandas is a good choice.
tool, pandas is a good choice.

However, compared to an Astropy `Table`, Pandas has one big drawback:
However, compared to an Astropy `Table`, pandas has one big drawback:
it does not keep the metadata associated with the table, including the
units for the columns. Nevertheless, we think it's a useful data type
to be familiar with.


::::::::::::::::::::::::::::::::::::::::::::::::::

It is straightforward to convert an Astropy `Table` to a Pandas `DataFrame`.
It is straightforward to convert an Astropy `Table` to a pandas `DataFrame`.

```python
import pandas as pd
Expand Down Expand Up @@ -642,7 +642,7 @@ and consolidate them into a single function that we can use to take the
coordinates and proper motion that we get as an Astropy `Table` from our
Gaia query, add columns representing the reflex corrected
GD-1 coordinates and proper motions, and transform it into a
Pandas `DataFrame`.
pandas `DataFrame`.
This is a general function that we will use multiple times as we build different
queries so we want to write it once and then call the function rather than having
to copy and paste the code over and over again.
Expand All @@ -653,7 +653,7 @@ def make_dataframe(table):

table: Astropy Table

returns: Pandas DataFrame
returns: pandas DataFrame
"""
#Create a SkyCoord object with the coordinates and proper motions
# in the input table
Expand Down Expand Up @@ -696,7 +696,7 @@ results_df = make_dataframe(polygon_results)

At this point we have run a successful query and combined the results into a single `DataFrame`. This is a good time to save the data.

To save a Pandas `DataFrame`, one option is to convert it to an
To save a pandas `DataFrame`, one option is to convert it to an
Astropy `Table`, like this:

```python
Expand All @@ -713,7 +713,7 @@ astropy.table.table.Table
Then we could write the `Table` to a FITS file, as we did in the
previous lesson.

But, like Astropy, Pandas provides functions to write DataFrames in other formats; to
But, like Astropy, pandas provides functions to write DataFrames in other formats; to
see what they are [find the functions here that begin with
`to_`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html).

Expand All @@ -733,10 +733,10 @@ And HDF5 stores the metadata associated with the table, including
column names, row labels, and data types (like FITS).

Finally, HDF5 is a cross-language standard, so if you write an HDF5
file with Pandas, you can read it back with many other software tools
file with pandas, you can read it back with many other software tools
(more than FITS).

We can write a Pandas `DataFrame` to an HDF5 file like this:
We can write a pandas `DataFrame` to an HDF5 file like this:

```python
filename = 'gd1_data.hdf'
Expand All @@ -760,19 +760,19 @@ file if it already exists rather than append another dataset to it.
In this episode, we re-loaded the Gaia data we saved from a previous query.

We transformed the coordinates and proper motion from ICRS to a frame
aligned with the orbit of GD-1, stored the results in a Pandas
aligned with the orbit of GD-1, stored the results in a pandas
`DataFrame`, and visualized them.

We combined all of these steps into a single function that we can reuse in the future to go straight from the output of a query with object coordinates in the ICRS reference frame directly to a Pandas DataFrame that includes object coordinates in the GD-1 reference frame.
We combined all of these steps into a single function that we can reuse in the future to go straight from the output of a query with object coordinates in the ICRS reference frame directly to a pandas DataFrame that includes object coordinates in the GD-1 reference frame.

We saved our results to an HDF5 file which we can use to restart the analysis from this stage or verify our results at some future time.

:::::::::::::::::::::::::::::::::::::::: keypoints

- When you make a scatter plot, adjust the size of the markers and their transparency so the figure is not overplotted; otherwise it can misrepresent the data badly.
- For simple scatter plots in Matplotlib, `plot` is faster than `scatter`.
- An Astropy `Table` and a Pandas `DataFrame` are similar in many ways and they provide many of the same functions. They have pros and cons, but for many projects, either one would be a reasonable choice.
- To store data from a Pandas `DataFrame`, a good option is an HDF5 file, which can contain multiple Datasets (we'll dig in more in the Join lesson).
- An Astropy `Table` and a pandas `DataFrame` are similar in many ways and they provide many of the same functions. They have pros and cons, but for many projects, either one would be a reasonable choice.
- To store data from a pandas `DataFrame`, a good option is an HDF5 file, which can contain multiple Datasets (we'll dig in more in the Join lesson).

::::::::::::::::::::::::::::::::::::::::::::::::::

Expand Down
18 changes: 9 additions & 9 deletions episodes/04-motion.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
---
title: Plotting and Pandas
title: Plotting and pandas
teaching: 50
exercises: 15
---

::::::::::::::::::::::::::::::::::::::: objectives

- Use a Boolean Pandas `Series` to select rows in a `DataFrame`.
- Use a Boolean pandas `Series` to select rows in a `DataFrame`.
- Save multiple `DataFrame`s in an HDF5 file.

::::::::::::::::::::::::::::::::::::::::::::::::::
Expand All @@ -30,7 +30,7 @@ analysis, identifying stars with the proper motion we expect for GD-1.

## Outline

1. We will put those results into a Pandas `DataFrame`, which we will use
1. We will put those results into a pandas `DataFrame`, which we will use
to select stars near the centerline of GD-1.

2. Plotting the proper motion of those stars, we will identify a region
Expand Down Expand Up @@ -88,7 +88,7 @@ results_df = pd.read_hdf(filename, 'results_df')

## Exploring data

One benefit of using Pandas is that it provides functions for
One benefit of using pandas is that it provides functions for
exploring the data and checking for problems.
One of the most useful of these functions is `describe`, which
computes summary statistics for each column.
Expand Down Expand Up @@ -236,7 +236,7 @@ type(phi2)
pandas.core.series.Series
```

The result is a `Series`, which is the structure Pandas uses to
The result is a `Series`, which is the structure pandas uses to
represent columns.

We can use a comparison operator, `>`, to compare the values in a
Expand Down Expand Up @@ -282,7 +282,7 @@ mask = (phi2 > phi2_min) & (phi2 < phi2_max)
## Logical operators

Python's logical operators (`and`, `or`, and `not`)
do not work with NumPy or Pandas. Both libraries use the bitwise
do not work with NumPy or pandas. Both libraries use the bitwise
operators (`&`, `|`, and `~`) to do elementwise logical operations
([explanation here](https://stackoverflow.com/questions/21415661/logical-operators-for-boolean-indexing-in-pandas)).

Expand Down Expand Up @@ -433,7 +433,7 @@ plt.plot(pm1_rect, pm2_rect, '-')
Now that we have identified the bounds of the cluster in proper motion,
we will use it to select rows from `results_df`.

We will use the following function, which uses Pandas operators to make
We will use the following function, which uses pandas operators to make
a mask that selects rows where `series` falls between `low` and
`high`.

Expand Down Expand Up @@ -563,7 +563,7 @@ Recall that we chose HDF5 because it is a binary format producing small files th

Additionally, HDF5 files can contain more than one dataset and can store metadata associated with each dataset (such as column names or observatory information, like a FITS header).

We can add to our existing Pandas `DataFrame` to an HDF5 file by omitting the `mode='w'` keyword like this:
We can add to our existing pandas `DataFrame` to an HDF5 file by omitting the `mode='w'` keyword like this:

```python
filename = 'gd1_data.hdf'
Expand Down Expand Up @@ -662,7 +662,7 @@ the proper motion limits we identified in this lesson, which will allow us to ex
:::::::::::::::::::::::::::::::::::::::: keypoints

- A workflow is often prototyped on a small set of data which can be explored more easily and used to identify ways to limit a dataset to exactly the data you want.
- To store data from a Pandas `DataFrame`, a good option is an HDF5 file, which can contain multiple Datasets.
- To store data from a pandas `DataFrame`, a good option is an HDF5 file, which can contain multiple Datasets.

::::::::::::::::::::::::::::::::::::::::::::::::::

Expand Down
2 changes: 1 addition & 1 deletion episodes/05-select.md
Original file line number Diff line number Diff line change
Expand Up @@ -435,7 +435,7 @@ our analysis at a later date we should save this information to a file.
There are several ways we could do that, but since we are already
storing data in an HDF5 file, we will do the same with these variables.

To save them to an HDF5 file we first need to put them in a Pandas object.
To save them to an HDF5 file we first need to put them in a pandas object.
We have seen how to create a `Series` from a column in a `DataFrame`.
Now we will build a `Series` from scratch.
We do not need the full `DataFrame` format with multiple rows and columns
Expand Down
6 changes: 3 additions & 3 deletions episodes/06-join.md
Original file line number Diff line number Diff line change
Expand Up @@ -882,7 +882,7 @@ that for each candidate star we have identified exactly one source in
Pan-STARRS that is likely to be the same star.

To check whether there are any values other than `1`, we can convert
this column to a Pandas `Series` and use `describe`, which we saw
this column to a pandas `Series` and use `describe`, which we saw
in episode 3.

```python
Expand Down Expand Up @@ -979,7 +979,7 @@ getsize(filename) / MB

## Another file format - CSV

Pandas can write a variety of other formats, [which you can read about
pandas can write a variety of other formats, [which you can read about
here](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).
We won't cover all of them, but one other important one is
[CSV](https://en.wikipedia.org/wiki/Comma-separated_values), which
Expand Down Expand Up @@ -1064,7 +1064,7 @@ the CSV file also does not.
However, even if we had written a CSV file from an astropy `Table`, which does contain data type,
data type would not appear in the CSV file, highlighting a limitation of this format.
Additionally, notice that the index in `candidate_df` has become an unnamed column
in `read_back_csv` and a new index has been created. The Pandas functions for writing and reading CSV
in `read_back_csv` and a new index has been created. The pandas functions for writing and reading CSV
files provide options to avoid that problem, but this is an example of
the kind of thing that can go wrong with CSV files.

Expand Down
2 changes: 1 addition & 1 deletion episodes/07-photo.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ stars. But the main sequence of GD-1 appears as an overdense region in the lowe

We want to be able to make this plot again, with any selection of PanSTARRs photometry,
so this is a natural time to put it into a function that accepts as input
an Astropy `Table` or Pandas `DataFrame`, as long as
an Astropy `Table` or pandas `DataFrame`, as long as
it has columns named `g_mean_psf_mag` and `i_mean_psf_mag`. To do this we will change
our variable name from `candidate_df` to the more generic `dataframe`.

Expand Down
4 changes: 2 additions & 2 deletions index.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ permalink: index.html
site: sandpaper::sandpaper_site
---

The Foundations of Astronomical Data Science curriculum covers a range of core concepts necessary to efficiently study the ever-growing datasets developed in modern astronomy. In particular, this curriculum teaches learners to perform database operations (SQL queries, joins, filtering) and to create publication-quality data visualisations. Learners will use software packages common to the general and astronomy-specific data science communities ([Pandas](https://pandas.pydata.org), [Astropy](https://www.astropy.org), [Astroquery](https://astroquery.readthedocs.io/en/latest/) combined with two astronomical datasets: the large, all-sky, multi-dimensional dataset from the [Gaia satellite](https://sci.esa.int/web/gaia), which measures the positions, motions, and distances of approximately a billion stars in our Milky Way galaxy with unprecedented accuracy and precision; and the [Pan-STARRS photometric survey](https://panstarrs.stsci.edu/), which precisely measures light output and distribution from many stars. Together, the software and datasets are used to reproduce part of the analysis from the article ["Off the beaten path: Gaia reveals GD-1 stars outside of the main stream"](https://arxiv.org/abs/1805.00425) by Drs. Adrian M. Price-Whelan and Ana Bonaca. This lesson shows how to identify and visualize the GD-1 stellar stream, which is a globular cluster that has been tidally stretched by the Milky Way.
The Foundations of Astronomical Data Science curriculum covers a range of core concepts necessary to efficiently study the ever-growing datasets developed in modern astronomy. In particular, this curriculum teaches learners to perform database operations (SQL queries, joins, filtering) and to create publication-quality data visualisations. Learners will use software packages common to the general and astronomy-specific data science communities ([pandas](https://pandas.pydata.org), [Astropy](https://www.astropy.org), [Astroquery](https://astroquery.readthedocs.io/en/latest/) combined with two astronomical datasets: the large, all-sky, multi-dimensional dataset from the [Gaia satellite](https://sci.esa.int/web/gaia), which measures the positions, motions, and distances of approximately a billion stars in our Milky Way galaxy with unprecedented accuracy and precision; and the [Pan-STARRS photometric survey](https://panstarrs.stsci.edu/), which precisely measures light output and distribution from many stars. Together, the software and datasets are used to reproduce part of the analysis from the article ["Off the beaten path: Gaia reveals GD-1 stars outside of the main stream"](https://arxiv.org/abs/1805.00425) by Drs. Adrian M. Price-Whelan and Ana Bonaca. This lesson shows how to identify and visualize the GD-1 stellar stream, which is a globular cluster that has been tidally stretched by the Milky Way.

GD-1 is a stellar stream around the Milky Way. This means it is a collection of stars that we believe was once part of a bound clump, but the gravitational influence of the Milky Way has torn it apart and spread it over an arc that traces out its orbit on the sky. This is interesting, because if the original bound clump was a dwarf galaxy, understanding its orbit with sufficient precision allows us to measure the mass of the Milky Way, which is very important for understanding the future and past of the Milky Way as a whole. But that is much easier to do if we have a coordinate system aligned with the stream because that makes fitting the location of the stars much easier mathematically - it becomes more linear instead of some complicated curve. Additionally, this stream is especially interesting because it has "gaps", which have a natural interpretation as being caused by the influence of small clumps of dark matter passing near the stream. Knowing the typical rate of these gaps tells you about the typical size and density of these clumps, which turns out to be one of the best probes we have of the fine structure of dark matter.

Expand All @@ -13,7 +13,7 @@ This lesson can be taught in approximately 10 hours and covers the following top
- Using Astroquery to query a remote server in Python.
- Transforming coordinates between common coordinate systems using Astropy units and coordinates.
- Working with common astronomical file formats, including FITS, HDF5, and CSV.
- Managing your data with Pandas DataFrames and Astropy Tables.
- Managing your data with pandas DataFrames and Astropy Tables.
- Writing functions to make your work less error-prone and more reproducible.
- Creating a reproducible workflow that brings the computation to the data.
- Customising all elements of a plot and creating complex, multi-panel, publication-quality graphics.
Expand Down
2 changes: 1 addition & 1 deletion instructors/calculating_MIST_isochrone.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,7 +185,7 @@ expect to find stars in GD-1.
We will save this result so we can reload it later without repeating the
steps in this section.

So we can save the data in an HDF5 file, we will put it in a Pandas
So we can save the data in an HDF5 file, we will put it in a pandas
`DataFrame` first:

```python
Expand Down
Loading