Skip to main content
Version: 0.24.3

Summarize Large Datasets with Pivot and Reshape

Summarizing large datasets can help you to better understand, identify, and compare specific aspects of your data. In this example, we'll demonstrate how to create pivot tables and reshape large datasets to compare New York's and California's Covid-19 cases.

Load Our Data

To start, load the URL https://github.com/nytimes/covid-19-data/raw/HEAD/us-states.csv into the session. We're then given a sample of our dataset that looks something like this:

United States Covid-19 data

Extract Our Data

Let's prepare our data by using the Extract skill to find the quarter of the year for each row. Later, we'll use this information to dig into the data further. We can:

  1. Click Column > Extract to open the Extract form.
  2. Enter "date" for the column, "YYYYQ" for the date parts to extract, and "YearQuarter" for the name of the extracted column.
  3. Click Submit.

Extract form

This gives us the following table. As we can see, the dates of recorded cases are now listed with their corresponding quarters.

extracted quarters table

Create a Pivot Table

Currently, our dataset has over 39,000 rows. Let's create a pivot table to make this dataset more manageable. Let's say that we'd like to show the total number of cases in each quarter for each state. We can:

  1. Click Dataset > Pivot to open the Pivot form.
  2. Select "total" for the Aggregate and "cases" for the Column.
  3. Drag and drop "YearQuarter" to the bottom-left section and "state" to the top section.
  4. Click Submit.

pivot form

This step creates a new pivot table, "totalCasesPivot", that summarizes the total number of cases for each state in each quarter.

the pivot table

Reshape Our Data

We can then investigate these findings a bit more using the Reshape skill. This shows us a different way to look at our data, helping us to better visualize comparisons. To compare the Covid-19 cases data between New York and California:

  1. Click Dataset > Reshape to open the Reshape form.

  2. Select Long Form.

  3. Enter "YearQuarter" for the Row Identifier.

  4. Enter "California, New York" for the Values.

  5. Click Submit.

    reshape form

Our new table is called "longReshape" and looks something like this:

the reshape table

Reshape takes our previous table from wide form to long form, showing us a comparison between New York's and California's Covid-19 cases. From here, we can create a bar chart to illustrate the change in cases for both states over all quarters:

  1. Click Plot in the sidebar to open the Chart Builder.
  2. Select "Bar Chart" for type.
  3. Under Required Fields, enter "Variables" for the X-Axis and "Values" for the Y-Axis.
  4. Under Optional Fields, enter "YearQuarter" for Group.
  5. Click Submit.

the bar chart

Using this chart, we are able to view a side by side comparison of total Covid-19 cases between New York and California.

Full Recipe

Here is the entire recipe we used in this topic:

Load data from the URL <strong>https://github.com/nytimes/covid-19-data/raw/HEAD/us-states.csv</strong>
Extract the YYYYQ from the column date and call the output columns YearQuarter
Pivot on total cases creating rows from YearQuarter and columns from state
Reshape the dataset current to long form, creating row IDs for each YearQuarter, and creating a Values column from California, NewYork
Plot an EChart with the specification <strong>{"caption":"","typeVersion":2,"plot":{"series":[{"type":"bar","mark":{"x":"Variables","y":"Values","shape":"YearQuarter","color":"YearQuarter"},"group":{}}],"presentation":{"title":{"textStyle":{"fontSize":18},"text":["Average Values vs. Variables, for each: YearQuarter"]},"xaxis":{"nameTextStyle":{"fontSize":18},"axisLabel":{"fontSize":12},"text":"Variables"},"yaxis":{"nameTextStyle":{"fontSize":18},"axisLabel":{"fontSize":12},"text":"Average Values"},"annotations":[],"colorOverride":[]}},"values":{"complete":false,"metadata":{},"columns":[{"name":"Values","type":"Float","unique":28},{"name":"Variables","type":"String","unique":2},{"name":"YearQuarter","type":"String","unique":14}],"aggregate":[{"expression":"Average","columns":["Values"],"groupBy":["Variables","YearQuarter"]}],"transforms":[],"reference":{"type":"dataset","params":{"session_id":"70cxrpmqg9w3nyo9","dataset":"longReshape","version":1}}}}</strong>