By Natalya St. Clair
We live in a world in which we are surrounded by data—from web browsing and business applications to real estate systems and transportation infrastructure. But as massive datasets accumulate, so does the need for people who understand how to work with complex datasets and software. The workforce of tomorrow—everyone from historians to coders—will require the services and specialties of data scientists. We currently face the problem of educating people in the new field of data science, a problem that falls into the domain of the even newer field of data science education.
Thankfully, a new movement is working to make research findings more available for data science education and developing new tools for—and approaches—to teaching data science.
Many of these tools center around data moves. At a recent Data Science Education Technology Conference, hosted by the Concord Consortium in Berkeley, CA, opening panelist Tim Erickson of EEPS Media spoke of data moves as an ingredient that separates data science from “regular” data. A data move is an action made by a data scientist—or student—in order to work with and make sense of data, especially a large or unruly dataset, for instance, choosing a subset to find a pattern, defining new variables or new measures, making a new or unusual visualization, or reorganizing the data.
Let’s look at an example: this Common Online Data Analysis Platform (CODAP) document about birth dates in the U.S. from 2000–2014. (CODAP is free and open-source software that provides an easy-to-use web-based data analysis tool, geared toward middle and high school students.) This dataset from the U.S. Social Security Administration was made available through FiveThirtyEight’s GitHub repository.
To explore this dataset in a classroom, a few opening questions to pique student interest might be: What do you think the points represent? What patterns do you see? How popular is your birthday?
Before you and your students get started, take another look at the graph (Figure 1). Notice something strange? The data clusters at two places, one at the top and one at the bottom. Perhaps filtering the data will give us insights. Let’s color the data by day of the week. (In CODAP, drag and drop the attribute “day_of_week” from the table header to the middle of the graph, then click on “day_of_week” and set to “Treat as categorical.”) We can think about this as a data move in which we modify a visualization in the hope of gaining insight into what is going on.
This simple data move allows us to make an interesting insight: the number of births is much smaller on weekends (Figure 2, pink and purple dots) than weekdays. Ask your students for conjectures that might explain this observation. (The answer, interestingly, is that most births are scheduled, either through induced labor or caesarean section, because doctors and hospital staff tend to be available on weekdays.)
Let’s make another data move. It’s hard to see what’s going on for a given date for a span of years, but what if we calculate the median number of births for a given date?
To illustrate, draw a line at the median number of births (Figure 3) (This CODAP Document demonstrates how I created the graph of Figure 3.) Again, the data is interesting—there are some obvious dips on certain calendar days. There are far fewer birthdays on U.S. holidays! There’s also a dip on February 29. Ask your students why.
Now we’re back to our initial question: How popular is your birthday? To illustrate popular birth dates, we can graph the median number of births by month. (To do this, drag the attribute “month_name” on the vertical axis of a new graph. Drag “median_births” on the horizontal axis of the graph. Click on the ruler menu on the right of the graph. Check the “median” box to see the median line.)
As the graph in Figure 4 shows, September is the most popular month for birthdays.
Analyzing and interpreting data is one of the key science and engineering practices of the Next Generation Science Standards (NGSS), and representing and interpreting data are featured throughout the Common Core State Standards (CCSS) for mathematics.
In the above example, we looked at birth dates from an open dataset within the CODAP web-based data analysis environment. Our data moves—adding color and lines, and defining new variables in order to run our analysis—helped to make sense of the data.
You can find additional open-source datasets from our CODAP Sample Documents webpage or FiveThirtyEight’s Github repository for data in many subject areas, including mathematics, social studies, environmental science, and more. Or try the scaffolded activities within CODAP that walk your students through several data moves. They can explore data in the Markov Game to master playing “Rock, Paper, Scissors” against the evil Dr. Markov and save Madeline the dog; experiment with dropping a parachute to see how changing the parachute’s size and mass of its cargo changes the terminal velocity of the drop; investigate relationships between a mammal’s mass and its speed, lifespan, and more; or investigate the tracks of four specific elephant seals swimming in the Pacific.
One goal with data science education is to provide students with more opportunities to think analytically about data in the world around them. We hope this exciting new field will engage students to become data scientists, whether as professionals or informed citizens who are equipped with the tools and abilities to explore data for their own curiosity and to answer their own questions.
For more, see:
- Smart Machines Will Eat Jobs, Except for Where Smart People Create Them
- Making a Difference with Data
- Maker Space Promotes Problem Solving & Computational Thinking
Stay in-the-know with all things EdTech and innovations in learning by signing up to receive the weekly Smart Update.