How to Prepare Datasets for Analysis

Steven L. Miller
3 min readDec 23, 2020


6 Lessons Learned Tidy Data

Understanding how to work with data is becoming a critical skill for success in an increasingly expansive number of occupations.

In Tidy Data Hadley Wickham outlines a series of principles those working with data can employ to do so more effectively.

Original notes these lessons are from.

1. Tidy Data is a Framework

Tidy datasets are easy to manipulate, model and visualise, and have a specific structure each variable is a column, each observation is a row, and each type of observational unit is a table. This framework makes it easy to tidy messy datasets because only a small set of tools are needed to deal with a wide range of un-tidy datasets.

Lesson Learned

Tidy Data is a concept that can be applied to any unstructured dataset to lay the foundation needed to analyze it.

2. Formatting of Tidy Data

Tidy data is a standard way of mapping the meaning of a dataset to its structure. A dataset is messy or tidy depending on how rows, columns and tables are matched up with observations, variables and types. In tidy data:

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table

Lesson Learned

For a dataset to be considered tidy it must adhere pre-defined rules for structuring data.

3. Get the Most out of R Using Tidy Data

Tidy data is particularly well suited for vectorised programming languages like R, because the layout ensures that values of different variables from the same observation are always paired. Fixed variables should come first, followed by measured variables, each ordered so that related variables are contiguous. Rows can then be ordered by the first variable, breaking ties with the second and subsequent (fixed) variables. most messy datasets, including types of messiness not explicitly described above, can be tidied with a small set of tools: melting, string splitting, and casting. The complete datasets and the R code used to tidy them are available online at,

Lesson Learned

While the principles of Tidy Data can be useful in a wide range of contexts. They are are particularly well suited for certain applications like the R programming language.

4. Four Fundamental Verbs of Data Manipulation

Filter: subsetting or removing observations based on some condition.

Transform: adding or modifying variables. These modifications can involve either a single variable (e.g., log-transformation), or multiple variables (e.g., computing density from weight and volume).

Aggregate: collapsing multiple values into a single value (e.g., by summing or taking means).

Sort: changing the order of observations.

Lesson Learned

Some combination of the Four Fundamentals of Data Manipulation can be used to turn any untidy dataset into a tidy one.

5. Manipulating Data in R

In R, filtering and transforming are performed by the base R functions subset() and transform(). These are input and output-tidy. The aggregate() function performs group-wise aggrega-tion. It is input-tidy. Provided that a single aggregation method is used, it is also output-tidy The plyr package provides tidy summarise() and arrange() functions for aggregation and sorting.

Lesson Learned

Functions and packages are readily available in R to help you leverage the four fundamentals of data manipulation to tidy data sets.

6. Tidy visualisation tools only need to be input-tidy as their output is visual.

Thank you for Reading

Click here to sign up for my newsletter and receive a free copy of my latest book. The World After Covid-19: Coexisting with the Novel Coronavirus.

You’ll learn how to understand, prepare, and protect yourself from the risks of Covid-19. To create a plan for the precautions needed to safeguard you and your loved ones

Connect: Email | Twitter | LinkedIn

Originally published at on December 23, 2020.



Steven L. Miller

Helping investors hack through the weeds to find the crypto gems at Musings at