R You Ready? Begin Your Data Analytics Journey with a Coin Flip Simulation

Random sampling, reproducibility, subsetting, and counting outcomes — all through a simple coin flip simulation. The concepts here apply everywhere in data analysis.

Christopher A. Rotunno Aug 26, 2024

The best way to learn data analysis fundamentals isn’t to start with a complex dataset — it’s to start with a problem simple enough that you can reason about the expected answer before running a single line of code. A coin flip is perfect.

You know that a fair coin should land on heads roughly 50% of the time. That means when you run a simulation, you can immediately tell whether your code is working correctly by checking whether the results look plausible. That feedback loop is invaluable when you’re learning.

Setting Up Reproducibility

Before anything else: set a random seed. Random simulations produce different results each run by default, which makes debugging and teaching difficult. set.seed() locks the random number generator to a specific starting point:

set.seed(42)

With this set, every time you run the script, you’ll get identical results. This matters for reproducibility in real analysis too — if someone else runs your code, they should get the same output.

Running the Simulation

Use sample() to randomly select from a set of options with replacement:

n_flips <- 500
coin_flips <- sample(c("Heads", "Tails"), size = n_flips, replace = TRUE)

The replace = TRUE argument is essential — it means each flip is independent. Without it, you’d run out of coins to draw from after the first two flips.

Inspecting the Result

Before analyzing, inspect the structure of what you’ve created:

str(coin_flips)

 chr [1:500] "Heads" "Tails" "Heads" "Heads" "Tails" "Tails" "Heads" ...

This tells you: character vector, 500 elements. Exactly what you expected. Always check your data structure before proceeding — mismatched types are a common source of silent errors in R.

Accessing Specific Elements

R uses bracket notation for indexing:

# First flip
coin_flips[1]

# Flips 1 through 10
coin_flips[1:10]

# The 250th flip
coin_flips[250]

Understanding indexing is foundational. The same syntax for accessing elements of a vector extends to rows and columns of data frames — master it here and it transfers everywhere.

Counting Outcomes

Two approaches for counting how many times “Tails” appeared:

Method 1: Logical comparison with sum()

n_tails <- sum(coin_flips == "Tails")
cat("Number of tails:", n_tails, "\n")
cat("Proportion:", n_tails / n_flips, "\n")

coin_flips == "Tails" creates a logical vector (TRUE/FALSE for each element). sum() counts the TRUEs, since TRUE = 1 and FALSE = 0 in numeric contexts.

Method 2: Subsetting with length()

tails_only <- coin_flips[coin_flips == "Tails"]
cat("Number of tails:", length(tails_only), "\n")

With set.seed(42) and 500 flips, this produces 239 tails — 47.8%. Close to 50% but not exactly, which is exactly right: that’s how probability works with finite samples.

Analyzing Patterns in Position

What if you want to compare odd-numbered flips to even-numbered flips? The seq() function generates sequences:

# Odd-positioned flips (1, 3, 5, ...)
odd_flips <- coin_flips[seq(1, n_flips, by = 2)]

# Even-positioned flips (2, 4, 6, ...)
even_flips <- coin_flips[seq(2, n_flips, by = 2)]

cat("Tails in odd positions:", sum(odd_flips == "Tails"), "\n")
cat("Tails in even positions:", sum(even_flips == "Tails"), "\n")

With a truly random process, you’d expect roughly equal counts. If you saw a dramatic difference, that would suggest the random number generator isn’t producing truly independent samples — which is a testable hypothesis.

Visualizing the Distribution

# Table of counts
flip_table <- table(coin_flips)

# Bar plot
barplot(
  flip_table,
  col = c("#4E9AF1", "#F1A94E"),
  main = "Coin Flip Results (n = 500)",
  ylab = "Count",
  ylim = c(0, 300)
)

# Add a reference line at the expected 250
abline(h = 250, lty = 2, col = "gray40")

Plotting against the expected 250-flip line makes the deviation from theoretical probability immediately visible.

Running More Simulations

A single run of 500 flips gives you one observation. Run it 1,000 times to see the distribution of outcomes:

n_simulations <- 1000
tail_counts <- replicate(n_simulations, {
  flips <- sample(c("Heads", "Tails"), size = n_flips, replace = TRUE)
  sum(flips == "Tails")
})

hist(tail_counts, 
     breaks = 30,
     main = "Distribution of Tails Count Across 1,000 Simulations",
     xlab = "Number of Tails in 500 Flips",
     col = "#4E9AF1")
abline(v = 250, col = "red", lwd = 2)

This demonstrates the central limit theorem visually — the distribution of tail counts follows a bell curve centered at 250. The simulation makes an abstract statistical concept concrete in about 10 lines of code.

What You Actually Learned

The mechanics here — sample(), indexing, logical comparison, seq(), replicate() — apply to almost every data analysis task in R:

Sampling underlies bootstrapping, cross-validation, and Monte Carlo methods
Logical indexing is how you filter rows in a data frame
Counting with sum() on logicals is how you compute frequencies, flag conditions, and build computed columns

The coin flip is simple. The patterns are universal.