Data Manipulation in R: From Histograms to Winsorizing

Generate synthetic data, compute descriptive stats, filter by condition, and cap outliers using winsorization — the core data cleaning toolkit in base R.

Christopher A. Rotunno Aug 25, 2024

Real data is messy. Outliers, skewed distributions, and extreme values distort summary statistics and mislead models. Learning to identify and handle these programmatically is one of the most practical skills in any analyst’s toolkit.

This walkthrough covers four techniques that come up in virtually every data cleaning job: generating distributions for testing, computing summary stats, counting conditional values, and winsorizing.

Step 1: Generate a Normal Distribution

Start with reproducible synthetic data:

set.seed(42)
x <- rnorm(n = 1000, mean = 0, sd = 1)

rnorm() generates normally distributed random values. The set.seed() call makes the results identical each time you run the script — essential for teaching and debugging.

Step 2: Verify with Summary Statistics

Check that the generated data matches the specified parameters:

cat("Mean:", mean(x), "\n")
cat("SD:  ", sd(x), "\n")
cat("Min: ", min(x), "\n")
cat("Max: ", max(x), "\n")

Mean: 0.01569
SD:   0.99208
Min: -3.08041
Max:  3.47503

With 1,000 samples, you won’t get exactly 0 and 1, but you’ll be close. The mean is 0.016 and SD is 0.992 — well within expected sampling variation.

Step 3: Visualize the Distribution

hist(x,
     breaks = 30,
     col = "#A8D8EA",
     main = "Histogram of 1,000 Normal Values",
     xlab = "Value",
     probability = TRUE)

# Overlay the theoretical density curve
curve(dnorm(x, mean = 0, sd = 1), 
      col = "#E23645", 
      lwd = 2, 
      add = TRUE)

Setting probability = TRUE converts the y-axis from counts to density, making it directly comparable to the theoretical curve. This visual check is a good habit — if your data is supposed to be normal and the histogram looks bimodal, something is wrong upstream.

Step 4: Counting Values Meeting Conditions

How many values fall outside one standard deviation?

# Values above +1
above_1 <- sum(x > 1)

# Values below -1
below_neg1 <- sum(x < -1)

# Total outside [-1, 1]
outside_1sd <- sum(abs(x) > 1)

cat("Above +1:  ", above_1, "\n")
cat("Below -1:  ", below_neg1, "\n")
cat("Total outside ±1:", outside_1sd, "(", round(outside_1sd/length(x)*100, 1), "%)\n")

Above +1:   157 
Below -1:   165 
Total outside ±1: 322 ( 32.2 %)

The theoretical expectation for a standard normal distribution is 31.7% outside ±1σ. Getting 32.2% confirms the data is behaving as expected. This kind of sanity check — comparing empirical results to theoretical expectations — should be automatic in your workflow.

Step 5: Winsorizing Outliers

Winsorization caps extreme values at a specified threshold rather than removing them. This preserves the sample size while limiting the influence of outliers on means and model coefficients.

Method 1: Vectorized (Preferred)

x_wins <- x
x_wins[x_wins > 1] <- 1
x_wins[x_wins < -1] <- -1

cat("Original mean:    ", mean(x), "\n")
cat("Winsorized mean:  ", mean(x_wins), "\n")

Original mean:     0.01569
Winsorized mean:   0.00864

The vectorized approach is faster and idiomatic R. It directly addresses indices where the condition is TRUE — no loop overhead.

Method 2: Loop-Based (For Comparison)

x_wins_loop <- x
for (i in seq_along(x_wins_loop)) {
  if (x_wins_loop[i] > 1) {
    x_wins_loop[i] <- 1
  } else if (x_wins_loop[i] < -1) {
    x_wins_loop[i] <- -1
  }
}

# Verify: should be identical to vectorized result
identical(x_wins, x_wins_loop)  # TRUE

Both produce identical results. The vectorized method runs faster on large datasets (measurably so once you’re past ~100,000 rows), but the loop version is more explicit about the logic — useful for teaching.

Step 6: Compare Before and After

par(mfrow = c(1, 2))

hist(x, breaks = 30, col = "#A8D8EA",
     main = "Original Distribution", xlab = "Value")

hist(x_wins, breaks = 30, col = "#F4D03F",
     main = "Winsorized at ±1 SD", xlab = "Value")

par(mfrow = c(1, 1))  # Reset layout

The winsorized histogram shows spikes at ±1 — all the capped values piling up at the boundary. That visual signature is how you confirm winsorization worked.

When to Use Winsorizing vs. Removal

Winsorize when:

You want to preserve sample size
The outliers are likely real but extreme (measurement error, legitimate edge cases)
Your model is sensitive to extreme values (regression especially)

Remove when:

Outliers are clearly data errors (negative ages, impossible prices)
The outlier mechanism is different from the main population and should be modeled separately
Sample size is large enough that removal doesn’t matter

Neither approach is always right. The choice should be driven by whether the extreme values represent signal or noise in the context of your specific problem.

The Bigger Picture

These four operations — generate, summarize, filter, transform — are the building blocks of every data cleaning pipeline, regardless of language or domain. Master them in base R and you’ll recognize the equivalent patterns in pandas, dplyr, and SQL immediately. The tools change; the logic doesn’t.