Titanic: Machine Learning from Disaster

The Titanic dataset is the classic ML classification benchmark. Here's a full walkthrough — feature engineering, Random Forest, Logistic Regression, and what the results actually mean.

The Titanic dataset has launched more data science careers than probably any other benchmark. It’s not because predicting 19th-century shipwreck survival is professionally useful — it’s because the problem has just the right amount of complexity to teach real skills: messy data, meaningful features, and a binary classification target that everyone intuitively understands.

This is a full walkthrough from raw data to submission-ready predictions.


The Dataset

The training set has 891 passengers. The test set has 417. Each passenger record includes:

VariableDescription
SurvivedTarget: 0 = No, 1 = Yes
PclassPassenger class (1 = 1st, 2 = 2nd, 3 = 3rd)
NamePassenger name
Sexmale / female
AgeAge in years (missing for ~20%)
SibSp# of siblings/spouses aboard
Parch# of parents/children aboard
TicketTicket number
FarePassenger fare
CabinCabin number (mostly missing)
EmbarkedPort of embarkation: C, Q, or S

Step 1: Load and Inspect

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Load from Kaggle's public GitHub mirror
train_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
df = pd.read_csv(train_url)

print(df.shape)
print(df.isnull().sum())

You’ll see Age missing for ~177 rows, Cabin missing for ~687, and Embarked missing for just 2. That tells you which fields require attention.


Step 2: Feature Engineering

Encode Categorical Variables

df = pd.get_dummies(df, columns=['Sex', 'Embarked'], drop_first=True)

Extract Titles from Names

Passenger names contain titles (Mr, Mrs, Miss, Master, Dr, etc.) that encode gender, age, and social status — all predictive of survival outcomes:

df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
df['Title'] = df['Title'].replace(
    ['Lady','Countess','Capt','Col','Don','Dr','Major','Rev','Sir','Jonkheer','Dona'], 
    'Rare'
)
df['Title'] = df['Title'].replace({'Mlle': 'Miss', 'Ms': 'Miss', 'Mme': 'Mrs'})
df['Title'] = pd.Categorical(df['Title']).codes

Bin Ages into Decade Groups

df['AgeBand'] = pd.cut(df['Age'].fillna(df['Age'].median()), bins=8, labels=False)

This reduces the noise from exact ages while preserving the signal that children had meaningfully better survival rates.


Step 3: Prepare Features

features = ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare',
            'Sex_male', 'Embarked_Q', 'Embarked_S', 'Title', 'AgeBand']

# Fill remaining nulls
df[features] = df[features].fillna(df[features].median())

X = df[features]
y = df['Survived']

Step 4: Train a Random Forest

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, random_state=42
)

rf = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
rf.fit(X_train, y_train)

print(f"Validation accuracy: {accuracy_score(y_val, rf.predict(X_val)):.4f}")

Step 5: Feature Importance

importances = pd.Series(rf.feature_importances_, index=features).sort_values(ascending=False)

importances.plot(kind='barh', figsize=(8, 5))
plt.title("Feature Importances — Random Forest")
plt.tight_layout()
plt.show()

Typically Sex_male, Fare, and Title dominate — reflecting the “women and children first” evacuation pattern and the reality that first-class passengers had better access to lifeboats.


Step 6: Logistic Regression Baseline

lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, y_train)

print(f"Logistic Regression accuracy: {accuracy_score(y_val, lr.predict(X_val)):.4f}")

Logistic Regression typically achieves 78–82% accuracy on this dataset — comparable to Random Forest and much easier to explain to stakeholders. The coefficients directly tell you which features push the survival probability up or down.


What the Results Tell Us

The model confirms what historical records suggest:

  • Sex was the dominant factor. Women survived at roughly 74% vs. 19% for men.
  • Class mattered. First-class survival rate was ~63%; third-class was ~24%.
  • Traveling alone was a disadvantage. Passengers with family aboard had slightly better survival rates.
  • Fare correlates with survival — not directly, but because it proxies for cabin location and passenger class.

Why This Problem Still Matters

The Titanic dataset teaches the skills that transfer everywhere:

  • Handling missing data without introducing leakage
  • Engineering features that carry real signal
  • Comparing model complexity vs. interpretability tradeoffs
  • Validating that your model captures real patterns, not noise

The workflow — load, clean, engineer, model, evaluate — is the same whether you’re predicting shipwreck survival or customer churn. Get comfortable with the pattern here, then apply it to problems that actually affect someone’s bottom line.

Christopher A. Rotunno Grounded in Analytics

Data analytics engineer and BI leader. Building pipelines, models, and dashboards that turn raw data into clear decisions.

Copyright 2026 Christopher A. Rotunno. All Rights Reserved

Built with & Claude Code