Ordinal Encoding — A Brief Guide

In machine learning, categorical data refers to variables that represent categories rather than numeric values—such as gender, education level, or product rating. While many algorithms can process numerical data effectively, they cannot inherently understand categorical text values. Feeding raw categorical data into models like decision trees, logistic regression, or neural networks often leads to errors or reduced performance.

Thus, encoding categorical variables into numerical format becomes an essential preprocessing step. Different encoding techniques exist, and ordinal encoding is specifically useful when there is an inherent order among categories. Choosing the right encoding strategy directly impacts model accuracy and learning efficiency.

What is Ordinal Encoding and Why is it Important?

Ordinal encoding is a technique that converts categorical variables with an inherent order into numerical values. In this method, each unique category is assigned an integer based on its rank or position. For example, education levels like “High School,” “Bachelor’s,” “Master’s,” and “PhD” might be encoded as 0, 1, 2, and 3 respectively.

It’s important to distinguish ordinal from nominal data:

  • Ordinal categorical data has a meaningful order (e.g., customer satisfaction ratings: Poor < Average < Good < Excellent).
  • Nominal categorical data has no inherent order (e.g., colors: Red, Blue, Green).

In ordinal encoding, preserving the natural order of categories is crucial because the numerical values imply a progression or ranking. If categories are incorrectly treated as unordered, models may misinterpret relationships between features and targets.

This encoding is especially important for algorithms that are sensitive to feature magnitude or distance, such as:

  • Linear Regression
  • Logistic Regression
  • Support Vector Machines (SVMs)

When done correctly, ordinal encoding helps machine learning models capture the ordinal nature of the data, leading to better predictive accuracy and more meaningful insights.

​​Preparing Data Before Ordinal Encoding

Before applying ordinal encoding, it’s critical to identify variables that truly represent an ordered relationship. Not all categorical features should be ordinal encoded—only those where the sequence has a logical meaning (e.g., education level, customer satisfaction).

Handling missing values is equally important. Missing data should either be imputed carefully or assigned a separate category if their absence carries meaningful information.

Additionally, applying domain knowledge is essential to correctly define the order. Incorrect ordering can mislead machine learning models and degrade performance. Proper preparation ensures that encoding accurately reflects the underlying data relationships.

Step-by-Step Guide to Implementing Ordinal Encoding in Python

Step 1: Install Required Libraries

Before getting started, install the essential libraries if you haven’t already:

pip install pandas scikit-learn

These libraries will help you create datasets and apply encoding easily.

Step 2: Import Libraries

Import the necessary modules into your Python script:

import pandas as pd

from sklearn.preprocessing import OrdinalEncoder

Pandas will handle data manipulation, while OrdinalEncoder from scikit-learn will perform the encoding.

Step 3: Create a Sample Dataset

Let’s create a simple dataset with ordered categories:

data = {

    'Education': ['High School', 'Bachelor', 'Master', 'PhD', 'Bachelor'],

    'Satisfaction': ['Poor', 'Good', 'Excellent', 'Average', 'Good']

}

df = pd.DataFrame(data)

print(df)

This example uses education levels and customer satisfaction ratings—both inherently ordered categories.

Step 4: Initialize and Apply OrdinalEncoder

Now, fit and transform the data using OrdinalEncoder:

# Define explicit orderings

education_order = ['High School', 'Bachelor', 'Master', 'PhD']

satisfaction_order = ['Poor', 'Average', 'Good', 'Excellent']

# Initialize the encoder with category order

encoder = OrdinalEncoder(categories=[education_order, satisfaction_order], handle_unknown='use_encoded_value', unknown_value=-1)

# Apply encoding

df_encoded = encoder.fit_transform(df)

df_encoded = pd.DataFrame(df_encoded, columns=['Education', 'Satisfaction'])

print(df_encoded)

Note:

  • Setting handle_unknown=’use_encoded_value’ ensures that any unseen category during transformation is encoded as -1, preventing errors at runtime.

Step 5: Verify and Interpret Encoded Data

After encoding, check the transformed dataset:

print(df_encoded)

Each category is now replaced with an integer based on its specified order, preserving the ordinal nature of the features. This encoded dataset can now be safely fed into machine learning models for training or prediction.

Example 2: Using Ordinal Encoding on a Real Dataset

Let’s apply ordinal encoding to the Titanic dataset, a popular public dataset.

import pandas as pd

from sklearn.preprocessing import OrdinalEncoder

# Load the Titanic dataset

url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'

titanic_df = pd.read_csv(url)

# Select the 'Pclass' feature (Passenger Class: 1 = First, 2 = Second, 3 = Third)

print(titanic_df['Pclass'].value_counts())

# Although already numerical, Pclass represents an ordinal category (higher class = higher value)

encoder = OrdinalEncoder()

# Apply encoder

titanic_df['Pclass_encoded'] = encoder.fit_transform(titanic_df[['Pclass']])

# Visualize before and after

print(titanic_df[['Pclass', 'Pclass_encoded']].head())

In this example, the ‘Pclass’ feature, representing passenger class ranking, is treated as an ordinal variable. After encoding, the order is preserved, preparing the feature for machine learning algorithms sensitive to feature magnitude.

Common Pitfalls to Avoid When Using Ordinal Encoding

  • Applying Ordinal Encoding to Non-Ordinal Data: Encoding nominal categories like colors or product IDs introduces false order, misleading models and degrading performance.
  • Ignoring Unseen Labels During Inference: If the model encounters categories not present during training, it can crash. Always configure encoders to handle unknown values safely.

Conclusion

Ordinal encoding is a vital technique in machine learning for converting ordered categorical variables into meaningful numerical representations. It ensures that the inherent ranking within the data is preserved, allowing models sensitive to magnitude, such as linear models or SVMs, to interpret features accurately.

It is particularly useful when working with features like education levels, satisfaction ratings, or product rankings—where the relative order carries important information. However, applying it thoughtfully, ensuring proper ordering, and handling unseen categories is crucial. With careful use, ordinal encoding can significantly enhance model performance and ensure more reliable predictions.

Read More:

Reference: