Making Sense of Categorical Data: Why It Matters for Machine Learning
Learn the best methods for encoding categorical data, including Label Encoding, One-Hot Encoding, and more!
Not all data is about numbers. Some of the most important details in a dataset come in the form of categories—things like customer types, product names, or payment methods.
This kind of data, called categorical data, helps describe the world in ways that numbers alone can’t.
But here’s the problem: computers don’t understand words like we do. If you give a machine learning model raw text—like "PayPal," "Credit Card," or "Cash"—it won’t know what to do with it.
Without properly converting this data into numbers, your model will be confused, and its predictions won’t be accurate.
Each week, I dive deep into Python and beyond, breaking it down into bite-sized pieces. While everyone else gets just a taste, my premium readers get the whole feast! Don't miss out on the full experience – join us today!
So, how do we fix this? We have to translate categorical data into a format that machine learning models can actually use. The way we do this depends on what kind of categories we’re working with and how we want the model to interpret them.
In this article, I’ll break it all down—what categorical data is, why it’s important, and the best ways to turn it into numbers that make sense for machine learning. By the end, you’ll have a clear understanding of how to handle categorical data like the sensei you truly are.
If you haven’t subscribed to my premium content yet, you should definitely check it out. You unlock exclusive access to all of these articles and all the code that comes with them, so you can follow along!
Plus, you’ll get access to so much more, like monthly Python projects, in-depth weekly articles, the '3 Randoms' series, and my complete archive!
👉 Thank you for allowing me to do work that I find meaningful. This is my full-time job so I hope you will support my work.
I spend a lot of my week on these articles, so if you find it valuable, consider joining premium. It really helps me keep going and lets me know you’re getting something out of my work!
👉 If you get value from this article, please help me out, leave it a ❤️, and share it with others who would enjoy this. Thank you so much!
Alright, let’s get right into Categorical Encoding. But quickly before we do, we need to take a look at a tool that will help not only encoding but also your entire development experience.
The Best AI Coding Assistant—10x Your Speed!
This of a GIF of Zencoder’s AI coding assistant in action—suggesting code, refactoring, and completing tasks effortlessly.
Stop wasting time on boilerplate. Start building.
Zencoder is your collaborative coding companion that understands your repo, fixes code, and generates tests on the fly. Built for engineers who want speed without compromise.
👉 Boost productivity. Cut down debugging. Ship faster.
What Is Categorical Data?
Categorical data is a type of data that represents groups or categories instead of numbers. Unlike numerical data, which deals with measurable values like age, height, or income, categorical data consists of labels a.k.a words that describe different characteristics or traits.
Now to clear the air, there are two main types of categorical data:
Nominal Data (No Specific Order)
Nominal data includes categories that don’t have a natural order or ranking. These are simply labels used to classify things. For instance, let’s say the following:
Colors: Orange, Purple, Blue
Car Brands: Porsche, Ford, Tesla
Customer Segments: New, Returning, VIP
Since there’s no meaningful sequence, you can’t say one category is “higher” or “better” than another, because that is really just your preference or in other words, your opinion.
Ordinal Data (Has a Meaningful Order)
Ordinal data, on the other hand, includes categories that do follow a logical order. However, while the order matters, the exact difference between each level isn’t always clear. If I were to break this down then we could view it like this:
Education Levels: High School, Bachelor's, Master's, PhD
Star Ratings: 1 star, 2 stars, 3 stars, 4 stars, 5 stars
Job Titles: Junior Analyst, Analyst, Senior Analyst, Manager
Even though these categories have a ranking, the gaps between them aren’t necessarily equal. For instance, the difference in knowledge between a Bachelor’s and a Master’s degree might not be the same as the difference between a Master’s and a PhD.
Why Do We Need to Encode Categorical Data?
To set our stage, let’s imagine you’re building a machine learning model to predict how much a customer will spend at a retail store. Your dataset has numerical columns like "total purchases" and "average spending per visit," but it also includes categorical data, like:
Customer Type: "New," "Returning," "Loyal"
Payment Method: "Credit Card," "Cash," "PayPal", “Bitcoin”
Product Category: "Electronics," "Clothing," "Furniture"
So, what’s the big deal, what’s the problem? Machine learning models don’t understand words—only numbers. If we try to feed this data directly into a model, it won’t know what to do with these text labels and just make your models bad.
👉 If you get value from this article, please help me out, leave it a ❤️, and share it with others who would enjoy this. Thank you so much!
Why Can’t Machine Learning Models Use Text Data?
Think of it this way: If a model sees "Credit Card," "Bitcoin," and "Cash" as raw text, it has no way of knowing how they relate to each other or if they even matter.
Unlike humans, we actually understand that credit card transactions might have different spending patterns than cash purchases, a machine learning model sees them as meaningless strings.
To make this information useful, we need to convert categorical data into numbers—but in a way that preserves its meaning and relationships. The method we choose depends on the type of categorical data and how we want the model to interpret it.
To hopefully help you visualize why encoding is crucial, let’s say we try to train a model using raw categorical data:
If we don’t encode the "Payment Method" column, our model won’t be able to process it. A machine learning model doesn’t "see" the words—it just sees missing information.
Okay, but how should we convert these categories into numbers?
Should I just assign random numbers (e.g., Credit Card = 1, PayPal = 2, Cash = 3)?
Maybe I should create multiple columns (e.g., one column for each payment method, with values of 1 or 0 to indicate if a customer used it)?
Or, I guess I rank them based on some meaningful order?
Each of these approaches is a different encoding method, and choosing the right one depends on the dataset and the problem you’re solving.
Types of Categorical Encoding
Not all ways of encoding data work the same. The method you choose can impact how well your model understands patterns and makes predictions. Let’s break down the most common encoding methods and when to use each one.
Label Encoding - a simple way to turn categories into numbers, making it easier for machine learning models to process them
One-Hot Encoding - converts categories into a format that machine learning models can understand by creating separate columns for each unique value.
Ordinal Encoding - converts categories into numbers while keeping their natural order.
Frequency Encoding - turns categories into numbers based on how often they appear in a dataset.
To continue reading and actually understand when and where to use each type of encoding, please check out the entire article here. ⤵️
Hope you all have an amazing week nerds ~ Josh
👉 If you get value from this article, please help me out, leave it a ❤️, and share this article to others. This helps more people discover this newsletter! Thank you so much!