
Word Embeddings for Tabular Data Feature Engineering
Image by Author | ChatGPT
Introduction
It would be difficult to argue that word embeddings — dense vector representations of words — have not dramatically revolutionized the field of natural language processing (NLP) by quantitatively capturing semantic relationships between words.
Models like Word2Vec and GloVe enable words with similar meanings to have similar vector representations, both supporting and uncovering the semantic similarities between words. While their primary application is in traditional language processing tasks, this tutorial explores a less conventional, yet powerful, use case: applying word embeddings to tabular data for feature engineering.
In traditional tabular datasets, categorical features are often handled with one-hot encoding or label encoding. However, these methods do not capture semantic similarities between the categories. For example, if a dataset contains a Product Category
column with values like Electronics
, Appliances
, and Gadgets
, a one-hot encoding treats them as entirely, and equally, distinct. Word embeddings, if applicable, could represent Electronics
and Gadgets
as more similar than Electronics
and Furniture
, potentially enhancing model performance depending on the scenario.
This tutorial will guide you through a practical application of using pre-trained word embeddings to generate new features for a tabular dataset. We will focus on a scenario where a categorical column in our tabular data contains descriptive text that can be mapped to words for which embeddings exist.
Core Concepts
Before getting to the code, let’s review the core concepts:
- Word embeddings: Numerical representations of words in a vector space. Words with similar meanings are located closer together in this space.
- Word2Vec: A popular algorithm for creating word embeddings, developed by Google. It has two main architectures: Continuous Bag-of-Words (CBOW) and Skip-gram.
- GloVe (Global Vectors for Word Representation): Another widely used word embedding model, which leverages global word-word co-occurrence statistics from a corpus.
- Feature engineering: The process of transforming raw data into features that better represent the underlying problem to a machine learning model, leading to improved model performance.
Our approach involves using a pre-trained Word2Vec model, such as one trained on Google News, to convert categorical text entries into their corresponding word vectors. These vectors then become new numerical features for our tabular data. This technique is particularly useful when the categorical values have inherent textual meaning that can be leveraged, such as our mock scenario where a dataset contains a categorical text and could be used to determine the similarity of other products. This same approach could be extended to, say, a product description text column if it existed, bolstering the possibility of similarity measurements, but at that point we are into much more “traditional” natural language processing territory.
Practical Application: Feature Engineering with Word2Vec
Let’s consider a hypothetical dataset with a column called ItemDescription
containing short phrases or single words describing an item. We’ll use a pre-trained Word2Vec model to convert these descriptions into numerical features. We’ll simulate a dataset for this purpose.
First, let’s import the libraries that we will need. It goes without saying that you will need to have these installed into your Python environment.
import pandas as pd import numpy as np from gensim.models import KeyedVectors |
Now, let’s simulate a very simple tabular dataset with a categorical text column.
# create “data” as a dictionary data = { ‘ItemID’: [1, 2, 3, 4, 5, 6], ‘Price’: [100, 150, 200, 50, 250, 120], ‘ItemDescription’: [‘electronics’, ‘gadget’, ‘appliance’, ‘tool’, ‘electronics’, ‘kitchenware’], ‘Sales’: [10, 15, 8, 25, 12, 18] }
# convert to Pandas dataframe df = pd.DataFrame(data)
# output resulting dataset print(“Original DataFrame:”) print(df) print(“\n”) |
Next, we will load a pre-trained Word2Vec model for converting our text categories to embeddings.
For this tutorial, we’ll use a smaller, pre-trained model; however, you may need to download a larger model like GoogleNews-vectors-negative300.bin.gz
. For demonstration, we’ll create a dummy model if the file isn’t present
https://code.google.com/archive/p/word2vec/
try: # replace “GoogleNews-vectors-negative300.bin” with your downloaded model path word_vectors = KeyedVectors.load_word2vec_format(‘GoogleNews-vectors-negative300.bin’, binary=True) print(“Pre-trained Word2Vec model loaded successfully.”)
except FileNotFoundError: # display a warning! import warnings warnings.warn(“Using dummy embeddings! Download GoogleNews-vectors for real results.”)
# create dummy model from gensim.models import Word2Vec sentences = [[“electronics”, “gadget”, “appliance”, “tool”, “kitchenware”], [“phone”, “tablet”, “computer”]] dummy_model = Word2Vec(sentences, vector_size=10, min_count=1) word_vectors = dummy_model.wv print(“Dummy Word2Vec model created.”) |
OK. With the above, we have either loaded a capable word embeddings model and can now use it, or we have created a very small dummy embeddings model of our own for the purposes of this tutorial only (it is useless elsewhere).
Now we create a function to fetch the word embeddings for am item description (ItemDescription
), what is essentially our item “category”. Note that we are avoiding using the term “category” to describe the item categories in order to separate our mock data as much from the concept of “categorical data” as is possible and avoid any potential confusion.
def get_word_embedding(description, model): try: # query the embeddings “model” for the embedding matching the “description” return model[description] except KeyError: # return a zero vector if word not found return np.zeros(model.vector_size) |
And now it’s time to actually apply the funciton to our dataset’s ItemDescription
column.
# create new columns for each dimension of the word embedding embedding_dim = word_vectors.vector_size embedding_columns = [f‘desc_embedding_{i}’ for i in range(embedding_dim)]
# apply the function to each description embeddings = df[‘ItemDescription’].apply(lambda x: get_word_embedding(x, word_vectors))
# expand the embeddings into separate columns embeddings_df = pd.DataFrame(embeddings.tolist(), columns=embedding_columns, index=df.index) |
With our newfound embedding features in-hand, let’s go ahead and concatenate them to the original DataFrame while dropping the original — and hopefully archaic — ItemDescription
, and then print it out to have a look.
df_engineered = pd.concat([df.drop(‘ItemDescription’, axis=1), embeddings_df], axis=1)
print(“\nDataFrame after feature engineering with word embeddings:”) print(df_engineered) |
Wrapping Up
By leveraging pre-trained word embeddings, we have transformed a categorical text feature into a rich, numerical representation that captures semantic information. This new set of features can then be fed into a machine learning model, potentially leading to improved performance, especially in tasks where the relationships between categorical values are nuanced and textual. Remember that the quality of your embeddings heavily depends on the pre-trained model and its training corpus.
This technique is not limited to product descriptions. It can be applied to any categorical column containing descriptive text, such as JobTitle
, CustomerFeedback
(after appropriate text processing to extract keywords). The key is that the text in the categorical column should be meaningful enough to be represented by word embeddings.