IMDB Movie Review Sentiment Anaysis using Pytorch and Word2Vec Embedding

Posted byreaderavskh Posted onMarch 2, 2025 Comments0

In today’s world, sentiment analysis has become a crucial task for businesses and individuals to understand public opinion, customer feedback, and more. Whether it’s gauging the mood of social media posts, reviews, or comments, sentiment analysis can reveal valuable insights. In this tutorial, we will walk through the process of building a sentiment analysis model using PyTorch and Word2Vec embeddings.

We will use the IMDB movie reviews dataset to train our model to classify reviews as positive or negative. The approach leverages the power of pre-trained Word2Vec embeddings to convert words into numerical vectors that capture semantic meaning, which our PyTorch-based neural network will then use for sentiment prediction.

By the end of this tutorial, you will learn how to:

  • Preprocess text data and clean it for analysis.
  • Use Word2Vec embeddings to convert text into meaningful feature vectors.
  • Build a neural network model with PyTorch for sentiment classification.
  • Evaluate the model’s performance and apply it to real-world reviews.

Let’s dive into the details and start building our sentiment analysis model!

What is an embedding?

In machine learning, particularly in natural language processing (NLP), an embedding is a way of converting text (such as words, sentences, or even entire documents) into numerical vectors. These vectors capture semantic meaning, allowing models to understand and process the text effectively.

Instead of treating each word as a discrete symbol or using one-hot encoding (which results in high-dimensional sparse vectors), embeddings represent words as dense, continuous vectors in a lower-dimensional space. These dense vectors are typically learned through large datasets, capturing relationships and similarities between words.

For example:

  • Words with similar meanings (like “king” and “queen”) are mapped to vectors that are closer to each other in the embedding space.
  • Words that appear in similar contexts are represented in similar vector spaces.

Why Are Embeddings Important?

Traditional methods of processing text (like using one-hot encoding) can’t capture relationships between words. For example, “cat” and “dog” would be treated as entirely different entities in one-hot encoding, despite both being animals. Embeddings overcome this limitation by placing similar words closer together in the vector space.

Types of Word Embeddings

Some popular pre-trained word embedding models include:

  • Word2Vec: A model that learns word representations from large text corpora, available in two versions: Skip-gram and Continuous Bag of Words (CBOW).
  • GloVe (Global Vectors for Word Representation): An unsupervised learning algorithm for generating word vectors by capturing global word-word co-occurrence statistics from a corpus.
  • FastText: An extension of Word2Vec that represents each word as a bag of character n-grams, improving the representation of rare and out-of-vocabulary words.

Word2Vec Embeddings

Word2Vec is one of the most popular methods for learning word embeddings. It works by training a model on a large corpus of text, where it tries to predict the surrounding words for a given word (Skip-gram model) or predict a word based on its surrounding words (CBOW model). The resulting vector for each word captures semantic relationships, allowing words with similar meanings or usage to have similar vector representations.

In the context of our sentiment analysis model, we use Word2Vec embeddings to convert the words in movie reviews into numerical vectors, which the machine learning model can process to predict the sentiment of the review. This helps the model understand the relationships between words, improving its ability to classify the reviews accurately.

Dataset

For this tutorial, we used the IMDB Movie Reviews Dataset, which contains 50,000 movie reviews labeled with their corresponding sentiment—positive or negative. This dataset is widely used for sentiment analysis tasks, providing a good balance between review length and diversity of language. It consists of 25,000 reviews for training and 25,000 reviews for testing, making it an ideal choice for building models that classify text sentiment.

You can download the dataset from the following Kaggle link: IMDB Dataset of 50K Movie Reviews

Dataset Structure

The dataset is divided into two main components:

  1. Reviews: Each entry contains a text review of a movie.
  2. Sentiments: The sentiment of each review is labeled as either positive (1) or negative (0), which is the target variable for our model.

The dataset is relatively clean, but we still need to preprocess the text (e.g., tokenizing the words, removing special characters) and convert the words into embeddings before training a model. This step ensures that the model can understand the text in a numerical format, which is essential for training machine learning models.

By using this dataset, we can train our sentiment analysis model to predict whether a movie review is positive or negative based on the words in the review, providing valuable insights into how well a model can process and analyze human language.

Preparing the Dataset for Training and Evaluation

To train and evaluate our sentiment analysis model effectively, we need to divide our dataset into two distinct parts: one for training the model and the other for testing its performance. This ensures that the model learns from one portion of the data (the training set) and is evaluated on a separate unseen portion (the testing set), which is essential for understanding how well the model generalizes to new data.

To accomplish this, we leverage the powerful train_test_split function from sklearn.model_selection, which conveniently splits the dataset into two subsets: training and testing. Below is how this process is implemented:

from sklearn.model_selection import train_test_split

# Mapping the dataset labels
df['sentiment'] = df['sentiment'].map({"negative": 0, "positive": 1})
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df['sentiment'])

Before splitting the data, the sentiment labels in the dataset are mapped from their original textual representation ("positive", "negative") to numerical values (1 for positive and 0 for negative). This conversion is crucial, as machine learning algorithms work with numerical data rather than text.

The train_test_split function is used to split the data into training and testing sets. Key parameters include:

  • test_size=0.2: This ensures that 20% of the dataset is reserved for testing, while the remaining 80% is used for training the model.
  • random_state=42: This parameter ensures that the data split is reproducible. By using the same seed value, you will get the same split each time you run the code.
  • stratify=df['sentiment']: This ensures that the sentiment distribution is preserved in both the training and testing sets. This is particularly important for ensuring the model has a balanced representation of both positive and negative reviews in each subset.

Why This Process Matters:

  • Training and Testing Split: This division allows the model to train on one subset of data while being evaluated on a separate, unseen subset. This is critical for testing the model’s ability to generalize to new data.
  • Stratified Sampling: By using stratified sampling, we ensure that both the training and testing sets reflect the same proportions of sentiment labels, thereby preventing any biases that could arise if the split were random. This results in a more reliable evaluation of the model’s performance.

With the data now split into training and testing sets, we are ready to proceed with training our model on the training data (train_df) and evaluating its performance on the testing data (test_df).

Text Preprocessing and Tokenization

Before training a machine learning model for sentiment analysis, it’s crucial to preprocess the raw text data to ensure that it’s in a suitable format for analysis. Text data often contains irrelevant elements like special characters, HTML tags, or unnecessary whitespaces, which can negatively affect the model’s ability to learn meaningful patterns. Therefore, a comprehensive cleaning process is essential.

In this step, we utilize the nltk (Natural Language Toolkit) library for text cleaning and tokenization, which is a common practice in natural language processing (NLP). Here’s a breakdown of the text preprocessing and tokenization process:

Steps for Text Preprocessing:

  1. Lowercasing:
    Text data can contain words in various cases (upper, lower, or mixed). To avoid treating the same word as different (e.g., “Happy” vs. “happy”), we convert all text to lowercase.
  2. HTML Tag Removal:
    Many textual datasets contain HTML tags, especially if the data was scraped from websites. These tags do not carry useful information for sentiment analysis, so we remove them using a regular expression.
  3. Special Character Removal:
    Text often contains punctuation, numbers, or other non-alphabetic characters. For sentiment analysis, we typically remove these characters as they don’t contribute to the sentiment of the review. We do this by keeping only alphanumeric characters and spaces.
  4. Tokenization:
    Tokenization is the process of splitting text into smaller units, typically words, called tokens. This allows the model to analyze the frequency and structure of words, which is essential for understanding sentiment. We use the word_tokenize function from the nltk library to achieve this.
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')  # Download necessary tokenization models
nltk.download('punkt_tab')
import re

def clean_text(text):
    text = text.lower()  # Convert to lowercase
    text = re.sub(r"<.*?>", "", text)  # Remove HTML tags
    text = re.sub(r"[^a-zA-Z0-9\s]", "", text)  # Remove special characters
    tokens = word_tokenize(text)  # Tokenize text
    return tokens

# Apply cleaning and tokenization
train_df['review'] = train_df['review'].apply(clean_text)
test_df['review'] = test_df['review'].apply(clean_text)

Explanation of the Code:

  • nltk.download('punkt') and nltk.download('punkt_tab'): These commands download the necessary resources for tokenization. punkt is a pre-trained tokenizer that can split text into individual words.
  • re.sub(r"<.*?>", "", text): This regular expression removes any HTML tags from the text. It matches anything between < and > and replaces it with an empty string.
  • re.sub(r"[^a-zA-Z0-9\s]", "", text): This regular expression removes any characters that are not letters, numbers, or spaces.
  • word_tokenize(text): This function splits the cleaned text into individual tokens (words) that can be analyzed further.

Finally, the clean_text function is applied to each review in the dataset using the apply() method, ensuring that both the training and testing datasets are preprocessed and tokenized.

Why Tokenization and Cleaning Matter:

  • Consistency: Converting text to lowercase ensures that the model treats words like “happy” and “Happy” as the same.
  • Removal of Noise: Cleaning the text by removing HTML tags and special characters ensures that only meaningful words are fed into the model.
  • Effective Analysis: Tokenization breaks down the text into smaller, manageable pieces that the machine learning model can process, helping it learn more effectively.

By preprocessing and tokenizing the text data, we can significantly improve the quality of the features that will be used to train our sentiment analysis model. This is an essential step in building robust NLP systems.

Converting Reviews into Word Embeddings

Now that we’ve cleaned and tokenized the text data, the next step is to convert these tokens into numerical representations, which the model can understand. In natural language processing (NLP), word embeddings are a way to represent words as vectors in a continuous vector space, where semantically similar words are mapped to nearby points. This transformation enables the model to learn relationships between words more effectively.

For this tutorial, we are going to use the Word2Vec model, a pre-trained model that has been trained on a large corpus of text, such as Google News, to generate high-quality word embeddings. These embeddings are 300-dimensional vectors, which represent words in a way that captures their meaning based on context.

How Word2Vec Works:

The Word2Vec algorithm learns to map words to dense vectors, where each word is represented by a vector in a high-dimensional space. The similarity between words can be captured by the distance between their corresponding vectors in this space. For example, “king” and “queen” would have similar vector representations, reflecting their semantic similarity.

Code Implementation:

We will use the Gensim library to load a pre-trained Word2Vec model and apply it to our dataset. The Word2Vec model that we use is trained on the Google News dataset, which contains over 100 billion words, making it an excellent resource for generating word embeddings.

Here’s the implementation:

import gensim.downloader as api
import numpy as np

# Load the pre-trained Word2Vec model
word2vec = api.load("word2vec-google-news-300")  # 300-dimensional vectors
embedding_dim = 300  # Word2Vec vector size

def get_embedding(tokens, embedding_dim=300):
    # Generate word vectors for each word in the review if they exist in the Word2Vec vocabulary
    vectors = [word2vec[word] for word in tokens if word in word2vec]
    
    # If no word embeddings are found, return a zero vector
    if len(vectors) == 0:
        return np.zeros(embedding_dim)
    
    # Compute the average of all word vectors to represent the entire review
    return np.mean(vectors, axis=0)  

# Convert reviews to embeddings
train_df['vector'] = train_df['review'].apply(lambda x: get_embedding(x))
test_df['vector'] = test_df['review'].apply(lambda x: get_embedding(x))

Explanation of the Code:

  • word2vec = api.load("word2vec-google-news-300"): This loads the pre-trained Word2Vec model with 300-dimensional vectors trained on the Google News dataset.
  • get_embedding(tokens, embedding_dim=300): This function generates the embedding for a list of tokens (words) by looking them up in the Word2Vec model.
    • If a word exists in the Word2Vec vocabulary, its corresponding vector is added to the list vectors.
    • If no vectors are found (i.e., none of the words in the review are in the Word2Vec model’s vocabulary), the function returns a zero vector.
    • If vectors are found, the function computes the average of all word vectors to generate a single vector representing the entire review.
  • train_df['vector'] = train_df['review'].apply(lambda x: get_embedding(x)): This applies the get_embedding function to each tokenized review in the training dataset and creates a new column vector containing the average word embeddings for each review.
  • test_df['vector'] = test_df['review'].apply(lambda x: get_embedding(x)): Similarly, this applies the embedding function to the test dataset, ensuring that the test data is also converted into vector form.

Why Use Word Embeddings?

  • Capturing Semantic Meaning: By using pre-trained word embeddings like Word2Vec, we can capture the semantic meaning of words. Words that are semantically similar will have similar vector representations, allowing the model to understand relationships between words.
  • Dimensionality Reduction: Instead of using raw text data, which can have high dimensionality (e.g., one dimension per unique word), embeddings represent each word as a fixed-length vector, significantly reducing the dimensionality.
  • Improved Performance: Using embeddings helps improve the performance of NLP models, as it provides rich, dense representations of words rather than sparse, high-dimensional representations.

Now that we have transformed our reviews into word embeddings, the data is ready for training a machine learning model, and we can move on to building the model itself!

Creating a PyTorch Dataset and DataLoader

Now that we have transformed our text reviews into word embeddings, the next step is to create a custom PyTorch Dataset and DataLoader. These are crucial components when training models in PyTorch, as they help efficiently manage and load the data in batches, making it easier to train models on large datasets.

What is a Dataset in PyTorch?

In PyTorch, a Dataset is an abstract class that represents a dataset. To create a custom dataset, you need to inherit from this class and implement the following methods:

  • __init__: Initializes the dataset by accepting the data and labels.
  • __len__: Returns the number of samples in the dataset.
  • __getitem__: Returns a single sample (input and label) from the dataset at the specified index.

The custom dataset class allows us to handle the data in a structured way, particularly when we need to perform operations like loading word embeddings and labels.

What is a DataLoader in PyTorch?

A DataLoader in PyTorch is used to load a dataset in batches and shuffle the data. It helps in providing an iterator over the dataset, which allows easy access to data during training. You can also configure parameters like batch size and whether to shuffle the data for each epoch.

Now, let’s go ahead and define our custom ReviewDataset and create the necessary DataLoader for training and testing.

Code Implementation:

from torch.utils.data import Dataset, DataLoader

# Define a custom dataset class
class ReviewDataset(Dataset):
    def __init__(self, df):
        # Convert word embeddings and labels to torch tensors
        self.reviews = torch.tensor(df['vector'].tolist(), dtype=torch.float32)
        self.labels = torch.tensor(df['sentiment'].tolist(), dtype=torch.float32)

    def __len__(self):
        # Return the length of the dataset
        return len(self.reviews)

    def __getitem__(self, index):
        # Return a specific sample (review and label) by index
        return self.reviews[index], self.labels[index]

# Create the training dataset
train_dataset = ReviewDataset(train_df)

# Create the testing dataset
test_dataset = ReviewDataset(test_df)

# Create DataLoader for training
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)

# Create DataLoader for testing
test_loader = DataLoader(test_dataset, batch_size=32, shuffle=True)

Explanation of the Code:

  1. Creating the Custom Dataset:
    • ReviewDataset(Dataset): This class inherits from torch.utils.data.Dataset. It takes a DataFrame (which contains word embeddings and sentiment labels) and converts them into PyTorch tensors.
    • self.reviews = torch.tensor(df['vector'].tolist(), dtype=torch.float32): This line converts the vector column, which contains the word embeddings for each review, into a PyTorch tensor with the float32 datatype.
    • self.labels = torch.tensor(df['sentiment'].tolist(), dtype=torch.float32): Similarly, the sentiment column, which contains the sentiment labels (0 for negative and 1 for positive), is also converted into a tensor.
    • __len__(self): This method returns the length of the dataset, i.e., the number of reviews in the dataset.
    • __getitem__(self, index): This method returns a single review (embedding) and its corresponding sentiment label at the specified index. This is how PyTorch will retrieve individual samples during training.
  2. Creating the DataLoader:
    • train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True): This line creates a DataLoader for the training dataset. We specify the batch_size as 32, meaning that the data will be loaded in batches of 32 samples. Setting shuffle=True ensures that the data is randomly shuffled at the start of each epoch, which helps the model generalize better and prevents overfitting.
    • test_loader = DataLoader(test_dataset, batch_size=32, shuffle=True): Similarly, we create a DataLoader for the test dataset, with the same batch size and shuffle configuration.

Why Use DataLoader?

  • Efficient Data Handling: When working with large datasets, loading all the data into memory at once might not be feasible. The DataLoader takes care of loading the data in manageable batches.
  • Random Shuffling: Shuffling the data ensures that the model doesn’t learn patterns specific to the order of the data, which helps with generalization.
  • Easy Batch Management: By using the DataLoader, we don’t need to manually manage the batching of data. The DataLoader handles batching and provides an iterator for easy iteration during training.

Now that we have our custom dataset and DataLoader set up, we are ready to feed the data into our neural network for training!

Defining the Neural Network Model

In this section, we will define the neural network architecture that will be used to predict the sentiment of the movie reviews. We will create a simple feed-forward neural network (FNN) using PyTorch’s nn.Module class. Our model will consist of several layers, including fully connected layers, batch normalization layers, dropout, and activation functions.

Understanding the Model Architecture:

We are building a binary classification model, where the goal is to predict whether a review has a positive or negative sentiment based on the word embeddings of the review.

Code Implementation:

import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F

class SentimentModel(nn.Module):
    def __init__(self, input_dim):
        super(SentimentModel, self).__init__()
        # Fully connected layers
        self.fc1 = nn.Linear(input_dim, 512)  # Increased neurons in the first layer
        self.bn1 = nn.BatchNorm1d(512)  # Batch normalization for the first layer
        self.fc2 = nn.Linear(512, 256)  # Second fully connected layer
        self.bn2 = nn.BatchNorm1d(256)  # Batch normalization for the second layer
        self.fc3 = nn.Linear(256, 128)  # Third fully connected layer
        self.fc4 = nn.Linear(128, 1)  # Final output layer (1 unit for binary classification)
        self.dropout = nn.Dropout(0.3)  # Dropout layer with a probability of 0.3
        self.sigmoid = nn.Sigmoid()  # Sigmoid activation function to output probabilities

    def forward(self, x):
        # Forward pass through the network
        x = F.relu(self.bn1(self.fc1(x)))  # ReLU activation + BatchNorm
        x = self.dropout(x)  # Dropout after the first layer
        x = F.relu(self.bn2(self.fc2(x)))  # ReLU activation + BatchNorm
        x = self.dropout(x)  # Dropout after the second layer
        x = F.relu(self.fc3(x))  # ReLU activation after the third layer
        x = self.fc4(x)  # No activation function yet, this will be passed to Sigmoid
        return self.sigmoid(x)  # Sigmoid for binary classification (0 or 1)

# Initialize the model with the embedding dimension as input size
model = SentimentModel(embedding_dim)

# Define loss function and optimizer
criterion = nn.BCELoss()  # Binary Cross Entropy Loss for binary classification
optimizer = optim.Adam(model.parameters(), lr=0.001)  # Adam optimizer with a learning rate of 0.001

Model Architecture Breakdown:

  1. Fully Connected Layers (fc1, fc2, fc3, fc4):
    • We have four fully connected layers in the network. The first layer takes the input embedding dimension and outputs 512 units. The subsequent layers progressively reduce the number of units to 256, 128, and finally output a single value representing the predicted sentiment (0 for negative, 1 for positive).
  2. Batch Normalization (bn1, bn2):
    • Batch normalization is applied after the first and second fully connected layers. This helps stabilize and speed up the training by normalizing the activations and gradients flowing through the network.
  3. ReLU Activation:
    • ReLU (Rectified Linear Unit) is used as the activation function after each fully connected layer. It introduces non-linearity to the model, allowing it to learn more complex patterns.
  4. Dropout:
    • Dropout is added after the first two fully connected layers to prevent overfitting. With a dropout rate of 0.3, 30% of the neurons in the dropout layers will be randomly set to zero during each forward pass. This forces the network to generalize better by preventing over-reliance on specific neurons.
  5. Sigmoid Activation:
    • The final output layer uses the Sigmoid activation function to produce a probability score between 0 and 1, indicating the likelihood of the review being positive (1) or negative (0).

Loss Function and Optimizer:

  1. Loss Function:
    • Binary Cross-Entropy Loss (BCELoss) is used because this is a binary classification task. It computes the error between the predicted probabilities and the actual sentiment labels (0 or 1).
  2. Optimizer:
    • Adam Optimizer is used to minimize the loss function. Adam is a popular choice because it adapts the learning rate for each parameter during training, making it efficient and effective for many types of neural networks.

Why This Architecture?

  • Scalability and Performance: By using multiple fully connected layers with batch normalization and dropout, the network is capable of learning complex patterns in the data while avoiding overfitting.
  • Binary Classification: Since we are working with binary sentiment (positive or negative), the final output layer has one unit with a sigmoid activation to produce a probability that will be classified as either positive or negative.

This model architecture will be used to predict the sentiment of movie reviews, and now that it’s defined, we can move forward with training and evaluation!

Training Loop with Accuracy Evaluation

Now that we’ve defined our model and set up the data loaders, we can proceed to train the model. In this section, we will walk through the training loop and explain how the model learns, how the loss is calculated, and how accuracy is evaluated on both the training and test sets.

Training Loop Explained:

In the following code, the model is trained over multiple epochs, during which it learns to predict the sentiment of movie reviews. We compute the loss using the Binary Cross-Entropy Loss function and track the accuracy of the model on both the training and test datasets.

Code Implementation:

epochs = 10  # Number of epochs for training

for epoch in range(epochs):
    total_loss = 0
    correct_train = 0
    total_train = 0

    model.train()  # Set the model to training mode
    for reviews, labels in train_loader:
        labels = labels.float().unsqueeze(1)  # Convert labels to float and add a dimension for batch compatibility
        optimizer.zero_grad()  # Zero the gradients of the model's parameters
        outputs = model(reviews)  # Forward pass: Get predictions from the model
        loss = criterion(outputs, labels)  # Compute the loss
        loss.backward()  # Backward pass: Compute gradients
        optimizer.step()  # Update model parameters

        total_loss += loss.item()  # Accumulate the total loss for this epoch

        # Compute training accuracy
        predicted = (outputs > 0.5).float()  # Convert model output to binary class predictions (0 or 1)
        correct_train += (predicted == labels).sum().item()  # Count correct predictions
        total_train += labels.size(0)  # Count the total number of labels (batch size)

    # Calculate training accuracy
    train_accuracy = (correct_train / total_train) * 100  # Convert to percentage

    # Evaluate on the test set
    model.eval()  # Set the model to evaluation mode (disables dropout, batch norm in evaluation)
    correct_test = 0
    total_test = 0
    with torch.no_grad():  # No need to compute gradients during evaluation
        for reviews, labels in test_loader:
            labels = labels.float().unsqueeze(1)
            outputs = model(reviews)
            predicted = (outputs > 0.5).float()  # Convert output to binary predictions
            correct_test += (predicted == labels).sum().item()  # Count correct predictions
            total_test += labels.size(0)  # Count the total number of labels

    # Calculate test accuracy
    test_accuracy = (correct_test / total_test) * 100  # Convert to percentage

    # Print loss and accuracy for this epoch
    print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_loader):.4f}, "
          f"Train Acc: {train_accuracy:.2f}%, Test Acc: {test_accuracy:.2f}%")

Explanation of the Code:

  1. Epochs and Loop:
    • The training process runs for a set number of epochs (epochs = 10 in this case). Each epoch represents one full pass over the entire training dataset.
  2. Training Mode (model.train()):
    • The model is set to training mode at the beginning of each epoch. This ensures that layers like dropout and batch normalization behave appropriately during training.
  3. Data Loading:
    • We iterate over the batches of reviews and labels provided by the train_loader (which was defined earlier with the DataLoader). For each batch, the following operations occur:
      • Zeroing gradients: optimizer.zero_grad() clears the gradients from the previous step.
      • Forward pass: The input data (reviews) is passed through the model, and predictions (outputs) are generated.
      • Loss computation: The loss function calculates the difference between the predicted and actual labels, which is then used to compute gradients.
      • Backward pass: The loss is backpropagated to update the gradients.
      • Optimizer step: The optimizer updates the model’s parameters based on the gradients.
  4. Training Accuracy Calculation:
    • After each batch, we calculate how many predictions were correct. The predicted output is compared with the actual labels, and the number of correct predictions is summed up. The training accuracy is calculated as the percentage of correct predictions over the total number of predictions for the epoch.
  5. Evaluation on Test Set:
    • After each epoch, we evaluate the model on the test set using the model.eval() mode. This disables dropout layers and batch normalization in evaluation mode, ensuring that predictions are made without any random behavior.
    • We again calculate the accuracy on the test set using the same method as for the training set, but without updating gradients (torch.no_grad()).
  6. Printing Metrics:
    • For each epoch, we print the average loss and both the training and test accuracies. This gives us an insight into how well the model is learning over time.

Why Evaluate on the Test Set?

  • Evaluating the model on the test set after each epoch helps to monitor whether the model is overfitting or generalizing well. A large gap between training and test accuracy may indicate overfitting, while similar accuracies suggest the model is performing well on both the training and unseen data.

This training loop, along with the accuracy evaluations, will help us track the model’s progress during training and assess its generalization ability.

Sentiment Prediction Function

Finally, we need a way to make predictions with the trained model. In this part of the tutorial, we define a predict_sentiment() function that takes a movie review as input and predicts whether the sentiment is positive or negative.

Code Implementation:

import torch

def predict_sentiment(model, review_text):
    # Preprocess the review
    tokens = clean_text(review_text)
    
    # Convert to embedding
    review_vector = get_embedding(tokens)
    
    # Convert to tensor
    review_tensor = torch.tensor(review_vector, dtype=torch.float32).unsqueeze(0)  # Add batch dimension
    
    # Make prediction
    model.eval()  # Set model to evaluation mode
    with torch.no_grad():
        output = model(review_tensor)
    
    # Interpret result
    sentiment = "Positive" if output.item() > 0.5 else "Negative"
    
    return sentiment, output.item()

# Example review
sample_review = "This movie was abysmal"
sentiment, confidence = predict_sentiment(model, sample_review)
print(f"Predicted Sentiment: {sentiment} (Confidence: {confidence:.4f})")

Explanation:

  1. Preprocessing:
    • The first step in predicting the sentiment is to clean and tokenize the input review. This is done using the clean_text() function, which converts the review to lowercase, removes special characters, and tokenizes the text.
  2. Embedding:
    • After tokenization, the review is converted into a vector representation using the get_embedding() function (which was defined earlier in the tutorial). This vector represents the review in a format that the model can understand.
  3. Tensor Conversion:
    • The vector is then converted into a PyTorch tensor, which is necessary for feeding the data into the model. The .unsqueeze(0) adds a batch dimension, as the model expects inputs in batch format, even if it’s a single review.
  4. Model Evaluation:
    • We set the model to evaluation mode using model.eval(). This is important because layers like dropout behave differently during training and evaluation. During evaluation, dropout is disabled.
    • The torch.no_grad() context is used to prevent PyTorch from calculating gradients during inference, making the prediction process faster and more memory-efficient.
  5. Prediction:
    • The model outputs a score between 0 and 1, representing the probability that the review has a positive sentiment. If this score is greater than 0.5, we classify the sentiment as positive; otherwise, it’s classified as negative.
  6. Output:
    • The sentiment (“Positive” or “Negative”) and the confidence score (the raw model output) are returned. The confidence score is a floating-point number representing the likelihood of the review being positive.

Example:

For the example review "This movie was abysmal", the function would output a prediction of negative sentiment along with the confidence score.

Predicted Sentiment: Negative (Confidence: 0.0172)

This indicates that the model predicts the sentiment to be negative with a high degree of confidence. You can test the function with other reviews to see how well the model performs

Conclusion

In this tutorial, we’ve explored how to build a sentiment analysis model using the IMDB Movie Review dataset. We walked through the entire process, starting from data preprocessing, tokenization, and embedding generation, to creating a neural network model in PyTorch, training it, and finally predicting the sentiment of unseen movie reviews. By applying techniques like word embeddings, batch normalization, dropout, and accurate evaluation methods, we’ve built a robust sentiment classification model.

This is a great foundation for anyone looking to dive into natural language processing (NLP) and sentiment analysis using deep learning. You can experiment with different architectures, hyperparameters, or even extend the model to handle multi-class classification for other types of sentiment.

If you’re interested in exploring the full code or implementing your own version, feel free to visit the GitHub repository:

Full Code on GitHub

Happy coding, and good luck with your deep learning journey!

Category

Leave a Comment