RNN vs. CNN vs. Autoencoder vs. Attention/Transformer: A Practical Guide with PyTorch

Deep learning has evolved rapidly, offering a toolkit of neural architectures for various data types and tasks. Among the most influential are Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), Autoencoders, and the modern Attention/Transformer models.
But how do they differ? When should you use each? Let’s break them down with simple PyTorch code examples!


Table of Contents

  1. RNNs (Recurrent Neural Networks)
  2. CNNs (Convolutional Neural Networks)
  3. Autoencoders
  4. Attention & Transformer Models
  5. Summary Table

1. RNNs: Sequential Data Specialists

recurrent_neural_networks

RNNs are designed for sequence modeling, where inputs are ordered and past context matters—think text, speech, time series.

Core idea:

  • Maintain a hidden state that is updated as the sequence progresses.
  • Handles variable-length input, but struggles with long-range dependencies (can “forget” far-back info).

Common use-cases:

  • Language modeling, text generation, sentiment analysis, speech recognition, forecasting.

PyTorch Example: Simple Character-level RNN

import torch
import torch.nn as nn

class SimpleRNN(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(embed_size, hidden_size, batch_first=True)
        self.fc = nn.Linear(hidden_size, num_classes)

    def forward(self, x):
        x = self.embedding(x)         # [batch, seq_len] -> [batch, seq_len, embed_size]
        out, _ = self.rnn(x)          # [batch, seq_len, hidden_size]
        out = self.fc(out[:, -1, :])  # Take last time step
        return out

# Example usage
model = SimpleRNN(vocab_size=100, embed_size=32, hidden_size=64, num_classes=10)
inputs = torch.randint(0, 100, (8, 20))  # batch_size=8, seq_len=20
outputs = model(inputs)
print(outputs.shape)  # torch.Size([8, 10])

2. CNNs: The Grid Data Pros

AI Atlas #16: Convolutional Neural Networks (CNNs) | Glasswing Ventures
CNNs
shine on grid-like data, especially images, but also 1D signals (audio, time series) and even text (for local feature extraction).

Core idea:

  • Use convolutional filters to extract local patterns, hierarchically combining them.
  • Exploit spatial/local correlations.

Common use-cases:

  • Image classification, object detection, medical imaging, audio, some text tasks.

PyTorch Example: Simple Image CNN

import torch
import torch.nn as nn

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(32 * 7 * 7, num_classes)

    def forward(self, x):
        x = self.pool(torch.relu(self.conv1(x)))  # [batch, 16, 14, 14]
        x = self.pool(torch.relu(self.conv2(x)))  # [batch, 32, 7, 7]
        x = x.view(x.size(0), -1)                 # Flatten
        x = self.fc1(x)
        return x

# Example usage
model = SimpleCNN(num_classes=10)
inputs = torch.randn(8, 1, 28, 28)  # batch_size=8, 1 channel, 28x28 image
outputs = model(inputs)
print(outputs.shape)  # torch.Size([8, 10])

3. Autoencoders: Unsupervised Compressors
Applied Deep Learning - Part 3: Autoencoders | by Arden Dertat | TDS  Archive | Medium

Autoencoders learn to encode data to a compact representation and reconstruct it—great for dimensionality reduction, denoising, and unsupervised feature learning.

Core idea:

  • Consists of an encoder (compresses input) and decoder (reconstructs input).
  • Forces learning of salient features in the bottleneck.

Common use-cases:

  • Denoising images, anomaly detection, unsupervised pre-training, generative tasks.

PyTorch Example: Simple MLP Autoencoder for MNIST

import torch
import torch.nn as nn

class SimpleAutoencoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(28*28, 128),
            nn.ReLU(),
            nn.Linear(128, 32)
        )
        self.decoder = nn.Sequential(
            nn.Linear(32, 128),
            nn.ReLU(),
            nn.Linear(128, 28*28),
            nn.Sigmoid()
        )

    def forward(self, x):
        x = x.view(x.size(0), -1)         # Flatten
        z = self.encoder(x)
        out = self.decoder(z)
        out = out.view(x.size(0), 1, 28, 28)
        return out

# Example usage
model = SimpleAutoencoder()
inputs = torch.randn(8, 1, 28, 28)  # batch_size=8
outputs = model(inputs)
print(outputs.shape)  # torch.Size([8, 1, 28, 28])

4. Attention & Transformer: Long-Context Masters

Self-Attention and Transformer Network Architecture | by LM Po | Medium
Transformers
(powered by attention mechanisms) revolutionized NLP and are now conquering vision, speech, and more.

Core idea:

  • Self-attention: Each element attends to all others, modeling global dependencies efficiently.
  • Processes sequences in parallel (unlike RNNs).
  • Easily scales to long sequences and enables transfer learning (pretrained models like BERT, GPT, ViT).

Common use-cases:

  • Language modeling, translation, summarization, question answering, code completion, vision transformers.

PyTorch Example: Tiny Transformer for Classification

import torch
import torch.nn as nn

class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, num_classes, max_len=32):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.pos_embedding = nn.Parameter(torch.randn(1, max_len, embed_dim))
        encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads)
        self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=2)
        self.fc = nn.Linear(embed_dim, num_classes)
        self.max_len = max_len

    def forward(self, x):
        # x: [batch, seq_len]
        x = self.embedding(x)                          # [batch, seq_len, embed_dim]
        seq_len = x.size(1)
        x = x + self.pos_embedding[:, :seq_len, :]     # Add positional encoding
        x = x.transpose(0, 1)                          # Transformer expects [seq_len, batch, embed_dim]
        out = self.transformer(x)
        out = out[0]                                   # Use the output at position 0 (could use pooling)
        out = self.fc(out)
        return out

# Example usage
model = SimpleTransformer(vocab_size=100, embed_dim=64, num_heads=4, num_classes=10)
inputs = torch.randint(0, 100, (8, 32))  # batch_size=8, seq_len=32
outputs = model(inputs)
print(outputs.shape)  # torch.Size([8, 10])

5. Summary Table

Architecture Data Type Pros Cons Example Use-case
RNN Sequences (text, time series) Captures order, works for variable length Hard to train on long sequences (vanishing gradients) Language modeling
CNN Images, grid-like Efficient, local feature detection Not suited for long dependencies Image classification
Autoencoder Any (usually images/tabular) Unsupervised, learns features Not for sequence tasks Denoising, anomaly detection
Transformer Sequences (NLP, vision, audio) Captures long-range/global dependencies, parallelizable Needs large data, compute Translation, summarization, ViT

Which to Use When?

  • RNN: Time-ordered, sequence tasks (text, audio) when sequence length isn’t huge.
  • CNN: Images or short, fixed-length signals.
  • Autoencoder: When you want to compress, denoise, or learn representations unsupervised.
  • Transformer: Most NLP tasks, especially with long dependencies or need for transfer learning; now strong in vision, audio, and more.

Categories: AI

0 Comments

What do you think?