RNN vs. CNN vs. Autoencoder vs. Attention/Transformer: A Practical Guide with PyTorch
Deep learning has evolved rapidly, offering a toolkit of neural architectures for various data types and tasks. Among the most influential are Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), Autoencoders, and the modern Attention/Transformer models.
But how do they differ? When should you use each? Let’s break them down with simple PyTorch code examples!
Table of Contents
- RNNs (Recurrent Neural Networks)
- CNNs (Convolutional Neural Networks)
- Autoencoders
- Attention & Transformer Models
- Summary Table
1. RNNs: Sequential Data Specialists

RNNs are designed for sequence modeling, where inputs are ordered and past context matters—think text, speech, time series.
Core idea:
- Maintain a hidden state that is updated as the sequence progresses.
- Handles variable-length input, but struggles with long-range dependencies (can “forget” far-back info).
Common use-cases:
- Language modeling, text generation, sentiment analysis, speech recognition, forecasting.
PyTorch Example: Simple Character-level RNN
import torch
import torch.nn as nn
class SimpleRNN(nn.Module):
def __init__(self, vocab_size, embed_size, hidden_size, num_classes):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_size)
self.rnn = nn.RNN(embed_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, num_classes)
def forward(self, x):
x = self.embedding(x) # [batch, seq_len] -> [batch, seq_len, embed_size]
out, _ = self.rnn(x) # [batch, seq_len, hidden_size]
out = self.fc(out[:, -1, :]) # Take last time step
return out
# Example usage
model = SimpleRNN(vocab_size=100, embed_size=32, hidden_size=64, num_classes=10)
inputs = torch.randint(0, 100, (8, 20)) # batch_size=8, seq_len=20
outputs = model(inputs)
print(outputs.shape) # torch.Size([8, 10])
2. CNNs: The Grid Data Pros

CNNs shine on grid-like data, especially images, but also 1D signals (audio, time series) and even text (for local feature extraction).
Core idea:
- Use convolutional filters to extract local patterns, hierarchically combining them.
- Exploit spatial/local correlations.
Common use-cases:
- Image classification, object detection, medical imaging, audio, some text tasks.
PyTorch Example: Simple Image CNN
import torch
import torch.nn as nn
class SimpleCNN(nn.Module):
def __init__(self, num_classes=10):
super().__init__()
self.conv1 = nn.Conv2d(1, 16, kernel_size=3, padding=1)
self.pool = nn.MaxPool2d(2, 2)
self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
self.fc1 = nn.Linear(32 * 7 * 7, num_classes)
def forward(self, x):
x = self.pool(torch.relu(self.conv1(x))) # [batch, 16, 14, 14]
x = self.pool(torch.relu(self.conv2(x))) # [batch, 32, 7, 7]
x = x.view(x.size(0), -1) # Flatten
x = self.fc1(x)
return x
# Example usage
model = SimpleCNN(num_classes=10)
inputs = torch.randn(8, 1, 28, 28) # batch_size=8, 1 channel, 28x28 image
outputs = model(inputs)
print(outputs.shape) # torch.Size([8, 10])
3. Autoencoders: Unsupervised Compressors

Autoencoders learn to encode data to a compact representation and reconstruct it—great for dimensionality reduction, denoising, and unsupervised feature learning.
Core idea:
- Consists of an encoder (compresses input) and decoder (reconstructs input).
- Forces learning of salient features in the bottleneck.
Common use-cases:
- Denoising images, anomaly detection, unsupervised pre-training, generative tasks.
PyTorch Example: Simple MLP Autoencoder for MNIST
import torch
import torch.nn as nn
class SimpleAutoencoder(nn.Module):
def __init__(self):
super().__init__()
self.encoder = nn.Sequential(
nn.Linear(28*28, 128),
nn.ReLU(),
nn.Linear(128, 32)
)
self.decoder = nn.Sequential(
nn.Linear(32, 128),
nn.ReLU(),
nn.Linear(128, 28*28),
nn.Sigmoid()
)
def forward(self, x):
x = x.view(x.size(0), -1) # Flatten
z = self.encoder(x)
out = self.decoder(z)
out = out.view(x.size(0), 1, 28, 28)
return out
# Example usage
model = SimpleAutoencoder()
inputs = torch.randn(8, 1, 28, 28) # batch_size=8
outputs = model(inputs)
print(outputs.shape) # torch.Size([8, 1, 28, 28])
4. Attention & Transformer: Long-Context Masters

Transformers (powered by attention mechanisms) revolutionized NLP and are now conquering vision, speech, and more.
Core idea:
- Self-attention: Each element attends to all others, modeling global dependencies efficiently.
- Processes sequences in parallel (unlike RNNs).
- Easily scales to long sequences and enables transfer learning (pretrained models like BERT, GPT, ViT).
Common use-cases:
- Language modeling, translation, summarization, question answering, code completion, vision transformers.
PyTorch Example: Tiny Transformer for Classification
import torch
import torch.nn as nn
class SimpleTransformer(nn.Module):
def __init__(self, vocab_size, embed_dim, num_heads, num_classes, max_len=32):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim)
self.pos_embedding = nn.Parameter(torch.randn(1, max_len, embed_dim))
encoder_layer = nn.TransformerEncoderLayer(d_model=embed_dim, nhead=num_heads)
self.transformer = nn.TransformerEncoder(encoder_layer, num_layers=2)
self.fc = nn.Linear(embed_dim, num_classes)
self.max_len = max_len
def forward(self, x):
# x: [batch, seq_len]
x = self.embedding(x) # [batch, seq_len, embed_dim]
seq_len = x.size(1)
x = x + self.pos_embedding[:, :seq_len, :] # Add positional encoding
x = x.transpose(0, 1) # Transformer expects [seq_len, batch, embed_dim]
out = self.transformer(x)
out = out[0] # Use the output at position 0 (could use pooling)
out = self.fc(out)
return out
# Example usage
model = SimpleTransformer(vocab_size=100, embed_dim=64, num_heads=4, num_classes=10)
inputs = torch.randint(0, 100, (8, 32)) # batch_size=8, seq_len=32
outputs = model(inputs)
print(outputs.shape) # torch.Size([8, 10])
5. Summary Table
| Architecture | Data Type | Pros | Cons | Example Use-case |
|---|---|---|---|---|
| RNN | Sequences (text, time series) | Captures order, works for variable length | Hard to train on long sequences (vanishing gradients) | Language modeling |
| CNN | Images, grid-like | Efficient, local feature detection | Not suited for long dependencies | Image classification |
| Autoencoder | Any (usually images/tabular) | Unsupervised, learns features | Not for sequence tasks | Denoising, anomaly detection |
| Transformer | Sequences (NLP, vision, audio) | Captures long-range/global dependencies, parallelizable | Needs large data, compute | Translation, summarization, ViT |
Which to Use When?
- RNN: Time-ordered, sequence tasks (text, audio) when sequence length isn’t huge.
- CNN: Images or short, fixed-length signals.
- Autoencoder: When you want to compress, denoise, or learn representations unsupervised.
- Transformer: Most NLP tasks, especially with long dependencies or need for transfer learning; now strong in vision, audio, and more.
0 Comments