Transformers

Attention is all I need

My attempt at staying relevant in 2025.

Based on the 2017 paper Attention Is All You Need this project implements the popular Transformer model architecture in TypeScript/JavaScript using TensorFlow.js. The goal is to gain a deeper understanding of how this foundational technology works by building it from scratch.

🎯 Overview

The Transformer is a revolutionary deep learning architecture introduced in 2017 that has become the foundation for modern AI models like GPT, BERT, and countless others. This implementation provides a fully functional, educational implementation of the complete Transformer architecture.

What makes Transformers special?

Parallelizable: Unlike RNNs/LSTMs, processes all positions simultaneously
Long-range dependencies: Captures relationships across entire sequences
Attention mechanism: Learns what to focus on automatically
Scalable: Can be trained on massive datasets efficiently

✨ Features

✅ Complete Transformer Architecture: Full encoder-decoder implementation
✅ Multi-Head Attention: Parallel attention mechanisms for learning diverse patterns
✅ Positional Encoding: Sine/cosine position embeddings
✅ Layer Normalization & Residual Connections: For stable deep network training
✅ Configurable Hyperparameters: Easily customize model size and capacity
✅ Masking Support: Padding masks and look-ahead masks for proper training
✅ TypeScript: Fully typed for better development experience
✅ TensorFlow.js: Runs in Node.js (or browser with minor modifications)
✅ Extensive Documentation: Every component thoroughly explained with comments

🏗️ Architecture

The Transformer follows the encoder-decoder architecture:

Input Sequence                     Target Sequence (shifted right)
     ↓                                       ↓
[Embedding + Positional Encoding]  [Embedding + Positional Encoding]
     ↓                                       ↓
┌─────────────┐                    ┌─────────────┐
│   ENCODER   │                    │   DECODER   │
│  (N layers) │───────────────────→│  (N layers) │
│             │   Cross-Attention  │             │
│  - Self     │                    │  - Masked   │
│    Attention│                    │    Self-Attn│
│  - FFN      │                    │  - Cross-   │
│             │                    │    Attention│
│             │                    │  - FFN      │
└─────────────┘                    └─────────────┘
                                          ↓
                                   [Linear Layer]
                                          ↓
                                   [Output Logits]

Key Components

Encoder: Processes the input sequence and builds contextualized representations
- Multi-head self-attention
- Position-wise feed-forward networks
- Layer normalization and residual connections
Decoder: Generates output sequence one token at a time
- Masked multi-head self-attention (can't look ahead)
- Multi-head cross-attention (attends to encoder output)
- Position-wise feed-forward networks
- Layer normalization and residual connections
Attention Mechanism: The core innovation
- Scaled dot-product attention
- Multi-head attention for parallel pattern learning
Positional Encoding: Adds position information to embeddings
- Uses sine and cosine functions at different frequencies

🚀 Installation

# Clone the repository
git clone https://github.com/nunsie/transformers.git
cd transformers

# Install dependencies
npm install

💻 Usage

Basic Example

import { Transformer, TransformerConfig } from './src/transformer';
import * as tf from '@tensorflow/tfjs-node';

// Configure the model
const config: TransformerConfig = {
    numLayers: 2,              // Number of encoder/decoder layers
    dModel: 128,               // Model dimension
    numHeads: 8,               // Number of attention heads
    dff: 512,                  // Feed-forward dimension
    inputVocabSize: 5000,      // Input vocabulary size
    targetVocabSize: 5000,     // Target vocabulary size
    maxPositionEncoding: 1000, // Maximum sequence length
    dropoutRate: 0.1,          // Dropout rate
};

// Create the transformer
const transformer = new Transformer(config);

// Prepare input data (token IDs)
const input = tf.tensor2d([[1, 45, 234, 12, 89, 0, 0]], 'int32');
const target = tf.tensor2d([[2, 56, 123, 78, 0, 0, 0]], 'int32');

// Forward pass
const output = transformer.call(input, target, false);

console.log('Output shape:', output.shape);
// Expected: [batch_size, target_seq_len, target_vocab_size]

// Get predictions
const predictions = tf.argMax(output, -1);
console.log('Predictions:', await predictions.array());

// Cleanup
transformer.dispose();

Running the Example

# Build the project
npm run build

# Run the example
npm run dev

📁 Project Structure

transformers/
├── src/
│   ├── attention.ts           # Multi-head attention implementation
│   ├── decoder.ts             # Decoder layer and stack
│   ├── encoder.ts             # Encoder layer and stack
│   ├── feedforward.ts         # Position-wise feed-forward network
│   ├── positional-encoding.ts # Positional encoding utilities
│   ├── transformer.ts         # Main Transformer model
│   ├── example.ts             # Example usage
│   └── index.ts               # Public API exports
├── package.json
├── tsconfig.json
└── README.md

🔍 How It Works

1. Embedding Layer

Converts token IDs to dense vectors:

Token ID: 45 → Embedding: [0.1, 0.3, -0.5, ..., 0.2] (dModel dimensions)

2. Positional Encoding

Adds position information since Transformers process all positions in parallel:

PE(pos, 2i)   = sin(pos / 10000^(2i/dModel))
PE(pos, 2i+1) = cos(pos / 10000^(2i/dModel))

3. Scaled Dot-Product Attention

Core attention mechanism:

Attention(Q, K, V) = softmax(QK^T / √d_k) V

Q (Query): What am I looking for?
K (Key): What information is available?
V (Value): The actual information

4. Multi-Head Attention

Runs multiple attention mechanisms in parallel:

Different heads learn different types of relationships
Outputs are concatenated and projected

5. Feed-Forward Network

Two linear transformations with ReLU activation:

FFN(x) = max(0, xW1 + b1)W2 + b2

6. Residual Connections & Layer Normalization

For stable deep network training:

output = LayerNorm(x + Sublayer(x))

⚙️ Configuration

The TransformerConfig interface allows you to customize the model:

Parameter	Description	Typical Value
`numLayers`	Number of encoder/decoder layers	6
`dModel`	Model dimension (embedding size)	512
`numHeads`	Number of attention heads	8
`dff`	Feed-forward hidden dimension	2048
`inputVocabSize`	Size of input vocabulary	10000
`targetVocabSize`	Size of target vocabulary	10000
`maxPositionEncoding`	Maximum sequence length	5000
`dropoutRate`	Dropout rate for regularization	0.1

Note: dModel must be divisible by numHeads.

📝 Examples

Machine Translation Example

// English to French translation
const englishTokens = tokenize("The cat sat on the mat");
const frenchTokens = tokenize("<start> Le chat");

const output = transformer.call(
    tf.tensor2d([englishTokens], 'int32'),
    tf.tensor2d([frenchTokens], 'int32'),
    false
);

// Get next token prediction
const nextTokenProbs = tf.softmax(output.slice([0, -1], [1, 1]), -1);

Custom Masking

import { createPaddingMask, createLookAheadMask } from './src/transformer';

// Ignore padding tokens
const paddingMask = createPaddingMask(inputSequence);

// Prevent looking at future tokens
const lookAheadMask = createLookAheadMask(sequenceLength);

📖 Resources

Original Paper: Attention Is All You Need (Vaswani et al., 2017)
TensorFlow.js: Official Documentation
The Illustrated Transformer: Visual Guide
Annotated Transformer: Harvard NLP

🤝 Contributing

This is a personal learning project, but suggestions and improvements are welcome! Feel free to open issues or submit pull requests.

📄 License

ISC License - see package.json for details.

👤 Author

Nusrath Khan

GitHub: @nunsie

Built with ❤️ to understand the technology that's changing the world

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
src		src
.gitignore		.gitignore
.nvmrc		.nvmrc
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Transformers

📚 Table of Contents

🎯 Overview

✨ Features

🏗️ Architecture

Key Components

🚀 Installation

💻 Usage

Basic Example

Running the Example

📁 Project Structure

🔍 How It Works

1. Embedding Layer

2. Positional Encoding

3. Scaled Dot-Product Attention

4. Multi-Head Attention

5. Feed-Forward Network

6. Residual Connections & Layer Normalization

⚙️ Configuration

📝 Examples

Machine Translation Example

Custom Masking

📖 Resources

🤝 Contributing

📄 License

👤 Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Transformers

📚 Table of Contents

🎯 Overview

✨ Features

🏗️ Architecture

Key Components

🚀 Installation

💻 Usage

Basic Example

Running the Example

📁 Project Structure

🔍 How It Works

1. Embedding Layer

2. Positional Encoding

3. Scaled Dot-Product Attention

4. Multi-Head Attention

5. Feed-Forward Network

6. Residual Connections & Layer Normalization

⚙️ Configuration

📝 Examples

Machine Translation Example

Custom Masking

📖 Resources

🤝 Contributing

📄 License

👤 Author

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages