Skip to content

nunsie/transformers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Transformers

Attention is all I need

My attempt at staying relevant in 2025.

Based on the 2017 paper Attention Is All You Need this project implements the popular Transformer model architecture in TypeScript/JavaScript using TensorFlow.js. The goal is to gain a deeper understanding of how this foundational technology works by building it from scratch.

📚 Table of Contents

🎯 Overview

The Transformer is a revolutionary deep learning architecture introduced in 2017 that has become the foundation for modern AI models like GPT, BERT, and countless others. This implementation provides a fully functional, educational implementation of the complete Transformer architecture.

What makes Transformers special?

  • Parallelizable: Unlike RNNs/LSTMs, processes all positions simultaneously
  • Long-range dependencies: Captures relationships across entire sequences
  • Attention mechanism: Learns what to focus on automatically
  • Scalable: Can be trained on massive datasets efficiently

✨ Features

  • Complete Transformer Architecture: Full encoder-decoder implementation
  • Multi-Head Attention: Parallel attention mechanisms for learning diverse patterns
  • Positional Encoding: Sine/cosine position embeddings
  • Layer Normalization & Residual Connections: For stable deep network training
  • Configurable Hyperparameters: Easily customize model size and capacity
  • Masking Support: Padding masks and look-ahead masks for proper training
  • TypeScript: Fully typed for better development experience
  • TensorFlow.js: Runs in Node.js (or browser with minor modifications)
  • Extensive Documentation: Every component thoroughly explained with comments

🏗️ Architecture

The Transformer follows the encoder-decoder architecture:

Input Sequence                     Target Sequence (shifted right)
     ↓                                       ↓
[Embedding + Positional Encoding]  [Embedding + Positional Encoding]
     ↓                                       ↓
┌─────────────┐                    ┌─────────────┐
│   ENCODER   │                    │   DECODER   │
│  (N layers) │───────────────────→│  (N layers) │
│             │   Cross-Attention  │             │
│  - Self     │                    │  - Masked   │
│    Attention│                    │    Self-Attn│
│  - FFN      │                    │  - Cross-   │
│             │                    │    Attention│
│             │                    │  - FFN      │
└─────────────┘                    └─────────────┘
                                          ↓
                                   [Linear Layer]
                                          ↓
                                   [Output Logits]

Key Components

  1. Encoder: Processes the input sequence and builds contextualized representations

    • Multi-head self-attention
    • Position-wise feed-forward networks
    • Layer normalization and residual connections
  2. Decoder: Generates output sequence one token at a time

    • Masked multi-head self-attention (can't look ahead)
    • Multi-head cross-attention (attends to encoder output)
    • Position-wise feed-forward networks
    • Layer normalization and residual connections
  3. Attention Mechanism: The core innovation

    • Scaled dot-product attention
    • Multi-head attention for parallel pattern learning
  4. Positional Encoding: Adds position information to embeddings

    • Uses sine and cosine functions at different frequencies

🚀 Installation

# Clone the repository
git clone https://github.com/nunsie/transformers.git
cd transformers

# Install dependencies
npm install

💻 Usage

Basic Example

import { Transformer, TransformerConfig } from './src/transformer';
import * as tf from '@tensorflow/tfjs-node';

// Configure the model
const config: TransformerConfig = {
    numLayers: 2,              // Number of encoder/decoder layers
    dModel: 128,               // Model dimension
    numHeads: 8,               // Number of attention heads
    dff: 512,                  // Feed-forward dimension
    inputVocabSize: 5000,      // Input vocabulary size
    targetVocabSize: 5000,     // Target vocabulary size
    maxPositionEncoding: 1000, // Maximum sequence length
    dropoutRate: 0.1,          // Dropout rate
};

// Create the transformer
const transformer = new Transformer(config);

// Prepare input data (token IDs)
const input = tf.tensor2d([[1, 45, 234, 12, 89, 0, 0]], 'int32');
const target = tf.tensor2d([[2, 56, 123, 78, 0, 0, 0]], 'int32');

// Forward pass
const output = transformer.call(input, target, false);

console.log('Output shape:', output.shape);
// Expected: [batch_size, target_seq_len, target_vocab_size]

// Get predictions
const predictions = tf.argMax(output, -1);
console.log('Predictions:', await predictions.array());

// Cleanup
transformer.dispose();

Running the Example

# Build the project
npm run build

# Run the example
npm run dev

📁 Project Structure

transformers/
├── src/
│   ├── attention.ts           # Multi-head attention implementation
│   ├── decoder.ts             # Decoder layer and stack
│   ├── encoder.ts             # Encoder layer and stack
│   ├── feedforward.ts         # Position-wise feed-forward network
│   ├── positional-encoding.ts # Positional encoding utilities
│   ├── transformer.ts         # Main Transformer model
│   ├── example.ts             # Example usage
│   └── index.ts               # Public API exports
├── package.json
├── tsconfig.json
└── README.md

🔍 How It Works

1. Embedding Layer

Converts token IDs to dense vectors:

Token ID: 45 → Embedding: [0.1, 0.3, -0.5, ..., 0.2] (dModel dimensions)

2. Positional Encoding

Adds position information since Transformers process all positions in parallel:

PE(pos, 2i)   = sin(pos / 10000^(2i/dModel))
PE(pos, 2i+1) = cos(pos / 10000^(2i/dModel))

3. Scaled Dot-Product Attention

Core attention mechanism:

Attention(Q, K, V) = softmax(QK^T / √d_k) V
  • Q (Query): What am I looking for?
  • K (Key): What information is available?
  • V (Value): The actual information

4. Multi-Head Attention

Runs multiple attention mechanisms in parallel:

  • Different heads learn different types of relationships
  • Outputs are concatenated and projected

5. Feed-Forward Network

Two linear transformations with ReLU activation:

FFN(x) = max(0, xW1 + b1)W2 + b2

6. Residual Connections & Layer Normalization

For stable deep network training:

output = LayerNorm(x + Sublayer(x))

⚙️ Configuration

The TransformerConfig interface allows you to customize the model:

Parameter Description Typical Value
numLayers Number of encoder/decoder layers 6
dModel Model dimension (embedding size) 512
numHeads Number of attention heads 8
dff Feed-forward hidden dimension 2048
inputVocabSize Size of input vocabulary 10000
targetVocabSize Size of target vocabulary 10000
maxPositionEncoding Maximum sequence length 5000
dropoutRate Dropout rate for regularization 0.1

Note: dModel must be divisible by numHeads.

📝 Examples

Machine Translation Example

// English to French translation
const englishTokens = tokenize("The cat sat on the mat");
const frenchTokens = tokenize("<start> Le chat");

const output = transformer.call(
    tf.tensor2d([englishTokens], 'int32'),
    tf.tensor2d([frenchTokens], 'int32'),
    false
);

// Get next token prediction
const nextTokenProbs = tf.softmax(output.slice([0, -1], [1, 1]), -1);

Custom Masking

import { createPaddingMask, createLookAheadMask } from './src/transformer';

// Ignore padding tokens
const paddingMask = createPaddingMask(inputSequence);

// Prevent looking at future tokens
const lookAheadMask = createLookAheadMask(sequenceLength);

📖 Resources

🤝 Contributing

This is a personal learning project, but suggestions and improvements are welcome! Feel free to open issues or submit pull requests.

📄 License

ISC License - see package.json for details.

👤 Author

Nusrath Khan


Built with ❤️ to understand the technology that's changing the world

About

Attention is all I need

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors