CS6140 Machine Learning

HW7 - Attention, Tranformer

Make sure you check the syllabus for the due date.

Setup: Machine Translation Spanish->English using seq2seq models. data , dataloader , starter-code

. In all problems you can tune various parameters:
- data params : num_steps, sample_percent, batch_size, min_freq
- model params : num_layers, num_blocks, num_hiddens, ffn_num_hiddents, num_heads, dropout, embed_size
- optional : try the beam search (implemented, needs work) instead of greedy/argmax for decoding test examples
- optional : try an adaptive LR (learning rate, not implemented) that allows params to update at different rates; also try decay LR that decreases LR with epochs

This code can be computational intensive, so consider the options:
- run on a local Nvidia GPU + cuda
- run on Google Collab or similar services
- adjust data size (sampling_percent param, batches etc). If you are on a mac, try to make torch use MPS
- adjust model params

PROBLEM 1 Cross-attention (from Decoder to Encoder) [50 points]

(A) Complete the #TODO-s in [DotAttention] and in [Seq2SeqAttentionDecoder] before running the simple-attention notebook : implements dot-product cross-attention in a RNN Decoder with states that query the (key,val) from encoder output. Oberve the attention weights per query
* optional, no credit : try to substitute [DotProductAttention] with the [AdditiveAttention ] (already implemented) and recheck query-key weights
(B) Complete the #TODO-s in [MultiheadAttention] and in [MultiHeadSeq2SeqDecoder] before running the multihead-attention notebook : implements same mechanism as in (A) but with multi-head attention

PROBLEM 2 Self-attention (inside RNN) VS Cross-attention [50 points]

(A) Complete the #TODO-s in [SelfAttentionAugmentedEncoder] before running the encoder-self-attention-hybrid notebook : keeps the cross-attention from Pb1, and adds the self-attention inside the GRU-RNN encoder
(B) Complete the #TODO-s in class [SelfAttentiveGRUDecoder] before running the decoder-self-attention-hybrid notebook : keeps the cross-atention, and ads self-attention to both RNN-Encoder and RNN-Decoder

PROBLEM 3 Attention + Transformer [30p]

(A) Complete the #TODO-s in [TransformerDecoderBlock] and [TransformerDecoder] before running the transformer-decoder-on-gru-encoder notebook : removes the RNN recurrence from the Decoder, using positional encoding to keep track of the order in the sequence
(B) optional no credit) Run the transformer notebook : removes recurrence(RNN) from both Encoder and Decoder : "attention is all you need".