-
Notifications
You must be signed in to change notification settings - Fork 8
Transformer
Your transformer is composed by n transformer encoders and m transformer decoders, n can be different from m.
- Each decoder must be connected to, at least, 1 encoder
- Each decoder can be connected to more than 1 encoder
- Connections are showed by the matrix encoder_decoder_connections (dimension: n x m), if encoder_decoder_connections[i][j] != 0 then the decoder j is connected to the encoder i.
- The initialization of the transformer_encoder structure requires an "input_dimension" parameter. That parameter doesn't indicate the dimension of the input of the encoder, but of the attention layer inside the encoder.
- The initialization of the transformer_decoder structure requires an "input_dimension" parameter. That parameter doesn't indicate the dimension of the input of the decoder, but of the first attention layer inside the decoder.
- The initialization of the transformer_decoder structure requires a "left_dimension" parameter. That parameter indicates the dimension of the input for the second attention layer of the decoder, look at the figure at the bottom to get a clearer understanding
-
The initialization of the transformer_decoder structure requires a "encoder_input_dimension" parameter. That parameter indicates the dimension of the second attention mechanism
-
the model inside the decoder and encoder must contain only convolutional layers of kernel_rows and kernel_cols = 1 and n_kernels the number you want.
-
The softmax outside the decoder output must be applied for each token with the same linear function (it is considered as an embedding matrix for the output) (try using a convolutional network then)
-
the model for the encoder and decoder must be convolutional networks, my advice: channels = 1, rows = the flatten dimension that comes out of the linear after the attention layer, cols = 1, kernel cols = 1, kernel rows = the token dimension of the input for the encoder (not 1 as the paper suggests!) n_kernels whatever you want, stride rows = token dimension of the input, stride cols = 1
-
the model after the attention outputs for decoder and encoder should have only 1 fully connected layer, with normalization if you want and dropout, but you must freeze the biases (training_mode = FREEZE_BIASES)
You must have a number of fully connected layers for the encoder equals to 3* number of head of the multi-headed attention mechanism. the input of each fcl layer must match only the input given during the feed forward and back propagation. The output for each fcl layer, must match the input_dimension/n_head given in the initialization.
For the decoder you must have a number of fully connected layers equals to 3*(n_head1+n_head2) the first 3n_head1 fully connected layers are used for the first attention layer, the others for the second one. the input of the first 3n_head1 fcl must match only the dimension of the input given during the feed forward and back propagation. The output instead must match the input_dimension/n_head1.
For the fc layers of the second attention layer instead we have a separation. The fcls that match this computation: i%3 != 2, which i is the index of the fc layer, must have the input dimension that match the left_dimension, the others the input dimension must match the input_dimension of the decoder. The ouput must match the value encoder_input_dimension/n_head2.
You can have a linear feed forward + activation + other linear in the feed forward passage like in the classic way, but i decided to insert a model structure in that passage, in this way you can use everything a model can offer. For example if you want a convolutional layer + pooling + padding + other residual connections + group normalization + fully connect with dropout and so on... your transformer can have such a structure. Another difference: for the normalization passage, i used the scaled l2 normalization. It's showed that has the same effect of layer normalization and is also faster. Furthemore the linear computation for the attention mechanism is created using fully connected layers so you can add everything a fully connected layer can offer, (even though it should be just a linear passage)
For further informations pls visit this link https://huggingface.co/blog/encoder-decoder