Glad you liked the article.

2 min readNov 1, 2019

Glad you liked the article. I’m still in the process of working out the PyTorch transformer example as there are many subtle things to figure out. Are you planning on using just self-attention or the full transformer architecture? If it is the former it is relatively simple(at least if you are using the standard self attention). You just need the a either convolutional layer or a standard dense layer to generate the initial embedding then follow that with a standard Multi-Head Attention layer. For instance, here is some of the code I have for attend and diagnose (warning this is very preliminary and I have only tested on toy data. Also, I’m using layer norm instead of the dense interpolation)

The full scale transformer is a lot more complex as it has multi-head attention in both the encoder and the decoder. Additionally, in the transformer decoder in NLP tasks the actual “target” is sent to the decoder, however as I stated above a mask is employed to mask out future tokens. So some sort of mask has to be applied for time series as well, however I’m not sure if it is the exact same at the moment. I’m still digging into the finer the details of the Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting paper, but it is hard since at the moment there is no code. There is also a whole bunch of other things like how long to have each sequence be and how to represent things like month, day, and time in the positional encodings. I promise to that I will write up a full tutorial on it as soon as I get it finished though that may still be a month or so away as I have other work going on too. Also have a look at DA-RNN if you haven’t already. It is in PyTorch and although it doesn’t use self-attention it does utilize a standard attention mechanism. I’m actually currently using as a baseline on some of my flow forecasting and found it works pretty well. Hope that helps.

Written by Isaac Godfried

No responses yet