We use dataset packing, so sequences shorter than the max sequence length are concattenated with a token to separate them. If an example is greater than maximum sequence length, it is split so the entire example is used, but in exclusive subsets.
Updated over 9 months ago