We use dataset packing, so sequences shorter than the max sequence length are concattenated with a token to separate them. If an example is greater than maximum sequence length, it is split so the entire example is used, but in exclusive subsets.
Updated over 11 months ago