Situation: Every day a bunch of JSON files are generated and put into Azure BLOB storage. Also every day an Azure data factory copy jobs makes a look up in the blob storage and does a “Filter by last modified”:
Start time: @adddays(utcnow(),-2)
End time: @utcnow()
The files are copied to Azure Datalake Gen2.
On normal days with 50-100 new JSON-files the copy jobs goes fine but at the last day of every quarter the number of new JSON-files increases to 10.000+ files and then the copy job fails with the message “ErrorCode=SystemErrorFailToInsertSubJobForTooLargePayload,…..”
Therefore I have made a new copy job that uses a for each loop to run parallel copy jobs. This can copy much larger volumes of files, but it still takes a couple of minutes per file and I have not seen more than around 500 files per hour being copied, so that is still not fast enough.
Therefore I am searching for more ways to optimize the copy. I have put in a couple of screen shots but can give more details on specifics.