I'm trying to understand the upload process better and I'm seeing some interesting behavior.
Repro:
1. Upload (syn.store) file1.txt to a project.
2. Delete file1.txt (with or without skipTrashcan)
3. Upload (syn.store) the same file to the same project. I have tried purging the item from the trashcan and deleting the local .synapseCache directory.
Result:
The file does not upload but it's almost immediately added back to the project. Is Synapse somehow restoring the deleted file without having to upload it again?
If I make a change to the file and upload it again I see the upload progress and the updated file in the Project.
Any info on how this works would be appreciated.
Thanks,
Patrick
Created by pstoutdev Thanks for the explanation @brucehoff and @larssono. That helps. I just needed to understand what I was seeing and make sure I wasn't doing something wrong. Patrick:
Bruce is exactly right about the general recommendation. In addition to trying to get out of the way of uploading directly to S3 we have implemented optimizations to avoid uploading (or downloading for that matter) the same content multiple times which is what you are observing. Specifically why you are seeing the behavior you describe in steps 1,2,3. When a file is uploaded by Synapse it is split into smaller chucks that are each uploaded to S3 separately then "stitched" together into the complete file . As part of the upload the md5 of the whole file and parts are verified. If you attempt to re-upload a chunk that you have already uploaded recently (I can't remember the max time delay) the client will skip that portion of the upload that was completed. When you edit the file after step 2 the content of the file uploaded in step 3 is identical to the file uploaded in step 1 and all of the parts have already been uploaded so the only delay is just the communication between the local client and Synapse verifying that the parts exist and have been stitched together to the complete file.
You can also search Synapse for any content that you have locally using the md5 of the file (deleted items in Synapsse will not be returned however). For example if you do:
```synapse show /path/to/local/file```
with the command line it will tell you which files in Synapse exist that match your local file and what Synapse know about the file (e.g. annotions, provenance etc). @pstoutdev Both the Synapse client and server have optimizations regarding file transfer, including avoiding transferring files already transferred and restarting failed transfers using partial results. (I do not have the details at front of mind.). You ask:
> I'm trying to understand the upload process better
Fundamentally, file upload is direct from your client to the underlying data store (usually an AWS S3 bucket). There's a bit of overhead added to "talk to Synapse" but for large files the time to upload to Synapse should be pretty close to (just a little longer than) the time to upload to S3 over the same connection. So when you profile upload speed (at least for large files) you are mostly profiling your network connection to AWS and the performance of your client machine.
For best performance we recommend using the Synapse Python client which uses multithreading/parallelization to optimize upload. The 'synapser' R package wraps the Python client but is constrained by R's architecture to work in a single thread, so it's necessarily slower. Similarly, constraints of the web browser as a client platform mean that you should expect slower upload using the web browser than using our Python package.
Hope this helps!
I'm seeing the same behavior when not deleting but just updating.
1. Upload file. I see the progress and it takes 10 seconds. Entity Version == 1
2. Remove a line from the file. Upload again. I see the progress and it takes 10 seconds. Entity Version == 2
3. Add the removed line back to the file so file size and MD5 are the same as Version 1 of the file. Upload again. No progress, completes immediately . Entity Version == 3 @brucehoff I do not see the upload progress display and I'm seeing 100MB files appear to upload immediately. If I change the same file and re-upload I see the upload progress and it takes 10 seconds or so to upload. This seems to indicate that the file is not being uploaded but is somehow restored.
For reference I'm trying to benchmark some upload processes. I'm benchmarking by uploading the same files to a project. I delete the files from the project for each test and I noticed this behavior. @pstout , why do you say "The file does not upload but it's almost immediately added back to the project. "? That is, if you use "Synapse.store()" and see that your file is added to your project, why do you say "the file does not upload"? Can you elaborate on what problem it is that you face?
Drop files to upload
Deleted entities not re-uploaded but restored somehow? page is loading…