Recursively searching and downloading objects from AWS S3 with the `aws` CLI
Yesterday I was asked if we had some (very) old files from a project. They were originally generated by one of our freelancers, who at the end of the project, had given us a backup of his external drive for archiving.
My go-to for this kind of long-term storage is AWS S3 or Glacier. I’d backed up the drive to a bucket and had forgotten about it until now.
The aws
CLI is powerful but inscrutable, given that it can do anything the web console can, for any AWS service.
The files were from Sibelius (.sib
) and for Clarinet, Flute and Sax. Cue some googling about the aws s3 list-objects
command.
You can combine the search terms to the --query
parameter to limit the results of the list-objects
call. Searches are case sensitive, and I wasn’t sure how the files would have been named originally, so I went for:
--query "Contents[?(contains(Key,'larinet') || contains(Key,'lute') || contains(Key,'ax')) && contains(Key,'.sib')]
To give me the best chance of finding Clarinet
or clarinet
, etc (no good of they’re named CLARINET
, but what kind of monster would do that?).
Because I only need the path to the S3 object for the get-object
call, I’ll filter on just those:
aws s3api list-objects --bucket "my-bucket-name" --prefix 'path/to/files/' --query "Contents[?(contains(Key,'larinet') || contains(Key,'lute') || contains(Key,'ax')) && contains(Key,'.sib')]"
This gives me an array of json objects like:
[
{
"Key": "path/to/files/Project backup/Clarinets/example.sib",
"LastModified": "2017-09-18T21:01:19+00:00",
"ETag": "\"e9abc61fab413ec5de1126e7ab48ac38\"",
"Size": 226180,
"StorageClass": "STANDARD",
"Owner": {
"DisplayName": "owner-name",
"ID": "d9fc4a52c54711ec9d640242ac120002f54ce640c54711ec9d640242ac120002"
}
},
...,
...,
]
We can then loop through these, downloading the files. Instead of just dumping them all in one directory, let’s replicate the path structure they’ve been saved to:
# Get the list of objects to download - note that the Key search is case sensitive
# hence using larinet instead of clarinet or Clarinet, etc
aws s3api list-objects --bucket "my-bucket-name" --prefix 'path/to/files/' --query "Contents[?(contains(Key,'larinet') || contains(Key,'lute') || contains(Key,'ax')) && contains(Key,'.sib')]" > files.json
# loop through the list, grabbing each object, putting it in the relevant folder
cat files.json | jq -r '.[] | .Key'| while read key
do dir="$(dirname "$key")"
file="$(basename "$key")"
mkdir -p "$dir"
aws s3api get-object --bucket "my-bucket-name" --key "$key" "${dir}"/"${file}"
done
All links, in order of mention:
- AWS S3: https://aws.amazon.com/s3/
- Glacier: https://aws.amazon.com/s3/storage-classes/glacier/
- bucket: https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingBucket.html
- aws s3 list-objects command: https://docs.aws.amazon.com/cli/latest/reference/s3api/list-objects-v2.html
- --query parameter: https://docs.aws.amazon.com/cli/latest/userguide/cli-usage-filter.html#cli-usage-filter-client-side
Recent posts:
- Patch for aarch64 (aka arm64) openssl 1.0.2 'relocation R_AARCH64_PREL64 against symbol OPENSSL_armcap_P error'
- TIL: the `NO_COLOR` informal standard to suppress ANSI colour escape codes
- Copy the contents of a branch into an existing git branch without merging
- Adding search to a static Jekyll site using pagefind
- asdf, python and automatically enabling virtual envs