One thing that's very exciting about the new company is that I'm getting to use quite a bit of Ruby and also the fact that we're entirely hosted on Amazon Web Services. We currently leverage EBS, ELB, EC2 S3 and CloudFront for our environment. The last time I used AWS in a professional setting, they didn't even have Elastic IPs much less EBS with snapshots and all the nice stuff that makes it viable for a production environment. I did, however, manage to keep abreast of changes using my own personal AWS account.
Of course the combination of Ruby and AWS really means one thing - Fog. And lot's of it.
When EngineYard announced the sponsorship of the project, I dove headlong into the code base and spent what time I could trying to contribute code back. The half-assed GoGrid code in there right now? Sadly, some of it is mine. Time is hard to come by these days. Regardless, I'm no stranger to Fog and when I had to dive into the environment and start getting it documented and automated, Fog was the first tool I pulled out and when the challenge of verifying our EBS snapshots (of which we're currently at a little over 700), I had no choice but to automate it.
A little bit about the environment:
- - A total of 9 EBS volumes are snapshotted each day
- - 8 of the EBS volumes are actually raid0 mysql data stores across two DB servers (so 4 disks on one/4 disks on another)
- - The remaining EBS volume is a single mysql data volume
- - Filesystem is XFS and backups are done using the Aleastic ec2-consistent-snapshot script (which currently doesn't support tags)
The end result of this is to establish a rolling set of validated snapshots. 7 daily, 3 weekly, 2 monthly. Fun!
Mapping It Out
Here was the attack plan I came up with:
- - Identify snapshots and groupings where appropriate (raid0, remember?)
- - create volumes from snapshots
- - create an m1.xlarge EC2 instance to test the snapshots
- - attach volume groups to the test instance
- - assemble the array on the test instance
- - start MySQL using the snapshotted data directory
- - run some validation queries using some timestamp columns in our schema
- - stop MySQL, unmount volume, stop the array
- - detach and destroy the volumes from the test instance
- - tag the snapshots as "verified"
- - roll off any old snapshots based on retention policy
- - automate all of the above!
I've got lots of code samples and screenshots so I'm breaking this up into multiple posts. Hopefully part 2 will be up some time tomorrow