AWS DFIR

I previously discussed some of the challenges and opportunities of working in a public cloud when it comes to forensic response. Since you don’t own the datacenter, traditional tactics of grabbing boxes and imaging drives with write-blockers don’t apply. The goal is still to capture an unaltered image of both the disk and live memory, preferably without any interaction with the target system. Network traffic is another useful optic, and in traditional models can be captured out-of-band without disturbing the target through taps. The reality is it is often inefficient to answer your questions, especially fleet-wide, by capturing full live memory and performing analysis on it, in which case you might consider live response capabilities as well to query the running system. The danger is that by interacting with the system you might alter the evidence or tip off the adversary. Ideally you will have already logged relevant activity and stored it off target. All of these optics are valuable, and their value compounds when combined. In deception it is difficult to control multiple channels; so for example, while the OS might lie, it is hard for that lie to stand up to correlation across network data.

In AWS the snapshot API for EBS-backed instances is an excellent tool. The snapshot process is well described in several places: using the web console, or the aws cli. Capturing live memory is more difficult. While Xen supports memory image export, AWS has not made this functionality available via API, so you have to move to a live response model and run the capture from target instance. This could be detected by an attacker. Similarly there is no API for performing network capture, beyond VPC Flow Logs, so if you need that optic you have to install a local tcpdump on target or route your traffic through another instance acting as a proxy. Where you need to use a live response model, EC2 does offer the send_command API if you run their SSM agent on your instances. Otherwise you can use traditional methods: agents or service accounts. I hope to share more on how we are automating this as I am able to generalize our Netflix specific scripts.

Based on our microservices architecture Netflix is well positioned to take advantage of differential analysis across Auto Scaling Groups (ASGs). Within any given ASG our application is composed of 10’s to 1000’s of copies of the same instance running in parallel. Requests are randomly distributed across this group, and members are added and removed on the fly. Our chaos monkey will terminate instances at random, and our performance team will terminate them if they seem to be going off the rails. This gives us a lot of flexibility in automating response actions on suspicious instances, as there is low cost if we knock it offline. In the traditional model there is always that one business critical server with five years of up-time that if knocked over may never come back.

Machine learning has been heavily marketed in the security space recently, but one area ML is actually effective is anomaly detection using various clustering algorithms. Our hypothesis is that if an attacker is able to exploit onto an instance, but not able to exploit them all simultaneously (due to out of band load balancing), we should be able to detect changes in behavior from the herd. The mircoservices model also gives us a set of ‘likely-good’ systems to compare against. Pre-deployment we can profile a fresh instance during the bake and deploy pipeline. Then while they are running we can constantly compare instances to this profile and the rest of herd as they handle requests. If one of the instances begins to deviate from the herd an alert is generated in our monitoring pipeline, and forensic response and analysis should be greatly simplified through differential techniques.

Differential techniques can apply to both live memory and disk images. Consider a file system disk image, blob_extractor could be run on each captured image and only the differential files forwarded to an analyst for review. This could be further refined by suppressing files that are known to change but contain little relevant data. New or suspicious executable files could be sent on for further processing, such as checking their hash against known databases, evaluating them with static analysis tools and live detonation. If configuration changes were detected they could be compared against CIS and other baselines to determine if the system had been weakened. Heuristic triggers, such as File Integrity Monitoring around user, network and credential configurations could be highlighted for analysts… along with many more ideas security professionals might come up with.

In terms of live memory we should be able to pursue a similar approach with Volatility by running plugins across multiple captured images and looking for non-trivial variations. We could also potentially discard large redundant sections of the memory map to scale processing and storage. These same differential techniques at the process level could give us a shot at detecting in-memory attackers.

Our efforts in this area are just getting started and we are always looking for folks to collaborate with!