AWS Ops Rig

Every responder needs a solid platform from which to operate. Something fast, secure and pre-loaded with relevant tools. Conducting incident response in a public cloud is the same as traditional IR but different. The public cloud is after all just other people’s servers; however, the ‘other people’s’ part changes things up a bit. You can’t roll into Amazon’s data center and grab a disk out of the rack, but Amazon provides an API for that. The concept they have put forward is ‘shared responsibility’ in which they handle the hypervisor and below, and above that is on you. We are moving towards a serverless approach to forensic response leveraging lambda functions and Amazon APIs; however, there is always a need for a solid ops box stocked with tools and enough permissions to get the job done.

Design Goals:
  1. AWS EC2 native - leverage AWS APIs; minimize external dependencies - can’t rely on much in a crisis.

  2. Ephemeral instances - Spin up fresh, clean and quick; and kill them off once needed data is persisted to S3.

  3. Tools for response on Debian/Ubuntu AMIs with headless server (no GUI).

  4. Use corporate services day-to-day, but provide fallback options if they are down.

Selecting a base AMI

I first considered a minimal or grsecurity linux distribution, like Alpine, to provide a bit of system diversity that might thwart highly sophisticated attacks against our production kernel. In the end this was more difficult (baking a custom AMI from an ISO for every update) than the potential security benefits, so we went with the same family as the target environment, but leveraged the latest public build and launched through built-in AWS features rather than our standard CI/CD pipeline. Jeremy Heffner created a useful script for finding the latest build in a given region (since the AMI id varies across regions).

Configuration

Many of the traditional IR linux systems are over-provisioned with tools that do not apply in EC2. For example wireless sniffers. We started with popular forensic collections and down-selected to EC2 relevant sets. The goal is speedy initialization of a forensic AMI. In most cases our response will be conducted with access to the AWS hosted mirrors of apt repositories, so in our ‘light’ configuration we only download essential tools and then rely on apt-get to pull down additional resources depending on where the case takes us. Note we still have a ‘heavy’ fallback instance in case the repos are down.

Potentially relevant toolkits:

In selecting from these lists we considered disk, memory and network forensic capabilities.

The core of the disk forensic toolkit are sleuthkit and plaso (log2timeline) which help analyze relevant linux disk artifacts. You also need AWS CLI tools to capture and mount disk snapshots for analysis. Many of the recovery tools are not needed. EBS volumes are backed with redundant hardware making accidental data lose unlikely. At the ext4 level undelete is useful to recover intentionally deleted files, along with various compression and viewing tools.

In terms of memory tools, volatility (or rekall) is key. Although there is no API for memory capture from an instance, you can always ssh in and run LIME. Volatility is currently struggling with paravirtualized memory structures but hvm works well and they are working towards a solution for PV.

Network tools. There is a limited amount of network data that can be gathered outside the instance as AWS does not provide network capture. You can ssh to an instance and run local tools to generate PCAP. We run forensic boxes headless so tcpdump and tshark are the backbone of our toolchain; although, one could run x11 as well.

You can find an example cloud-init script here. You can pass this script as ‘user data’ while launching an AMI and it will pull in the relevant packages and establish a user account. You will still need to configure network, security group and IAM permissions for the instance, so you should create launch profiles for each region if you don’t have an automated script.

Heavy Fallback

For daily use we spin up a freshly built instance; however, in case the AWS apt repos go down, we also stage a ‘heavy’ AMI with a wider range of tools. The downside to that approach is that the baked AMI doesn’t get rebuilt with the latest base image as often, and AMI images have to be provisioned per region. You either need one per region you operate in, or you are forced to bring large capture files cross region which can take time (and costs $). In the future we may create a lambda function to rebake and pre-stage AMIs in each region each time a new Ubuntu is released. To get started the light approach with a cloud-init script will get you up and running with just the EC2 web console.

Permissions

Security folks love least privilege, except of course when it is applied to them. The forensic instance will need some pretty capable privileges to get the job done, but they should be used sparingly. It is best to decompose the privileges into roles. The instance itself should start with almost no privilege, but be allowed to assume into roles in order to get things done. This is especially important if you are running across zones and multiple accounts.

The forensic instances should run from a privileged account, perhaps their own dedicated account. Access to the instances can be controlled either manually by adding ssh keys, or using your corporate identity solution. For day-to-day activity I prefer the corporate services with the added value they bring in account control, audit, detection, etc. It is however important to maintain a ‘break glass’ capability in the event the corporate identity server goes down. This could be as simple as a shared SIRT SSH key stored securely.

Instances can be launched with launch profiles or using a lambda function triggered via the API gateway. They should be in a restrictive security group and their initial role should only allow limited read access to a dedicated S3 bucket(s). You may want to pre-stage some useful scripts on S3 (if you don’t want to pull from git). The security group should whitelist IPs from the responder’s VPN concentrator (or desktop network if no VPN is available) and block all other traffic. If there is a need to SSH into other instances the AWS CLI should be used with the forensics role to open up those routes.

There should be a role in each account and region that the ForensicInstanceRole can assume into which provides permissions for API actions to achieve the following results:

  1. SSH access to an instance under investigation.

    1. Establish a network route. In VPC, add a security group to the instance, and/or even a new network interface.

    2. Push SSH keys if they are not pre-staged with the SSM agent, or via more involved tricks with EBS.

  2. Capture a disk image using the snapshot API. Mount that volume read-only to the forensic instance and restrict other access.

  3. Capture memory of an instance and store it in S3. Use SSH access or run command if SSM-managed instances to execute LiME. Read and copy output to a write-only S3 bucket, that the ForensicInstanceRole can read.

  4. Inspect an instance, and gather AWS information on it using describe* calls, Cloudwatch, Cloudtrail, VPC Flow Logs, and any other resource.

  5. Quarantine an instance. In VPC, revoke the existing security groups, add a new one with egress only through a SIRT controlled proxy, and SIRT inbound SSH.

  6. Terminate an instance.

We are working on lambda functions for each of these outcomes, accessed through an authenticated API call, rather than requiring CLI-fu, but both are useful. If your service is multi-region you will want to provision and launch instances (and lambdas) in the same region as the instance under investigation. Any memory intensive function (memory snapshot, EBS snapshot, network capture) will be more efficient in terms of cost and time if it lands in an S3 bucket (write-only) in the same region. You will also need to duplicate security groups and roles across all regions and accounts. Stay tuned for examples of IAM roles and boto scripts that achieve these results in future posts (as we generalize our internal tools).