Security Monitoring in Public Cloud

What should security monitoring look like in an AWS EC2 environment? Consider an EC2 environment running standard Linux AMIs with two use cases: detection and response. Detection is best supported by streams of real time events emitted by the endpoints. You can run signatures, heuristics and ML on these streams to detect suspicious conditions and trigger automated response: block, bounce, increase level of monitoring, etc. Response generally involves a human who wants to ask questions about the entire fleet, a subset or a single instance. With public cloud there is a new layer of relevant information, in our case from CloudTrail, CloudWatch, VPC flow logs, and other AWS sources. More on those later; we also need traditional system-level monitoring of instances from our side of the shared responsibility model.

For detection ‘pull’ is not ideal. Many solutions rely on polling on some interval which results in a window between pulls. An attacker that is aware of your logging then has an opportunity to cover their tracks. For detection I prefer push or event driven solutions. A system event occurs, a log entry is created and immediately sent off-instance to remote analysis and storage. The drawback is limited context available at the moment of generation with which to determine relevance. Filtering and selection is thus limited, and a greater number of events must be sent to the mothership.

Response queries are by nature request driven so polling is a non-issue; however, in the context of AWS there is little value in leaving logs on instance given traffic within region is free and storage in S3 is cheaper than EBS. So long as log generation and shipment doesn’t overly impact performance (another post), you should get data off instance as fast as possible. Your response queries can then be run across the central repository rather than each endpoint.

Activities on a Linux system require interaction with the kernel through system calls, so that is a good place to monitor. Linux provides many options for capturing syscalls, including the audit framework which consists of a kernel process and userland daemon, kauditd and auditd, as well as various utilities. kauditd emits filtered events to userland via the auditd daemon. Auditd generates syslog and Basic Security Module (BSM) compatible output, or for JSON fans go-audit will be another option. Paired with rsyslog for reliable transport off-instance, and a data pipeline you have the basic plumbing for log generation and storage.

The data pipeline used to move these events should provide reliable transport and scale along with your service. Amazon offers Kinesis as a streaming platform, which AirBnB has used successfully. Netflix already has a robust logging solution for application logs built around Kafka, so for us it makes more sense to leverage that and point rsyslog at a Kafka topic.

Alternatives

There are other options on the host like OSSEC, OSQuery and Sysdig. While these have more cachet than ugly old BSM, they have different purposes. There are also emerging kernel capabilities around eBPF that have promise.

OSSEC can collect and ship logs, but it’s primary value proposition is as a Host Intrusion Detection System (HIDS); it can take action on its own using pre-built logic. For ephemeral microservices we can be more aggressive in our response actions (instances are cattle not pets) and exert external control to terminate instances that appear to be sick. A heavy weight HIDS running locally must be protected and requires more resources to operate, thus the remote analysis option is a better fit in a public cloud microservice architecture

OSQuery collates several information sources and presents them through an easily queried SQL interface. This is excellent for response activities driven by human generated questions. It is not designed as as streaming service, but the daemon can perform recurring queries at any interval. OSQuery is excellent if your endpoints have intermittent connectivity (user machines), but in a data center connectivity is seldom the issue. In the public cloud I prefer to get events off-instance as quickly as possible and perform both monitoring and response queries against the data warehouse, rather than across the fleet of endpoints.

Sysdig is the latest addition. They have created a kernel module that passes events to a ring buffer shared with user space. Filtering and selection occurs in user space, which has a performance impact vice filtering in the kernel, but gives you more flexibility and safety. The introduction of another 3rd party kernel module makes me a bit nervous; there are enough ways to hook system calls in the mainline kernel.

eBPF implements a virtual machine within the kernel that can Just In Time compile filtering rules into byte code, potentially yielding very flexible and performant queries. This is an active area of kernel development that we are following, and our own Brendan Gregg has written extensively on this. In Xenial it works out-of-the box.

Conclusion

On current mainline kernels the best bet is using the built-in audit framework and native logging (auth.log) to generate security relevant data, and rsyslog to connect to a data pipeline. More on what to monitor, performance tuning and how to respond in future posts.