Looking for more podcasts? Tune in to the Salesforce Developer podcast to hear short and insightful stories for developers, from developers.
70. Monitoring, Privacy, and Security in Public Cloud
Hosted by Robert Blumen, with guest Sean Porter.
We're used to monitoring our applications for metrics like performance and availability. But increasingly, teams are also extending monitoring practices towards security and privacy of their systems and data. Whether it's ensuring that there isn't a wide configuration drift, or even ensuring that there are no anomalies on how a server is being accessed, monitoring internal server metrics can help provide engineers and DevOps teams enhanced peace of mind. Sean Porter, the CTO of Sensu, discusses these monitoring tactics with Robert Blumen, a DevOps engineer at Salesforce.
Robert Blumen is a DevOps engineer at Salesforce, and he's interviewing Sean Porter, the CTO of Sensu, a cloud monitoring platform. Monitoring your infrastructure often looks like keeping track of the four golden signals: latency, throughput, error rate, and saturation. To that, Sean advocates identifying data specific catered to security and privacy. For example, with regards to intrusion detection, a company could track the rate at which unauthorized attempts are being made, and where they're coming from. This could signal potential weak spots in the system or software which malicious actors are probing. Armed with this data and analysis, one could reinforce their security.
More broadly, intrusion detection is really about monitoring changes to your system's state. You could take a snapshot of your entire file systems, from permissions of folders to the individual bytes of each binary; by recording the information of a known "good" state, you can track any changes that are occurring. You would be able to identify the rate at which your servers are undergoing configuration drift, or be notified if key system software, such as ssh or ps, have been tampered with. Monitoring your security is about taking a proactive approach to observing any state change on a machine, not necessarily whether unauthorized ports are being sniffed.
With regards to privacy, you could build some auditing functionality to ensure that you're not exposing any user information you shouldn't be. One approach might be to monitor whether numbers that look like a credit card are being accidentally showing up in your logs. It's also important to be mindful of compliance with regulations like GDPR. GDPR stipulates that users must give explicit permission for the ways in which you store and make use of their information. Sean points out that there are tracing systems which can track a user's movement from their browser navigation through each microservice they transparently access. Your monitoring system would want to keep an eye on these flows and ensure that every system is behaving appropriately.
Links from this episode
- Sensu is a platform that automates monitoring workflows
- Sensu Go open source projects
- Sean Porter articles on DevOps
- Sean Porter technical blog
- Sean Porter talk on monitoring architecture patterns
- Google SRE book (free online version)
- CPU vulnerabilities like Spectre are a new breed of attack on public cloud systems
- Terraform offers tools for managing configuration drift
Robert: For Code[ish], this is Robert Blumen. I'm a DevOps engineer at Salesforce, and with me today is Sean Porter. Sean has over a decade of experience in DevOps and infrastructure, and is the co-founder of Sensu, a provider of open-source monitoring. We'll be talking about monitoring, privacy, and security in public cloud. Sean, welcome to Code[ish].
Sean: Thanks for having me, Robert.
Robert: We are going to be talking about monitoring, privacy, and security, but I want to start out with a more general discussion about monitoring. Monitoring has its origins in infrastructure monitoring, performance, and availability. The Google SRE book, which has been very influential, talks about the four golden signals which are latency, throughput, error rate, and saturation. None of those are either security or privacy. How do you extend the idea of monitoring to security and privacy?
Sean: I think the first... While the golden rules are good for identifying, there's an issue with the system, distributed or not, at any scale. More context is always required in order to troubleshoot the issues that those golden signals identify, and it comes down to like context is really key to quicker resolutions and more effective retrospectives. To collect this context, we trace, we log. We also periodically probe our services to measure and record that data. That process of data collection and analysis really extends beyond just those normal signals.
Robert: It does. Let me ask you another question, which I think might help eliminate this. I think of monitoring as systems collect metrics and publish them, and then you have tools that ingest the metrics, save them, and you're defining what is a normal condition and what is an anomaly so that you could apply that to your memory, or latency, or running out of disk space. But you could also apply that to security and privacy if you can define what is normal and what is an anomaly.
Sean: Mm-hmm (affirmative).
Robert: Is that where you were going with your discussion?
Sean: Yeah. Yeah, absolutely. I mean, we can look at a number of different tools, and approaches, and practices in the space of security and privacy. If we look at one example, intrusion detection, you start with like an assumed correct, safe state, and then any unplanned changes to that state are to be considered violations. Those violations could be considered the anomaly similar to your time series data, your metrics, and anything that constitutes an anomaly needs attention. Same can be said for intrusion detection. If we think of even what everyone has on their home PC, virus scanning or vulnerability scanning. If you look at the enterprise, we have tools in the space for scanning all of our systems and our applications for known vulnerabilities. Those vulnerabilities could be considered an anomaly, either an automated action or a human being needs to be involved to take action on that. Again, it's very much in line with considering monitoring to just being time series data, but the same pattern can be applied across security and privacy.
Robert: If you're talking about states of infrastructure and if you've built your infrastructure, you could scan it and look for vulnerabilities. But if you fixed them, then ideally, they should stay fixed, unless you're changing things. So are you talking about unplanned changes like why is that port open on that server when we didn't configure it that way?
Sean: Yeah. I mean, I guess unplanned changes is generic or broad by intention. It could be somebody reverted a change in a git repository that applied a set of changes or somebody jumped on the individual machine and like, "Oh, I really need to troubleshoot this issue using Telnet," and then punched a hole or what have you, and then forgot to revert it, or it could be actually a malicious attacker that's gotten on your cloud infrastructure and now they're starting to wreak havoc and instrument your infrastructure with their own tooling, if you will.
Sean: I think if you look at some of the practices over the last 10 years, namely with like config management, that was used as a way to combat some of these unplanned changes through consecutive runs. Chef, or Puppet, or Ansible will just pervert any anomalies. The problem with that is even if you revert the anomaly or you address it, your system may have already been compromised. So even in that case, it's important that we raise those signals so that people or systems looking for them can pay attention and take appropriate action. If that port was open for long enough for somebody to replace binaries on your system that you use every day, that's a pretty serious condition.
Robert: Now, it is very common with infrastructure's code tools that you would run them either on a schedule or periodically. If they see configuration drift, they'll revert to a known good state, and they might generate some kind of report, and maybe someone will look at it or not. Are you advocating that these tools, when they run, they should create some kind of useful metrics like, "This is how many configurations were reverted," or where are you going with that?
Sean: Yeah. I think metrics or time series as it relates to system configuration works really great for like a high-level overview or a top-level signal so that... Yeah. I think your idea or example of recording the number of resources touched or modified between runs is fantastic because it will give you an idea of how efficient and effective your config management is. Perhaps you have a little bit of debt where were not all those resources needed to be touched every run, so that's a good measurement to use for improving the quality of your infrastructure's code, your config management.
Sean: But I think the same as with the general monitoring and those golden signals to see when things are going awry or need attention, context is king. Same thing applies to config management and drift is you still want those like context-rich reports to be saved and archived somewhere at least so you can look at them and throw them away later intentionally. But those high-level metrics are really just for you at a glance to know if something is wrong. Then, you can dig into these and really understand why these changes are occurring and what corrective actions could be taken, so you definitely need both. Just like with regular systems monitoring, you need your time series, and then you need your event data, your context-rich event data too.
Robert: In the case of... Let's take one of the four golden signals, saturation. I should never see a disk fill up, and that's always an error with configuration management if you're making changes to your configuration. The next time it runs, you would expect to see modifications as they get applied across the system. But if nobody is making any changes and there was a lot of modifications, maybe something is going on. How can you reasonably find what is an anomaly and what is expected? It seems a bit harder than in the case of some of the infrastructure metrics.
Sean: It matters how you're approaching config management. For example, if you're using like a golden image that kind of acts as a baseline for your systems, you expect like config management tools or any scripting that are attuned to them to only touch so many things. So you have your standard baseline, then you have a preconceived or set expectation in a matter of how many things are going to get modified. So that's one thing because you have your golden image that sets your baseline.
Sean: The other is like from a raw bootstrap. So you start from a VM image that's just like a Ubuntu 18.04 or Rails 7. In that case, your baseline is only really established after your first config run, and then you're looking for patterns. What's really interesting is if you continuously run your config management and you're storing these metrics like modified objects and time series, you will see a pattern emerge. We as human beings are great at pattern recognition, even if it might not be there, but we could see it on a chart and know what is not normal. So I think those are two examples. So there's golden, golden images establish your baseline, and then you have straight raw bootstrap, which only creates a baseline after it's had an initial bootstrap run.
Robert: Some organizations use the approach of immutable infrastructure where you build images using a pipeline and you never update them. If you need to make a change, you build everything new. That is, in contrast, the approach where you run Puppet every 30 minutes to apply your changes. It makes sense to combine these approaches to build a mutable infrastructure, and then run Puppet. If it sees any changes, it reverts them, but you should never see any changes. So if you see even one, that's an anomaly?
Sean: Yeah, that's a perfectly valid practice. I mean, and if you do, you could use Puppet to bootstrap the image that you then snapshot and uses your complete image for a particular service. Then, there should be zero modifications. Yeah. You could continue to use Puppet for that purpose, or you could use like a simple intrusion detection system, some low-cost means of just monitoring on an ongoing basis to make sure that that immutable image remains immutable and your application isn't leaving the boundaries that you set in terms of space where it's allowed to write to disk or any storage medium.
Robert: Can you describe what is an intrusion detection system, and how would that tie into monitoring?
Sean: So intrusion detection at the basic is just looking for simple changes in state. It's really all about... It boils down to take a snapshot of your system, usually your entire file system, and observe how many bytes are in each binary, permissions of directories and files within, and record all that information, and then periodically and possibly aggressively probe all those things to see if they get modified. If they get modified, compare what was modified with a static list of severities, if you will. So you could say, "Oh, if these particular groups of files or binaries get switched, it's a higher severity."
Sean: For example, if somebody replaces SSH binary or ps on your system, that's pretty high severity. You should know about that because its surface area is quite large. So at a glance, that's what intrusion detection is, and because it's like this really simple system that says, "These things have changed. Here's the signal," that signal is really easy to interpret as a monitoring check, and then that monitoring check could result in firing off an alert through PagerDuty as an example to wake somebody up or saving that signal to time series database or a document store like Elastic or Splunk. Really, the options are endless. There are so many things you could do with that.
Robert: I went through the pipeline of metrics, anomalies, and then you introduced alerts, and that goes to an operator. What actions does the operator take when they receive an alert of this type?
Sean: So what's interesting is you don't have to have an operator there. They could be an observer for a more automated approach. For example, you could do the common practice of what we refer to as auto-remediation. So if there is an intrusion or a modification detected, you could trigger CM runs to override, or if you're going the very like golden image or complete image approach, "Ah, something's been mutated," nuke this VM or container completely and stand up something anew. Really, all these automated actions are just being measured, and presented to, an operator to observe to observe that, A, they happen, and B, they didn't have secondary adverse effects on the overall system.
Sean: If that isn't the case, usually, an operator would have to use their operator human brain and be like, "Well, does this look severe or not? Why did this occur? Do we have any code changes that applied right around the time of that getting modified?" You just go more into like a retrospective or a post-mortem kind of workflow of investigating why it occurred, and then see if something actually needs to be remediated.
Robert: Suppose an attacker has gained access to a server, they've replaced some executables with programs that contain some sort of Trojan horse or backdoor, they're hoping that someone will run those programs. But immediately, the instance is terminated and replaced with a good instance. How effective is that type of auto-remediation? Do we have any real evidence or hard data on how quickly does this repel or prevent attacks?
Sean: The problem is generally when you replace something with a new copy of the same thing, the vulnerability is still present in the way that the attacker actually achieved or almost achieved their goal is still going to work, and then it also creates an opportunity, a new form of denial of service by using that leverage, leveraging that vulnerability to not necessarily successfully get your binary on the machine or modify anything beyond a simple file. But if you did that en masse on other things that looked like that, you can effectively cause like a restart loop of that application for a particular business. So I don't think it's wise to use as a general practice. I think it's an interesting practice that we can exercise and learn more about how it actually operates and behaves in the wild. I think it's a noble idea that we can continue to automate more of our daily work though.
Robert: By nuking, where I first go is storing incidents where it makes sense to sequester it for forensics to identify how it was able to be attacked in the first place?
Sean: Yeah, absolutely. I think you're your first thing in auto-remediation may be to capture more systems data report in an out-of-band manner like collect data, or label it, or tag it elsewhere. You're likely already logging system logs and other things, so you have some signals already saved off of that machine. If you want to go further, I think that's a great idea of essentially unplugging or disconnecting it in a way for forensic analysis later is a cool practice. The only thing it may not scale well if you're a small team managing hundreds or thousands of machines, the unfortunate reality is that might not come to pass, and you might not be able to take the time to analyze that machine.
Robert: Some of the discussion we're having is fairly general. Other aspects rely on features of public cloud. Could you highlight a bit how this monitoring of security is different in public cloud than in first-party or on prem?
Sean: Yeah. I think public cloud is really interesting in regards to security mainly because there are several large players or vendors in the space, and they each have their own very different offerings from one another. Some examples of these differences between the players are authentication, authorization, key and policy management. As an operator, you need to know the intricacies of each provider and how best to leverage those things, and do it in a particular way that is proper and correct.
Sean: The same could be said about networking. Each of them provide you different building blocks to provide different layers of isolation, and then that isolation has to even apply with multi-region support. So it's just a lot to take in. If you're just using one public cloud offering and that's where your entire infrastructure lives, it's less of an issue. But as soon as you start to go the multi-cloud approach, you now have to understand or have a deep understanding of each of those providers, and the offerings they provide, and the security ramifications of each of those things, and making sure you're doing it right.
Robert: Public cloud providers have their own building blocks for implementing security, and they have a certain amount of metrics that you can use in generally alerting or importing metrics into your own tools. Do they provide enough for a solution to monitor these types of security anomalies, or are you looking at building your own security anomaly monitoring on top of these building blocks?
Sean: You have to do both. I think it comes down to... There has to be a certain level of trust with the vendors and that they're providing these services and doing things the right way. I would be willing to bet they're probably doing it better than myself or you, or many of the listeners would be able to do it, but they still have their limitations where they are supporting an inherently multi-tenant environment. They're still human beings. They still write software. Human beings are flawed. Our software is flawed. The best code is no code kind of scenarios, so I think you need to be able to leverage these things, and you can leverage the metrics and the systems that they provide to ensure a secure operating environment.
Sean: Same thing applies like with AWS. They have notifications say, "Hey, we noticed your S3 bucket policy. It could have some public access." That is monitoring. That's a notification right there, so that's great. But we as operators leveraging these public services need to also leverage our own security tooling within the host systems as best we can. We have to leverage our own monitoring systems, and we can tie them all together so that we're still getting that single, unified visibility. But ultimately, we still have to do some of the old practices that we are doing when we're running these things in-house. The fact that it's a multi-tenant system also is terrifying because every few months now, we're learning of a new silicon vulnerability. Right? Like if we're all sharing the same CPU with 20 to 100 other tenants, you start to take that into account, so it gets pretty deep. Get your tinfoil and armadillo hat on because... That's why I think public cloud is so, so interesting.
Robert: The other big area we're going to be talking about is monitoring privacy. Before we get into monitoring privacy, could you give a brief overview of what are privacy concerns that we face today, especially in public cloud?
Sean: Where a lot of organizations started to get concerned around privacy challenges was with the rise of GDPR and how it impacts all the businesses of today. So the GDPR or General Data Protection Regulation is really a set of laws within the EU, but they have such a global effect. North American companies have to deal with it. EU users as the consumers are the real winners here because now there's regulation that states they have the right to certain things about their data that companies collect and use.
Sean: GDPR, the introduction, and the fact that it's EU regulation that affects North American companies was pretty significant. The fact that nowadays it's so trivial to run an application or business globally in multiple regions, you now have to comply with these, and GDPR is just one. There's actually several other countries and states that have similar regulations. I mean, Brazil has LGPD, which is modeled after the GDPR. Australia has NDB, the Notifiable Data Breaches. It's really about breach of privacy disclosures, making sure that users know when their data has been exposed. California has CCPA, the California Consumer Privacy Act. So the list goes on and on, and privacy protection regulation is becoming more prevalent, so we really have to be in tune to that as organizations, as businesses.
Sean: What makes this interesting for us as operators is to understand how our systems are operating. We're on an ongoing basis collecting more and more data about our systems, building out that rich set of context. That context may leak into the user experience and possibly have some crossover that contains some privacy information. So now you've got regulatory regulation applying to our data pipelines, and visibility, monitoring, and observability tools. So we have this effect, which has regional regulation that has global effect that has an impact on our day-to-day as operators of these large systems.
Robert: The area of privacy protection is quite broad. I do want to narrow focus down to what is the contribution of monitoring to privacy protection. Could you drill down into that?
Sean: Earlier, we touched on the umbrella of monitoring, and I made the case for security monitoring to come underneath that umbrella, so intrusion detection, and vulnerability scanning, and that kind of thing. I think the anomalies or the signals that they produce are fairly straightforward, and I think we can consider it monitoring. I think with privacy, it's really looking through our data collection systems for monitoring and observability, and ensuring that information that's going through those systems don't look like privacy information. That's one. The other would be actually leveraging monitoring as a form of compliance of privacy regulation within our applications and the like. So you can use it as an ongoing audit capability, and then notify operators or take automated action when something is in violation of that regulation.
Robert: You've made two points there. Let's go into more detail in those a bit. You talked about data collection systems. What are some of these systems you're thinking of?
Sean: We can start really broad and general. I mean, even syslog is a data collection system, if you're sending it centrally. We have trace systems now that span everywhere from the user opening a web browser and viewing a page to trickling down through all your microservices. So that's some really rich information there. So really, it's like logs, trace. I consider even time series data with good tagging and labeling. There's a fairly good source of data there. Basically, anywhere where we're scraping and funneling information into a central database for analysis. So that's what I consider to be general data collection.
Robert: You're concerned about these data collection systems as places where there may be privacy leakages. Is that right?
Sean: That's part of it. So yeah, just being mindful that you may expose the organization to some risk by collecting all that information if it crosses over into private information, particularly for users that live or work within a certain region with their own policies and law. Then, the other piece is leveraging still that data collection process and pipeline to monitor the rest of the business to make sure it's complying with regulation.
Robert: So are we talking about if someone accidentally logs a credit card number into a log aggregation system, or would that be an example of the type of thing we're talking about?
Sean: Not quite. I'm thinking more along the lines of if I'm capturing you are a user of the system, you are a customer of mine. I am observing how you're using the application, and I want to contextualize all the data that's coming out around your particular experience and your session with your information so that it goes through my system. I incorrectly use private information maybe even just your email address and a few facts about you or I accidentally capture that, that goes through. I've now violated a regulation because it probably crossed regions, crossed bounds. It's stored in a new database. When you sign up as a user, I did not say we will use your information to better understand how our systems work. Yadda, yadda, yadda. I think the crux of the issue of it is that when you signed up, you basically didn't sign up for somebody to leverage your personal information to contextualize monitoring data.
Robert: So are we talking about then if any private information flows through different points of the system that you'll have the ability to detect that and alert on that based on your own guidelines of what should or shouldn't be moving across different boundaries?
Sean: So it really depends on your approach to data collection and if you're doing like localized aggregation first before pushing outside of a particular region. Perhaps if you're pooling, you're doing log aggregation on a per-region basis first. Then, if you were to just observe those streams for particular information that looks like privacy information, then you could catch it before pushing that data outside of the region, and you can delete and deal with it there.
Sean: Ultimately, I think no matter what, if you do detect a violation, you have violated the privacy regulation, and it really depends on the regulation of what actions need to be taken as well. Perhaps the user needs to be notified and told that there's been... Not a data breach, but misuse of their personal information. So I think that's just interesting. The problem with this is it's hard for me to answer it in a general way because it's so specific to the actual regulation that's being applied.
Robert: So Sean, you and I were talking offline before we recorded about an incident that you can't describe in detail because it would compromise privacy of a customer, but privacy violation was found by a privacy audit that occurred after the fact. If you are thinking about how to set up a system to detect these things through monitoring and alert on them, how would you do that?
Sean: During this retrospective specifically, the security audit was looking at data collected and just going through the usual outputs which is report data, dashboard visualizations, and the day-to-day analysis of that data. It became clear that some personal information had leaked through as context as part of that regular operation and analysis. So really, it's like there's two ways that I think you could learn from this experience or this example, which is perhaps you have a monitoring tool or mechanism to, on an ongoing basis, look for these common signs, make sure that data coming into this system is being anonymized or perhaps generating a unique ID for the user in their session, which codifies some of the unique attributes about them, but not actually leaking through that sensitive information.
Sean: So you can monitor at the edge perhaps in SDKs or the libraries that are being used. Perhaps put some constraints in place to protect developers from having privacy data get into that data pipeline. You could also have a polling base at the aggregation points or the final destination to look for what looks like privacy information, or you could drink the Kool-Aid and experiment with things like machine learning to identify personal information as it comes through the system. Perhaps then, that would be actually the most effective means of dealing with it.
Robert: Last major topic I want to hit on is tooling. For conventional infrastructure monitoring, there are a number of popular tools that people are familiar with, and as I said, matter of DevOps, and then there are vendors who can provide dashboards in their own set of tools. When we're getting into monitoring security and privacy, are the existing tools adequate, or are there other tools people should be looking at?
Sean: Yeah. I mean, it depends how you present some of the data. So in the case of taking security monitoring where we were capturing a number of files changed over time, intrusion detection violations, those kinds of things, they can actually quite easily be visualized as time series. So we have tools like Grafana that can, and we have time series databases, so we can easily store and then visualize that information. So I think that's great, and Grafana actually has a number of different visualization tools. Then, tools like Sensu that can collect information from a different number of formats and run your probes on your services and applications, and then output the events in different formats. I think data collection is quite easy to do there. One example is the other day, I was using Sensu to execute Tripwire runs, and then aggregate the results, and then write them into Elastic.
Sean: So I think we have a number of different tools that can be combined to address security and regulatory monitoring. But the reporting side, I think there's still a void there of how we can combine these best of breed tools that we already use for time series, and service checks, and other facets of monitoring, but combine them in a way that we can still provide meaningful reporting around security, and compliance, and privacy. I think we're good on time series. I think we're good on the data collection and processing pipeline. I think that the missing piece now is how do we present useful reports that people are used to in terms of security and privacy regulation.
Robert: All right. Sean Porter, thank you so much for speaking to Code[ish].
Sean: Thanks for having me, Robert.
A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.
Lead DevOps Engineer, Salesforce
Robert Blumen is a dev ops engineer at Salesforce and podcast host for Code[ish] and for Software Engineering Radio.
More episodes from Code[ish]
Jim Jagielski and Alyssa Arvin
Jim Jagielski is the newest member of Salesforce’s Open Source Program Office, but he’s no newbie to open source. In this episode, he talks with Alyssa Arvin, Senior Program Manager for Open Source about his early explorations into open... →
Lisa Marshall and Greg Nokes
This episode of Codeish includes Greg Nokes, distinguished technical architect with Salesforce Heroku, and Lisa Marshall, Senior Vice President of TMP Innovation & Learning at Salesforce. Lisa manages a team within technology and product... →
Innocent Bindura and Greg Nokes
How do you know an application is performing well beyond the absence of crash reports? Innocent Bindura, a senior developer at Raygun, shares the company's tools and utilities, discusses the importance of monitoring P99 latency, and talks... →