Level zero - Nothing
Level 1 - Self healing
Level 2 - Instant provisioning
Level 3 - Attack immunity
Level 4 - Retaliation
Level 5 - Community
There are no automated processes in place, or what is in place is ad hoc. All work is performed by highly skilled engineers.
The actual paper script used in Level Zero has been converted into a set of automatons which “Perform function Y when condition X occurs.” At this level your systems have the ability to self repair.
Demand processes are kicked off when someone “Asks” for something. At this level there is no delay between a request being approved, and it being implemented. We call that instant provisioning.
There is a reason why this is not level 1. It is because at this level your automation has the risk of causing damage. Which means you must have a solid QA process built for your automation efforts before you can do this level.
If a firewall ever receives an unexpected packet, which by the way will fail to pass due to a strict whitelist firewall design, then automation will immediately shun that interface. Everything behind that now shunned firewall interface is disconnected entirely from the network.
This begins with automated shun rules being written to the Internet firewall as packets are being received that are deemed malicious. It continues up to and including having the data centers swap roles.
Servers are provisioned on demand of the systems under which they operate.
You must address all three threats to achieve level 3. Having reached level 3 means you are no longer vulnerable to an attack.
As you are writing that shun rule to the Internet firewall, fire off an automaton that will attempt to gain access to the IP in question with the intent of:
Determining what IP's this node is in communication with
Fix this machine.
As soon as you've fixed 1 machine, your efforts have paid off in full.
You share your experience and your methodology with anyone who cares to ask. You help others trying to get to your level. At this final level, you are shaping things.
Where is your organization?
None or ad hoc
I seriously doubt you even have self healing. The why may surprise you.
Here's where you (and nearly everyone else on the planet) is in terms of automation. Everybody, and I do mean EVERYBODY agrees that “we need more automation.” However nothing systemic has been done, ever, to forward that proposal. What you have is a collection of what should be considered toys. These are typically perl or python scripts that perform a particular, very narrow function. They are not automation because they have no sensor. They have to be executed by the engineer. They are half an automaton, the actuator.
Now here's the problem with your situation. Go to the network engineer that built that little toy in python that performs task 'y' and ask them, “who peer reviewed this code?” I doubt the engineer will even know what peer review means, since that's a developer term not an engineering term. I can guarantee you the only person who has laid eyes on that code is that one engineer. Beyond the likelihood for just bad code, can I get an “Amen” around unforeseen consequences? You do realize that every developer knows they can't QA their own code. So then, network engineer who thinks they are a developer, who was your QA? You know the one that threw the kitchen sink at your code to try and get it to break. And then reported back to you about all the ways your code broke. That you have to fix now. And this is the eighth round of this. If you've never been through this, you aren't a developer. Can you see now why network engineers can't be developers? Wrong skill set. Being able to code does not make you a developer.
That's part of the why. Wrong skill set. No idea what “Release Management” even is. No concept of QA in any form. Now, you need this resource. They are the ones that can tell you what needs to be done, and how to do it. That is then given to a real developer who follows a framework, and that has a QA department alongside them. When they get to a release candidate, the engineers are reengaged to see if the automation perform as expected. They have final release approval. That's how you do automation. And, “No” I've never actually seen that approach used. Not over a 30 year carrer working with many hundreds of companies.
Here's the other part of the why. Go to any IT engineer you like, any discipline and ask them this question: “How much of your daily job is repetitive work?” You will get an answer somewhere between 50% and 85%. What that means is putting in automation to relieve people from repetitive tasks, may just put them out of a job. “Yes, you will” says the person who has actually done that on several occasions. I've lost support contracts because I left enough automation in place that I simply wasn't needed any longer. Engineers are builders. We build things. Why have we been asked to take on the role of maintainer? Particularly given the fact that the best maintainer is automation.
You are at level zero. You and everybody else. How do we get to 1? This one is easy because you already have the script. Every level 1 NOC tech ever hired was sat at their station and given two pieces of paper. One, a postit note with their credentials on it. The other an 8½x11 sheet of paper with the top 20 things that might go wrong and what to do when that occurred. To get to level 1 what you do is you hand that same piece of paper to a developer, and with some vocabulary help from an engineer they turn that into 20 automatons.
What you will end up with is a collection of “When condition 'X' occurs (sensor trips), perform task 'Y' (actuator is executed).” When you turn those 20 automatons on, you will then remove 90+% of all your incident tickets. Just gone. The automation will be fixing things before the monitoring system can register a fault. It's a question of time. Three successive failures on a 5 minute polling cycle gives the automation 15 minutes to fix anything. In that much time I can spin up a cold, blank device and configure it for a role as long as it is already plugged in. So you should have a few cold, blank devices wired into your network so they can take over for any fully failed device. But spinning up cold, blank devices and configuring them for a role on the fly is a level 3 process. For now just get the top 20 things on the sheet of paper you hand to every tech.
Then go and fire every last level 1 tech you have on staff. Unnecessary. Every time you go up a level you will be firing a lot of people. All those whose jobs have just been eliminated by your automation.
Level 2 is demand processes. Anything anyone “asks” for. The “ask” is your sensor so in this realm these will all be half-automation actuators that are somehow kicked off based on a user interaction. Let's start with one of the most common demands, new server.
Now, first off what kinds of servers you can ask for and what extent of parameters I can adjust about that server is entirely within the purview of your organization, and is irrelevant here. What you ask for isn't the question. The question is, given a request which has endured whatever processes are in place to get it to the state of “Approved”, how long will it be before the request is executed? The correct answer is moments at most. Sub-second typically. By the time you hung up the phone with your manager who had to click on “Approved” in your request, you should be able to login to your new server.
You do this one process at a time to its completion. Until you've done them all. Then there are no longer any humans involved in “building more things.” Which means that process happens instantly now. No one goes to “Sally” or “Ralph” to get an 'X' anymore. Sally and Ralph don't work here anymore. People go to a portal and fill out a request, which must then be approved (somehow, irrelevant) so that the automation can create the resource requested. There should be no delay between approval, and creation.
It also means you have a whole lot of people who used to build things, and run the processes around building things, that suddenly have nothing to do. If you fired around 20 level 1 noc technicians, you will be firing around 200 of these “service” people. It's an order of magnitude jump at each level.
This level is about threat mitigation. There are three kinds of threats: Internal, External, and Process. You have to address all three threats to achieve level 3. The reason this level is dangerous is because at this level human intervention is no longer possible. Your automatons are going to do whatever they are going to do, and you can't stop them.
The correct response to an internal threat is to quarantine that node. The way to limit the scope of that quarantine to the least number of downstream endpoints is to have the limit be 1. Every endpoint exists on a /31 address space on a point-2-point link with its default gateway, which is a firewall interface. Nothing enters your network except having traversed a firewall. Which means quarantine domains are 1 node.
Now that firewall interface is configured using a very strict white-list approach. These few well known traffic patterns are allowed for this device based on its role. Any deviation from those prescribed traffic patterns will result in a dropped packet, and a shunned interface.
Let's dig into this one a bit. First thing you have to do is accomplish the following: Nothing may enter the network except having passed through a firewall. Firewalls are at the perimeter of the network, not within it. This is going to be a significant wiring change. Just this step is its own project, and you can't actually start any work on automation until this project is completed.
Now let me help you out here. First up what you are going to do is get six copies of a virtualized firewall appliance and install those six onto each hardware compute resource as guests. These firewalls take control of all the hardware networking interfaces. You can only enter or leave the hardware via these firewalls. That's how you do servers. The reason there are six is because you have two sets of three firewalls. One of the sets is between the servers and everything else on the network. The other set is for the isolated management network connections. You do have an isolated management network, right? Within a set you have one that is active, one that is standby, and one that is cold.
Users likely need actual hardware appliances installed. Now I can argue that quarantining several or even many users is acceptable, so perhaps you don't have to go down to the /31 subnet with client machines just yet. But I think you can understand that your current design is not going to be acceptable. You are using a /24 address space within each user VLAN. That's 254 people who all go down when any one of them “clicks on that link.” That will last as long as it takes to involve a hotheaded manager. At which point you will be forced to break up your user VLANs. So plan ahead and work that out now.
Once you have completely redesigned your network so that “Nothing enters the network except having traversed a firewall”, now you can begin automation efforts. All those firewalls are going to be reporting all kinds of information. You need to tap into that stream with a sensor looking for a rule-based dropped packet. Get the offending interface identification and then write the shun to the firewall. Now you have to have a process to undo a shun, which I'm sure would involve removal and investigation of the offending machine.
The point here is that the automation is almost trivial as compared with the effort needed to make the automation available. This is what happens when the design principles put into place in 1990 are still in affect in 2021. And /24 for users (1990 predates VLANs) is right out of the original ICRC book.
Because this is automation we can do things which aren't available to human engineers. Case in point, we actually can have a simple process that writes shun rules to the Internet firewall as packets are received which are deemed malicious. You couldn't do that using human engineers. They could never keep up. So front line is auto-shun automation. Likely this is activated both by the Internet firewall itself, but also has input from various honey pots. So as an attacker you get one shot. One packet. Go.
Sometimes the situation can get so bad that the appropriate remedy is to swing the data centers. This should let your real customers through while you deal with the attackers. Now this is automated, not a weekend event involving tens of people and months of planning. It just flips because there is a security sensor that says, “Too much impact to our real customers; Need to flip now” which then kicks off what is likely an army of actuators. These do whatever your human engineers used to do they just do it correctly, cooperatively, and quickly.
So honey pots, automated shun, and automated data center cut over. Yeah, its that last one isn't it?
How you got to where you are is because of organic, unplanned growth. You are now suffering under all the unintended consequences of that lack of design. So the next time you are beginning your planning for the next data center swing, you actually get everybody in the room and talk to them. What are you doing? Why are you doing that? You just need to capture it all from every perspective. And take anything at this point. It's all good right now. Once you have it all written down you can then take a look at it as a process. You aren't doing that now because everyone is doing there own thing, using their own tools and processes in isolation. No one has ever tried to pull it all together. That's your job. As soon as you do that, optimizations will just fall out. Do those. Then run your next switch. A bit better this time, yes? Okay now you have people's attention. Now for the next run get them to help you put it together, from their perspectives. They will add details you didn't have before, and now that there are several of you working this problem, you are more likely to spot additional optimizations.
This is going to be a repetitive refinement effort, over many instances. But it is possible.
You probably aren't used to thinking of this as threat mitigation. This is provisioning. But it is threat mitigation because of how it is done. Being overwhelmed is a threat to the service. For instance, when a cluster exceeds 75% overall utilization what happens is another server of that kind is spun up to join the cluster. Which means you can now handle things like unforeseen inbound traffic. The number of member servers will be increased to mitigate the load. This is called making the service automatically elastic. It grows and shrinks based on current demand. And this is every service.
The other form of process threat has to deal with servers that fail or are removed from service. They have to be replaced with new ones. That means spinning up another 'X' role server. Which means spinning up an 'X' role server isn't even a demand process anymore. No one “asked” for it. It happens based on need and without human intervention. Humans slow things down, make mistakes, don't follow the process, take a break, go on vacation and leave the company. None of your automatons will ever do that.
To pull this off you are going to be conducting interviews. “What do you do when 'this' occurs?” For the NOC techs we had their script in hand already. For this effort we have to create the script. Which means we will be asking people about how they do there job, to the point where they will understand that they are being replaced. Otherwise you wouldn't need that level of detail. So be aware of the non-zero probability that someone will:
Lie to you
Hold things back
Simply be unavailable
You are going to make them unemployed by virtue of automation. I can't fault them for being upset.
This is going to be the single largest project your organization has ever attempted. And you are going to fail repeatedly. Be prepared. You are replacing your entire IT organization with automation.
What is going on here is a coupling between your monitoring systems and your automation. Basically every possible event that the monitoring system can generate has a responding automation actuator. In this instance the monitoring system is the sensor of all the automatons. The various actuators were built from your interviews.
And since you (likely) have more than one monitoring system, you need to have automation built for every event EVERY of your systems might produce. And I do mean 'might.' Here's why: whatever obscure, never gonna happen event you decide doesn't need automation; that's the one that will fire.
So we moved all of our digital demand processes (The ones we built to get to level 2) into self healing, elastic matrices of services. Which means all of the IT folks now have nothing to do.
No, really. Your entire IT department has just been replaced by automation. You don't need any of them any longer.
And that's level 3. You are only immune to attack at this point. There are two more levels, albeit there's no one left to fire.
We were never really engineers. We were fulfillment. We spun up the servers, allocated the LUNs, wrote reports that no one ever read, and watched screens to see if anything bad was happening. All of that work is now performed by automation. All of it. If you haven't gotten to “all of it” yet, you aren't running at level 3. At this level you should not have an IT department. Or at least whatever you had at level zero in terms of an IT department, is now gone. That functionality has been entirely replaced by automation. Automation don't need managers or directors.
You will likely occasionally need to bring in an engineer to make changes. You don't have any on staff. You don't need them continuously anymore. The only reason you thought you needed them continuously is because you were having them perform a different job. Maintainer. Engineers build, they do not maintain.
Let's discuss ethics. I understand how breaking into a machine via any method which could be construed as unauthorized is unethical. I get that. I am not the problem. The problem is the metrics around how security is managed and how attacks are conducted favor the attacker to an extreme. Which means we have a systemic bias in play. That bias is ethics. We have them. The attackers do not. Now as much as I am repulsed by the thought of “descending to their level”, I would like to remind you that there are no other options. There is only our current level where we enjoy our ethics to the detriment of our networks; and their level where at least we are all on the same playing field. I am arguing for you to ignore your ethics based entirely on the question of efficacy.
Now, likely the IP addresses of the actual attackers contacting your site are secondary machines compromised by malware. These are not the perpetrators themselves. These are their minions. Here's how you fight minions. You have automation that attack the minion to attempt to gain control of it, by scanning it for malware and using that for entry. You are attempting to use the same malware the perpetrators used to initially take over the machine.
Once inside you ascertain what IP's this machine is currently communicating with, and those IP's go onto the Blacklist and are assigned an automaton.
Then what you do is you fix this machine. You remove it from the attackers arsenal by reverting it to a pristine, uninfected state. One down.
Now since your services (like the Retaliation service) are elastic matrices, they can reasonably respond to a significant onslaught. Many orders of magnitude more so than a group of human engineers could respond to. This is taking the advantage of removing process threats. We are now using our ability to respond, to respond directly to this threat. And keep in mind here, perfect isn't the goal nor is it even relevant. Any of the secondary machines I can wrest control of from the perpetrators is one less they can use against me, or anyone else. Any = success.
What retaliation requires is two fold. You need to have an intimate understanding of various malware; how to use them and how they infect the machine. Second you need to have access to a set of “clean” things which can be used to replace infected versions. Commonly a windows system file is infected. Copy down the original from an actual copy of Windows. Or rather this build of windows. There is a lot of research required for this step, which is why it's fourth.
This is the point at which you are making the world a better place because of your existence.
First requirement is that you have two organizations at level 4. Nope. So clearly we are smoking things at this point.
However, here's how it should (likely) be:
By the time you get to level 4, you should be pretty aware of things. Unavoidable. which means you should be aware of others trying to get to your stature. Help them. Be a human.
Find somebody you can converse with at your level. Doesn't matter where they are on the planet, what sector or field they are in, only that they are at or are approaching level 4. Share. You will be better because of your interaction with them. “We” are better than “any of us.”
This level isn't about anything you are doing within your organization. It's about what you are doing outside of it. Because you are responsible for us.
You can actually achieve level 5. You are going to endure some seriously radical changes in your organization, but the goal is achievable.
You need to treat automation like any other software development effort. Automation is software. Use appropriate resources.
You can achieve some seriously high
end goals by instituting some very basic automation. Think of just
the security impact of enacting the auto-quarantine service. Level
1, where all you did was enact the script you hand to every
NOC tech, means you've achieved self healing.
That's just level 1.
“Buy/use 'X' automation software.”
I'm actually saying, write it yourself.
“Do your automation this way.”
I don't care how your automation is built, only that it exists. Even the sensor/actuator model isn't necessarily the only one that works. The measure is that it works, not how it is built. Efficacy, not standards.
“Only write automation in language 'Z'.”
Why would anyone ever say anything like this? On its face it is obviously false.
And yet I can find these or very similar statements in many publicized treatises on the subject of automation. Just forget you ever heard of Chef or Ansible. Irrelevant. They are not part of this discussion. They might be a tool that the engineers used to build your services, but once your services have been built, whatever thing is underneath them is invisible to you.