Let’s lay some ground work. Step 1: There are overarching principles which apply to every network built. These include:
Step 2: There is a “best” design that applies to every situation, and to everyone which derives directly from the principles above. You’ve likely never seen anything like it, but it does exist. Again, this is either an agreement we have or we have nothing further to discuss. You have to have a ‘Valhalla’ you are at least working towards. Call this ‘Ideal’ if you like. Nevertheless, this is the goal. Move towards it. You will likely not have a choice in the future.
Step 3: If you adopt this design you are going to have some people who used to do work for you that now have nothing to do. What you do with these people is your business. Just be aware this is coming and plan appropriately.
When you are done with this effort, and you’ve enacted all the prescribed automation what you will have is a network that simply never goes down, and simply cannot be hacked. It will also have a LOT fewer people running it, likely no one currently involved. They are all the wrong skill set.
This principle states that in all things you are seeking a minimum. Least number of hops. Least number of IP’s used as networking gear. Least number of firewall rules to write. Least number of 10G links aggregated onto a 40G link. Always summarize your routes so you send the least you can in your updates. The list is nearly endless, which is why it’s a principle.
The difficulty comes when you have competing leasts, say least amount of money spent on new gear versus least number of aggregators based on port densities. This is why there are other principles.
There are a few of these that are fundamental, such as Least privilege. This is both to say you have been granted no more privilege than you actually need, but it also means that initially you were granted no privilege at all. The default security policy is “Access denied.” This continues the idea of “White list” to the policy side of things. Least privilege means you have to do something to gain privilege. You may need to login a second time, with other credentials to a privileged machine (a Hop box) which has access to the isolated Mgmt network. This is where your applications are tied to your login processes. And again, Principle of Separation, your credentials that are used to get onto the network and get your email, are different from the credentials you use to manage things. And the mgmt credentials don’t work to log you onto the network, they are system accounts.
Let’s talk about one least in particular. Least number of endpoints on a VLAN. Your ‘typical’, and very broken allocation plan uses a /24 subnet for any particular VLAN. 254 possible hosts, all of which have nothing between them. Compromise one of them and the other 253 are available immediately. Let’s not do that anymore, shall we? The least is 1. In this case what you have is a system speaking to its default gateway over a point-to-point link using a /31 address space for that VLAN. Since these are all point-to-point links, that VLAN could be numerically replicated on another device somewhere in the network because all links are point-to-point, and no VLANs are transitive. VLANS only exist within an aggregator.
Now for security's sake, that default gateway is actually a firewall interface. That’s right, every endpoint has a firewall between it and the rest of the network. That way if you misbehave, we can quarantine you at your port. What that also means is that there is a firewall at every door (entry way) into the network. That means once you’ve hit the network, you don’t need any more firewalls. You just need to pass through one when you EXIT the network. Firewalls are by definition a perimeter device. So in this design they only exist at the perimeter. I am sure you’ve heard of the term “Security Zones” which are what is terminated at a firewall. All this is doing is including internal compute resources as being “outside.” Which is quite frankly how they should be treated. Suspect. So two zones, “Inside” where routing occurs and “Outside” where all the endpoints are.
Let’s talk about another least – Least number of hops. I have actually encountered networks that had violated this one so badly there were portions of the network that couldn’t ping other portions because the 30 hop-count limit had been exceeded. Let me give you a wiring walk through so you can see what the actual minimum is. So using our firewall-as-default-gateway scenario what happens is that the network within a building is going to collapse down dramatically. What you need is nothing more than a core router connecting the uplinks of various firewalls. One hop. You might have a second if for instance you ran between buildings, core to core. Now don’t get ahead of me and start complaining about “only 1 core router” when you know what I mean is one routing hop, likely built be a collection of redundant devices. We’ll cover the principle of resiliency later.
Another least, routing updates. Summarize, summarize, summarize. You allocate a /22 to a particular firewall because that is the limit it has for VLANs. That device only advertises that /22, not all the /31’s that are actually in use. Beyond bandwidth consumption issues, there is another reason why this is a best practice.
Another least, knowledge. I don’t know anything more about you than I need to know. Because the less I know about you, the less dependent I am on you. If I knew nothing, than anyone could replace you. I need to know your addresses. Destination. But then anything exhibiting your addresses could replace you. So here’s what you are going to do. You are going to remove every last password you have on maintenance protocols. You have no reason to mistrust the connectivity of any of your peers. That’s all a password on OSPF says – connectivity is correct, not that the other speaker is or isn’t doing anything correctly. They are all under the same management domain, are they not? So we don’t share any extra knowledge, like a password or a certificate chain or an MD5 hash which are unnecessary for the function at hand. Make it easier to swap things out. Stop trying to solve problems that don’t exist.
Which leads directly to, Least enablement. If you are only using 11 ports in this 24 port switch, then the other 13 are configure down and error-disabled. Thus the only thing that has a live port is a live compute node, and the process to enable a port is significant. Which is why you don’t need to worry about anyone randomly sniffing your traffic. How would they do that if they can’t get a live port? And all the compute nodes in this network are behind a firewall, so no sniffing here. This has direct impact on performance as this reduces memory consumption of unneeded processes. It also eliminates one of the most common attack vectors, unused services. Turn off or disable things that aren’t being used. Case in point, sshd. Unless your system is a hop box, it is unlikely that SSH is a protocol that is part of the service this server provides. Turn it off. You can always manage the system via the console, which is itself a virtual thing. Now think about the security stance of a situation where the only servers listening on port 22 were hop boxes. Yet another attack vector squashed.
I could go on. The basic premise is, no matter what you are talking about, better of it means less of it.
Redundancy is not resiliency. Redundancy is every device is connected to two others in every direction. Resiliency is when there are three or more. In resiliency you still have redundancy even during a failure condition. Neither remaining device had to process the entire load that was dropped by the failed one. This principle finds direct application with servers, more so than with networking gear. Networking gear has a much more deterministic failure path making redundancy nearly always sufficient. The one exception is the core router. This needs to nearly always be a quad, and sometimes a cube (8).
Redundancy is every device is connected to two others in every direction. Notice how there are no exceptions to which devices participate. Either they all do, or you don’t have redundancy. Now redundancy can be measured easily enough. I should be able to yank the power cord from any single device in your network, and your network should behave as if nothing has happened. Your monitoring service may be busy but your packet switching network is not affected. That’s the measure. You can’t tell anything happened. Which means BTW that your ability to bring on a replacement has to be faster than the (typical) connection timeout of 30 seconds. So, for those instances like firewalls where your only option is standby mode, you need to be able to have the standby unit take over live connections in < 30 seconds. Nearly every firewall does that just fine and no one has to reconnect. Make sure however, that whatever firewall you buy says specifically, “No one needs to reconnect on standby promotion.”
Resiliency is 3 or more. Every service provided by a server should be accessed using a load balancer. No exceptions. Here’s why. Every service should have a minimum collection of nine servers. Three are live right now via the load balancer VIP. Three are on standby. Three are used for QA. Which group of three each server is in changes over time so that every server serves in every role at least four times a year. That means your process for changing roles better be pretty tight, and be automated. This cannot be a weekend maintenance window level event. Not if EVERY service runs on this design. It should JUST HAPPEN. No one kicks it off (or forgets to kick it off.) Just every 90 days, this service has its standby set promoted to online, the QA set is redirected to being standby, and the former online becomes QA. Tick. It should just happen.
Your core is likely handling nothing but 40G or 100G connections. Firewalls pretty much top out at 10G which means you will need aggregator devices to groom multiple 10G firewall uplinks onto a single 100G link to the core. These aggregators should NOT be a routing hop. They are pure layer 2 switching. VLANs, nothing more. Within an aggregator should be a single subnet, allocated evenly among the downstream firewalls so that the router can advertise this entire aggregator as a single route entry. The /22 networks on the firewalls are all part of the /16 of the aggregator.
Now let’s discuss the core. First up, you have no transitive VLANs which means you aren’t running spanning tree. At all. Anywhere. All of your VLANs are point-to-point connections. Now the reason we have a quad is because that gives us four zones, each of which connects to two of the physical routers. Four routers A, B, C, D. First zone connects to A & B. Second connects to B & C. Third C & D. Fourth D & A. Four zones. Users, Devices, Surveillance, Internet. I.e. Untrusted, Trusted, High Security, External. Now since you aren’t running spanning tree, I’d suggest switching to MPLS instead. You have the fast failover therein. Also, if the ‘core’ is really an MPLS PE, then traversing between buildings wouldn’t incur a second hop. You would just exit the MPLS using one hop.
If you need more than four zones, then you use a cube. In this you either get 6 via a quad of connections, or you get 12 edges if using only 2 connections per zone. Now I can make the case for six by splitting both the Users and Devices in the previous list. So, Guest users and Internal users. Or “Infrastructure” devices and “BYOD” devices. I can’t really make the case for 12. And think about the resiliency of a quad of connections leaving every zone. You’ve moved resiliency down one level which is a positive.
Again, all these ‘core’ routers are MPLS PE’s so there are some P routers somewhere. This is how you get between buildings. Every buildings core is connected to every other buildings core via the corporate MPLS network. The P router in this building has the actual circuits leaving the building. Given this you can construct the cube across multiple buildings if that is appropriate.
So resiliency is measured in behavior. Basically, you never go down, and there are always two paths available.
You can only access what you need to access. Your access has no influence on my access. The path by which service is provided differs from that with which the service runs or was built; which is different yet from how things are managed and monitored. You can spell "virtualize." One server failing does not influence the other servers in this collection. The overarching principle here is that every instance is autonomous.
Let’s start with networks. There isn’t ‘a’ network, there are several of them. All of which are wholly isolated from each other. Starting with “Front”, this is the network you have now. It is connected to the Internet and is using IPv4 addressing and is running an MTU of 1500. Front is how users connect to services. As soon as we change to a server talking with another server, we change networks. “Back” is the network used by every server to communicate with another server or infrastructure components. A couple of weird things about Back:
Which means we don’t have to do anything to let the devices figure out how to communicate between front and back. Front is IPv4. Back is IPv6. Your DNS servers need to answer on both based on who asked and either provide normal A records, or AAAA records for IPv6 queries.
The third network is Management “Mgmt.” This includes monitoring. Your monitoring service will have two interfaces; one on Front so you can get to it, the other on Mgmt so it can get to the devices. Mgmt is IPv4 but uses an RFC1918 subnet unused on Front. Generally, this is the 172.16.0.0/12 subnet. 10.0.0.0/8 is used for Front as is 192.168.0.0/16, which is used for DMZ. (Part of Front but not routable to/from 10.0.0.0/8)
You MAY have a fourth network. This is a sniffer network. There are a few OEMs in this space, perhaps most notably Gigamon, but Arista plays here too. The idea is you collect a span port from each aggregator device into its own network so you can deliver that stream to multiple clients. The primary recipients are IPS systems, but honestly any reasonable security approach would take this data and look for aberrations.
So 3, possibly 4 networks. Front, Back, Mgmt and Sniffer. No they are not in any way connected to each other, although any one server may connect to one or more of them. There are in fact planned services which connect Front to Mgmt so that people doing maintenance can accomplish that feat. Servers have no Front connection. Only a Back and Mgmt.
Why are we doing this? Having the traffic that builds the service (from a DB or such) run across the same interface we got the client request from is both over utilizing an interface, and is preventing us from using things like jumbo frames from increasing our performance. Having IPv4 out one interface and IPv6 out the other makes things like transitioning trivial. It also makes all the tools of the typical hacker worthless. There is no IPv4 on this side. There have also been a few reports of increases in performance when using IPv6 only on a particular interface. This has to do with “No more ARP” and the built-in multicasting.
So Front is where users are. They only speak to load balancer VIPs. The load balancers then use Back to communicate with their member servers, and the whole process behind those member servers is entirely IPv6 and Jumbo framed. Again, if you front every service you have with a load balancer, which you should do, then no server has a link to the Front network. Now think about what has just happened to the poor fool who managed to get one of your employees to click on that link in the email. There is nothing here. There aren’t any servers on this network. There are only firewalled-off other users and load balancer VIPs. That’s why we are doing this. Even if one of the humans does something stupid, it won’t matter.
Now play this out all the way down to the workstation itself. Now I’m not actually arguing for a thin client here. What I am arguing for is an ephemeral workstation. One that is pixie booted from the network and has no local storage. Applications are served via any of several remote execution mechanisms. Similarly, file storage is only available as network storage. You may not insert a USB drive and have it work, although your USB headphones will work just fine. This is a portal machine. It looks and behaves like every other portal machine your company has. They are all identical which means you have a set of cold spares in the closet. Should something happen to a particular machine all you do is shut it off and power up one of the spares in its place. You login again and return to where you left off. Network storage is both protected and backed up (Principle of Resiliency.)
This is separating the machine being used, from the user who is using it. There is nothing personal ALLOWED onto the machine. And Oh BTW – flipping all your applications to remote execution (think Windows Remote Desktop Connection or VNC) means the device in front of the user doesn’t have to perform that task. You have wisely spent money on the servers providing the remote applications. The bit of whatever you have in front of the user needs only to be essentially a browser. Nothing more. And isn’t it great that you have all the same $300 client machine on every desktop. There’s no reason to give the CEO a fancy laptop. It won’t perform any better and now you have to configure a windows machine to not allow local storage.
You control the boot image that every client machine uses from a central location. You can make updates to it as needed. And if anyone’s system gets befuddled, you literally just reboot. The machine is entirely virtual. If the machine becomes compromised you (wait for it) reboot and it goes back to factory default.
Can you see why we are doing this? Oh and another little addition. No matter what, you have now converted all your licensed applications to “Concurrent use” model. You buy 3 licenses of X which are running as three copies of it on an RDP server. Anyone allowed to run that application can run it, just only three of you at any one time. I didn’t have to load software onto 20 machines because there are 20 people in this group. I loaded it three times on the server and I only bought 3 licenses, not 20. So doing this is going to save you money immediately.
Principle of Separation includes:
I suppose the principle could have been written as that of Isolation, but I think that may be a bit too far. I think it would be advisable to use the knowledge you have of how your application is supposed to work, to build a sensor for when it is misbehaving. Which isn’t purely isolated. But I do think the sentiment of ‘Isolated’ is valid. Case in point, Hop boxes. Nothing that is connected to two networks is allowed to forward packets between them. Test that with actual packets. Similarly for the Mgmt network with the exception of things like syslog and SNMP trap message, all endpoints receive only. Which means an endpoint attemting to create a connection over the management network would trip the whitelist ACL on that firewall interface. Which would trip quarantining. So yes, separate by direction as well.
No two components of your design may depend upon each other. Don’t help the attackers. Obfuscate. Randomize.
The one that shows up nearly always is the link between IP address subnet, and VLAN. I’ve seen this down to the third octet of the address WAS the VLAN number. Don’t do that. Keep in mind we just made your VLAN plan really difficult because we are now allocating /31 VLANs not /24. So we are now done with +10 and 500’s for Printers kinds of allocation plans.
Now there are two sets of aggregators. There are the core facing aggregators which bring things up to 100G from 10G. Then there is a separate aggregation layer between users, servers and their firewall. This is because the firewall only speaks 10G and users are 1G. And since servers are virtual, we are just building new paths through the existing virtual environment. The aggregator for servers is the switching functionality within whatever the virtual machine system uses.
Now since the edge firewall’s 10G uplink towards the core is being groomed onto a 100G link by the core aggregation layer, that means whatever VLANs are in use on that side don’t affect me. I can’t see them. So every edge firewall gets to reuse all 4096 VLANs. This is because there are no transitive VLANs. All VLANs are point-to-point interfaces. A VLAN only has relevance between a firewall and its downstream aggregator. So now that VLAN doesn’t matter (VLAN is completely dissociated in this design) best to use random. Which means there is automation in existence such that when a new connection needs to be built in this zone (we are installing a new server) then a random VLAN is allocated and groomed onto the aggregator through to a firewall interface. Within the virtual server system we spin up the server and attach it to the new VLAN, which the firewall already recognizes when the server is booted.
Obfuscate < randomize. Obfuscate means you have an algorithm which can be compromised. That makes it less. Random is random. Use random whenever you can. Try to use random even when you think you can’t. You’re likely wrong in that assumption. Remember this axiom: “If things don’t make sense, check your assumptions. One or more of them is wrong.” Many of your assumptions are in fact wrong. It would be of value for you to actually test them.
No part of your design should inform any other part of your design. If so, you did it wrong.
You have a firewall which is expressing public IP NATs for internal addresses speaking BGP with at
least one internet router.
Internet connects to:
DMZ Internet firewall
DMZ Router – 192.168.0.0/16 - This is a simple 10G router
DMZ Edge Firewall 1 – these are your zones. Each is likely a /24 or so which describes the collection of /31 subnets in use on this firewall.
DMZ Edge Firewall 2
DMZ Edge Firewall 3
Load balancer VIP interface. - Notice how there are no actual servers in this network.
The load balancer gets to its member servers over the Back network, not the Front. Which means your load balancer must be very IPv6 aware and functional. You need an IPv4 VIP attached to IPv6 member servers. No, I don’t think all of the available load balancers can accomplish that so now we have a criterion based on function for picking them out.
Internet connects to
Outbound Internet firewall
Edge FW 1 – These are your zones. Each firewall IS a zone and everything behind it is in that zone. Each firewall manages a /22 subnet from which all of its internal connections are derived. Interesting consequence of this design is your firewall must be willing to be a DHCP server to all the devices behind it. Yet another criterion based on function for purchasing decisions.
Edge FW 2 - I.e Database Server VIPs. These zones are based on traffic patterns. Principle of Leasts says least number of services per zone, i.e. 1. So if you had both MySQL and Oracle database services, those would be in different zones.
Edge FW 3 – I.e Users (BTW All users are Users. There is only one zone for humans. There is NOT any form of “admin” zone for privileged users. That would violate the principle of leasts – Least privilege. It would also violate the principle of separation – Default security policy for the Admin group is not “Access denied”, that’s why they exist. Don’t do that. This is every attacker’s dream situation.
Notice how we’ve dropped the whole Core/Distribution/Access model. That design was meant to sell gear not build a functional network. Our design is a firewall at every entrance to the network, and then a 1 hop routed network between the firewalls.
Users are the only zone that has physical aggregation switches. Servers aggregation switches are the ones internal to the virtual server system. Which means no two firewalls share the same downstream aggregator. Which means every firewall can use all 4096 VLANS. The firewall fronting users is likely physical but really needn’t be. It’s just a bit simpler if it is. Also keep in mind that Users are Front so Front is basically your existing physical network. Most of Back and Mgmt are virtual, at least until you get to the core aggregators. Then they are VDC’s
For wireless, use thin AP’s please. The way you would do that is the AP's themselves are only connected to the Back network. They then tunnel the IPv4 connection with the client back to the controller via IPv6. Which again means we have a purchasing criterion. Thin AP’s can tunnel back to the controler over IPv6. Then there is Guest access. Guests can only comunicate with the Internet firewall. Nothing else internal. Nothing else on 10.0.0.0/8. The guest firewall ensures that. The other purchasing criterion is that each client is tunneled separately. Each client is part of its own /31 subnet with an edge firewall subinterface. Principle of Separation: separate tunnels for each client. I don’t believe this is a common feature among wireless vendors.
I would also suggest that at this point you need to assume all these firewalls are themselves virtual. There are simply too many to have them be physical. I think the only physical ones you have are the Internet firewalls, and perhaps the firewall for wired users. Which means the only VM on the system which can access the hardware 10G port is the firewall VM. All other traffic is entirely within the memory of the virtual server system. So now the hardware pair of teamed 10G interfaces is groomed directly onto the core aggregation layer. So yes you need VPC (Cisco) kind of teaming. Teaming to two different upstream switches. Because otherwise you’ve violated the Principle of Separation, and the Principle of Resilency. Your two upstream links should go to two different devices. Always. We now have a another functionality-based purchasing criterion. You have to speak IEEE 802.3ad to my system hardware correctly using two endpoints. This is where those dual ended patch cable assemblies show up. The ones I talked about in Resiliency. At the system side, the cables are actually physically attached to each other, with a 1 foot long lead at the end. (The attachment is 1 foot back from the ends of the cable)
Back looks very much like Front from a wiring standpoint. Down to the edge firewall is itself virtual and is the only VM on the system which can access the hardware networking interfaces connected to the Back Core aggregation layer. The reason these two networks look alike is because they serve the same purpose. Each is built to “provide services.” Front provides services to users. Back provides services to servers. So Back is a one-for-one duplicate of Front, it’s just configured differently. For those devices which are capable of it, using a VDC (Cisco again) for each role the device fulfills makes sense. Therefore, every hardware device has three roles which exist as isolated VDCs: Front, Back and Mgmt. Doing it this way means all three networks are wired identically, use exactly the same gear, but are still completely and wholly isolated from each other.
Mgmt – You guessed it, looks identical to Front as well. By design.
So, every hardware platform used for hosting of VMs must:
Now let’s talk about network services. Starting with DNS. First off, all the A records in your DNS refer to load balancer VIPs, not servers. Servers have no connection to Front so they have no IPv4 addresses. To get a server address you have to be an IPv6 client and ask for an AAAA record. So you can see here the two DNS systems are isolated from each other. They have nothing in common. Therefore, there is no reason to have the two use the same DNS system. Have separate ones. And Oh, BTW you don’t need load balancers in IPv6. It’s called Anycast.
So in this Valhalla you have something like a Chromebook on every desk, that pixie boots. The only networking hardware you have is a quad of 100G capable core routers, two sets of Internet firewalls, some L2 aggregation hardware to groom 10G onto 100G. And your existing switching infrastructure to get user connections groomed onto 10G uplinks. That’s about a dozen, maybe 18 total hardware devices (plus the existing user swiches) in the network. Big beefy ones, but not very many of them. Certainly not the 100's (1000's? 10,000's?) of devices you are paying maintenance on now. The various 10G connections leaving the virtual server system hardware is connected directly to the core aggregation layer. The core router and the aggregation devices run three isolated roles. The virtual server system runs six firewall instances which then own all the hardware networking interfaces on the system. All other server communication runs entirely in memory. All applications are remote execute. All writeable space is only network storage. You can swap out any device at anytime,
And remeber those zones we had in Front? (Untrusted, Trusted, High Security, External) These apply to Back and Mgmt as well. External-Back would be where you connect to third parties for back end services. External-Mgmt would be where you involve third party services such as remote device managment, monitoring, or IPS. You treat all of these Externals the same way you treat the Internet on Front. Which classifies the servers in direct contact with the external endpoints "Untrusted". This is your DMZ within Back. Front and Back really are the same.
The two tasks that every network engineer performs during a Sunday early morning maintenance window is software upgrades and management information updates; passwords, etc. Neither of these tasks should involve a human being. This is all robot work. Not a robot in terms of a physical entity but rather a piece of automation that behaves robotically. You need someone to build you automation that fires off every 90 days that does the following:
Now the steps that are here that your engineers aren’t doing are steps 2,5 and 6. In my experience the idea of a QA or acceptance test wasn’t a concept of which network engineers could conceive. Developers will get it immediately. The terms spring from their discipline, which is why it is a foreign concept for network engineers.
Now if we are doing software updates with automation, which requires a reboot
then anything else you might want to do should be done by automation. Update all the enable passwords – robot work.
Swing these three servers to another VLAN so they take on a different role – robot work.
Stand up three servers as THIS role in THAT zone – robot work.
Update all the “Description” fields to contain...
wait a minute.
We don’t need description fields any longer. We don’t have humans logging into networking gear and reading configurations because you can’t make changes to this network. The only time you can make a change is if you have an active fault ticket in hand and you need to make a change to repair the fault. Suddenly the whole “readability” criterion for configurations is gone. The only person who needs to read the configuration is the network engineer that set it up in the first place, and they only needed that information while they were setting things up. Now that it is setup, that information is no longer of use. No one, and I mean NO ONE will ever read the configuration of your network devices.
Here’s why. You aren’t employing Network Engineers any longer. You don’t need that resource you need another. We have replaced everything the Network engineer was doing for you with automation. Again, not a physical entity sitting in their chair. What you do with this employee is your business. Keep in mind you don’t need this kind of resource any longer.
The resource you need is a diagnostician. “What happened and why did that happen?” are the questions they are tasked with. These are the humans that work in your NOC that service trouble tickets. They do NOT service Incident tickets. That's robot work again. They are searching for “Root cause.” They write the reports and make the systemic changes recommendations. They do not enact those recommendations (Separation of Duties, yet another Separation.) Network engineers are contracted to come in and make the changes. Network engineers are now like plumbers. You call them in when you need them. They are not on staff because you don’t need them continuously. If they have a brain in their head, Network engineers will unionize soon.
Now here’s the thing about tickets. There is another layer of automation in your NOC. These are the Level 1 technicians that follow a script. When X occurs, perform Y. The automation is a combination of a sensor for condition X, which triggers an actuator that performs function Y. Here’s the thing, you already have the script in hand for this. You hand it to every level 1 technician you hire on an 8 1/2x11 sheet of paper. The last time I turned one of these kinds of systems on, the incident rate dropped 93% on the first day. The automation was repairing things before the monitoring system could notice a fault. This is a time function. Most monitoring systems work on a five minute polling cycle, and have to fail three times to note a failure. I can do a lot of work with a script in 15 minutes. Heck in this much time I can power up a stone-cold, blank replacement and configure it for a particular function.
Which means you should have a few stone cold, blank replacement devices wired into your network. One or two for each model in use. This is how you go from Redundant to Resilient with networking gear. You have cold, unconfigured spares that can be placed into service as any appropriate role by enabling and configuring them. They are already fully connected; they are just blank and powered off. Which means when a redundant pair member dies, it is replaced by the cold spare. You have maintained redundancy in a failure situation which is the definition of resiliency. And all of this is automated. You, the human being have nothing to do in this realm. Someone set it up ahead of time, and QA’d the processing like you would any software endeavor. What you the human get is an alert that essentially says “Cold spare has replaced device X functioning.” Which means you the diagnostician are now going to log into device X and see what is going on. Device X has already been quarantined off the network so you can reboot it, whatever you like. The only thing you can really do is blank out this device and have it become the new cold spare, or RMA the device if it is a true failure. Which means you also need automation which blanks out a device.
Here’s how you do automation. Any process you perform twice is a candidate for automation. Repetition is the criterion. So even something as trivial as blanking out a device has automation built for it because you will likely do that function again. We make software updates so that is automation. We make administrative updates, so that is automation. We build new server instances so that is a “Demand” automaton. Demand automation use a sensor which requires a human to kick it off. In fact, nearly every “demand” process you have is automation. If the words ‘fulfill’ or ‘allocate’ appear in the functional description of your process, then this is a demand process. Then the question is “atoms” or “bits.” You can’t build automation (easily) if you are handling atoms. If this is bits, then no human should be involved.
There is no change management. Changes are not allowed. You do still have maintenance windows though. Here’s how that works. Whenever one of the automatons wants to make a change, it must first request a window. The system it queries decides based on the current state of the network. Is the network currently in a state where it can tolerate a maintenance window? Yes or no? How that decision is made and what criteria it is based on is the programing behind that system. But it returns either a yes or a no. If yes, then the maintenance window is currently active and you may proceed. Now from an engineering standpoint you need to wrap your head around a 5 minute maintenance window. If at the end of 5 minuntes you haven’t reported “Success” back to me, then I will cause your process to be killed and we will execute the backout automation for your process. This process requires positive confirmation at each step. If I allocated a window for you, you have to tell me when you are done and whether or not you were successful. If you were not successful, then I will instigate the backout function.
Automation can make changes, because these are the changes that were determined ahead of time to be valid at any time. These are the changes that would normally not be required to go through typical change mgmt. They’re the freebies. They are sometimes referred to as “Standard Changes.” Those are the only changes ever made on your network. The freebies. And because they are the freebies, they are repetitive and thus have automation attached to them.
Final Automation needs
Now here’s the last bit of automation you need to put into place. You have all these firewalls. They report whenever they drop a packet because it wasn’t allowed. Here’s what you do whenever, and I do mean IN ALL CASES: you quarantine the offending node. Always. It is now connected to nothing because its default gateway has shunned it. Now since this is an automaton based response, you will get a response no later than 1 second after the offending initial packet was ejected by the workstation. So your employee clicked on the link in the email, that code immediately tried to contact its command node on some random port, which triggered the firewall to drop a packet, which triggered the shunning 1 second later. You suddenly can’t hack this network. You have 1 second. Go.
Now watch this; have the firewall always be willing to perform a pixie boot even if the port has been quarantined. Because my machine is virtual. If it gets compromised and I’ve been quarantined (Which I can tell because I can’t connect to anything) what I do is reboot. When the hardware sends out the pixie boot request, the shun is lifted and the port returns to operational mode because the device is getting a clean image to run. Not only did we completely prevent an attacker from being able to do anything, our recovery process was a simple reboot and we returned to operational stance using all the same hardare.
Now do you understand why we are doing all these things? Now can you understand how f’d up your current situation is? No one, and I mean not a single entity anywhere can accomplish that feat. And yet it falls out of this design.
And again, don’t forget the labor aspect of this. You won’t have any form of Engineer on your staff any longer. They aren’t needed. They set things up, and then they leave to move onto their next project. You may engage then again in the future if your Diagnosticians come up with a change that needs to be made. They also wrote all the automation, or at least their same company did. Automation is created cooperatively between engineers and developers. Automation is primarily a software endeavor so that falls under that stricture of Release Management. Anything that a former human engineer would have done for you is now performed by automation. That’s why you have no engineers on staff.
In this environment monitoring is joined to automation, so there is no work to do. Things come up or are spun down as needed. Everything is configured so it can be adjusted at runtime, and then adjustments are made based on current conditions. Even a data center swing is kicked off by security monitoring.
The federal government will step in to regulate how networks are built.
At some point the federal government is going to be forced by recent circumstances to step in and regulate how networks and computer systems are built. They will have to. The fully distributed, independent chaos we have now isn’t working and we are reaching the limit of its capabilities as we speak. I suspect we will get the first attempts at this from the Biden adminstration as they are already telegraphing their thoughts on regulating how things are built around infrastructure.
What will happen is this. A very substantial breakin will occur, something really bad and really pervasive. I think the recent issue with Solarwinds is the right kind of event. Now imagine one where that kind of breakin occurs at 1/4 or 1/3 of every system in the United states at once. The public would cry out for accountability and it will be determined that there is no mechanism to accomplish that. So they will make one. Building code level strictures that every network must adhere to. PCI-DSS goes away. No longer relevant. And inspection is performed by governmental bureaucrats, not companies trying to make a deal to give you a "Pass" on your inspection. (I actually worked for one of those at one point. I'm sure you can understand how my intrinsic brutal honesty was a problem there.)
Since this is building code, you can’t occupy (use) your network until it has passed inspection. Or you can’t connect to the Internet until you provide your ISP with a passing Inspection report.
Since this is building code, all networks become essentially the same because they are all having to pass the same inspection. Which means any Network Engineer can walk in, site unseen and be able to make reasonable changes to the system. After all, that's how plumbers work right? Any plumber can be expected to make positive changes to any plumbing situation. Which means:
To those of you out there who consider yourself a Network engineer:
Here’s the thing, we have almost without exception never actually been Network Engineers. NE’s know how to untie an OSPF mismatch, or know how to setup a switching architecture to minimize (Principle of Leasts) latency. When have you ever been asked to do something like that? In your career? Once? Twice maybe? So now here’s the thing, we are suddenly going to need a whole lot of diagnosticians in our NOCs. If you believe that job to be beneath you, I will politely ask you to step off your high horse and realize that in the new economy we are not going to need even 1/10 the number of people who currently bare the title Network Engineer. You will be completing with 15 other engineers for every seat in the companies that move forward. You better be that good. You have to be able to configure any device with an IP address, any manfacturer. You have to have enough Python to communicate with the Dev team about automation.
Here’s the other thing. Unionize. Now. The reason you have to be on call 24x7x365 is because you are fighting for your humanity by yourself. This is exactly what unions were created to combat. Employees being exploited. If you are done being exploited, unionize.
Once unionized then become a trade. This is the Apprentice, Journeyman, Master kind of promotional situation. And being a trade union means you are likely moving up in your ranks by virtue of licensure. A governmental body has verified that you are operating at a certain level, not a device vendor. Look, we have always been plumbers. Even our math is the same as plumber’s math. We were just never treated like plumbers. We were exploited. We were classified as exempt salary so that we weren't eligible for overtime pay. Then we were both asked to work 60 hours, and asked to track our time. If I am tracking my time, then I am hourly not salary. You need to pay me 1.5x that 20 hours over 40 I worked last week.
So if that sounds familiar, I will ask you again:
Are you done being exploited?
Take your salary and divide it by the actual number of hours you worked. This is your effective hourly rate. Do the math. Is that acceptable?