Tag: ITIL4

  • Crowdstrike Outage “Not What You Thought”

    Crowdstrike Outage “Not What You Thought”

    It’s been six months since the Crowdstrike outage – enough time to reflect on the incident and take stock. I had lunch with my CISO about a week after the outage. It was the first time we had seen each other in several weeks. “So,” I asked sheepishly, “how have you been since the outage?” “I’ve been fine. But the Service Desk has been swamped. Since my security team wasn’t that busy, we pitched in to help remediate the outage. They touched 15,000 servers and client machines in three days.” I inquired further. His role focused on the management of encryption keys that were necessary to unlock and manually patch the operating systems of the affected machines. “The hard part of the recovery was managing the keys,” he said. As his team was jointly responsible for the security of those keys, that was the extent of his involvement. You see, Crowdstrike pushed a bad patch – one file – but an important one that loads at the kernel level. This caused all of those Windows machines to “blue screen.”

    Something didn’t compute. I thought he was going to be falling asleep at the table, eyes bloodshot, bags under them, a quart jug of coffee in his hand. Instead, he seemed rather chipper. Then it hit me. This wasn’t a security incident. Rather, it’s what we call in ITSM a deployment and release management issue. It’s not that Security Management wasn’t involved, they were. But it was apparent early in the Problem cycle that this wasn’t a cyberattack.

    The response from our university IT was quick and appropriate. Within thirty seconds of the patches being applied, customers began to call and report “blue screens.” This spawned a number of related incidents at the Service Desk. These incidents were quickly correlated into a Problem record, which was upgraded to a major incident (i.e., outage) record in less than an hour, all of this happening around midnight on July 19th. During the early morning hours, an incident response team did a root cause analysis and quickly determined the problem was a vendor patch.

    The vendor response was quick and the patch was available by early morning, although the CEO of Crowdstrike was criticized in subsequent days for not issuing a timely apology. The damage to Crowdstrike’s reputation was done. After all, the outage affected roughly 8.5 million computers. Crowdstrike was quickly seen as the responsible party and IT folks around the world became heroes as the outage response progressed. But Microsoft was also responsible for letting Crowdstrike play in the Windows kernel. Microsoft distanced themselves from responsibility by asserting, “Although this was not a Microsoft incident, given it impacts our ecosystem, we want to provide an update on the steps we’ve taken with CrowdStrike and others to remediate and support our customers.” In this instance, Microsoft was acting as an integrator, more specifically, as a Service Guardian, where they managed both a third-party vendor (Crowdstrike) and provided services (Windows). In this instance, ITIL best-practices dictate that we have a high-level of communication and trust with the integrator, but also acknowledge that our customers will hold us – not our vendors – responsible. After all, who are our customers going to blame – us or our vendor?

    I see a double failure here. Crowdstrike failed by deploying a service with a critical bug in it, which they should’ve uncovered in their acceptance testing. This is not George Kurtz’s first high-visibility failure. In 2010, he was CEO of McAfee when a similar outage occurred. The second failure was Microsoft’s mismanagement of their vendor. One may ask why they allowed a vendor to deploy a file at the kernel level without sufficient testing. You would also expect Microsoft to have caught the error prior to approving the release of the errant file. Was Microsoft’s trust of Crowdstrike so great that they didn’t do acceptance testing and simply passed the updates through? If so, they need to review their Deployment and Release Management practices. Of course, this is pure speculation.

    Meanwhile, back at “the ranch,” the IRT created a Change Request that included testing of the patch on a number of machines. Procedures to apply the patch were documented at both the individual asset level and the more strategic coordination level. On the communication side, customer communication began as soon at the Problem was identified, about an hour into the incident, with a number of communications happening in the early morning hours via IT staff in the colleges and university communications to stakeholders. Communication continued through the next few days as the incidents were remediated and non-reported servers and endpoints patched. An After Action Review was conducted less than a week after the initial incident was reported. Lessons learned were documented. DONE!! YAY!!!

    Since I retired from IT, I’m an “observer” these days and I can tell you that I don’t miss the excitement surrounding outages. Been there, done that, got the t-shirt. But I must say that I’m very proud of the way our university handled this major incident – responsive, professional, by the book. I don’t think our response would’ve been as good five years ago. We’ve come a long way in our journey in understanding ITSM.

    In summary, what ITSM practice areas were involved in this outage?

    1. Service Desk
    2. Incident Management
    3. Problem Management
    4. Continuity Management (via Major Incident/Outage)
    5. Vendor Management
    6. Asset Management
    7. Relationship Management (i.e., communication with stakeholders)
    8. Change Management
    9. Security Management (indirectly)

    This is a pretty impressive slice of the ITIL ITSM Practices for a single issue. I think our IT folks would report that we have varying levels of maturity in each of the Practice areas, but I can tell you from experience that this kind of outage hones our skills to respond better the next time. Iron sharpens iron.

  • Lost Improvements: An Analogy to Defects

    Lost Improvements: An Analogy to Defects

    Defects are not free. Somebody makes them, and gets paid for making them.

    W. Edwards Deming

    To summarize Deming’s teaching on defects, they cost an organization thrice. First, the defect is made, which robs the organization of a “working” product or service. Second, the defect must be identified, which also takes time and resources. Lastly, the defect must be resolved, thus taking more resources away from producing non-defective products and services. If this isn’t bad enough, these costs don’t include opportunity costs which could be mitigated with improvements.

    In manufacturing (and IT ;-)), a defect happens because of a quality failure either at the source or somewhere upstream. Once a defect is built into a product, there are two ways to detect it. First, it may be detected prior to shipping. Second, the customer may see the defect, which is significantly worse from a CX perspective. To draw the analogy to lost improvements, if there is no system in place to record improvements, that’s the equivalent of allowing a defect to get to the customer. Lack of improvement causes more technical debt and operational overhead down the line and will be reflected in much of the work that is done by the organization. These defects will be visible to customers, one way or another. How does an organization create a culture of continual improvement?

    First, an organization must embrace a culture of improvement. According to ITIL4, a culture of improvement requires three things; transparency, managing by example, and building trust (CDS, 2.3.4, 2.3.8). I’ll treat these three topics in more detail in a future post, but suffice it to say that my perspective is that the former are dependent on the latter – that is, trust is the “coin of the realm” and other aspects of an improvement culture are dependent on it. For example, organizations that have a high degree of trust manifest a corresponding high level of transparency.

    Trust is the “coin of the realm” and other aspects of an improvement culture are dependent on it.

    Second, an organization must provide mechanisms for conserving, prioritizing, and executing improvement initiatives. Starting with a Continual Improvement Register (CIR) is a good first step. If systems are too proscribed, or improvement processes not defined, team members don’t feel empowered (or able) to record improvement ideas. Without improvement, the organization will continue to produce defects. Making the CIR accessible at all levels of the organization is also recommended. Appointing a small, dedicated improvement person or team responsible for prioritizing and executing on those improvement opportunities closes the loop. Communicating the status of improvement opportunities creates buy-in from the organization and keeps the suggestions rolling in. In my experience, organizations go awry in the second requirement. They may build a culture of trust and improvement, but that culture must be operationalized to realize the true benefits.

  • The Non-Technical Economy

    The Non-Technical Economy

    It seems that everything these days is about AI and how the world as we know it will end. Some are prophesying that entire swaths of our economy will be replaced by AI. Writers are lining up on both sides of the argument. As IT professionals, much has been written (or assumed) about the use of AI in IT.

    In 2019, Brian Merchant wrote, “A robot is not ‘coming for’, or ‘stealing’ or ‘killing’ or ‘threatening’ to take away your job. Management is.” Is that true?

    While most people are focusing on those things that AI can do to replace human work, there are a significant number of things that AI can’t do better than humans. Interestingly, I asked AI (i.e., MS CoPilot) for a list of things that humans do better. Here’s the list:

    1. Emotional Intelligence: Humans are naturally equipped with the ability to understand, interpret, and respond to emotions in a way that AI cannot. This includes empathy, compassion, and the ability to read subtle social cues, which are essential for meaningful human interactions.
    2. Creativity and Innovation: While AI can generate new content based on existing data, humans have the innate ability to think outside the box, imagine completely new concepts, and create innovative solutions to problems.
    3. Ethical and Moral Judgment: Humans can navigate complex ethical dilemmas and make decisions based on a nuanced understanding of morality, societal norms, and personal values. AI lacks the consciousness to make such judgments.
    4. Adaptability and Flexibility: Humans can easily adapt to new situations, learn from diverse experiences, and apply knowledge in various contexts. AI systems, on the other hand, can struggle with tasks that deviate from their training data or require adaptation to new environments.
    5. Complex Problem-Solving: Humans are capable of solving poorly defined or novel problems that require intuition, judgment, and the ability to find connections between seemingly unrelated information. AI systems are typically limited to solving algorithmic problems and can struggle with tasks that cannot be systematically defined.
    6. Physical Dexterity and Mobility: Humans have fine motor skills and the ability to navigate complex physical environments, which are challenging for AI and robotics to replicate with the same level of proficiency.
    7. Interpersonal Skills: The ability to build relationships, collaborate effectively, and navigate social networks is a distinctly human trait. AI can assist with communication but cannot fully replicate the depth of human interaction.
    8. Learning from Limited Data: Humans can learn new skills or understand new concepts from a few examples, whereas AI often requires large datasets to learn effectively.
    9. Understanding Context and Nuance: Humans excel at understanding context, sarcasm, irony, and nuanced language, which can be challenging for AI to interpret correctly.

    It’s important to recognize that AI is a tool designed to augment human abilities, not replace them. The collaboration between human intelligence and AI has the potential to enhance productivity and innovation across various fields.

    What’s interesting about this list is that most of these skills are closely related to those needed to provide excellent IT service management. As the emphasis in IT has grown over the last three decades from technical to customer-service competencies, the identification of these soft skills has been one of the ways the profession has defined and delineated itself. Take, for example, the list of skills necessary to provide excellent service desk support (ITIL4 Foundation Training, 2024):

    • Customer service
    • Empathy
    • Incident analysis and prioritization
    • Effective communication
    • Emotional Intelligence

    It would appear, at least at this moment in time, that AI will not be able to do some of the fundamental things we do in IT service management. Indeed, a survey of those industries most susceptible to “takeover” by AI include manufacturing, finance, healthcare, cybersecurity, and education. Note that these fields don’t rely heavily on stakeholder interactions to be effective.

    So why are “managers” still trying to replace us? I think the answer is that they are thinking in a binary way – either we use AI to do work or we use humans. The real answer is that AI will augment and complement humans in IT service management, not replace them. The collaboration between human intelligence and AI has the potential to enhance productivity and innovation across various fields. This is reflected in the newest ITIL4 Create, Deliver, Support curriculum which stresses the effective integration of AI, among other tools. Mature IT Managers will realize that AI is a tool that can automate steps of the value stream, but at the end of the day, customers will have better outcomes and realize more value if humans are left to do what humans do best.

  • ITIL 4 and Aggregation Theory

    ITIL 4 and Aggregation Theory

    Back in the days of ITILv3, focusing on process was the right thing to do at the time. Building out robust, documented, repeatable processes went a long way toward consistent service delivery, and for many years, this approach to service management was enough. Then in the late two-thousands, significant changes in availability of IT service suppliers and the flattening of service delivery created a situation in which our customers, who had historically been a “captive” audience, now had choices. They quickly learned that we weren’t the only game in town. They had choices from outside the organization. Enter shadow IT. Were we still relevant to our customers? If our role wasn’t service provision, what was it?

    When ITIL 4 came around, the framework transitioned from an internal process-heavy focus to an external, customer-centered focus. At the time, the shift toward customer value “felt” right, but I couldn’t put my finger on the reason why. For a number of years, I had noticed that our customers were reasonably happy with the services we provided. But when we started engaging them strategically with BRM (Business Relationship Management) by fostering a relationship in order to understand their business and what they really valued, their happiness increased significantly. This practice worked in a big way, but why?

    Today, I made a connection between the outsized results we reaped with BRM and Aggregation Theory. The basic idea of aggregation theory is that value chains have three different groups: suppliers, distributors, and consumers/users1. Before the Internet disrupted everything, distribution was expensive. Take the example of newspapers. Newspapers had to be physically distributed. Competitive advantage was gained by the distributors (e.g. New York Times, The Washing Post, etc.) integrating the suppliers (i.e., journalists). The reason this worked was because customers outnumbered suppliers. A distributor that integrated supplier relationships had a significant advantage over distributors that didn’t. This was integration up the value stream.

    Post-Internet 2.0, the cost of the customer transaction decreased to practically zero as distribution became aggregated. Using our example, newspapers moved to digital editions and the cost of distribution decreased. But along with lowering customer transaction costs came de-personalization of the relationship. I missed the sight of my paperboy meandering down the street on his bike only to toss my paper in the bushes. In the new era, customers became weary of thousands of scattershot email solicitations, the rampant buying and selling of their information, and the always annoying automated feedback requests.

    “You’ve been chosen as one of our special customers to give us feedback today. For your time and effort, you’ll be eligible to receive a totally worthless coupon that you can’t redeem unless you stand on your head, pat your belly, and cough three times.”

    Customers actually missed drop-in visits from support team members, calls from their sales reps, and conversations with the engineering teams. The ubiquity of low-value customer connections had increased the value of the personal relationship. And it wasn’t just the relationship, it was the nature of what we did for them. While we continued to provide IT services (if not all), our role had to shift to that of a strategic partner. We had to grieve that we would no longer have the exclusive affections of our customers and accept that they had become poly-amorous, so to speak.

    This is why the focus on value and relationship has taken center stage today. Successful organizations will be those that provide the best user experience. This means an increased focus on customer relationships and a careful curation of customer experiences – integrating customers down the value stream. It means continuously understanding what the customer really values. It means getting out and talking to our customers, and I don’t mean our robots talking to their robots. I mean WE have to talk with THEM.


    1Incidentally, ITIL 4 simplifies this model by describing two top-level roles: providers and consumers, and then extending the concept to the three-part model by stating that organizations are both consumers and providers. ITIL 4 focuses on the relationships between organizations in the service relationship model.