Strategies to succeed in cyber-resilience for grids

So we all want to build 100% renewable grids in the next 20 years and beyond. And we want them to be big and interconnected to 100’s of millions of devices. OK fine, but how the hell to we make something this big and complex also to be resilient to cyber attacks and failures from the interweb? How do we start taking some first steps on such a monumental task?

We need a strategy for building big complex things that don’t break.

Building complex things is tough

Building big things that don’t break is hard. The Romans built aqueducts, which were a huge triumph for the empire. But as they found out, operating them required continuous and costly maintenance. And towards the end of the empire, they became a weak point for invading armies to cut off water supplies, and to inflict huge economic damage that cannot be easily repaired or replaced.

In the next decade, most OECD countries will completely rely on our electricity infrastructure for survival. We will need to set aside sufficient funds to keep it well maintained and operating smoothly. And we also need to design it to resilient to damage, intentional or otherwise, and not a weak link for foreign invaders. Whilst cybersecurity and internet based technologies are the new weapons on the block, these threats to our cities date back thousands of years, as do the approaches to building resilient infrastructure.

Now the blog topic here is not about the Romans – it is about large and complex digital software systems that are essential for everyday survival. Think air traffic control software, industrial control systems, swarms of controllable IoT devices – things that we use and need everyday. How we can trust that they will continue to work this way into the future? Who can we trust to make sure these are well maintained, and are resilient against the worst possible attacks?

Maintaining complex things is tougher

But why is it that big things are so expensive to maintain? Its generally because they are really complex. Nature hates complexity so much that it made a law against things staying ordered for too long (the second law of thermodynamics). That means pieces of it will inevitably fail, and in complex processes such as aqueducts as well as software control systems, that leads to a whole range of potential cascading failures that can bring the whole system to its knees.

For an electricity system to be fully resilient in practice, this requires that we can predict all the different ways in which the system can fail, and be prepared for each and every eventuality. In theory, if I have 10 interconnected software components that could fail in any order, that’s around 3.6 million failure sequences that we might have to contend with.

In practice, for modern electricity grid that might have 100’s of millions of components, the number of permutations is not mathematically computable – but nor are most of any consequence. There will only be a small subset of these failure modes that require some detection, repair and restoration, and an even smaller number of failure modes that are critical to keep the entire system running at all times. The point here though is that complexity is the enemy of resilience, and the number of ways the system can fail scales exponentially with increased number of moving parts. This is particularly acute when they are tightly coupled and interconnected, such as decentralised energy grids that are connected to the internet.

The point is, the more moving parts the more complexity there is – and complexity is the arch enemy of resilience.

Learn how to fail gracefully

We all remember the times when we trip over something when walking down the street, and awkwardly stumble and just manage to avoid falling flat on our face – if we’re lucky – and embarrassment avoided. Being able to recover quickly from a fault, failure or even an attack is a crucial for big things that we depend on; our modern airliners have multiple redundancies in all mechanical, electrical and digital systems, multiple trained pilots who can adapt to unexpected situations, and emergency systems when it all goes bad. Redundancy, training and preparedness, monitoring, detection, rapid response and recovery, and emergency backup are all part of establishing the resilience of the airline operations.

Translating this thinking into the rapidly shifting world of decentralised energy, IoT, and Cloud enabled electricity grids that are a clear target for nefarious actors, we need to start preparing ourselves for the millions of possible failure and recovery scenarios. Of course, we have to do the basics – training staff to detect and adapt to failures or attacks, building in redundancies and emergency backup systems, and so on. But in trying to identify, plan, detect and respond to the worst case scenarios, we can no longer assume its the attacks on our largest thermal generators or wind farms that we need to plan for, but also the cascading dominos of smaller things that can have big impacts.

To fall over gracefully in the context of a 100% renewable cyber-resilient electricity grid, it will become necessary to track and monitor all of the possible ways in which the system has or may fail of its own accord, or through via outside influences. Cyber threats in the electricity sector will continue to rise. So too will new forms of AI that can automate intrusions and to employ social engineering on a daunting scale. And there’s the fact that our power systems are plugging in millions of new ways to fail every year. Maintaining the graceful resilience of this giant, real-time, globally interconnected electricity supply chain is a thankless task that would leave even the Romans scratching their heads.

So how do we make a start eating this elephant?

Collect data on critical systems

As we continue down this journey to 100% renewables, we are rapidly learning that to tame the beast, we need more data and visibility into what assets are responding to market conditions, and more detail on how they might fail.

When operators dispatch distributed energy portfolios, such as battery storage and EV charging systems into energy and ancillary markets, this comes as little surprise that standing data is required upon registration, and some real time telemetry for operational visibility. Two recent examples by AEMO are for the VPP trials into the FCAS markets, and Schedule Lite which is intended to allow VPP dispatch in 5 minute energy spot market. Similarly for distribution networks, such as the network flex markets in the UK DNOs, and the flexible exports constraint regulations in Australia, both have non-trivial standing and operational data requirements.

So it should be no surprise that to reach a state of blissful cyber-resilience against any type of unintentional or malicious attack, that we need more data to do achieve this. So, what data do we need to collect to understand the myriad of failure modes and how to prioritise them. Obviously, the biggest portfolios of these decentralised assets will have the biggest swing factor on power systems security, so they’re clearly a high priority. But so too are Cloud based data services that other dispatch systems depend on, such as weather feeds, price feeds and so on.

To answer to this is some combination of Threat Intelligence (the actors and attack vectors), Systems Intelligence (the products in the field), and the Grid inventories (the assets and network locations). That’s not a very concise answer, but suggests some general directions for operators and regulators on where we will need to go in the coming decade.

Establish who is in charge

So cyber resilience is clearly an important capability that must be amp-ed up this decade. So who has overall responsibility for grid cyber-resilience? Who has responsibility over specific asset portfolios? Well, the answer is pretty murky…

For the East Australian National Electricity Market, it looks something like this;

For the big end of town – the large generators, transmission and distribution networks, and others that fall under the SOCI Act (Security of Critical Infrastructure), there are delegations between the federal government agencies and the Responsible Entities under the Act. Cybersecurity are still the domain of each organisation (with some step in powers in rare circumstances).
Analysis of faults and failures of large generators and loads sits with AEMO as the Transmission system operator to ensure continuity of the power system in the case of trips or outages, and ensure they have sufficient reserves.
For individual organisations, each has a responsibility to meet the appropriate information and cyber security standards and practices. Of course, most of these are voluntary, and based on maturity rather than prescriptive obligations, so that means its a pretty mixed bag. Sharing information again is not required, and very non-standard, which means its hard to communicate intelligence and reliability data to other parties in the energy system. As for business continuity, that means varying things to different organisations, and there’s even less consistency there.
For smaller scale energy assets (less than 30 MW), there are no specific provisions under the rules for faults, failures or cyberattacks, other than penalties for non-performance under the market participation rules. This is perhaps one of the more pressing challenges, as aggregations of these smaller assets are now in the hundreds of MW and they still fall wide of any cyber compliance obligations.
Then we have the distribution network businesses, each with the task of looking after their IT and OT infrastructure assets, but only as far as the grid connection and no insight or responsibility of what happens on the other side of the connection point. IoT devices are often complicated as they can seen as OT systems, and in other instances as IoT systems that sit on the customer site.
And finally, we have the OEMs and technology partner companies in supply chains, who may (and usually are) operating out of data centres all over the world. Many do not fall under Australian law, and are largely unregulated.

So I think you get the point. Its complicated. And there’s plenty of ways to fail that would go undetected for some time, and could even lead large scale outages.

To start on the road to cyber resilience, regulators and governments need to be clearer about responsibilities for when things go wrong. We often make assumptions that these scenarios are covered by the largest bodies, such as Transmission System Operator or even the National cybersecurity centres. However, the reality is that cyber resilience of our electricity grids needs a decade long plan, and needs clear governance and ownership.

Start moving on regulation

Whilst cybersecurity is a red hot topic, few grid operators anywhere in the world have figured out how to plan for the massively complex cyber resilience and the role of small and medium scale energy resources. However, there are few areas regulators can start looking at to make some small but meaningful steps.

Aggregator obligations. Whilst not promoting regulation as the only way to solve this problem, it is perplexing that there are regional and global aggregators of Gigawatts of DER and IoT capacity that have no legal mechanisms by which governments and market operators can impose operational requirements on these participants. If there were ever to be a reason, then both power system security and national security risks should put this high on the agenda for regulators.
Clearer responsibilities based on impact. Transmission system operators need to take oversight for the protection, detection and response to the various failure scenarios that impact power system security, but will also need to coordinate with all other levels in the system. Not rocket science, but each of them needs to look after their own security realms of identify/protect/detect/respond, and communicate threat intelligence to others affected. The energy regulators need to publish clear regulatory guidelines and certification processes that defines the types of activities to be undertaken by each party, and standard message formats for real-time communications between the layers and their participants
Clearer responsibilities on the supply chain. Like it or not, the electrification journey will make more things interdependent, and make it very hard to predict the ripple effects of a failure or system compromise unless all parties share in this responsibility. The key ingredient here is data on the software lifecycles and interdependencies of critical software systems, and then methods for reporting issues with the supply chains when things go bad. This too will be a subject for regulators to ensure that systems that are large enough to cause large impacts have adequate threat intel and real-time communications established in each market in which they operate.

These are a few strategies that will be imperative in the years to come. But its a decade long endeavour to build in cyber resilience, and we are likely to have a crisis or two on the way. If we establish who is in charge, and put in place the most basic incident response processes, we will grow

Strategies to succeed in cyber-resilience for grids

Recommended Posts

The climate battle before AI extinction risk

Generative AI in energy: excited or concerned?

Energy market risk and the sovereignty of nations

Can artificial intelligence completely transform the electricity sector?