Random Access Memory
Posts
Continuous Architecture in Practice Notes

Continuous Architecture in Practice Notes

Sam Benjamin Pragasam
June 03, 2024

Continuous Architecture in Practice

Software architecture in the age of Agility and DevOps

continuousarchitecture.com

Continuous Architecture in Practice book gives a solid idea on how to understand other factors while dealing with security. It will give you light on scalability, reliability, performance, Maintainability, Cost, Security. This is a must read when we are dealing with Architecture reviews. Please support the author by buying and reading this wonderful book. I have shared my notes that i have taken while reading this book in this post. Addison Wesley Signature Series is a good set of books to read in the current software trend and technology.

Principle 1: Architect products; evolve from projects to products. Architecting products is more efficient than just designing point solutions to projects and focuses the team on its customers. Principle 2: Focus on quality attributes, not on functional requirements. Quality attribute requirements drive the architecture. Principle 3: Delay design decisions until they are absolutely necessary. Design architectures based on facts, not on guesses. There is no point in designing and implementing capabilities that may never be used—it is a waste of time and resources. Principle 4: Architect for change—leverage the “power of small.” Big, monolithic, tightly coupled components are hard to change. Instead, leverage small, loosely coupled software elements. Principle 5: Architect for build, test, deploy, and operate. Most architecture methodologies focus exclusively on software building activities, but we believe that architects should be concerned about testing, deployment, and operation, too, in order to support continuous delivery. Principle 6: Model the organization of your teams after the design of the system you are working on. The way teams are organized drives the architecture and design of the systems they are working on. Those principles are complemented by a number of well-known architecture tools, such as utility trees and decision logs.
We would argue that creating a software architecture entails making a series of compromises between the requirements, the decisions, the blueprints, and even the ultimate architecture artifact—the executable code itself.
Continuous Architecture approach focuses on delivering software rather than documentation. Unlike traditional architecture approaches, we view artifacts as a means, not as an end.
The time dimension is a key aspect of Continuous Architecture. Architectural practices should be aligned with agile practices and should not be an obstacle. In other words, we are continuously developing and improving the architecture rather than doing it once.
This brings us to the question, What is the unit of work of an architect (or architectural work)? Is it a fancy diagram, a logical model, a running prototype? Continuous Architecture states that the unit of work of an architect is an architectural decision. As a result, one of the most important outputs of any architectural activity is the set of decisions made along the software development journey. We are always surprised that so little effort is spent in most organizations on arriving at and documenting architectural decisions in a consistent and understandable manner, though we have seen a trend in the last few years to rectify this gap.
A problem in many modern systems is that the quality attributes cannot be accurately predicted. Applications can grow exponentially in term of users and transactions. On the flip side, we can overengineer the application for expected volumes that might never materialize. We need to apply principle 3, Delay design decisions until they are absolutely necessary, to avoid overengineering.
Technology singularity is defined as the point where computer (or artificial) intelligence will exceed the capacity of humans. After this point, all events will be unpredictable.
Another challenge we have seen is how principles are written. At times, they are truisms—for example, “all software should be written in a scalable manner.” It is highly unlikely that a team would set out to develop software that is not scalable. The principles should be written in a manner that enable teams to make decisions.
Although we state that architecture is becoming more of a skill than a role, it is still good to have a definition of the role. As mentioned earlier, in The Mythical Man-Month,32 Brooks talks about the conceptual integrity of a software product. This is a good place to start for defining the role of architects—basically, they are accountable for the conceptual integrity of the entity that is being architected or designed.
Following are a few key characteristics that should be considered in determining your approach. Simplicity: Diagrams and models should be easy to understand and should convey the key messages. A common technique is to use separate diagrams to depict different concerns (logical, security, deployment, etc.). Accessibility to target audience: Each diagram has a target audience and should be able to convey the key message to them. Consistency: Shapes and connections used should have the same meaning. Having a key that identifies the meaning of each shape and color promotes consistency and enables clearer communication among teams and stakeholders.
A key benefit of principle 6, Model the organization of your teams after the design of the system you are working on, is that it directs you to create cohesive teams structured around components. This principle greatly increases the ability of the teams to create a common language and is aligned with the concepts of bounded context and ubiquitous language.
As a side note on microservices, we want to emphasize that the overall objective is to define loosely coupled components that can evolve independently (principle 4, Architect for change—leverage the “power of small”). However, micro can be interpreted as a desire to have components with an extremely small scope, which can result in unnecessary complexity of managing dependencies and interaction patterns.
This challenge is elegantly summarized in Eric Brewer’s CAP theorem.11 It states that any distributed system can guarantee only two out of the following three properties: consistency, availability, and partition tolerance. Given that most distributed systems inherently deal with a partitioned architecture, it is sometimes simplified as a tradeoff between consistency and availability. If availability is prioritized, then all readers of data are not guaranteed to see the latest updates at the same time. However, at some unspecified point in time, the consistency protocol will ensure that the data is consistent across the system. This is known as eventual consistency, which is a key tactic in NoSQL databases for both scalability and performance. The CAP theorem is further expanded in Daniel Abadi’s PACELC,12 which provides a practical interpretation of this theorem: If a partition (P) occurs, a system must trade availability (A) against consistency(C). Else (E), in the usual case of no partition, a system must trade latency (L) against consistency (C).
First stop is Postel’s law, also known as the robustness principle: Be conservative in what you do, be liberal in what you accept.
Now that we have the first version of the security design for our system, we’re in a good position to understand our security risks and to be confident that we have chosen the right set of mitigations (mechanisms). However, this is only the first version of the threat model, and as we do for most architectural activities, we want to take the “little and often” approach so that the threat modeling starts early in the life cycle when we have very little to understand, analyze, and secure. Then, as we add features to our system, we keep incrementally threat modeling to keep our understanding up to date and to ensure we mitigate new threats as they emerge; this is an example of principle 3, Delay design decisions until they are absolutely necessary. This approach allows threat modeling, and security thinking in general, to become a standard part of how we build our system, and by going through the process regularly, our team members rapidly become confident and effective threat modelers.
The problem with dealing with a ransomware attack is that, by the time the organization realizes it has a problem, it is probably too late to stop the attack, and so it becomes a damage limitation and recovery exercise.
There are three main parts of secrets management: choosing good secrets, keeping them secret, and changing them reliably.
Sensitive operations involving security or large financial transfers should be candidates for “four-eyes” controls, so that two people must agree to them.
We have added some complexity to our system, and possibly operational cost to maintain the security mechanisms, but unfortunately, some complexity and cost are inevitable if we want effective security.
One way to look at scalability is to use a supply-and-demand model frame of reference. Demand-related forces cause increases in workload, such as increases in number of users, user sessions, transactions and events. Supply-related forces relate to the number and organization of resources that are needed to meet increases in demand, such as the way technical components (e.g., memory, persistence, events, messaging, data) are handled.
As Abbott and Fisher point out, “The absolute best way to handle large traffic volumes and user requests is to not have to handle it at all. . . . The key to achieving this is through the pervasive use of something called a cache.” . Caching is a powerful technique for solving some performance and scalability issues. It can be thought of as a method of saving results of a query or calculation for later reuse. This technique has a few tradeoffs, including more complicated failure modes, as well as the need to implement a cache invalidation process to ensure that obsolete data is either updated or removed.
Any message processing logic is implemented in the services that use the message bus, following what is sometimes called a smart endpoints and dumb pipes24 design pattern. This approach decreases the dependency of services on the message bus and enables the use of a simpler communication bus between services.
So why do stateful services create scalability challenges? For example, HTTP is considered stateless because it doesn’t need any data from the previous request to successfully execute the next one. Stateful services need to store their state in the memory of the server they are running on. This isn’t a major issue if vertical scalability is used to handle higher workloads, as the stateful service would execute successive requests on the same server, although memory usage on that server may become a concern as workloads increase. As we saw earlier in this chapter, horizontal scalability is the preferred way of scaling for a cloud-based application such as TFX. Using this approach, a service instance may be assigned to a different server to process a new request, as determined by the load balancer. This would cause a stateful service instance processing the request not to be able to access the state and therefore not to be able to execute correctly. One possible remedy would be to ensure that requests to stateful services retain the same service instance between requests regardless of the server load. This could be acceptable if an application includes a few seldom-used stateful services, but it would create issues otherwise as workloads increase.
It is important to remember that for cloud-based systems, scalability (like performance) isn’t the problem of the cloud provider. Software systems need to be architected to be scalable, and porting a system with scalability issues to a commercial cloud will probably not solve those issues.
One way to look at performance is to see it as a contention-based model, where the system’s performance is determined by its limiting constraints, such as its operating environment. As long as the system’s resource utilization does not exceed its constraints, performance remains roughly linearly predictable, but when resource utilization exceeds one or more constraints, response time increases in a roughly exponential manner.
When we discuss performance, we are concerned about timing and computational resources, and we need to define how to measure those two variables. It is critical to define clear, realistic, and measurable objectives from our business partners to evaluate the performance of a system. Two groups of measurements are usually monitored for this quality attribute. The first group of measurements defines the performance of the system in terms of timings from the end-user viewpoint under various loads (e.g., full-peak load, half-peak load). The requirement may be stated using the end-user viewpoint; however, the measurements should be made at a finer-grained level (e.g., for each service or computational resource). The software system load is a key component of this measurement set, as most software systems have an acceptable response time under light load.
However, a common fallacy is that performance is the problem of the cloud provider. Performance issues caused by poor application design are not likely to be addressed by running the software system in a container on a commercial cloud, especially if the architecture of the system is old and monolithic. Attempting to solve this kind of performance challenge by leveraging a commercial cloud is not likely to be successful and probably won’t be cost effective.
Caching is a powerful tactic for solving performance issues, because it is a method of saving results of a query or calculation for later reuse. Caching tactics are covered in Chapter 5, so only a brief summary of tactics is given here: Database object cache: This technique is used to fetch the results of a database query and store them in memory. The TFX team has the option to implement this technique for several TFX services. Because early performance testing has shown that some TFX transactions such as an L/C payment may experience some performance challenges, they consider implementing this type of cache for the Counterparty Manager, the Contract Manager, and the Fees and Commissions Manager components. Application object cache: This technique caches the results of a service that uses heavy computational resources for later retrieval by a client process. However, the TFX team isn’t certain that they need this type of cache, so they decide to apply the third Continuous Architecture principle, Delay design decisions until they are absolutely necessary, and defer implementing application object cache. Proxy cache: This technique is used to cache retrieved Web pages on a proxy server so that they can be quickly accessed next time they are requested, either by the same user or by a different user. It may provide some valuable performance benefits at a modest cost, but because the team does not yet have any specific issues that this technique would address, they decide to defer its implementation following principle 3, Delay design decisions until they are absolutely necessary.
Materialized views can be considered as a type of precompute cache. A materialized view is a physical copy of a common set of data that is frequently requested and requires database functions (e.g., joins) that are input/output (I/O) intensive. It is used to ensure that reads for the common dataset can be returned in a performant manner, with a tradeoff for increased space and reduced update performance. All traditional SQL databases support materialized views.
Traditionally, people talk about system availability in terms of the “number of 9s” that are required, referring to the percentage availability where 99 percent (two nines) means about 14 minutes unavailability per day, 99.99 percent means about 8.5 seconds per day, and 99.999 percent (the fabled five nines) means less than a second of unavailability per day. We have observed three significant problems with the common use of these measures. First, in most situations, it is assumed that more is better, and every business situation claims to need five nines availability, which is expensive and complex to achieve. Second, the practical difference between most of these levels is close to zero for most businesses (e.g., the business impact of 86 seconds [99.9%] and 1 second [99.999%] unavailability in a day is often not materially different). Third, these measures don’t consider the timing of the unavailability. For most customers, there is a huge difference in having a system outage for 1 minute in a retail system during the Christmas shopping season and a 20-minute outage of the same system early on a Sunday morning. This is why we have found the number of nines to be less meaningful than most people assume.
“TTR is more important than TBF (for most types of F).” This means that when creating a resilient system, the time to recover is at least as important as the time between failures and is probably easier to optimize for. What these companies found was that, in practice, minimizing the time to recover resulted in higher availability than maximizing time between failures. The reason for this appears to be that, because they faced a very unpredictable environment, problems were inevitable and out of their control. Therefore, they could control only the time taken to recover from problems, not the number of them.
The key concepts we need to think about for the data availability in our system are the recovery point objective (RPO) and the recovery time objective (RTO). The RPO defines how much data we are prepared to lose when a failure occurs in the system. The RTO defines how long we are prepared to wait for the recovery of our data after a failure occurs, which obviously contributes to the MTTR.
All of the data related to payments is clearly critical and will need a very low RPO, aiming for no data loss. Therefore, we might need to accept a relatively long RTO for the Payment Gateway service, perhaps 20 or 30 minutes. Other parts of the system, such as the Fees and Commissions Manager’s data, could accept the loss of recent updates (perhaps up to half a day’s worth) without significant business impact, as re-creating the updates would not be particularly difficult. This means that we might be able to recover most of the system very quickly, probably within a minute, but with some services taking longer to recover and the system offering only partial service until they are available (e.g., offering partial service for a period after a major incident by recovering most of the system very quickly but not allowing payment dispatch or document updates for a longer period). Second, when people engage in retrospective analysis, it must be based on an analysis of data collected at the time of the incident. A common problem with retrospectives without data is that they result in conclusions that are based on incomplete memories combined with opinions rather than an analysis of facts. This rarely leads to valuable insights!
The concrete activities that we need to achieve measurement and learning are as follows: Embed measurement mechanisms in the system from the beginning so that they are standard parts of the implementation and data is collected from all the system’s components. Analyze data regularly, ideally automatically, to understand what the measurements are telling us about the current state of the system and its historical performance. Perform retrospectives on both good and bad periods of system operation to generate insight and learning opportunities for improvement. Identify learning opportunities from data and retrospectives, and for each, decide what tangible changes are required to learn from it. Improve continuously and intentionally by building a culture that wants to improve through learning and by prioritizing improvement (over functional changes, if needed) to implement the changes required to achieve this.
The tactics are grouped according to which element of resilience they primarily address: recognizing failures, isolating failures, protecting system components to prevent failures and allow recovery, and mitigating failures if they do occur. We don’t have tactics for the resolution of failures, as failure resolution is generally an operational and organizational element of resilience that isn’t directly addressed by architectural tactics.
One of the hallmarks of a resilient system is its ability to gracefully degrade its overall service level when unexpected failures occur within it or unexpected events occur in its environment. An important tactic for achieving graceful degradation is to limit the scope of a failure and so limit the amount of the system that is affected (or “limit the blast radius,” as people sometimes say, rather dramatically).
Bulkheads are a conceptual idea; they can contribute significantly to limiting the damage from an unexpected fault or event. The idea of a bulkhead, borrowed from mechanical engineering environments such as ships, is to contain the impact of a fault to a specific part of the system to prevent it becoming a widespread failure of service.
Whenever we are considering the resilience of our system, we need to keep asking, what will happen if that service is unavailable for a period? Defaults and caches are mechanisms that can help us provide resiliency in some situations when services that provide us with data suffer failures. a cache is a copy of data that allows us to reference the previous result of a query for information without going back to the original data source. Naturally, caches are suitable for relatively slow-changing information; something that varies significantly between every call to the service is not a useful data value to cache. Caches can also be inline, implemented as proxies (and so the caller is unaware of the cache) or as lookaside caches where the caller explicitly checks and updates the cache value. Of course, there is always a tradeoff involved in any architectural mechanism and, as the old joke goes, there are only two hard things in computer science: cache invalidation, naming things, and off-by-one errors. The big problem with caches is knowing when to discard the cached value and call the service again to refresh it. If you do this too often, you lose the efficiency value of the cache; if you do it too infrequently, you are using unnecessarily stale data, which may be inaccurate. There isn’t really a single answer to this problem, and it depends on how often the data is likely to change in the service and how sensitive the caller is to the change in value.
Suppose we do have an asynchronous system but one where the clients sometimes produce workload a lot faster than the servers can process it. What happens? We get very long request queues, and at some point, the queues will be full. Then our clients will probably block or get unexpected errors, which is going to cause problems. A useful resilience tactic in an asynchronous system is to allow some sort of signaling back through the system so that the clients can tell that the servers are overloaded and there is no point in sending more requests yet. This is known as creating backpressure.
Just as the easiest way to achieve better performance in a system is to identify a design that requires less work to achieve the required output, a valuable architectural tactic for resilience is to simply reject workload that can’t be processed or that would cause the system to become unstable. This is commonly termed load shedding.
A variant of load shedding that is sometimes identified as a distinct approach is rate limiting (or “throttling”). Rate limiting is also normally provided by infrastructure software that handles network requests. The difference between the two is that load shedding rejects inbound requests because of the state of the system (e.g., request handling times increasing or total inbound traffic levels), whereas rate limiting is usually defined in terms of the rate of requests arriving from a particular source (e.g., a client ID, a user, or an IP address) in a time period. As with load shedding, once the limit is exceeded, additional requests are rejected with a suitable error code.
Timeouts are a very familiar concept, and we just need to remind ourselves that when making requests to a service, whether synchronously or asynchronously, we don’t want to wait forever for a response. A timeout defines how long the caller will wait, and we need a mechanism to interrupt or notify the caller that the timeout has occurred so that it can abandon the request and perform whatever logic is needed to clean things up.
There are many ways to design a circuit breaker proxy to protect a service from overload, but this example uses three states: Normal, Checking, and Tripped. The proxy starts in normal state and passes all requests through to the service. Then, if the service returns an error, the proxy moves into Checking state. In this state, the proxy continues to call the service, but it keeps a tally of how many errors it is encountering, balanced by the number of successful calls it makes, in the errors statechart variable. If the number of errors returns to zero, the proxy returns to Normal state. If the errors variable increases beyond a certain level (10 in our example) then the proxy moves into Tripped state and starts rejecting calls to the service, giving the service time to recover and preventing potentially useless requests piling up, which are likely to make matters worse. Once a timeout expires, the proxy moves back into Checking state to again check whether the service is working. If it is not, the proxy moves back into Tripped state; if the service has returned to normal, the errors level will rapidly reduce, and the proxy moves back into Normal state. Finally, there is a manual reset option that moves the proxy from the Tripped state straight back to Normal state if required. Both timeouts and circuit breakers are best implemented as reusable library code, which can be developed and tested thoroughly before being used throughout the system. This helps to reduce the likelihood of the embarrassing situation of a bug in the resilience code bringing down the system someday.
You may remember that we discussed in Chapter 3, “Data Architecture”, another related tactic known as eventual consistency. The difference between the two is that compensation is a replacement for a distributed transaction and so actively tries to minimize the window of inconsistency using a specific mechanism: compensation transactions. Eventual consistency mechanisms vary, but in general, the approach accepts that the inconsistency exists and lives with it, allowing it to be rectified when convenient. Eventual consistency is also more commonly used to synchronize replicas of the same data, whereas compensation is usually used to achieve a consistent update of a set of related but different data items that need to be changed together. Compensation sounds like a straightforward idea, but it can be difficult to implement in complex situations. The idea is simply that for any change to a database (a transaction), the caller has the ability to make another change that undoes the change (a compensating transaction). Now, if we need to update three databases, we go ahead and update the first two. If the update to the third fails (perhaps the service is unavailable or times out), then we apply our compensating transactions to the first two services. A saga is a collection of compensating transactions, which organizes the data needed to perform the compensation in each database and simplifies the process of applying them when needed. When implemented by a reusable library, they can provide an option for implementing compensating transactions at a relatively manageable level of additional complexity.
Finally, there is the question of resilience in the face of the difficult problem of data corruption within our databases. Data corruption can be a result of defects in our system’s software or problems or defects in the underlying database technology we are using. Such problems are thankfully relatively rare, but they do happen, so we need to have a strategy for them. The three tactics to use to mitigate data corruption are regular checking of database integrity, regular backup of the databases, and fixing corruption in place.
All of these challenges mean that, while testing specific scenarios in the development cycle is still valuable, one of the most effective ways we know of to gain confidence in a system’s resilience is to deliberately introduce failures and problems into it to check that it responds appropriately. This is the idea behind the famous Netflix Simian Army project,20 which is a set of software agents that can be configured to run amok over your system and introduce all sorts of problems by killing VMs, introducing latency into network requests, using all the CPU on a VM, detaching storage from a VM, and so on. Approaches like this that inject deliberate faults into the system at runtime have become a key part of what is known as chaos engineering.21 This approach can also be useful for testing other quality attributes, such as scalability, performance, and security. The original Simian Army project (https://github.com/Netflix/SimianArmy) has now been retired and replaced by similar tools from Netflix and others, such as the commercial Gremlin platform (https://www.gremlin.com). 21. Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, and Ali Basiri, Chaos Engineering (O’Reilly Media, 2020). Also, Principles of Chaos Engineering (last updated May 2018). https://www.principlesofchaos.org.
Achieving good operational visibility allows us to detect problematic conditions and recover from them. Analyzing our monitoring data from the past and considering what it is showing us and what may happen in the future allows us to predict possible future problems and mitigate them before they happen.
Michael Nygard’s book Release It! Design and Deploy Production-Ready Software, 2nd ed. (Pragmatic Bookshelf, 2018) provides a rich seam of experience-based advice on designing systems that will meet the demands of production operation, including improving availability and resilience. A number of our architectural tactics come directly from Michael’s insights. A good text on classical resilience engineering is Resilience Engineering in Practice: A Guidebook (CRC Press, 2013) by Jean Pariès, John Wreathall, and Erik Hollnagle, three well-known figures in that field. It does not talk about software specifically but provides a lot of useful advice on organizational and complex system resilience. A good introduction to observability of modern systems is Cindy Sridharan’s Distributed Systems Observability (O’Reilly, 2018). A good introduction to chaos engineering can be found in Chaos Engineering: System Resiliency in Practice (O’Reilly, 2020) by former Netflix engineers Casey Rosenthal and Nora Jones. Two books from teams at Google that we have found valuable are Site Reliability Engineering: How Google Runs Production Systems (O’Reilly Media, 2016) by Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, which is particularly useful for its discussion of recognizing, handling, and resolving production incidents and of creating a culture of learning using postmortems; and Building Secure and Reliable Systems by Ana Oprea, Betsy Beyer, Paul Blankinship, Heather Adkins, Piotr Lewandowski, and Adam Stubblefield (O’Reilly Media, 2020). You can find useful guidance on running learning ceremonies known as retrospectives in Ester Derby and Diana Lawson’s Agile Retrospectives: Making Good Teams Great (Pragmatic Bookshelf, 2006). Aino Vonge Corry’s Retrospective Antipatterns (Addison-Wesley, 2020) presents a unique summary of what not to do in a world gone agile. The following offer abundant information on data concerns. Pat Helland, “Life beyond Distributed Transactions: An Apostate’s Opinion,” in Proceedings of the 3rd Biennial Conference on Innovative DataSystems Research (CIDR), 2007, pp. 132–141.23 23. Also available from various places, including https://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p15.pdf Ian Gorton and John Klein, “Distribution, Data, Deployment: Software Architecture Convergence in Big Data Systems,” IEEE Software 32, no. 3 (2014): 78–85. Martin Kleppman, Designing Data Intensive Systems (O’Reilly Media, 2017). Finally, an idea that seems to come and go in the software resilience area is antifragility. This term appears to have been coined by the well-known author and researcher Nassim Nicholas Taleb in Antifragile: Things That Gain from Disorder (Random House, 2012). The core idea of antifragility is that some types of systems become stronger when they encounter problems, examples being an immune system and a forest ecosystem that becomes stronger when it suffers fires. Hence, antifragile systems are by their nature highly resilient. The original book doesn’t mention software systems, but several people have applied this idea to software, even proposing an antifragile software manifesto; see Daniel Russo and Paolo Ciancarini, “A Proposal for an Antifragile Software Manifesto,” Procedia Computer Science 83 (2016): 982–987. In our view, this is an interesting idea, but our observation is that a lot of the discussion we’ve seen of it in the software field seems vague, and it is hard to separate these proposals from better established ideas (e.g., chaos engineering and continuous improvement).

Reply

or to participate.