Solution Design Principles

Scaling

Vertical vs Horizontal

In horizontal scaling team adds more servers to handle increasing workloads

In vertical scaling team adds additional computer storage capacity and memory power to the some server to handle increasing workloads.

Predictive vs Reactive

Predictive scaling is useful for scenarios with predictable traffic patterns. By analyzing historical data, organizations can forecast traffic trends and adjust their resources accordingly

Reactive scaling is essential for dealing with unforeseen traffic surges, which might be significantly higher than the norm. By monitoring traffic and resource utilization organization can scale up in response to spike and back down when no longer needed

Scaling Static Content

Its essential to scale and distribute static content efficiently. the Content Distribution Networks (CDNs) help cache this content closer to the users, reducing latency and speeding up loading times

Database Scaling

While NoSQL databases are designed to be scaled horizontally, care must be taken to scale relational databases.You can use read replicas to add nodes that can be queried but all writes will still be limited to a single node. To better scale a relational database you need to redesign and divide it into shards by applying partitions. Each shard can grow independently, and the application needs to determine a partition key to store user data in a respective shard. The application needs to direct user records to the correct partition

Elastic Architecture

Elasticity is required to right size your infrastructure. You expand your server infrastructure to meet rising to demand, and contract when the server load diminishes.

HA and Resilient Architecture

To achieve HA, plan workloads in an isolated physical location.Should an outage occur in one place, your application replica can operate from a separate location. “design for failure, and nothing will fail”

A resilient architecture means your application should be available for customers while recovering from failure. Making your architecture resilient includes applying best practices to recover you application from increased loads, malicious attacks, and component failure. A resilient archiectured should recover in a desired amount of time.

To make your architecture resilient, you need to define the time of recovery and address the following points:

Identify and implement redundant architectural components wherever required.
Understand when to fix versus when to replace architectural components. For example, fixing a server issue might take longer than replacing it with the same machine image.
Deploying server clusters across different racks, extending to multiple data centers within the same region, and further, across various geographic regions. This geographical distribution ensures protection against localized and regional disasters and reduces latency for a global user base.
Incorporating intelligent load balancing and global traffic management, such as DNS-based routing with health checks, ensures that users are always served from the optimal location. Database resiliency is achieved through strategic replication, with automated failover mechanisms to maintain database availability and integrity.

Best practices need to be applied to create a redundant environment:

Use the CDN to distribute and cache static content such as videos, images, and static web pages near the user’s location so that your application will still be available.
Once traffic reaches a region, use a load balancer to route traffic to a fleet of servers so that your application can still run even if one location fails within your region.
Use autoscaling to add or remove servers based on user demand. As a result, your application should not be impacted by individual server failures,
Create a standby database to ensure the high availability of the database, meaning that your application should be available in the event of a database failure.

Fault Tolerant Architcture

Fault tolerance goes beyond high availability. In an HA setup with two datacenters and two servers each, if one data center goes down, the application is still available but at lower capacity. Fault tolerant means overprovisioning, in this case 4 servers per data center, so that if one datacenter goes down the application can still operate at full capacity

Designing for Performance

The following are considerations for adding caching to various layers of your application design:

Use the browser cache on the user’s system to load frequently requested web pages.
Use the DNS cache for quick website lookup.
Use the CDN cache for high-resolution images and videos that are near the user’s location.
At the server level, maximize the memory cache to serve user requests.
Use cache engines such as Redis and Memcached to serve frequent queries from the caching engine.
Use the database cache to serve frequent queries from memory.
Take care of cache expiration, which is the process by which data stored in the cache becomes outdated and is marked for update or removal. Cache eviction, on the other hand, is the process by which data is removed from the cache, typically to make room for new data.

Security

Consider the following security aspects during the design phase:

Physical security of data center: All IT resources in data centers should be secure from unauthorized access.

Network security: The network should be secure to prevent any unauthorized server access

Identity and Access Management (IAM): Only authenticated users should have access to the application, and they can do the activity as per their authorization

Data security in transit: Data should be secure while traveling over the network or the internet.

Data security at rest: Data should be secure while stored in the database or any other storage

Security monitoring: Any security incident should be captured and the team should be alerted to act

Disaster Recovery

When planning disaster recovery, a solutions architect must understand an organization’s Recovery Time Objective (RTO) and Recovery Point Objective (RPO).

RTO measures how much downtime a business can sustain without significant impact;

RPO indicates how much data loss a business can tolerate.

Common disaster recovery plans:

Backup and store: least costly, maximum RTO and RPO. all the server’s machine images and database snapshots should be stored in the disaster recovery site. The team will try to restore the disaster site from a backup in a disaster.
Pilot lite: In this plan, all the server’s machine images are stored as a backup, and a small da tabase server is maintained in the disaster recovery site with continual data synchronization from the leading site. Other critical services, such as Active Directory, may be running in small instances. In a disaster, the team will try to bring up the server from the machine image and scale up a database. Pilot lite is more costly but has lower RTO and RPO than Backup and Store.
Warm standby: In this plan, all the application and database server (running at low capacity) instances in the disaster recovery site continue to sync up with the leading site. In a disaster, the team will try to scale up all the servers and databases. Warm standby is costlier than the pilot lite option but has lower RTO and RPO.
Multi-site: This plan is the most expensive and has a near-zero RTO and RPO. This plan main- tains a replica of the leading site in a disaster recovery site with equal capacity that actively serves user traffic. In a disaster, all traffic will be routed to an alternate location.