Key considerations & best practices of AWS cloud native applications

White paper

AWS Development & Testing for Cloud-based Applications

Key Considerations & Best Practices

mobile app and web development challenges

INTRODUCTION

Development & AWS Testing for Cloud Applications

This white paper guides project managers, developers, testers, systems architects, enterprises and anyone involved in software applications development activities.

It covers aspects of the AWS as a development platform, AWS cloud products and solutions, building cloud native applications, advantages of AWS, support, pipelines, automation, testing tools and techniques.

Millions of customers leverage AWS cloud products and solutions to build sophisticated applications with increased flexibility, scalability and reliability.

AWS has established itself as a leading platform for enterprise development teams with a variety of command-line tools and software development kits (SDKs) to deploy and manage applications and services. AWS SDKs are available for any number of platforms and programming languages including Java, PHP, Python, Node.js, Ruby, C++, Android and iOS.

Advantages & exploitation

Cloud native applications

Cloud native is an approach to building and running applications that fully exploit the advantages of the cloud computing model.

Developed and deployed properly, cloud native applications provide substantial benefits including high-level services, greater elasticity, on-demand delivery, and streamlined global deployment.

The AWS development platform is aligned entirely with the cloud native approach and all its benefits.

Packaged as lightweight containers

Developed with best-of-breed languages and frameworks

Designed as loosely coupled micro-services

Deployed on elastic cloud infrastructures

Managed through an Agile DevOps process

Dynamically orchestrated

“Cloud native technologies empower organizations to build and run scalable applications in modern, dynamic environments such as public, private, and hybrid clouds.”

CNCF Cloud Native Definition v1.0

Cloud Native Applications

Why Development Teams Use AWS to Build

AWS has become the preferred platform for cloud native applications.

Some of its many advantages include:

PROGRAMMING BUILD AND AWS INTEGRATION SUPPORT

Programming Build and Integration Support

AWS provides scalable platform, integration, development and deployment support to web centric programming languages like Ruby, NodeJS, JavaScript, CSS, HTML, Python and more. It also provides platforms like Elastic Bean Stalk to build and run this code.

dynamic scalability in a cloud native stack

Fully Controlled Scale-Up and Scale-Down

Cloud applications require a mechanism to scale up and down as web traffic increases or decreases, thereby making it flexible and dynamic in nature. AWS-ECS, Autoscaling, and AWS-EKS provide full-featured functionality to address dynamic scalability in a cloud native stack.

Notification, User Authentication and Message Queuing

AWS SNS helps to coordinate the delivery of push notifications for apps, to either subscribing endpoints, or clients. AWS SQS is a message queuing system that helps developers de-couple as well as scale distributed systems, serverless apps, and microservices that are deployed on cloud native stacks. AWS Cognito facilitates the control of user authentication on the cloud native stack.

Cost Effective Storage and Secured Network Layer Support

Cloud Native Applications require storage and network capability that is mutable and flexible in nature. AWS S3 storage helps to provide online backup as well as archive data and application programs. AWS VPC, Subnets, and Elastic IP help to build flexible network segmentation over cloud space to keep it secure and safe from external non-authorized access.

Fault-Tolerant Microservices

Amazon ECS (Elastic Container Service) is a highly scalable, high-performance container orchestration service that supports Docker containers that enable run and scale containerized cloud applications on the cloud native stack. It is containers without servers making it secure and cost effective. In addition, Amazon EKS runs the Kubernetes management infrastructure across multiple availability zones to eliminate a single point of failure.

Code Management

AWS also offers CodeDeploy, CodeCommit and CodePipeline to manage code in the cloud native stack. CodeDeploy helps to deploy code at scale, with a focus on rapid development and rapid deployment in mission-critical situations where the cost of failure is high. Code Commit is a managed revision control service that hosts Git repositories and works with all Git-based tools and CodePipeline is used to model and automate software release process on the cloud native stack.

AWS BENEFITS

12-Factor App Support

1 CODEBASE

Main objective of this factor is to have all cloud application code in revision control – If events are shared (such as a common Amazon API Gateway API), then the Lambda function code can be used for those events to put in the same repository. Otherwise, AWS Lambda break functions can be used along with event sources into their own repositories.

2 Dependencies

To build special processing or business logic the best solution may be to create a purposeful library using following languages – Node.js, Python: pip, Java: Maven, C#: NuGet and Go: Go get Packages.

3 CONFIG

AWS Lambda and AWS API Gateway can be used to set configuration information using the environment in which each service runs.

4 Backing Services

AWS Lambda has a default model for this factor. Typically, it is possible to reference any database or data store as an external resource via HTTP endpoint or DNS name.

5 Build, Release, Run

AWS CodeDeploy, CodeCommit and CodePipeline can be utilized for this factor.

6 Port Binding

AWS Lambda can be used with either of these three invocation modes – Synchronous, Asynchronous and Stream-based.

7 Concurrency

AWS Lambda can be used which has massive concurrency and scaling capacity.

8 Stateless Processes

AWS Lambda functions can be used and treated as being stateless, despite the ability to potentially store some information locally between execution environment re-use.

9 Disposability

AWS best practices help to identify where to place certain logic, how to re-use execution environments, and how by configuring function for more memory to get a proportional increase in CPU available. With AWS X-Ray, it is possible to gather some insight as to what a function is doing during an execution and to make adjustments accordingly.

10 Environment Parity

AWS serverless application model allows users to model serverless applications in greatly simplified AWS CloudFormation syntax. With this model, it can use CloudFormation’s capabilities – such as Parameters and Mappings to build dynamic templates.

11 logs

AWS Cloud watch logs as well as API Gateway provides two different method of logs – 1) Execution logs 2) Access logs.

12 Admin Processes

Typically, cloud functions are scoped down to single or limited use cases and have individual functions for different components of the web application. Even if they share a common invoking resource, such as an API Gateway endpoint and stage, it is possible to separate the individual API resources and actions to their own Lambda functions, so it is possible to build design that meets the requirement.

systematic handling of test environments

Defining & Creating Pipeline for Cloud Native

One of the challenges to testing in the cloud lies in finding ways to create a test environment that contains all of the necessary app configurations and test data. This process can be time-consuming for testers and developers alike. The model below presents a systematic way of handling the different test environments in the cloud to overcome these challenges.

Functionally, there should be no change in the test approach between web-based and cloud environment. However, more focus may be required in certain aspects such as performance, security, environment availability, integration with interfaces, customer experience and adherence to SLAs when testing in a cloud-based environment.

The test strategy and test plan document may also change depending on the service models and implementation models selected. Choice of tools and techniques should be similar to that of web-based applications, however, during tool selection it is important to make sure that they are available on cloud and that they support the pay-as-you-use model.

Local Environment

Build

Locally deploy

Manual verification

Static code analysis

Dev Environment

CI Job:

Build and deploy

Smoke tests

Static code analysis

Integration Environment

CI Job:

Build and deploy

Integration tests

Security and performance tests

STAGING Environment

CI Job:

Build and deploy

Regression tests

Manual verification

Regression, exploratory

Pre-POD Environment

E2E tests

Load tests

Security tests

Production Environment

Release tests

Smoke tests

Metrics verification

Testing & verification

Robust Automation

Cloud native applications lend themselves to robust test automation.

cloud native application testing and verification automation

Almost all rounds of testing can be automated

detect interface defects for applications on cloud

There should be more focus on verifying the communication paths and interactions between components to detect interface defects for applications on cloud

Even the smallest piece of testable software is in scope

It is also critical to verify interactions at the boundary of an external service to assure that it meets the contract expected by a consuming relevant cloud service

OUR GUIDE

Testing Tools & Techniques for Cloud Native

The right tools and thoughtfully designed acceptance criteria are key success factors for cloud native testing. The table below is a good guide based on our experience:

Testing Type

Tools

Acceptance Criteria designed for CN Applications

Testing Type

Unit/Component Testing

Tools

Testing Framework: JUnit, NUnit, RSpec

Automation Tools: Pact/QMetry Automation Framework

Acceptance Criteria designed for CN Applications

1. Boundary condition

2. Status code and Status message verification

3. Positive and Negative business flows

Testing Type

Integration Testing

Tools

Testing Framework: JUnit, NUnit, RSpec

Automation Tools: Pact/QMetry Automation Framework

Acceptance Criteria designed for CN Applications

1. Response chaining

2. Status code and Status message verification at the end level

3. Session management

4. Positive and Negative scenarios with respect to third party

5. Consumer scenarios to verify nothing is breaking after service deployment

Testing Type

Contract Testing

Tools

Automation Tools: Pact/QMetry Automation Framework

Acceptance Criteria designed for CN Applications

1. Each consumer needs to verify each contract

Key and value pair
Value type

2. Positive and Negative contact verification

AWS Cloud-Based Applications

AWS Testing Approaches

Picking the right testing strategy to address specific challenges.

Functional Performance Testing (FPT)

Functional Performance Testing (FPT) means conducting functional testing of the business use cases under the application load conditions. It includes a suite of tests all under server load conditions; e.g.

End-user business scenarios
Validation of the data breach (data swap)
Multi-user/Concurrent testing
Validation of the session continuity
Validation of auto scaling

Proper FPT Testing

The table below compares the differences and impact of a proper FPT approach in most enterprises.

Before (Current Activities)

1Very limited functional testing under load (eg. Data Swap)
2Very limited multi user testing (in Appt Center Booking Flow)
3Lack of awareness among team members on edge conditions/corner conditions that resulted
1. a) Potential high risk for bad member experience
2. b) Compromise on member PHI data
4Lack of enforcement as a practice across the squad members

After – Successful FPT Rollout

1 Formalized and documented approach for Functional Performance Testing (FPT)
2 FPT Suite contains collection of tests including data breach validations
3 Functional verification of business use cases while application under load
4 Enforcement of FPT via DoD Checklist (Squad level and Release QA checklist)

Outcome – Business BenefitsIncreased test coverage that results in reduced risk level data swap conditions

FPT Testing Types

Type of testing

objective

Description

Type of testing

1Data Swap Validation (multi-user, negative concurrency)

objective

Helps in validating the data integrity for the end-user data

Description

Validate the data integrity of the end-users with single user or multi-user.

Multi User Testing → Up to 5 TPS for Service Layer via Live Services

Concurrent Users (High Traffic) → Up to 50 TPS for Service Layer via Virtualized Services

Type of testing

2Auto-Scaling Policy Validation

objective

Helps in identifying the potential issues related to data loss, session management, session continuity, performance degradation during the auto scaling of the infrastructure

Description

Verify the application behavior is working as expected when infrastructure resources are updated via auto scaling (scale in or scale out) policy

Security Testing

Security is absolutely key when working in the cloud and there are a number of important factors when deciding how to approach it.

The more the security, the more load on the performance, so decisions about what to secure need to be made at the architecture level
It is not possible to attain 100% security
Never presume anything in regard to security
Never trust the dependencies of the application
Set traps in code and in repositories (Canary Tokens)
Always identify the communication method used between:
- Users and Resources
- Between different Resources
Always audit the logs and API calls (Cloud Trail and Cloud Watch)

Encryption in AWS

The table below compares the differences and impact of a proper FPT approach in most enterprises.

Encryption in Flight

Data should be encrypted before sending and after receiving, but only server and sender should know how to do this.

SSL Certificates help with Encryption (HTTPS)
It ensures that no MITM (Man-in-the-middle) attack can happen

Encryption at Rest

Server-side encryption

Data should be encrypted after received so, even in case the server is hijacked, the hacker could be able to read the data
Data should be decrypted before being sent back to the client
Encryption/decryption keys should be managed somewhere (KMS) and the server should have rights to access them

Client-Side Encryption

Data is encrypted by the client and never decrypted by the server.

Data should be always decrypted by the client only in this case (Envelope Encryption)

Types of Security

There are two types of Security to focus on – User-Based and Resource-Based.

User-Based Security

Is managed by IAM, which helps determine what kinds of API calls should be allowed from a specific user according to their authorization level.

IAM acts as the center of the AWS environment and is integrated with all the AWS services, so it plays the role of central security provider for the whole AWS environment.
There is typically only one IAM role per application.
Best practices include never writing IAM credentials in code, setting up Multi-Factor Authentication (MFA) for the root user, and using least privilege principals.

Resource-Based Security

Depends on every resource, according to its communication method and data usage method.

Load & Performance Testing

Load and performance testing is intended to determine the responsiveness, throughput, reliability, and/or scalability of a system under a given workload. It typically involves:

Assessing production readiness
Evaluating against performance criteria
Comparing performance characteristics of multiple systems or system configurations
Finding the source of performance problems
Supporting system tuning
Finding throughput levels

Why Load & Performance Test?

There are two types of Security to focus on – User-Based and Resource-Based.

Speed Does the application respond quickly enough for the number of users?

Stability Is the application stable under expected and unexpected user load?

Scalability Will the application handle load beyond expected user demand?

Confidence Do we believe the application will handle the required amount of load?

Testing Considerations

Five distinct types of Load & Performance Testing

1. AWS Load Testing

Tests the application for its performance on normal and peak usage by checking the response to user requests consistently within accepted tolerance.

Key Considerations

What is the max load the application is able to hold before the application starts behaving unexpectedly?
How much data the database is able to handle before system slowness or the crash is observed?
Any network related issues to be addressed?

2. AWS Stress Testing

Involves testing beyond normal operational capacity, often to a breaking point, in order to observe the results. This kind of test is started with good load test for which the application has already been tested. The load is increased slowly up to the point where the system fails to respond.

Key Considerations

What is the max load a system can sustain before it breaks down?
How the system breaks down?
Is the system able to recover once it’s crashed?

3. AWS Spike Testing

Involves increasing the system load rapidly using bursts of concurrent users to check behavior of the system. The goal is to determine whether performance will suffer, the system will fail, or if it is able to handle dramatic changes in load.

Key Considerations

Is the application able to handle sudden burst of user’s load?
Working of system failback mechanism due to unexpected load?

4. AWS Volume Testing

Involves huge volumes of data exchanged in database and testing its relation to performance of application. This can be performed in short and long duration.

Key Considerations

Are the database queries tuned to be performant?
What is the effect of heavy database on application performance?

5. AWS Endurance Testing

Also known as Soak Testing, is typically done to determine if the system can sustain the continuous expected load. During tests, memory utilization is monitored to detect potential leaks. For example, in software testing, a system may behave exactly as expected when tested for 1 hour but when the same system is tested for 3 hours, problems such as memory leaks cause the system to fail or behave randomly.

AWS Load and Performance Testing: Parameters

Response Time – the total amount of time it takes to respond to a request for service. That service can be anything from a memory fetch, to a disk IO, to a complex database query, or loading a full web page.

Throughput

Calculated as requests/unit of time. The time is calculated from the start of the first sample to the end of the last sample. This includes any intervals between samples, as it is supposed to represent the load on the server.

The formula is:

Throughput = (number of requests) / (total time).

Latency

The measure of responsiveness that represents the time taken to complete execution of a request. It can also represent sum of several latencies or subtasks.

90% Line (Percentile)

The 90th percentile is the value for which 90% of the data points are smaller. This is a standard statistical measure.

Baseline

A set of tests run to capture performance metric data for the purpose of evaluating the effectiveness of subsequent tuning activities performed to the application.

Benchmarking

A process of comparing system performance against the baseline that is created internally or against an industry standard recognized by another organization.

Transaction Response Time

Represents the time taken for the application to complete a defined transaction or business process.

Hits per Seconds

Hits are requests of any kind made from the virtual client to the application being tested (Client to Server). It is measured by number of Hits and the higher the Hits Per Second, the more requests the application is handling per second.

Throughput

The amount of data transferred across the network is called throughput. It considers the amount of data transferred from the server to client only and is measured in bytes/sec.

Performance Testing

Methodology

1. Requirements Gathering

Collecting key project information including performance testing goals (expected users, response time, etc.), required credentials, application URL, tool preference, etc.

2. Test Planning

Creating complete test plans that incorporate objectives, scope, approach, and focus of the software testing effort. Test planning includes planning load test with the reference of Capacity Planning conducted at the time of Application Development. Capacity Planning helps determine what type of hardware and software configuration is required to meet application needs, and how many resources an application uses. The main goal is to identify the right amount of resources required to meet service demands now and in the future.

3. Script Creation

In this step, actual script is created per the business flows using Performance Testing tools such as JMeter, Load Runner, etc. A sample execution with 1-5 users is typically done to verify created script.

4. Load Test Execution

Users execute created script with provided parameters (such as # of user’s, ramp-up, ramp-down time, etc.). It requires monitoring servers for memory utilization, CPU utilization, Disk I/O, etc.

5. Result Analysis and Reporting

In this phase, users take the test results and perform analysis and create a report which identifies performance bottlenecks such as memory leak, server errors, response time, etc.

6. Tuning and Fixing

The above reports enable users to fix these bottlenecks and continue to tune the application until specified goals are met.

Load & Performance Testing

Tools Used

Open source

Open STA

Grinder

Rubis

Locust

Paid tools

LoadRunner (most popular)

Neo Load

Skill Performer

Blazemeter

Cloud-based Applications

Tools to Complement & Accelerate

AWS Architecture – In General

AWS Cloud based application architecture

Migration Sequence

Components

Tools

Migration Sequence

Database Server

Components

Tables, users, functions, procedures, triggers etc.

Tools

Datapump, Import, Export, SQL developer

Migration Sequence

Data Transfer

Components

All tables data

Tools

AWS DMS, PL/SQL script, Talend etc.

Migration Sequence

Application Server

Components

Forms fmb, fmx, configuration and setup files

Tools

SCP, Oracle 11g forms converter and engine

Migration Sequence

WebLogic

Components

Web API

Tools

Oracle fusion middleware

Migration Sequence

Roles

Components

Application users, roles, DB roles

Tools

AWS DMS or third party tool

Migration Sequence

Security

Components

Groups, users, roles, privileges

Tools

AWS console

Migration Sequence

Backup

Components

Application and database

Tools

Amazon S3 buckets and RMAN

Disaster Recovery Planning in AWS

natural disasters recovery planning in aws

Natural Types of Disasters

Agricultural diseases and pests
Damaging winds
Drought and water shortage
Earthquakes
Emergency diseases (pandemic influenza)

Extreme heat
Floods
Hurricanes and tropical storms
Landslides and debris flow
Thunderstorms

man-made and technological disasters recovery planning in aws

Man-Made & Technological Types of Disasters

Power service disruption and blackout
Nuclear power plant and nuclear blast
Radiological emergencies
Chemical threat and biological weapons

Cyber attacks
AWS console
Civil unrest

Disaster recovery measures can be classified mainly into three types:

Preventive measures

Aimed at preventing an event from occurring

Corrective measures

Aimed at fixing a system in case of a negative event or disaster

Detective measures

Aimed at detecting and discovering negative events

Key factors

Recovery Time Objective (RTO) & Recovery Point Objective (RPO)

The right DR plan for an organization involves many factors, but few are as important as RTO and RPO.

The terms are similar, but fundamentally different. RTO is the maximum length of time after an outage that your company is willing to wait for the recovery process to finish. RPO is the maximum amount of data loss your company is willing to accept as measured in time.

Disaster Recovery Approaches

There are four primary approaches for DR on AWS:

Backup and recovery

In most traditional environments, data is backed up to tape and sent off-site regularly
It can take a long time to restore your system in the event of a disruption or disaster
Amazon S3 is an ideal destination for backup data that might be needed quickly to perform a restore
Transferring data to and from Amazon S3 is typically done through the network, and accessible from any location

You can use AWS Import/Export to transfer very large data sets by shipping storage devices directly to AWS
For longer-term data storage where retrieval times of several hours are adequate, there is Amazon Glacier, which has the same durability model as Amazon S3
Amazon Glacier and Amazon S3 can be used in conjunction to produce a tiered backup solution

Preparation Phase

The figure below shows data backup options to Amazon S3, from either on-site infrastructure or from AWS.

Recovery Phase

The following diagram shows how you can quickly restore a system from Amazon S3 backups to Amazon EC2

Key steps for backup and recovery:

Selecting an appropriate tool or method to back up data into AWS
Ensuring an appropriate retention policy for the data
Ensuring appropriate security measures are in place for the data, including encryption and access policies
Regularly testing the recovery of data and system restoration

Pilot Light

The term describes a DR scenario in which a minimal version of an environment is always running in the cloud:

The on-premise database server mirrors data-to-data volumes on AWS
The database server on cloud is always activated for frequent or continuous incremental backup
The application and caching server replica environments are created on cloud and kept in standby mode — very few changes take place over time

AMIs can be updated periodically
If the on-premise system fails, the application and caching servers get activated
Users are re-routed using elastic IP addresses to the ad hoc environment on cloud
Recovery should take just a few minutes

Preparation Phase

The figure below shows the preparation phase.

Key steps include:

Setting up Amazon EC2 instances to replicate or mirror data
Ensuring access to all supporting custom software packages available in AWS
Creating and maintaining AMIs of key servers where fast recovery is required
Regularly running and testing servers, and applying software updates and configuration changes
Automating the provisioning of AWS resources where possible

Recovery Phase

Key steps for the Pilot Light Recovery Phase include:

Starting the application Amazon EC2 instances from custom AMIs
Resizing existing database/data store instances to process the increased traffic
Adding additional database/data store instances to give the DR site resilience in the data tier
Changing DNS to point at the Amazon EC2 servers
Installing and configuring any non-AMI based systems, ideally in an automated way

Warm stand-by

This approach is a scaled-down version of a fully functional environment and is always running in the cloud. It extends the Pilot Light elements and preparation and further shortens the recovery time. By identifying business-critical systems, users can fully duplicate them on AWS and have them always on – running on a minimum-sized fleet of Amazon EC2 instances on the smallest sizes possible. In a disaster, the system can scale up quickly to handle the production load.

Preparation Phase

The following figure shows the preparation phase for a warm standby solution, in which an on-site solution and an AWS solution run side-by-side.

Key steps in the Preparation Phase include:

Setting up Amazon EC2 instances to replicate or mirror data
Creating and maintaining AMIs
Running an application using a minimal footprint of Amazon EC2 instances or AWS infrastructure
Patching and updating software and configuration files in line with the live environment

preparation phase for a warm standby solution

Recovery Phase

In the case of failure of the production system, the standby environment can scale up quickly for production load, and DNS records are changed to route all traffic to AWS.

Key steps include:

Increasing the size of Amazon EC2 fleets in service with the load balancer
Starting applications on larger Amazon EC2 instance types as needed
Manually changing the DNS records, or using Amazon Route 53 automated health checks so that all traffic is routed to the AWS environment
Using auto scaling to right-size the fleet or accommodate the increased load
Adding resilience or scale up to the database

Multi-site

Multi-Site is the optimal backup and DR approach. All activities in the preparatory stage are similar to a warm standby; except that AWS backup on Cloud is also used to handle some portions of the user traffic using Route 53. When a disaster strikes, the rest of the traffic that was pointing to the on premise servers is re-routed to AWS. Using auto scaling techniques, multiple EC2 instances are deployed to handle full production capacity.

Users can increase the availability of their Multi-Site solution by designing Multi-AZ architectures. The cost of this scenario is determined by how much production traffic is handled by AWS during normal operation. In the recovery phase, you pay only for what you use for the duration that the DR environment is required at full scale.

Preparation Phase

In a Multi-Site Approach, preparation includes:

Setting up the AWS environment to duplicate the production environment
Setting up traffic routing technology to distribute incoming requests to both sites
Configuring automated failover to re-route traffic away from the affected site

Multi-Site is the optimal backup and DR approach

Recovery Phase

The following figure shows the change in traffic routing in the event of an on-site disaster.

Key steps for recovery in a Multi-Site DR approach include:

Either manually, or by using DNS failover, changing DNS weighting so that all requests are sent to the AWS site
Leveraging application logic for failover to use local AWS database servers for all queries
Using auto scaling to automatically right-size the AWS fleet if possible

AWS offer four levels of DR support across a spectrum of complexity and time

Execution & Approach

Failing Back

Once you have restored your primary site to a working state, you will need to restore your normal service, which is often referred to as a “failback.”

There are different ways to execute it based on which DR approach has been deployed.

For Backup & Restore