Create a real load into clusters before running tests

Registered by Denis Klepikov

How to create pre-production load on cluster before starting tests

Horizon- H
Tenant - T
Controller node - CTRL
Compute node - CN

1 Controller nodes:

1.1 Create 10 tenants (T) (5 using Horizon, 5 using CLI)
1.2 Login to Horizon (H) from all created tenants and start navigate inside H to all tabs and menus
1.3 Create 10 users inside all T from step 1.2
1.4 Login to Horizon (H) from all created users (step 1.3) and start navigate inside H to all tabs and menus
1.5 Upload 2 images from all T using H UI (at least 5 uploading should be started at the same time)
1.6 Under all T start to create FireWall rules
1.7 Under all T start to create and download keypairs

1.8 Login as root to one of the CTRL, upload 20 images to Glance one by one
1.9 Delete images 1, 3, 5, 7, 9, 11, 13, 15, 17, 19
1.10 Upload 2 images, check it via glance image-list, delete images.

1.11 Login as root to one of the CTRL, create 20 cinder volumes
1.12 Put into each volume 50 files with their MD5 sums (file size should be: 5 more than 5G each file, 45 more than 120M each file, if small cluster 1 file > 5G, 4 files > 120M, 45 files > 12M each)

1.13 Create volume from image #4 from step 1.8
1.14 Create volume from image #20 from step 1.8
1.15 Boot 2 instances using bootable volume from step 1.12
1.16 Boot 2 instances using bootable volume from step 1.13

[Under admin credentials into H]
1.17 Create 20 different flavors
1.18 Modify 20 different flavors from step 1.16
1.19 Create 20 routers
1.20 Create 20 networks and connect it to created routers
1.22 Create new tenant
1.22 Login to H using credentials from step 1.20, logout
1.23 As admin change password for T from step 1.20
1.24 Login to H using credentials from step 1.22, logout

2 Load inside instances:

To calculate instances count we can use a simple formula:

CN CPU core count - 3 CPU core (for CN itself) - 4 CPU core (for tests, to prevent overcommit)
So if we have CN with 40 CPU cores:
40-3-4=33 instances per CN
Also RAM should be calculated to prevent overcommit

2.1 IO load copy some folder inside instance and check MD5 sum of the files inside it after copying. All commands should be runned with time command.
2.2 Network load - download to instance some files and their MD5 sum from remote server, check MD5, Upload files to remote server, check MD5. All commands should be runned with time command.
2.3 CPU load - Upload to instance prepared MySql sources, install necessary soft for building, build MySql (make ….), check MD5 sums of copied files. All commands should be runned with time command.

3 Compute nodes (CN)

FOR EACH COMPUTE NODE
 All commands should be runned with time command.
3.1 create 2 instances,
3.2 Assign floating IP
3.3 Check ping from instance, download some file
3.4 Mount volume from step 1.12
3.5 Check files on volumes for MD5
3.6 Suspend instances from step 3.1
3.7 Unsuspend instances from step 3.1
3.8 Check that instances are up and running (repeat 3.3)
3.9 Reboot instances from step 3.1
3.10 Check that instances are up and running (repeat 3.3)
3.11 Terminate instances from step 3.1

4 Ceph cluster

Ceph cluster should be filled up to 60% during tests
You can calculate all images, volumes with files, instances that will be created during reproducing pre-load, and calculate how many data you have to put inside ceph cluster to fill it.

In case of using S3 API or Swift API via radosgw:

4.1 Create 10*CN buckets under different accounts
4.2 Upload 10 files to each bucket (filesize > 20M)
4.3 Download files (see 4.2) and check MD5
4.4-4.7 Upload 100, 200, 500 1000 files to each bucket (filesize 1k-1M)
4.8-4.11 Download files (see 4.4-4.7) and check MD5 for several files
4.12 clear all data into all buckets

NOTE: All commands should be started with 'time' command.

Tests 1.1-1.7 should be running continuously and all created tenants should be removed when 1.7 test ends.

Test 1.10 should be running continuously as a circle.

Test 1.13-1.16 should be running continuously as a circle. Started at the end of 1.8

Test 1.17-1.24 should be running continuously as a circle.

Test 2.1-2.3 should be running continuously as a circle. Started at the end of 1.8. Files after checks should be deleted, and tests start to do everything from the beginning.

Test 3.1-3.11 should be running continuously as a circle.

Test 4.1-4.12 should be running continuously as a circle.

During all tests all data about successful on not successful test should be logged on separate server and include checks of MD5, time of execution, timestamp.
This data will help us to determine bottlenecks and services stability under close to production load.

Also we can create a config file with several workload configurations:
- pre-loaded cluster (70 %) of compute CPU is used
- stress test (100%)
- overload (110-120%)

To load all compute nodes in the same rate you can calculate flavors for groups of the computes with the same HW.

Blueprint information

Status:
Not started
Approver:
Nastya Urlapova
Priority:
High
Drafter:
Denis Klepikov
Direction:
Approved
Assignee:
None
Definition:
New
Series goal:
Accepted for newton
Implementation:
Not started
Milestone target:
milestone icon 10.0

Related branches

Sprints

Whiteboard

I suggest we may use Rally to create the necessary load to the test environment.

(?)

Work Items

This blueprint contains Public information 
Everyone can see this information.