1. Introduction

As part of the ARMv8 upbringing work, the Kernel Working Group took on porting the Cortex Strings library to the kernel.

2. Code

The revised kernel code is available @ kernel_stringlib_for_arm64_git

3. Test Results

Tests for this code are @ cortex_string_test_git.

The tests run on the ARMv8-a fastmodel with one core.

Serveral implementations of functions are introduced to do the test. They are:

  • cortex-function name : the original function from cortex string library, such as cortex-memset.

  • updated-function name : the new functioin to be upstreamed,such as updated-memset.

  • existing-function name : the current aarch64 kernel function, such as existing-memset.

3.1. memcmp

3.1.1. source0 offset is inequal to source1's

The updated-memcmp only optimized the process when the two source addresses' offset are inequal.

3.1.1.1. source0 offset is bigger than source1's

s13 means the source0 address is 13, aligned offset is 5. d10 represents the source1 address is 10, aligned offset is 2.

memcmp-t0-data-src0-gt-s13-d10

T0-memcmp-mini-time-src0-gt-s13-d10-small

T0-memcmp-mini-time-src0-gt-s13-d10-all

  • For small length, updated-memcmp's efficiency is improved about 100% in some length range.

3.1.1.2. source0 offset is less than source1's

memcmp-t2-data-src0-lt-s11-d12

T2-memcmp-mini-time-src0-lt-s11-d12-small

T2-memcmp-mini-time-src0-lt-s11-d12-all

  • The updated-memcmp's efficiency is improved over 100%. As the length is increasing, it is getting more efficient.

3.1.2. source0 offset is equal to source1's

Compared with the inequal case, memcmp based on the both aligned addresses is much efficient.(All the ldp are aligned)

memcmp-t4-data-s0-d0

T4-memcmp-mini-time-s0-d0-alllen

3.2. memset

3.2.1. Small length memory set

3.2.1.1. aligned start address

The memory start address is ZERO. The byte data also is 0. Test invoked each function 1000 times, recorded the minimal time to reflect the best performance. The test memory length is less than 64.

  • testcase0 raw data

The data in following file are minimal time among 1000 times memset execution, the time scale is microsecond.

memset-t0-data-less64-s0-d0

The memory length is in the first colloum.

  • testcase0 performance graph

T0-memset-minimal-time-s0-d0-aligned

  • According to the test result graph, the following conclusions were drawn:
    • the current aarch64 memset is not as good as the new memset in some memory lengths. Since it needs more memory write operations especially when the length is not multiply of 8 or 16.

3.2.1.2. non-aligned start address

The memory start address is 34, it means offset aligned with 16 is 2. The test also executed the memset one thousand times, and got the minimal time among them.

  • testcase1 raw data

memset-t0-data-less64-s2-d0

  • testcase1 performance graph

T1-memset-minimal-time-s2-d0

  • According to the test result graph, the following conclusions were drawn:
    • Basically, the new memset has the similar performance as the original cortex-strings memset, but in new memset, most memory write operations are aligned accesses rarther than all non-aligned accesses in original cortex-strings.
    • The current aarch64 memset is not as good as the new memset in some memory lengths. Since it needs more memory write operations. For non-aligned start address, all accesses are non-aligned.

3.2.2. bigger length memory set

The memory length is not less than 64.

3.2.2.1. aligned start address

  • testcase2 raw data

memset-t2-data-notless64-s0-d0

  • testcase2 performance graph

T2-memset-minimal-time-s0-d0

  • According to the test result graph, the following conclusions were drawn:
    • As the length is increasing, the efficiency of current kernel memset is getting worst.

3.2.2.2. non-aligned start address

  • testcase3 raw data

memset-t3-data-notless64-s2-d0

  • testcase3 performance graph

T3-memset-minimal-time-s2-d0

3.3. memcpy

3.3.1. source offset is equal to destination offset

In test case0 and test case2,the source start address is twenty as the destination start address.

The test case0 is for the small length memcpy.

  • testcase0 raw data

memcpy-t0-data-less64-s4-d4

  • testcase0 performance graph

T0-memcpy-minimal-time-s4-d4

  • The new memcpy is a bit better in performance than others.(stp/ldp do the contribution to this result)
  • testcase2 raw data

The test case2 is for the bigger length memcpy.

memcpy-t2-data-less64-s4-d4

  • testcase2 performance graph

T2-memcpy-minimal-time-s4-d4

  • testcase4 raw data

In test case4,the source start address is sixteen as the destination start address, which means both address are aligned with 16.

memcpy-t4-data-less64-s0-d0

  • testcase4 performance graph

T4-memcpy-minimal-time-gt64-s0-d0

  • The new memcpy has better performance than current kernel memcpy. As the length is increasing,the efficency is more better.
  • The efficency is near between new memcpy and the original cortex-string memcpy, except the new memcpy mostly access memory by aligned address.

3.3.2. source offset is inequal to destination offset

The test case1 is for the small length memcpy.

  • testcase1 raw data

memcpy-t1-data-less64-s1-d7

  • testcase1 performance graph

T1-memcpy-minimal-time-s1-d7

The test case3 is for the bigger length memcpy.

  • testcase3 raw data

memcpy-t3-data-notless64-s1-d7

  • testcase3 performance graph

T3-memcpy-minimal-time-s1-d7

  • In case where memory length is bigger, the new memcpy is a bit better than the current kernel memcpy. But the improvement is not so obvious as the case where the offsets are equal. (The time

elapsed when the offset are equal is much smaller than that one in case of the offset are inequal at the same length.)

  • the write operation to destination is not aligned access.

3.4. memmove

According to the difference between source address and destionation address, there are two main test cases. When the source is bigger than destination, the cortex-memmove's process is different from the one of updated-memmove. Some more test cases were done to obtain the perfromance comparision.

3.4.1. source is bigger than destination

In updated-memmove, it actually calls the corresponding memcpy to finish the move process. But in cortex-memmove, it goes into two different branch depended on the difference between source and destination. When the difference is less than 16, the process is in correspongding memmove, otherwise in memcpy.

  • testcase raw data

s29 means source start address is 29, d22 means destination start address is 22. Test case 0 and Test case 1 are for the case where the difference between source and destination is bigger than 16. The others are for the reverted case.

memmove-t0-data-less64-s29-d22

memmove-t1-data-notless64-s29-d22

memmove-t2-data-less64-s29-d6

memmove-t3-data-notless64-s29-d6

  • testcase performance graph

T0-memmove-lt64-mini-time-s29-d22

T1-memmove-notless64-mini-time-s29-d22

T2-memmove-mini-time-lt64-s29-d6

T3-memmove-notless64-mini-time-s29-d6

  • The updated memmove is a bit better in performance than existing memmove.
  • The performance is near between updated-memmove and cortex-memmove in those test cases. But all source accesses in updated-memmove are aligned.
  • When source aligned offset is equal to the destination one, the update-memmove is efficient than cortex-memmove. Since the process is done through memcpy, please refer to the test results of memcpy.

3.4.2. source is less than destination

All the process are done in corresponding memmove rarther than memcpy.

  • testcase raw data

memmove-t5-data-less64-s17-d23

memmove-t6-data-notless64-s17-d23

T5-memmove-mini-time-lt64-s17-d23

  • The cortex-memmove is better than the updated-memmove in several lengths.

T6-memmove-mini-time-notless64-s17-d23

  • The updated-memmove is better than the existing-memmove in some lengths. As the lenght is increasing, the efficiency is more better.

3.5. strcmp

The updated-memcmp only optimized the process when the two source addresses' offset are inequal.

3.5.1. both strings' align offsets are equal

s0 means the source0 address is 0, aligned offset is 0. d0 represents the source1 address is 0, aligned offset is 0 too.

strcmp-t0-data-aligned-s0-d0

T0-strcmp-mini-time-lt64-s0-d0

T0-strcmp-mini-time-s0-d0-big

  • The updated strcmp has much better efficiency than the current kernel strcmp.
  • In this case, the efficiency is equal between updated-strcmp and cortex-strcmp.

3.5.2. source0 offsets is unequal to source1'

s31 means the source0 address is 31, aligned offset is 7. d25 represents the source1 address is 25, aligned offset is 1.

strcmp-t2-data-nonalign-s31-d25

T2-strcmp-mini-time-lt64-s31-d25

T2-strcmp-mini-time-s31-d25-big

  • The updated-strcmp has the best performance, as the length is increasing, the efficiency difference is getting more larger.

3.6. strncmp

The input count parameter here is bigger than the two strings' length.(for example 1K)

3.6.1. both strings' align offsets are equal

strncmp-t0-data-aligned-s0-d0

T0-strncmp-mini-time-small-s0-d0

T0-strncmp-mini-time-s0-d0-big

3.6.2. source0 offsets is unequal to source1'

strncmp-t2-data-nonalign-s31-d25

T2-strncmp-mini-time-lt64-s31-d25

  • The updated-strncmp is a bit worst than cortex-strncmp in short length range ( from 9 to 13).

T2-strncmp-mini-time-s31-d25-big

3.7. strlen

3.7.1. string address is aligned

Here, s16 means the string address is 16, which aligned with 16.

strlen-t0-data-aligned-s16

T0-strlen-mini-time-aligned-s16

3.7.2. string address is not aligned

Here, s27 means the string address is 27, which means the address offset is 11(aligned with 16).

strlen-t2-data-non-align-s11

T2-strlen-mini-time-non-align-s11

WorkingGroups/Kernel/ARMv8/cortex-strings (last modified 2013-12-12 03:46:48)