Быстрая замена BSF

Тема в разделе "WASM.A&O", создана пользователем Onix-Studio, 1 ноя 2006.

  1. Onix-Studio

    Onix-Studio New Member

    Публикаций:
    0
    Регистрация:
    23 окт 2006
    Сообщения:
    22
    В программе часто приходится употреблять эту инструкцию, поэтопу решил поискать в сети оптимизацию и натолкнулся на следующее:

    -----------------------------------------------------------------------------------------
    This macro utilizes an algorithm published in the news group comp.arch by Robert Harley in 1996. The algorithm converts the general problem of finding the position of the least significant set bit into a special case of counting the number of bits in a contiguous block of set bits. By computing x^(x-1), where x is the original input argument, a right-aligned contiguous group of set bits is created, whose cardinality equals the position of the least significant set bit in the original input plus 1.

    The input x is of the form (regular expression): {x}n1{0}m. x-1 has the form {x}n{0}(m+1), and x^(x-1) has the form {0}n{1}(m+1). This step is pretty similar to the one used by macro PREPBSF, only that PREPBSF creates a right-aligned group of set bits whose cardinality equals exactly the position of the least significant set bit.

    Harley's algorithm then employs a special method to count the number of bits in the right-aligned contiguous block of set bits. I am not sure upon which number theoretical argument it is founded, and it wasn't explained in the news group post.
    According to Harley, if a 32-bit number of the form 00...01...11 is multiplied by the "magic" number (7*255*255*255), then bits <31:26> of the result uniquely identify the number of set bits in that number. A 64 entry table is used to map that unique result to the bit count. Here, I have modified the table to reflect the bit position of the least significant set bit in the original argument, which is one less than the bit count of the intermediate result.

    I have tested the macro EMBSF5 exhaustively for all 2^32-1 possible inputs, i.e. for all 32-bit numbers except zero.
    Place the following table in the data segment:
    Код (Text):
    1. table  db      0, 0, 0,15, 0, 1,28, 0,16, 0, 0, 0, 2,21,29, 0
    2.          db      0, 0,19,17,10, 0,12, 0, 0, 3, 0, 6, 0,22,30, 0
    3.          db     14, 0,27, 0, 0, 0,20, 0,18, 9,11, 0, 5, 0, 0,13
    4.          db     26, 0, 0, 8, 0, 4, 0,25, 0, 7,24, 0,23, 0,31, 0
    And here follows the actual macro:
    Код (Text):
    1. ;
    2. ; emulate bsf instruction
    3. ;
    4. ; input:
    5. ;   eax = number to preform a bsf on ( != 0 )
    6. ;
    7. ; output:
    8. ;   edx = result of bsf operation
    9. ;
    10. ; destroys:
    11. ;   ecx
    12. ;   eflags
    13. ;
    14.  
    15. MACRO   EMBSF5      
    16.  
    17.         mov     edx,eax         ; do not disturb original argument
    18.         dec     edx             ; n-1
    19.         xor     edx,eax         ; n^(n-1), now EDX = 00..01..11
    20.  
    21. IFDEF FASTMUL
    22.         imul    edx,7*255*255*255       ; multiply by Harley's magic number
    23. ELSE
    24.         mov     ecx,edx         ; do multiply using shift/add method
    25.         shl     edx,3
    26.         sub     edx,ecx
    27.         mov     ecx,edx
    28.         shl     edx,8
    29.         sub     edx,ecx
    30.         mov     ecx,edx
    31.         shl     edx,8
    32.         sub     edx,ecx
    33.         mov     ecx,edx
    34.         shl     edx,8
    35.         sub     edx,ecx
    36. ENDIF
    37.  
    38.         shr     edx,26          ; extract bits <31:26>
    39.         movzx   edx,[table+edx] ; translate into bit count - 1
    40.  
    41. ENDM
    Note: FASTMUL can be defined if your CPU has a fast integer multiplicator, like the AMD. The IMUL replacement should run in about 8-9 cycles.
    -----------------------------------------------------------------------------------------

    Но что-то не заметил особого ускорения (AMD Duron 1300). Кто-нибудь сталкивался c этой оптимизацией? Может по-другому можно?
     
  2. leo

    leo Active Member

    Публикаций:
    0
    Регистрация:
    4 авг 2004
    Сообщения:
    2.542
    Адрес:
    Russia
    Для всех современных процев (P6, P4, AMD) цепочка imul + shr + movzx r,m выполняется за то же или бОльшее время, чем bsf -> шило на мыло ;)