Tips for Code & ASM

Special collaborative projects managed by SynthMaker users

Moderators: electrogear, exonerate

Tips for Code & ASM

Postby infuzion on Thu Sep 08, 2011 2:08 pm

trogluddite wrote:I usually reserve a block of syntax definitions in notepad++ for variable names, to make sure my spelling is consistent - hence I though my code was thoroughly checked. But of course,a duplicate is still spelled right, so my check wasn't as thorough as I thought!

My usual practice is to use CamelCase variable/array names that are descriptive, it make debugging so much easier - e.g. LeftBuffer[DelayCount], AllDoneFlag, etc...
Using a capital first also helps separate variables visually from SM functions & ASM commands.

Using a real text editor allows you to save .ASM & Code separately, global search & replace, user editable code highlighting, macros, & much more!
Need help? First search the forum & WiKi, then post in the help forum with a clear topic, request, & OSM. Then please WiKi the correct solution. If you want my personal assistance, I charge by the hour or for an exchange of services.
infuzion
smstar
smstar
 
Posts: 6163
Joined: Wed May 04, 2005 8:02 pm
Location: Earth, USA, CO, Denver

Re: Tips for Code & ASM

Postby trogluddite on Thu Sep 08, 2011 3:33 pm

Second that! Once you have tried it, you will wonder how you ever managed without it.
I've used Notepad++ for a long time now, very friendly and configurable (and free!) - my favourite feature is being able to have the same code open in two windows that are both editable, so that I don't have to keep scrolling up to add new variable declarations!

There's a Notepad++ syntax definition file for SM's assembly primitive in this old post so that you can get up and running quckly.
When I'm back on my main machine, I'll have a poke around, I think I've updated it a bit since then.
Feel free to use any schematics and algorithms I post on the forum in your own designs - a credit is appreciated (but not a requirement).
Don't stagnate, mutate to create. Without randomness and serendipity the earth would be just another barren rock.
User avatar
trogluddite
smychopath
 
Posts: 3028
Joined: Mon Oct 20, 2008 3:52 pm
Location: Yorkshire, UK

Re: Tips for Code & ASM

Postby infuzion on Thu Sep 08, 2011 6:06 pm

trogluddite wrote:There's a Notepad++ syntax definition file for SM's assembly primitive in this old post so that you can get up and running quckly.
Here is mine; now finish that trigger tutorial please! /cracks whip ;)
Attachments
NotepadPlusPlus_asmSM.7z
(884 Bytes) Downloaded 214 times
Need help? First search the forum & WiKi, then post in the help forum with a clear topic, request, & OSM. Then please WiKi the correct solution. If you want my personal assistance, I charge by the hour or for an exchange of services.
infuzion
smstar
smstar
 
Posts: 6163
Joined: Wed May 04, 2005 8:02 pm
Location: Earth, USA, CO, Denver

Re: Tips for Code & ASM

Postby trogluddite on Thu Sep 08, 2011 8:28 pm

infuzion wrote:Here is mine; now finish that trigger tutorial please! /cracks whip

Yes, I'll try to do the next lesson this weekend, boss! I couldn't quite face SM while my latest baby was poorly with array bugs and random crashes, but Malc made it all better - so I shall repay his efforts by helping out with the missing SM documentation! ;)
Feel free to use any schematics and algorithms I post on the forum in your own designs - a credit is appreciated (but not a requirement).
Don't stagnate, mutate to create. Without randomness and serendipity the earth would be just another barren rock.
User avatar
trogluddite
smychopath
 
Posts: 3028
Joined: Mon Oct 20, 2008 3:52 pm
Location: Yorkshire, UK

Re: Tips for Code & ASM

Postby mwvdlee on Tue Sep 27, 2011 4:57 pm

Seems I found a solution for my long-standing bug. It's not a generic solution - the envelope control is still buggy - but it works specifically for my situation. It's possible it may apply to some other people's code, so I'll try and work out a more generic demonstration.

In the mean time I was trying to optimize some ASM code and found the following:
Code...
Code: Select all
dif = abs(in);

translates in ASM to:
Code: Select all
movaps xmm0,in;
movaps smIntVarTemp,xmm0;
cmpps xmm0,smIntVarZero,6;
andps xmm0,smIntVarTemp;
addps xmm0,xmm0;
subps xmm0,smIntVarTemp;
movaps dif,xmm0;


IMHO, the following code does exactly the same:
Code: Select all
dif = max(in, 0-in);

But translates to somewhat less assembly:
Code: Select all
movaps xmm0,F0;
subps xmm0,in;
movaps xmm1,xmm0;
maxps xmm1,in;
movaps dif,xmm1;

This last one has a redundant movaps even!

Perhaps maxps is incredibly slow or something?
Am I missing something or would this be a generic improvement worthy of replacing the current abs() implementation?
My current top SynthMaker bug:
    1. MIDI Input issue (showstopper, no workaround)
    2. All my previous bugs in SM1.7, because bug 1 makes SM2 worse than SM1.7
User avatar
mwvdlee
smanatic
 
Posts: 552
Joined: Thu Dec 03, 2009 8:42 am
Location: NL

Re: Tips for Code & ASM

Postby mwvdlee on Tue Sep 27, 2011 5:25 pm

Division

SM translates this division in code:
Code: Select all
out = total / 4410;

Into this ASM:
Code: Select all
movaps xmm0,total;
divps xmm0,F4410;
movaps out,xmm0;

Only three statements. But one of them, divps, can be VERY expensive.

If you don't need the precission of divps, but could settle for 12 bit precission, you could multiply the reciprocal instead.
Code: Select all
movaps xmm0,F4410;
rcpps xmm0,xmm0;
mulps xmm0,total;
movaps out,xmm0;

One instruction more, but both rcpps and mulps are cheap.

This little trick applied to just ONE case decreased CPU from ~5.2% to ~5.0%; 4% improvement!

FYI, In this context, the reciprocal of a number is the same as dividing 1 by that number; reciprocal of A is 1/A.

If you do more than one division by the same number, you could even take the time to calculate the reciprocal of the divisor in high precission and multiply by that number:
Code: Select all
out = A / 1234;
out = B / 1234;

Translates to:
Code: Select all
movaps xmm0,A;
divps xmm0,F1234;
movaps out,xmm0;
movaps xmm0,B;
divps xmm0,F1234;
movaps out,xmm0;

But, calculating the reciprocal of 1234 first:
Code: Select all
R1234 = 1/1234;
out = A * R1234;
out = B * R1234;
or in ASM...
Code: Select all
movaps xmm0,F1;
divps xmm0,F1234;
movaps R1234,xmm0;
movaps xmm0,A;
mulps xmm0,R1234;
movaps out,xmm0;
movaps xmm0,B;
mulps xmm0,R1234;
movaps out,xmm0;

Even though the latter uses more code, the advantage of having only one divps and two mulps versus two divps is quite high.

Ofcoure, if the divisor is static or coming from outside the code module, you should precalculate the reciprocal :)
Doing that decreased my ~5.2% 0.1% further to ~4.9% (whilst improving quality if rounded to sufficient digits). These sound like minor improvements, but if you add up many of these simple changes, the performance difference can be dramatic.
My current top SynthMaker bug:
    1. MIDI Input issue (showstopper, no workaround)
    2. All my previous bugs in SM1.7, because bug 1 makes SM2 worse than SM1.7
User avatar
mwvdlee
smanatic
 
Posts: 552
Joined: Thu Dec 03, 2009 8:42 am
Location: NL

Re: Tips for Code & ASM

Postby Mo on Tue Sep 27, 2011 5:51 pm

mwvdlee wrote:Ofcoure, if the divisor is static or coming from outside the code module, you should precalculate the reciprocal
Also it can be calculated in stage(0).
User avatar
Mo
essemilian
 
Posts: 439
Joined: Thu Jan 24, 2008 2:00 pm
Location: Copenhagen

Re: Tips for Code & ASM

Postby mwvdlee on Tue Sep 27, 2011 6:00 pm

Mo wrote:
mwvdlee wrote:Ofcoure, if the divisor is static or coming from outside the code module, you should precalculate the reciprocal
Also it can be calculated in stage(0).

Indeed. Atleast that would guarentee good quality for those numbers that don't round out well.
Otherwise; why waste the cycles? ;) AFAIK stage(0) is called for every triggered poly (should also be for every re-triggered poly, but it doesn't).

I've got some other trick I'd like to share here, most probably well known already but it doesn't hurt to document them for those who don't.
My current top SynthMaker bug:
    1. MIDI Input issue (showstopper, no workaround)
    2. All my previous bugs in SM1.7, because bug 1 makes SM2 worse than SM1.7
User avatar
mwvdlee
smanatic
 
Posts: 552
Joined: Thu Dec 03, 2009 8:42 am
Location: NL

Re: Tips for Code & ASM

Postby stw on Tue Sep 27, 2011 7:41 pm

Hi mwvdlee,
good to hear you could succeed with that annoying bug!
About the abs routines:
Though your math is a clever solution i guess it won't result into much of a difference to the SM code. I think behind the scene maxps is doing exactly what the code describes; a cmp/and/add.
Most saving abs trick is the so called "abs mask" discovered by sambean IIRC decaes ago!

Code: Select all
streamin in;
streamout out;
int      absmask=2147483647;

movaps  xmm0,in;
andps xmm0,absmask;
movaps out,xmm0;


There some threads relating to asm optimizing e.g:
viewtopic.php?f=5&t=5379
viewtopic.php?f=12&t=5359

Regarding the fastest code i recently found that list which covers a variety of comparisons for (many) different chips.
http://www.agner.org/optimize/instruction_tables.pdf
maybe it's useful for you or anyone else


oh...and new tips&tricks are always appreciated! :D
stw
smanatic
 
Posts: 640
Joined: Mon Jun 30, 2008 2:55 pm

Re: Tips for Code & ASM

Postby trogluddite on Tue Sep 27, 2011 7:49 pm

mwvdlee wrote:dif = abs(in);

There is an even cheaper, sneaky, way to do this...
Code: Select all
int absmask=2147483647;  //must be exactly this number!!

movaps xmm0,in;
andps xmm0,absmask;
movaps dif,xmm0;

The integer 'absmask' creates a btmask which allows all of a float number through the bitwise andps - except for the sign bit, which gets set to zero (the number is the largest possible positive 32bit integer).

mwvdlee wrote:you could even take the time to calculate the reciprocal

Another closely related tip...
Try and avoid this kind of dependency...
Code: Select all
divps xmm0,number;
addps xmm0,AnotherNumber;
mulps xmm0,Avariable;

In this case, the add and multiply have to wait until the division is finished before they can begin their calculations. But with a modern CPU, there are multiple execution units that can get on with executing some other instructions whlile the divps does its stuff - as long as those instructions are not waiting for the division result.
So if you really need a divide, try and set it running 'early', and put some instructions from an unrelated part of the code immediately after it - so that something else useful is being done while the divide calculates.

These kind of 'dependency chains' occur with all instructions, but are particularly bad when following a divide. I was quite stunned when I went through some of my code (folowing infuzion's advice), and 'interleaved' some of the lengthy equations - letting the CPU execute some instructions out-of-order to allow some breathing space for the greedy ones. I've seen savings of nearly 20% on some algorithms!
(But if you do this, document it well - it does make assembly a lot harder to read when your algorithm is 'leap-frogging' over itself a lot!
Feel free to use any schematics and algorithms I post on the forum in your own designs - a credit is appreciated (but not a requirement).
Don't stagnate, mutate to create. Without randomness and serendipity the earth would be just another barren rock.
User avatar
trogluddite
smychopath
 
Posts: 3028
Joined: Mon Oct 20, 2008 3:52 pm
Location: Yorkshire, UK

Re: Tips for Code & ASM

Postby mwvdlee on Tue Sep 27, 2011 8:19 pm

Come to think, shouldn't these tips be in the wiki somewhere? Just a generic ASM optimizing topic ranging from simple to more specialized tips? Since ASM performance is relatively important to SM, it should be well documented.

And Malc should just implement more SSE instructions so we don't need to use workarounds for something already natively supported by the CPU. I find it incredibly (and inexcuseable) that something as obvious as XORPS is missing, let alone some of the more specialized stuff. I guess he's just afraid op ASM support becoming powerful enough for us not to have to rely on Outsim for new stuff. Which would be fine if they actually made new stuff.
My current top SynthMaker bug:
    1. MIDI Input issue (showstopper, no workaround)
    2. All my previous bugs in SM1.7, because bug 1 makes SM2 worse than SM1.7
User avatar
mwvdlee
smanatic
 
Posts: 552
Joined: Thu Dec 03, 2009 8:42 am
Location: NL

Re: Tips for Code & ASM

Postby infuzion on Wed Sep 28, 2011 1:08 am

Lots of great tips!
trogluddite wrote:Try and avoid this kind of dependency...
Code: Select all
divps xmm0,number;
addps xmm0,AnotherNumber;
mulps xmm0,Avariable;
In this case, the add and multiply have to wait until the division is finished before they can begin their calculations. But with a modern CPU, there are multiple execution units that can get on with executing some other instructions while the divps does its stuff - as long as those instructions are not waiting for the division result. So if you really need a divide, try and set it running 'early', and put some instructions from an unrelated part of the code immediately after it - so that something else useful is being done while the divide calculates.
(But if you do this, document it well - it does make assembly a lot harder to read when your algorithm is 'leap-frogging' over itself a lot!
Good tip, but ensure if you move opcodes around that the results are stored in an XMMx that is not being used around the new location.

mwvdlee wrote:Division
Code: Select all
divps xmm0,F4410;
If you don't need the precission of divps, but could settle for 12 bit precission, you could multiply the reciprocal instead.
Code: Select all
rcpps xmm0,xmm0;
mulps xmm0,total;

One instruction more, but both rcpps and mulps are cheap.
Not that much cheaper;
Core2Duo 65mm latency: divps 6-18, rcpps + mulps 7
AMD k10: divps 18, rcpps + mulps 7
Newest CPUs are likely to have faster divps. You do save CPU with rcpps, but the gap is smaller than what people assume sometimes. If you feel that you need the extra bits of precision (eg filters, mastering), the use the full divps.

mwvdlee wrote:Come to think, shouldn't these tips be in the wiki somewhere? Just a generic ASM optimizing topic ranging from simple to more specialized tips? Since ASM performance is relatively important to SM, it should be well documented.
Anyone with a forum logon can edit the WiKi, so you nominate yourself?

mwvdlee wrote:And Malc should just implement more SSE instructions so we don't need to use workarounds for something already natively supported by the CPU. I find it incredibly (and inexcuseable) that something as obvious as XORPS is missing, let alone some of the more specialized stuff.
I want to email Malc with a clear & prioritized plan. You can help by posting on the new 64bit audio request thread here.
Need help? First search the forum & WiKi, then post in the help forum with a clear topic, request, & OSM. Then please WiKi the correct solution. If you want my personal assistance, I charge by the hour or for an exchange of services.
infuzion
smstar
smstar
 
Posts: 6163
Joined: Wed May 04, 2005 8:02 pm
Location: Earth, USA, CO, Denver

Re: Tips for Code & ASM

Postby mwvdlee on Wed Sep 28, 2011 6:51 pm

infuzion wrote:Not that much cheaper;
Core2Duo 65mm latency: divps 6-18, rcpps + mulps 7
AMD k10: divps 18, rcpps + mulps 7
Newest CPUs are likely to have faster divps. You do save CPU with rcpps, but the gap is smaller than what people assume sometimes. If you feel that you need the extra bits of precision (eg filters, mastering), the use the full divps.

This (http://www.intel.com/content/dam/doc/ma ... manual.pdf) recent (july 2011) document by Intel claims the rccps method is about 6x faster. It claims DIVPS is 14 cycles and, more importantly, non-pipelined. (on that topic; I know MULPS and ADDPS can be pipelined and DIVPS cannot, but what about the other ops? SUBPS? ANDPS?).
I should note that precission is much higher if you can reasonably precalculate the reciprocal yourself: mulps xmm0,FHALF is about as accurate as divps xmm0,F2, but significantly faster.
The Intel document also states a "newton-raphson" approximation, which is 22 bit precise:
Code: Select all
rcpps xmm3, xmm1
movaps xmm2, xmm3
addps xmm3, xmm2
mulps xmm2, xmm2
mulps xmm2, xmm1
subps xmm3, xmm2
mulps xmm0, xmm3

Apparently "only" 2.7x faster than plain DIVPS, but almost as precise (and likely more that you need for 16-bit output).

Sadly, most other tricks in that Intel document don't work, simply because SM lacks some of the most basic operands such as JMP, SQRTPS, XORPS, CMP*PS.

I also tried to interleave SSE ops with scalar (FP, X87, "old") ops and got a slight improvement of about 0.1%. Not much, but if you use lots of scalar code, it may be worthwhile, and it's very easy to do.
My current top SynthMaker bug:
    1. MIDI Input issue (showstopper, no workaround)
    2. All my previous bugs in SM1.7, because bug 1 makes SM2 worse than SM1.7
User avatar
mwvdlee
smanatic
 
Posts: 552
Joined: Thu Dec 03, 2009 8:42 am
Location: NL

Re: Tips for Code & ASM

Postby trogluddite on Wed Sep 28, 2011 8:42 pm

Something else which has saved a considerable number of cycles for me recently is optimising arrays and buffers.

Deleting the scalar op's that relate to SSE channels that you aren't using is pretty commonplace - but for arrays and buffers, it's worth thinking about the redundant memory too (since arrays always reserve space for all 4 channels even if you don't use them).

For example if you are using mono4 to save CPU when processing stereo signals, consider 'streamlining' the way that arrays are accessed so that you don't have 2 channels worth of redundant empty array space (e.g. shl eax,3; instead of the usual shl eax,4;, and halve the array size )
As well as the 50% saving on memory, and fewer scalar op's to fill the 'silent' SSE channels, this can seriously cut down on cache misses. Effectively, twice as much useful data is now present on one cache line, so it will be longer before this data is all read out, resulting in less calls to the main memory (loading a new cache line can cost dozens of clock cycles)
And, of course, for modules intended for mono streams, you cut everything down to 1/4 of the original size.

Similarly if you have a block of ASM or code that has multiple arrays, it can be worth trying to put the data into one, larger, array. i.e. keep data that will be accessed close together (in time) as near neighbours (in memory). Combining multiple code primitives into one larger block can also help, for the same reason of 'closeness' of the data in memory for any given sample.

When using the CPU analyser on delays and buffers, you can often see the effect quite clearly - the CPU graph will mostly show a pretty steady reading - but with regular(ish) spikes, lasting only one audio sample, on iterations where the code stalls as it waits for a new cache line to be loaded.
Feel free to use any schematics and algorithms I post on the forum in your own designs - a credit is appreciated (but not a requirement).
Don't stagnate, mutate to create. Without randomness and serendipity the earth would be just another barren rock.
User avatar
trogluddite
smychopath
 
Posts: 3028
Joined: Mon Oct 20, 2008 3:52 pm
Location: Yorkshire, UK

Re: Tips for Code & ASM

Postby Andrew J on Wed Sep 28, 2011 10:51 pm

mwvdlee wrote:Sadly, most other tricks in that Intel document don't work, simply because SM lacks some of the most basic operands such as JMP, SQRTPS, XORPS, CMP*PS.


SM does have sqrtps!
Andrew J
smanatic
 
Posts: 616
Joined: Tue May 29, 2007 4:53 am
Location: Australia

Next

Return to Projects

Who is online

Users browsing this forum: No registered users and 2 guests