2022-03-18 09:04:35 I like the word 'tib', it sounds funny 2022-03-18 09:05:11 backwards for 'bit' 2022-03-18 09:39:12 My dog thinks 'tib' means 'treat' 2022-03-18 10:58:56 :-) 2022-03-18 11:51:52 I suppose it's just "text input buffer." That's what I've always presumed, at least. 2022-03-18 11:52:26 So I did an exhaustive test of that mod 255 word. It delivers the right answer for all 32-bit numbers; first failure is on 0x1000000FF. 2022-03-18 11:54:33 I don't completely understand why exactly it fails there - that's not a large enough number to overflow the 64-bit cell into negative numbers at any point. 2022-03-18 12:22:30 I did some measurements - looks like the mod 255 word I wrote takes 1.4 ns, whereas a simple-minded version based on idiv takes 6 ns. 2022-03-18 13:40:47 Looks like using imul instead of shifts and adds is faster - takes about 1 ns instead of 1.5. 2022-03-18 13:44:41 oooh interesting 2022-03-18 13:44:46 how did you time them? 2022-03-18 13:46:17 Yes; first timed the "loop framework" with no mod word in it, then with my mod word, then with the idiv mod word. Then just now with the imul mod word - it wins. 2022-03-18 13:46:37 The best version is about 6x faster than an idiv-based implementation. 2022-03-18 13:46:56 This is the code, if you're interested: 2022-03-18 13:46:58 code "255%", p255 2022-03-18 13:47:01 xor dx, dx 2022-03-18 13:47:03 mov rax, 0x1010101 2022-03-18 13:47:05 imul rrTOS 2022-03-18 13:47:06 i wonder how much of it is the speed up from mixing adds and multiplies 2022-03-18 13:47:07 shr rax, 32 2022-03-18 13:47:09 mov rdx, rax 2022-03-18 13:47:11 shl rdx, 8 2022-03-18 13:47:13 sub rdx, rax 2022-03-18 13:47:15 sub rrTOS, rdx 2022-03-18 13:47:17 cmp rrTOS, 0xFF 2022-03-18 13:47:19 je p255a 2022-03-18 13:47:21 next 2022-03-18 13:47:23 p255a: xor rrTOS, rrTOS 2022-03-18 13:47:25 next 2022-03-18 13:47:27 (rrTOS is an alias for rcx). 2022-03-18 13:47:51 Well, it looks like the "clever approach" gains a factor of four to six. Four for shifts/adds, six for imul. 2022-03-18 13:48:24 Too bad I can't refer to the high-order 32 bits of rax as a 32-bit register; that would save that 32-bit right shift. 2022-03-18 13:49:02 But as far as I know register AH is the only one you refer to that starts other than at the bottom. 2022-03-18 13:50:19 ya i think youre right about AH 2022-03-18 13:51:24 the interesting and weird thing about x86 is how other stuff in the instruction stream affects performance. doesnt look it applies here but if you have some muls followed by adds, you can get a good speed boost by interspersing them 2022-03-18 13:51:42 assuming the data dependencies allow it 2022-03-18 13:53:27 or on the other hand if you have a mul followed by an add and it's faster, you cant generalize about the mul in cases where it doesnt have an add near it since it could be slower 2022-03-18 13:55:26 I ought to time a forth imp too, : 255% 255 % ; just for grins. 2022-03-18 13:59:44 Oh wow. 2022-03-18 13:59:46 Nice. 2022-03-18 14:01:08 So, the timed loop (2^32 iterations) using an idiv-based 255-specific approach) took 42 seconds. A loop using that Forth definition I just showed has the extra overhead (above and beyond an idiv-based modulus word %) of calling the extra definition, loading 255 onto the stack, and returning to the caller. 2022-03-18 14:01:13 But it still just took 44 seconds. 2022-03-18 14:01:47 So that call/return overhead and loading the 255 - practically nill. Half a nanosecond. 2022-03-18 14:09:46 I tuned it just a bit - I was wasting an add/subtract instruction. 2022-03-18 14:14:30 Oh, that xor rdx, rdx isn't needed either - that's needed for divides, not multiplies. 2022-03-18 14:19:58 Ok, with those tweaks it's coming in at about 1.2 ns. 2022-03-18 14:20:34 Give or take a few percent - when the test only last 15-20 seconds my ability to hit the stopwatch button matters a little. 2022-03-18 16:05:19 KipIngram: That's not the real cost if it's a loop, because it will be able to predict/pipeline etc more 2022-03-18 16:09:33 I don't immediately follow - explain? I mean, I generally know all kinds of shenanigans go on where "cases" get recognized and optimized, but I'm not clicking to a specific thing here. 2022-03-18 16:10:18 I'm timing a loop that either contains or does not contain a single Forth word, and calling the time difference divided by the loop count the time consumed by that word. 2022-03-18 16:10:30 I know it'll get cached of course. 2022-03-18 16:10:43 So I'm really measuring it's "already in the cache" performance. 2022-03-18 16:11:06 I'm sure calling it once out of the blue, with it not in the cache, would take a lot longer. 2022-03-18 16:11:50 Because it's doing one thing over and over it will more perfectly predict the branching, and for all I know it will optimise the internal code more, which is not characteristic of normal performance 2022-03-18 16:12:17 Oh, I deliberately wrote it so that 254 out of 255 times it won't take the branch. 2022-03-18 16:12:28 There are a lot of reasons but the true way to optimise is always to profile something *in practice* 2022-03-18 16:12:46 A loop like this isn't how it will be used in practice so it can't be used to "fine tune" it, just get a rough idea 2022-03-18 16:12:55 Yeah - it's almost impossible to know all the tricks that might or might not get played. 2022-03-18 16:13:35 And chances are it doesn't matter which version is fastest, and the version that seems fastest in a loop might be slower in practice but you'd never even be able to measure it 2022-03-18 16:13:37 Anyway, it's a hell of a lot faster than idiv - I don't the measurement could be so bogus as to mis-report that. 2022-03-18 16:13:50 All of optimisation is a game of narrowing down where your time is worth being spent optimising 2022-03-18 16:14:04 Yeah I don't disagree with that 2022-03-18 16:14:07 And yes - the percent of the time thing will be used will be small in practice, so it really doesn't matter that much. 2022-03-18 16:14:16 I guess I felt like spending an afternoon being anal. 2022-03-18 16:14:19 :-) 2022-03-18 16:14:23 I understand 2022-03-18 16:14:36 I do the same, but I am just honest that it's a waste of time :P 2022-03-18 16:14:39 Yes - I am planning to equip this thing with a good profiling ability sooner or later. 2022-03-18 16:14:50 Because you're absolutely right about that bit. 2022-03-18 16:15:06 A lot of time in software development gets spent on unnecessary optimization. 2022-03-18 16:15:34 Probably for much the same reasons as me doing this - it gives someone an opportunity to feel clever. 2022-03-18 16:15:54 It's a learning experience as well 2022-03-18 18:10:11 Hey, BLOCK seems to be working in all respects now. Proper caching, proper respect for each buffer's dirty bit, etc. 2022-03-18 18:10:33 BLK.DAT is growing the right way when I write a higher block than I've previously written. 2022-03-18 18:11:00 Seems sound. I haven't actually validated in written data, yet, though. Just checked "around the boundary," sort of. 2022-03-18 20:48:17 Ok, just loaded source from a disk block for the firts time with this current implementation. A little milestone. :-) 2022-03-18 20:48:30 s/firts/first/ 2022-03-18 20:49:04 Now I can go get my editor source from the old system and port it over and make it work.