2023-09-29 00:00:52 I don't actually think they do - GendaMo just mentioned returning to "the one before that" and those are words I might use to describe my double return. I was also scratching my head over how that would fit with coroutines. 2023-09-29 00:01:13 but I couldn't think of what else "the one before that" would mean. 2023-09-29 00:01:20 Other than the caller's caller. 2023-09-29 00:01:57 I think couroutines are sometimes implemented by exchanging the proram counter with the top item of the return stack. 2023-09-29 00:02:06 program 2023-09-29 00:02:40 That's how Chuck's 18A implements EXECUTE as well. 2023-09-29 00:02:55 And his top return item is cached in a register, so it's really just a reg/reg swap. 2023-09-29 00:03:23 Which is a pretty cool way to do it when you think about it. 2023-09-29 00:27:48 KipIngram: Long lines? He keeps lines within 80 columns and averages less than 40 characters wide. 2023-09-29 00:28:52 Just did a quick analysis on the tokenizer (TK.alpf) and parser (PS.aplf). 2023-09-29 00:35:49 oh, well, I saw stuff that looked like wrapped lines. I mean, it spanned over eight or nine lines, and they all looked basically full. Then every now and again you'd see a short line - I assumed that was where a long line was "ending." 2023-09-29 00:36:07 I was just seeing that on some of his slides. 2023-09-29 00:37:51 Oh, I see. Definitely likely that slides got blown up. 2023-09-29 00:38:42 Hsu calls the Co-dfns architecture/style a "Linear Data Flow" model. 2023-09-29 00:39:10 So, he clearly referred to a 74-line compiler. You're saying that's 74 lines no longer than 80 characters? That was part of what "primed" me to think the lines might be long too - the prospect of getting a compiler into that few lines. 2023-09-29 00:39:23 Input literally just goes in at the top and flows straight down, line-by-line, to the end. 2023-09-29 00:39:43 Yeah, he stressed the fact that there were no loops, no branches, etc. 2023-09-29 00:39:54 Yup. A 500-ish character compiler :) 2023-09-29 00:43:14 Loops and branches are kind of anathema to good GPU code, since they create data dependencies that can't be vectorized. 2023-09-29 00:46:27 To be clear, that 500 char count is just the compilation pass, i.e. generating the AST. It doesn't include tokenization or code generation. 2023-09-29 00:47:38 The code generation is a bunch of C that gets passed off to your friendly neighborhood C compiler and depends on arrayfire for the low level GPU stuff. 2023-09-29 00:53:30 The SLOC ratio between Algol-like languages and Hsu-style APL tends to be on the order 10X for small things and 100X for full systems. 2023-09-29 00:54:41 His modern, full-fledged PS.aplf is more like 250 SLOC (~10 kchars), counting error handling and all that. 2023-09-29 00:55:52 100X that, i.e. 25,000 lines, for a production-level compiler, is on the smaller side. 2023-09-29 00:57:27 One crack-brained idea I have for marrying Forth's implementation simplicity with APL is to write a Forth backend for Co-dfns. 2023-09-29 02:15:46 Oh, neat. I don't know exactly what my mashup is going to look like yet, but I want some of all of these things. 2023-09-29 02:16:22 Although I haven't given any serious thought to GPU operation. I recognize it as the best computing resource in most computers, though, for real number crunching. 2023-09-29 02:17:01 I guess provided you have the right kind of problem. 2023-09-29 02:35:08 I have thought about trying to learn more about this: 2023-09-29 02:35:10 https://www.khronos.org/api/spir 2023-09-29 02:35:25 and see if I could make a Forth GPU interface based on it. 2023-09-29 02:35:50 Cut the middle layers out. It seems like the closest thing to "GPU machine code" I can find. 2023-09-29 02:53:29 As a Forth person, what I really WANT to do is directly produce the binary "stuff" that guides the GPU through its actions. I have no idea if that's even possible though, realistically. 2023-09-29 03:21:18 KipIngram: not being able to do this is one of the reasons why I don't like GPU programming 2023-09-29 03:21:27 there is sort of a portable machine language for GPUs 2023-09-29 03:21:32 but it's once again an abstraction 2023-09-29 03:31:37 Yeah. I probably should still look into it sometime, but... it just doesn't carry the appeal of CPU oriented work. 2023-09-29 03:31:53 I mean, you're chatting with the guy who wishes Intel would open the microcode up. 2023-09-29 03:32:01 I wanna program THE HARDWARE. 2023-09-29 03:32:45 But I suppose the microcode probably changes details every neW CPU, so I guess I get why they don't do that. 2023-09-29 03:32:57 It would just create another layer of backward compatibility fences. 2023-09-29 03:33:32 And maybe would expose some of their IP as well. 2023-09-29 03:33:54 I also wish FPGA vendors would open their bitstream format. 2023-09-29 03:34:15 you can't really change the microcode 2023-09-29 03:34:16 And make it possible to change "pieces" of the FPGA's configuration, on the fly. 2023-09-29 03:34:29 you can ony patch it a little 2023-09-29 03:34:32 Maelleable FPGA hardware, anyone? :-) 2023-09-29 03:34:46 also, most parts of modern Intel CPUs are not actually microcoded 2023-09-29 03:34:52 I didn't know - I figured it was just "reloadable." 2023-09-29 03:34:57 no 2023-09-29 03:35:05 it's extremely sophisticate 2023-09-29 03:35:08 *-ed 2023-09-29 03:35:12 I'm sure. 2023-09-29 03:35:24 the microcode patch RAM is structured a bit like a cache 2023-09-29 03:35:36 i.e. it has tags matching addresses in the microcode ROM and data 2023-09-29 03:35:56 on microcode fetch, the cache is consulted. A hit is loaded from RAM, a miss from ROM. 2023-09-29 03:36:06 So by loading cache lines into the patch RAM, you can patch the microcode. 2023-09-29 03:36:15 Ah, that makes sense. 2023-09-29 03:36:24 There's also a mechanism to pattern-match non-microcoded instructions and make them microcoded. 2023-09-29 03:36:36 so you can fix instructions that are not microcoded. 2023-09-29 03:36:45 This comes at a big performance penalty though. 2023-09-29 03:36:49 Interesting. So it's designed to let them "tweak" things. 2023-09-29 03:36:53 Not totally re-write things. 2023-09-29 03:36:53 yeah 2023-09-29 03:36:56 correct 2023-09-29 03:37:02 That's sensible, actually - given their needs. 2023-09-29 03:37:05 a full microcode RAM would be way too big and power hungry. 2023-09-29 03:37:13 another thing Intel does is "chicken bits" 2023-09-29 03:37:22 every new HW feature has a configuration bit somewhere to turn it off 2023-09-29 03:37:40 Yeah. We have a big RAM in our SSDs at work for logical to physical page translation and they're always belly aching over not wanting it to get any bigger. 2023-09-29 03:37:46 so when a new feature doesn't work, the BIOS can turn off the chicken bit and make the CPU work 2023-09-29 03:38:11 But when you have to map every 16kB page of a 100 TB logical drive, that's a pretty big RAM. 2023-09-29 03:38:16 oh yes 2023-09-29 03:38:43 Actually, it doesn't fit in RAM any more as of this latest eneration - it's paged in and out of flash. 2023-09-29 03:39:23 And our older stuff used 4kB logical pages - the switch to 16kB was to reduce that RAM requirement. 2023-09-29 03:39:56 NAND flash turns out to be super interesting stuff. 2023-09-29 03:40:08 Enough complexity in the things to give them some really involved characteristics. 2023-09-29 03:40:48 makes sense 2023-09-29 03:40:56 so cache locality may now start to matter for flash. 2023-09-29 03:41:22 Does the SSD expose 16 kB sectors to the host? 2023-09-29 03:41:39 Yes - that's what the host sees. An array of 16kB logical pages. 2023-09-29 03:42:05 We have on-the-fly hardware compression, though - we can store severall logical pages in one physical page. 2023-09-29 03:42:09 sometimes. 2023-09-29 03:42:27 And there's also a layer of error detection and correction in there. 2023-09-29 03:42:43 So what actually gets written to the physical flash is "code pages" that come out of the ecc encoder. 2023-09-29 03:42:56 They're around 8kB. 2023-09-29 03:43:33 So that logical table has to identify how the page you've asked for is mapped to code pages. 2023-09-29 03:43:56 It's nice when it's in just one code page. 2023-09-29 03:44:19 Which it often is if you have compressible data. 2023-09-29 03:44:31 But it can be spread across as many as three, worst case. 2023-09-29 03:44:51 So those all get fetched in parallel, and the FPGA stiches the pieces together and shoves it through the ecc decoder. 2023-09-29 03:45:09 goes from there to the decompressor and then to the host via nVME. 2023-09-29 03:45:36 Well, sorry - that's out of order. 2023-09-29 03:46:03 Each code page goes through the ecc, and then what comes out of there is stiched into a logical page and decompressed. 2023-09-29 03:46:49 And all of that is done in hardware - there's no firmware involved unless some sort of a failure occurs. 2023-09-29 03:47:15 There's some firmware in the nvme part. 2023-09-29 03:48:09 Anyway, thanks for the info on the CPU stuff - that's interesting. 2023-09-29 03:48:32 Now I don't have to feel like I'm missing out on something. :-) 2023-09-29 04:44:26 Isn't "portable machine language for GPUs" basically the same thing as an ISA? 2023-09-29 04:46:41 I don't think there's a strong technical difference between having an ISA and having hardware that abstracts away the Maxwell/QED equations for us. 2023-09-29 04:46:54 hmm... what would be in such an ISA? some way to write vertex and fragment shaders, no? 2023-09-29 04:47:54 xelxebar: you have not then seen a simple computer made up of air pressure control units then 2023-09-29 04:48:32 no QED or Maxwell equations there 2023-09-29 04:51:05 Replaced by Laplace's equation. To get the behavior of the water/air/mechanical analogues correct, you have to mess with a tangle of differential equations either way. 2023-09-29 04:52:21 I mean, you don't *have* to. You can fiddle and tinker with things experimentally, but the point stands that the hardware is providing a kind of abstracted interface. 2023-09-29 16:11:48 I think calling it an ISA isn't unreasonable. 2023-09-29 16:12:42 I wonder, though, if we would find a lot less standardization among GPUs than CPUs? Which would necessitate a "driver layer" between the stuff you specify and the bits that actually control the hardware? 2023-09-29 16:12:56 That's a guess - maybe it's more standard than I think. 2023-09-29 16:13:40 https://en.wikipedia.org/wiki/Standard_Portable_Intermediate_Representation 2023-09-29 16:14:42 Yeah, I was looking at SPIRV last night. 2023-09-29 16:14:57 Guessing that it might be the closest we can get. 2023-09-29 16:15:44 I haven't found a reference yet, though, that describes the relationship between SPIRV and the GPU hardware, the way a CPU instruction set description would be presented. Seems to more describe the interface between SPIRV and various other software layers. 2023-09-29 16:16:10 I'm afraid the GPU itself still looks an awful lot like a black box to me. 2023-09-29 16:17:04 Know any links that describe typical GPUs in a "parallel computer architecture" way? 2023-09-29 16:17:43 I.e., how all the logic is actually laid out and interconnected? 2023-09-29 16:18:14 Ultimately it's a big digital circuit - surely there are descriptions in detail of that circuitry around somewhere. 2023-09-29 16:18:44 It's a multicore CPU with different tradeoffs 2023-09-29 16:19:03 Simpler cores (but more of them) with big vector units 2023-09-29 16:19:05 Yes. Simpler processors but a lot more of them. 2023-09-29 16:19:12 :-) 2023-09-29 16:19:31 Is there an echo in here? :-) 2023-09-29 16:19:39 :P 2023-09-29 16:19:50 But... details. 2023-09-29 16:20:09 Like I said, maybe it varies vendor to vendor - I have no idea. 2023-09-29 16:20:39 Just because we can buy CPUs that behave "just alike" from Intel and AMD doesn't mean that's the case with GPUs too. 2023-09-29 16:22:29 I see this: 2023-09-29 16:22:31 https://core.vmware.com/resource/exploring-gpu-architecture#section1 2023-09-29 16:22:51 Pretty high level, though. 2023-09-29 16:23:35 https://download.nvidia.com/developer/cuda/seminar/TDCI_Arch.pdf 2023-09-29 16:23:56 Much better. 2023-09-29 16:24:07 That will keep me busy for a little while. 2023-09-29 16:25:35 It still doesn't get down to the level of "poke this information in via these registers to get this result" though. 2023-09-29 16:25:48 Which is what I would expect from an instruction set architecture. 2023-09-29 16:26:31 I'll put it this way. For an x64 CPU, I can learn precisely what data I need to get into RAM, somehow, to get a particular behavior from the CPU. 2023-09-29 16:26:52 Granted, most of the time we need various tools to actually do that, but in principle at least it's specified down to that level. 2023-09-29 16:27:14 I'd like the same for the GPU. Forget the tools - what data needs to be put where to elicit what behaviors? 2023-09-29 16:28:25 If I flip this particular bit, how will it change what happens? Etc. A "pure hardware" level description. 2023-09-29 16:28:38 Maybe look at AMD's stuff 2023-09-29 16:29:00 I believe they provide more information about their GPUs 2023-09-29 16:29:03 Yeah, I guess now I've got myself a little motivated - I'll dig around some this weekend. 2023-09-29 16:29:17 Wife's out of town visiting her sister - it'll give me something to do. 2023-09-29 16:33:57 https://www.amd.com/en/search/documentation/hub.html#sortCriteria=%40amd_release_date%20descending&f-amd_product_type=Graphics 2023-09-29 16:38:40 Brian Kernighan wrote unix while his wife was away 2023-09-29 16:39:18 Yeah, but to be fair that was something like three or four weeks :P 2023-09-29 16:46:58 Ah, that link is good. 2023-09-29 16:56:40 This is definitely the sort of info I was talking about: 2023-09-29 16:56:43 https://www.amd.com/content/dam/amd/en/documents/radeon-tech-docs/instruction-set-architectures/rdna2-shader-instruction-set-architecture.pdf 2023-09-29 18:04:21 Ugh. The MAX32655 flash memory architecture is different from the sort of thing I'm familiar with from work. I think it's a NOR flash, readable as though it's RAM and fully mapped into the processor's address space. 2023-09-29 18:04:34 8kB pages, 128-bit words. 2023-09-29 18:04:45 I'm going to have to think a little about how to best use it. 2023-09-29 18:11:06 It's a 512kB flash, and is stated to have a life of 10k write cycles. So that's around 5 GB of total lifetime writes, max.