2023-06-05 03:41:52 Yeah it's an interesting passion project 2023-06-05 03:44:46 hand-written machine code :( 2023-06-05 05:26:55 You say that like there's something wrong with it. :-) 2023-06-05 05:28:33 Looks like a nice little project to me. 2023-06-05 05:39:01 "Passion project." I like the phrase. 2023-06-05 05:39:18 I guess that's what mine is. I only work on it when it's calling to me. Comes in binges. 2023-06-05 05:58:57 https://en.wikipedia.org/wiki/Double_hashing 2023-06-05 06:00:24 I'm not thrilled with the prospect of having to occasionally slow things down by re-hashing to a double size table. That's the same sort of thing as batch garbage collection - it destroys "real time" behavior. 2023-06-05 06:01:24 I suppose a really smooth design could do that work alongside continued operation, but that would probably take some real care. 2023-06-05 06:01:28 I have settled on garbage collection for everything to do with defining and compiling words but none for actually running them 2023-06-05 06:01:57 Yeah that's why I say passion project. IMO hand-written machine code is pointless, but I respect the author's reasons which is obviously because it's fun to them 2023-06-05 06:02:04 During that period you'd continue to use the smaller hash table for ongoing work, and you'd have to keep the new one up to date as well. 2023-06-05 06:02:30 If you want real time behaviour you need to know the max capacity in advance 2023-06-05 06:02:33 MrMobieus: that's my typical way out of this kind of thing too. 2023-06-05 06:02:36 And be more intelligent with choice of hash too 2023-06-05 06:02:41 Compile time vs. runtime. 2023-06-05 06:03:01 Well, that link I just share goes into choice of hash. 2023-06-05 06:03:22 It seems easy to design hash functions that will deliver optimum performance, if you're willing to divide. 2023-06-05 06:03:32 i.e. use a perfect hash for the data set, manually or with a perfect hash generator 2023-06-05 06:03:54 Unfortunately the easy solutions demand dividing by prime numbers, so you're not going to get a nice "shift" solution. 2023-06-05 06:04:16 I don't see how you can have a perfect hash when you're adding new words constantly. 2023-06-05 06:04:34 So you can't use that kind of technique for a dictionary 2023-06-05 06:04:36 They exist and they sound great, but you have to know all your strings in advance to find the thing. 2023-06-05 06:04:53 Yes, I agree that perfect hashes are... perfect, when you can use one. 2023-06-05 06:04:59 Bt you can still choose a sensible max number of words, or allow tuning it 2023-06-05 06:05:23 Yes, and the idea here is that you can change that max from time to time as your system grows. 2023-06-05 06:06:01 The empty space in your hash table is your resource. 2023-06-05 06:06:33 Have you read how chained records work in polyforth's database support library? 2023-06-05 06:06:42 We deal with this same kind of issue in our drives - we have to reserve some of the flash capacity for our own "management" of the drive. The more we reserve, the better performance is, but... the less we can offer to the user. 2023-06-05 06:07:10 We've typically reserved somewhere between 10% and 20% - it's varied from model to model. 2023-06-05 06:07:22 And when the user hasn't filled up his space yet, we use that too. 2023-06-05 06:07:27 I recommend reading how that whole database library works because it's quite interesting and involves a lot of fixed-size allocations (and workarounds where that's not appropriate) 2023-06-05 06:07:51 Sure, I'll take a look. 2023-06-05 06:07:55 That's what I'd expect, reserving 10-20% 2023-06-05 06:08:22 Just google "greenarrays polyforth" if you don't have the manual, I think that PDF should come up 2023-06-05 06:08:24 Lea's memory allocator is pretty cool too - it also uses a smorgasboard of combined methods in pursuit of the "best practical outcome." 2023-06-05 06:08:37 Yeah but it solves a different problem 2023-06-05 06:08:45 It does. 2023-06-05 06:08:46 It's a good general-purpose RAM allocator 2023-06-05 06:08:52 Just "qualitatively similar" in the thinking. 2023-06-05 06:09:11 Understanding how they all work and where they fit into the problems we face is useful 2023-06-05 06:09:31 Yep. 2023-06-05 06:09:50 Anyway, the goal with this hashed dictionary is to "improve on linked list performance," and that's not particularly hard to imagine doing. 2023-06-05 06:10:12 Even if you do several table probes on average (i.e., let your table fill up a bit), you're still way ahead of a linked list. 2023-06-05 06:10:18 With hundreds of words in it. 2023-06-05 06:10:55 Not to mention that the key comparisons on those probes are going to be a lot faster than full string compares. 2023-06-05 06:11:59 And as I mentioned a few days ago, once I have these routines, I can probably use them in my block buffer management too and get improved average "heavy use" performance. 2023-06-05 06:12:10 The thing is I rarely wind up doing heavy disk use. 2023-06-05 06:12:14 I "think about it," though. 2023-06-05 06:14:14 I will definitely take a look at polyforth - I've had a "background interest" in databases for a while anyway. 2023-06-05 06:14:35 For most of my life I paid them no attention - they always struck me as a whole lot of fuss that I never needed. 2023-06-05 06:14:58 Then I started doing this performance work and needed to permanently archive my test results, and I very quickly became a big fan of good databases. 2023-06-05 06:15:46 It was a pleasant experience - I designed a little database for that purpose, and one of the design goals was to not "freeze in" my list of quantities that it stored - I felt I'd want to add new stuff over time. 2023-06-05 06:16:06 And that turned out to be true; after a while they started wanting me to measure power consumption, temperature, and a number of other things. 2023-06-05 06:16:15 Those things just folded right in with no muss or fuss. 2023-06-05 06:16:23 And I can find old data really fast. 2023-06-05 06:16:55 The tool I used originally, before I did any of that, was pathetic in some ways. It would let you run a test and it would give you graphs of that test. 2023-06-05 06:16:59 It did that well. 2023-06-05 06:17:17 But there wasn't even a facility for comparing two runs - putting the graphs from both of them on the same piece of paper. 2023-06-05 06:17:46 I wrote a bunch of python that scraped up all the rather spread out output of those runs and dumped it all in this database. 2023-06-05 06:17:53 And then wrote reporting tools around that. 2023-06-05 06:18:30 Later I stopped using that tool and switched to an open source tool called fio - that just required I re-write that "gathering and dumping into db" part, but otherwise nothing changed. 2023-06-05 06:19:58 After eight years it's beginning to get a little slow - I probably should archive my tables and make new ones with, say, the last year of data in them. 2023-06-05 06:27:37 Hmmm. That wikipedia article discussed "extended double hashing" which seems to avoid certain numerical pathologies. The way it's presented, though, is as a little block of code that computes an entire array of hash probe indices - that code might compute more than you actually needed in some particular case. 2023-06-05 06:28:09 Seems like it would be nice to weave that work in with the actual table accesses, so you could stop when you had what you needed. 2023-06-05 06:28:40 But computing those same results one at a time from scratch would be inefficient - that little routine they give does it very efficiently. 2023-06-05 06:30:06 What it lets you do is use a second hash function that's of the same form as the first one (but with differences of specific values) - then one routine could do both. 2023-06-05 06:31:05 Looks like if this stuff is coded right, then you actually COULD fill your table up completely. It would take a lot of probes, but it would all "work." 2023-06-05 06:57:27 IMO the main point of double hashing is for security, it protects against pathologies introduced maliciously. You can usually avoid the need for it (with e.g. pseudorandom hash algorithm, or modifying hash algorithm for expected data) 2023-06-05 06:58:14 I think I have four operations on this: insert(name, voc), find(name, voc), wipe(), and forget(name, voc). 2023-06-05 06:58:19 Yes, that's true. 2023-06-05 06:58:43 Most of the time standard random reality would keep a simpler algorithm out of trouble. 2023-06-05 06:59:29 The MOST straightforward approach, though, can wind up with clusters. I don't really know how likely they are. 2023-06-05 06:59:56 To be clear: a pseudorandom hash (or even crypto hash) won't solve the security issue, it just solves the non-malicious issues 2023-06-05 07:00:02 Before someone slaps me 2023-06-05 07:00:05 You just want two different strings to have not only different initial slots but also different slot sequences," even if they happen to start on the same one. 2023-06-05 07:00:30 No no - I hear exactly what you're saying. 2023-06-05 07:00:43 I'm starting to think security is something that is just a component of every single thing we do, and needs to be taught about alongside any algorithm 2023-06-05 07:00:47 Your saying the weaknesses are more of a concern as "exploits" than anything else. 2023-06-05 07:01:30 Like "Input, Output, Algorithm, Complexity, Security" should be the five things we learn about for any algorithm or data structure 2023-06-05 07:01:39 Yeah, in a world where the intent is for systems to be available to the public, that's certainly true. And I guess in any organization big enough that you can't actually have full trust in all of your members. 2023-06-05 07:01:52 That sounds right to me. 2023-06-05 07:02:18 Well the workaround I suggest... pseudorandom, or even cryptographic hashing, don't actually avoid *any* pathological input, just *expected* pathalogical input 2023-06-05 07:02:20 And to underscore that, they really haven't mentioned "vulnerability to exploits" at all in these lectures I've been watching. 2023-06-05 07:02:35 So if someone is working maliciously they can produce contrived input to break it 2023-06-05 07:02:48 Of which there really won't be any, since I'm writing this primarily for my own use. 2023-06-05 07:03:02 Double hashing allows algorithms where it's impossible to do this consistently 2023-06-05 07:03:22 I mean, I guess I might decide someday to deliberately see how easily I can break it, but that would just be an exercise for fun. 2023-06-05 07:03:34 Put on my black hat for a few minutes. 2023-06-05 07:04:22 The cleverness the world has exhibited in pursuit of such input has impressed me quite a bit. 2023-06-05 07:04:24 This stuff absolutely matters, imagine I have a website where people can upload pastes, and choose a name. Say I store the name database at runtime in a hash table of the names. Then people can choose many names that will slow my whole website down, if they can see my algorithm! 2023-06-05 07:04:45 Yes. 2023-06-05 07:04:46 Say goodbye to 'my first C website' 2023-06-05 07:05:31 And if money is involved it WILL happen. May happen anyway just because there are pathological personalities out there that get off on doing such things. 2023-06-05 07:06:51 Well I don't want to accuse hackers of being 'pathological', I think it's a legit activity especially if they do no actual harm and disclose things ethically 2023-06-05 07:07:06 An often thankless endeavour 2023-06-05 07:07:21 Sure, in fact you might regard that as a "service." 2023-06-05 07:07:30 Notifying people of weaknesses. 2023-06-05 07:07:57 What's interesting is the basic "plug it in and use it" hash table in C++'s standard library is vulnerable to this kind of attack 2023-06-05 07:08:08 And it's not even that performant 2023-06-05 07:08:17 It makes me think... what were they thinking? 2023-06-05 07:08:32 If it's not even fast, or memory-efficient, why not at least make it secure 2023-06-05 07:09:03 Yeah. The whole open source ecosystem, though, seems like it's naturally going to arrive at an "average" quality level. 2023-06-05 07:09:08 I wonder about Rust too 2023-06-05 07:09:19 I'm pretty sure Lua's table is vulnerable as well 2023-06-05 07:09:33 The more people you have involved in a project, the harder it becomes to maintain a really high standard. 2023-06-05 07:09:52 Kind of a second law of thermo type thing. 2023-06-05 07:09:53 Although it's probably easy to implement double hashing within Lua's tables, if you want it 2023-06-05 07:10:22 Yeah, I was about to say that my usual outcome on such things is to implement "the best I know how that's easy and efficient." 2023-06-05 07:10:41 If it's cheap and simple to add in an extra layer of quality, then I'm likely to do it (if I know how). 2023-06-05 07:11:19 I just try to avoid outright torpedoing performance in the name of some perfect nirvana. 2023-06-05 07:11:39 It goes to show you need to understand *the whole stack* if you want to write good code 2023-06-05 07:11:46 Or at least that's what I'd LIKE to do - sometimes I get to overthinking some pieces of it. 2023-06-05 07:11:47 Security is one of the reasons 2023-06-05 07:11:58 I could not agree more. 2023-06-05 07:12:10 And it's why whole-stack-programming like Forth is more relevant/informative than you might think 2023-06-05 07:12:12 That's the biggest problem with the existing culture around all this. 2023-06-05 07:12:24 than one* might think, I know *you* think :P 2023-06-05 07:12:26 People operate in niches. 2023-06-05 07:13:16 :-) 2023-06-05 07:13:25 Let's say I'm writing a database for internal use, then I don't need to worry about this kind of security issue, I can immediately discard all this complexity 2023-06-05 07:13:54 But I'm writing a website database? Then I will either use the double hashing or figure out an alternative to hashing (more likely) 2023-06-05 07:13:58 Of course, that's how we've wound up with an insecure internet in the first place. 2023-06-05 07:14:10 When the whole thing was born, security was less of a factor. 2023-06-05 07:14:20 I don't think internal databases tend to go external 2023-06-05 07:14:29 Then we decided to take the whole thing and open it up to the world, and no one wanted to re-write things from scratch. 2023-06-05 07:14:33 It's like internal source code becoming open source, it's extremely rare and tricky 2023-06-05 07:14:52 Yeah - if you're not careful you wind up giving away more than you meant to. 2023-06-05 07:15:19 It's why I'm very impressed when MS open-sources stuff, especially legacy stuff 2023-06-05 07:15:22 There was a HUGE ordeal of "bluewashing" (an IBM internal name) the little companies software when IBM came in and took over. 2023-06-05 07:15:26 But also it explains their restraint 2023-06-05 07:15:34 Getting it up to their standards was difficult. 2023-06-05 07:15:55 Some uses of open-source routines had to be purged and re-written. 2023-06-05 07:16:06 A lot of it was just properly documenting things, but some new work had to get done. 2023-06-05 07:17:13 Note that my list of operations above didn't contain "delete." 2023-06-05 07:17:23 This system won't ever delete individual items. 2023-06-05 07:17:36 .wipe and forget are both "batch" operations. 2023-06-05 07:18:23 IMO a good compromise alternative for 'secure' or 'realtime' hashing is just using a provable fast-enough balanced tree algorithm 2023-06-05 07:18:41 That's probably simpler to implement than a secure double hashing algorithm 2023-06-05 07:19:09 Yeah. log(n) isn't const(n), but it's also "not bad." 2023-06-05 07:19:16 It's basically const(n) 2023-06-05 07:19:31 log() gets fairly flat. 2023-06-05 07:19:45 Because log(n) might as well be replaced with '64' for all feasible computers 2023-06-05 07:20:39 Within a certain bound all memory on earth for the forseable future is logarithmically reduced to some number under 100 2023-06-05 07:21:01 So it's a pretty much a reasonable constant for practical purposes 2023-06-05 07:22:40 It's not the most performant but at least it has a consistent and predictable maximum runtime 2023-06-05 07:23:04 I'd rather that than exploitable 2023-06-05 07:23:24 Yeah - "demonstrably more secure" has value. 2023-06-05 07:23:37 More secure and also not complicated 2023-06-05 07:31:21 The double hasing process is still "interesting," though. :-) 2023-06-05 07:31:40 It's the right algorithm for common use-cases 2023-06-05 07:32:08 But I think it's probably rarely the right algorithm for Forth because it is just more complicated than alternatives with worse security or performance 2023-06-05 07:34:08 There would be some complexities with a tree approach too. Especially around .wipe. 2023-06-05 07:34:24 Not insurmountable. 2023-06-05 07:34:37 It's my guess that it's easier, I could be wrong 2023-06-05 07:34:56 I don't think I have a guess - I'd have to think about it. 2023-06-05 07:35:29 I am currently writing a simple hash table in Forth 2023-06-05 07:35:30 But removing a node in the middle of a tree is at least a little cumbersome if the removed node has two children. 2023-06-05 07:35:33 I'll share it if I finish 2023-06-05 07:36:36 I remember that in a b-tree you always split a node that's full when deleting just to avoid that case causing trouble, regardless of if it was necessary 2023-06-05 07:37:06 I'm not that familiar with balanced trees though, one of the algorithms I've spent least time implementing 2023-06-05 07:38:16 Do I mean deleting or inserting? Maybe both, I can't remember. I've never implemented a B-tree myself 2023-06-05 07:38:39 There's a "rotation" operation involved with keeping it balanced. 2023-06-05 07:38:45 Or close to balanced, at least. 2023-06-05 07:38:59 It's guaranteed to be balanced to a certain extent I think 2023-06-05 07:39:17 Not necessarily optimally, but within some factor of optimal 2023-06-05 07:39:24 Yeah, unless you get some of that "designed" input. 2023-06-05 07:39:40 I think it's regardless of input 2023-06-05 07:39:58 I think it's balanced for any and all input, there is a worst case that's still within a factor of being balanced 2023-06-05 07:40:25 But I could deliberately choose an insertion sequence that just shot one branch out to whatever length I wanted. 2023-06-05 07:40:52 But if the insertion algorithm strategically does those "rotations" it can circumevent that. 2023-06-05 07:41:25 I disagree, it would be balanced to an extent still 2023-06-05 07:41:48 Not sure what you mean 2023-06-05 07:41:55 What if I started with an empty tree and just inserted ascending integers? 2023-06-05 07:42:05 Yeah it would balance that 2023-06-05 07:42:09 A B-tree would 2023-06-05 07:42:14 And a red-black tree 2023-06-05 07:42:28 Ok fair enough - you've got some of those "correction ops" in that. 2023-06-05 07:42:35 Yeah that's what I mean 2023-06-05 07:42:36 I was thinking of just a straight sorted tree. 2023-06-05 07:42:44 I am talking about balanced tree algorithms, not unbalanced 2023-06-05 07:42:49 Fair fair - I agree. It's "not hard" to keep it balanced. 2023-06-05 07:42:56 At relatively low cost. 2023-06-05 07:43:23 It's log(n) 2023-06-05 07:43:24 And if you do it along the way those correction operations happen out near the leaves and are fairly inexpensive. 2023-06-05 07:45:28 That's why I assumed we're talking about balanced tree algorithms, because generally B-Trees are presented as balanced algorithms only 2023-06-05 07:45:31 And you said log(n) 2023-06-05 07:45:42 If it's unbalanced, operations aren't logarithmic 2023-06-05 08:18:45 Yeah, I thought for a minute you were talking about a "binary tree with no extra care." 2023-06-05 08:19:48 Which would "probably" stay close to log(n), just due to natural random variation, but it'd be easy to break. 2023-06-05 08:20:35 I would expect it to be unbalanced, as you say even just increasing numbers will break it 2023-06-05 08:20:50 Lots of typical/natural input would break it 2023-06-05 08:21:44 Yep. 2023-06-05 08:21:57 You'd have to try to keep it balanced. 2023-06-05 08:22:40 How likely would I be to insert 1-7 as 4 3 6 1 2 5 7? 2023-06-05 08:23:12 I'd think not even randomly likely. 2023-06-05 08:52:12 You know, I just need to get over my craving for memory regions lined up on perfect power of two boundaries. :-| 2023-06-05 08:52:36 It's been easy to give myself that with the very "limited" memory management I've done in the past. But a generic allocator just isn't going to give that to me. 2023-06-05 08:52:59 And in the end all it saves me is a single addition instruction. 2023-06-05 08:53:23 I can still do all my calculations with nice even numbers, and hten just offset it at the very end. 2023-06-05 09:03:41 I think generic allocators are overrated 2023-06-05 09:19:41 I agree that specialty allocators make sense in a lot of situations. 2023-06-05 09:24:32 I think if you've got an 'OS' with any kind of paging or virtual memory then each process can probably get away with simple ALLOT-style allocation 2023-06-05 09:24:43 each process/task 2023-06-05 09:25:29 And then if you don't have that you either manually allocate memory carefully or you probably run out of memory anyway with a generic allocator 2023-06-05 09:27:00 In those kinds of situations it's probably best to use fixed-size chunks of allocated memory, which changes how you use dynamic allocation significantly, so that you avoid fragmentary waste 2023-06-05 09:27:16 If you really can't structure it more intelligently, anyway 2023-06-05 09:28:34 If you look at more complicated programs like e.g. a web browser, you realise that it is at least as complicated as an OS anyway and could work as I said with ALLOT, but with many tasks per tab/resource and virtual memory 2023-06-05 09:33:04 And maybe... just maybe... there's an argument for simulating virtual memory in your Forth if you need to do this kind of allocation on an embedded machine without virtual memory 2023-06-05 09:34:01 WIth simulated virtual memory, crafted for the specific application, you can probably avoid much more wasted memory than you would avoid with a generic allocator; potentially without too much runtime overhead 2023-06-05 09:46:22 I do think that a PROPER Forth would directly control the memory related hardware, making the stuff you just said true. Unfortunately I'm not that far along yet. I am living inside an OS. I do try to minimize my reliance on it, though. Contain it in as tight a circle as possible. 2023-06-05 09:47:53 And I'm a pretty big fan of fixed size allocators. They're simple and fast and just "get the job done." 2023-06-05 09:48:42 I think it's reasonable when hosted on an OS to just map new memory regions for tasks, presuming a 'real' Forth would let you do the same if running as the OS 2023-06-05 09:49:51 But it's not something you can assume on any Forth platform 2023-06-05 09:50:14 Not that I've "done it" yet, but I want to think that if my Forth had control of the hardware then every thread would get an address space that started at zero. 2023-06-05 09:50:29 Or something along those lines. 2023-06-05 09:50:30 I would probably start at page 1 2023-06-05 09:50:38 Possibly, yeah. 2023-06-05 09:50:40 Zero's a useful address to trap misuse 2023-06-05 09:50:46 Yes. 2023-06-05 09:51:05 I mostly meant "consistent" - they'd all see the same thing. 2023-06-05 09:51:18 I agree 2023-06-05 09:51:22 Simplifies the code 2023-06-05 09:51:24 And there might be a shared block mapped in for ipc. 2023-06-05 09:51:31 Maybe. 2023-06-05 09:52:18 Well one method of IPC used in Forths is BLOCK's 2023-06-05 09:53:02 Yeah. 2023-06-05 09:53:24 That's what I'd prefer for simple IPC