Note: This is satirical in nature. Slight CW if you are at a point in life where “Office Space” has unveiled itself as a disturbing existential horror movie. This taps into that the same darkness. A tale of six brave Internet pioneers. Senior Business Founder / Senior CEO – Zach Senior Tech Lead / Senior Architect / Senior CTO – Kevin Senior Backend dev Senior Frontend dev – Erin Two Senior UX engineers| www.marginalia.nu
How you engage with the world changes how you experience the world, and how the world experiences you. A snarky and cynical approach, by its default assumption that things are shit, or if they are not yet shit will inevitably turn to; such an approach will give your world a malodorous brownish tint. Granted, snark gives you plausible deniability, a motte-and-bailey that protects you from direct criticism, encountering backlash you can always backpedal and say it was just a joke that you accid...| www.marginalia.nu
The Marginalia Search index has been partially rewritten to perform much better, using new data structures designed to make better use of modern hardware. This post will cover the new design, and will also touch upon some of the unexpected and unintuitive performance characteristics of NVMe SSDs when it comes to read sizes. The index is already fairly large, but can sometimes feel smaller than it is, and paradoxically, query performance is a big part of why.| www.marginalia.nu
As some of the work planned for Marginalia Search this year has been progressing a bit faster than anticipated, there was time to implement an unplanned change. This post details the implementation of a system for detecting when servers are online, to avoid serving dead links and improve data quality, and for detecting when websites have significant changes including ownership transfers and parking. Table Of Contents Feature Rationale Data Representation Live Data Event Data Change Detection ...| www.marginalia.nu
You wake at 05:30 in the morning, feeling somewhat groggy. Instead of the alarm clock ringing like it normally does, a cheerful hologram appears: “Hi! I’m Kyle, your new alarm clock assistant!” You get dressed as Kyle explains all of the fantastic things he is capable of. You head over to the coffee machine. “Hey there! I’m Evan! Are you ready for AI in your coffee? But first - tell me about yourself!| www.marginalia.nu
The most recent change to the search engine is a system that profiles websites based on their rendered DOM. The goal is identifying advertisements, trackers, nuisance popovers, and similar elements. The search engine already tries to do this, but isn’t very good at it because it’s only looking at static code. It turns out to be somewhat difficult to determine what a website that has non-trivial javascript will look like based its source code alone, as this would require us to among other ...| www.marginalia.nu
The search engine has recently gained the ability to index the PDF file format. The change will deploy over a few months. Extracting text information from PDFs is a significantly bigger challenge than it might seem. The crux of the problem is that the file format isn’t a text format at all, but a graphical format. It doesn’t have text in the way you might think of it, but more of a mapping of glyphs to coordinates on “paper”.| www.marginalia.nu
Some time ago, I migrated the crawler off the okhttp library, to use Java’s builtin HTTP client. This seemed like a good idea at the time, but has led to a fair number of headaches. Java’s HttpClient has one damning flaw, and that that it doesn’t support socket timeouts. Its only supported timeout values are time to connect, and time until first byte of the response. This means the client can get stuck on a read call if a server stops responding, potentially for a very long time!| www.marginalia.nu
A problem the search engine’s crawler has struggled with for some time is that it takes a fairly long time to finish up, usually spending several days wrapping up the final few domains. This has been actualized recently, since the migration to slop crawl data has dropped memory requirements of the crawler by something like 80%, and as such I’ve been able to increase the number of crawling tasks, which has led to a bizarre case where 99.| www.marginalia.nu
I’m happy and grateful to announce that the Marginalia Search project has been accepted for a second nlnet grant. All the details are not yet finalized, but tentatively the grant will go toward addressing most of the items in the project roadmap for 2025. I’ve already been working full time on the project since summer 2023, and this grant secures additional development time, and extends the runway to a comfortable degree.| www.marginalia.nu
This text is satirical in nature. Tech news is abuzz with rude AI crawlers that forge their user-agent and ignore robots.txt. In my opinion, if this is all the AI startups can muster, they’re losing their touch. wget can do this. You need to up your game, get that crawler really rolling coal. Flagrant disregard for externalities is an important signal to the investors that your AI startup is the one.| www.marginalia.nu
This update is a few days late, the canonical birth date of the project is Feb 26. It has been another year of Marginalia Search. The project is still ongoing, still my full time job, although the project is entering a somewhat more mature phase of development, most of the big pieces are in place and do a decent job at what they do. The roadmap for the project is available on GitHub.| www.marginalia.nu
A while back an update went live that, with some caveats, changes the time it takes for an update on a website to reflect in the search engine index from up to 2 months to 1-2 days. Conditions being if the website has an RSS or Atom feed. The big crawl job takes about two months, and is run partition by partition, meaning there’s typically a slice of the index that is two months stale at any given point in time.| www.marginalia.nu
I recently put together a small library called Slop, for intermediate on-disk data representation for the search engine, replacing a few ad-hoc formats I had in place before. This post isn’t so much an attempt to convince anyone else to use this library, as it makes trade-offs catering to a fairly niche use case, but to explore some of its design ideas, as it all came together very nicely, in the hopes that other libraries can draw ideas from it.| www.marginalia.nu
Marginalia Search now properly supports phrase matching. This not only permits a more robust implementation of quoted search queries, but also helps promote results where the search terms occur in the document exactly in the same order as they do in the query. This is a write-up about implementing this change. This is going to be a relatively long post, as it represents about 4 months of work. I’m also happy and grateful to announce that the nlnet people reached out after the run of the gra...| www.marginalia.nu
Article URL: https://www.marginalia.nu/log/a_110_java_io/ Comments URL: https://news.ycombinator.com/item?id=41616653 Points: 51 # Comments: 24| Hacker News: Newest
As an experiment, I’ve reduced my coffee-intake to a single cup a day for about a week now. It’s made an enormous difference in sleep, mood and energy. I get tired at night, fall asleep quickly, and wake up refreshed. As mentioned previously in the context of morning sunlight exposure—another thing that’s aided my sleeping habits, but is somewhat less practical to sustain as it requires fair weather—I’ve always been slow to get going in the morning, active at night, bad at getting...| Weblog on marginalia.nu
A neat property of the parquet file format is that it’s designed with block I/O in mind, so that when you are interested in only parts of the contents of a file, it’s possible to some extent to only read that data. Many tools are aware of this property, and DuckDB is one of them. Depending on which circles you run in, a lesser known aspect of HTTP is range requests, where you specify which bytes in a file to be retrieved.| Weblog on marginalia.nu
The project has been haunted by a mysterious bug since sometime February. It relates to the code that constructs the index, particularly the code that merges partial indices. In short the search engine constucts the reverse index through successive merging of smaller indices, which reduces the overall memory requirement. You can conceptualize the revese index itself as two files, one with offset pointers into another file, which has sorted numbers. This code runs after each partition finishes...| www.marginalia.nu
I set out a little over a week ago to add a service registry to Marginalia Search, primarily to reduce its dependence on docker. I would like it to be able to run on bare metal as well, which poses a problem since configuring the application manually is a bit of a headache with dozens of ports that need to be set up. It would also be desirable to be able to run multiple instances of important services in order elliminate downtime during upgrades.| Weblog on marginalia.nu
It’s been three years since the inception of Marginalia Search, then a dinky experiment to find where the heck the cool Internet has gone, now my full time job. While there’s always things that can be improved, it’s fair to say the search engine has never worked as well as it does right now. A great number of milestones have been reached, perhaps biggest of all the search engine has moved out of my living room and into a proper enterprise server.| Weblog on marginalia.nu
One of the great joys of working on a search engine is that you get to reverse engineer SEO spam, and overall study how it evolves over time. I’ve been noticing the search engine spam strategy of adding ‘reddit’ to page titles for a few years now, but it feels like it’s been growing a lot recently. I don’t think it’s actually working, but it’s so cute that they are trying.| Weblog on marginalia.nu
I get significantly more work done when I unplug my computer from the Internet. It’s not that my productive output drops to zero when I’m plugged in, but more like 70%. Despite many of the tools that I use requiring a connection, and certainly the Internet containing a wealth of information that might expedite my work, these benefits are drastically outweighed by the wealth of distractions also available. It’s very appealing, when the code is compiling or the docker containers restartin...| Weblog on marginalia.nu
You have a hobby you’ve been into for a decade or more. You like talking about your hobby, and your friends and family, after listening to these things for as long as you’ve been into them, maybe aren’t as excited to always hear about it as you are about discussing them, so in an act of compassion you create a youtube channel where you can monologue about your passion instead.| Weblog on marginalia.nu
This is a bit of a retrospective of every project I’ve worked on, as far as I remember them. I’ve tried to unearth any artifacts that remain. Far from everything is flattering and resounding success, but then again, maybe that’s good. There are definitely patterns in the things that didn’t pan out. Earliest Traces I was definitely programming stuff, but I don’t think it ever amounted to anything tangible. It was more like playing house, I built GUIs that looked like real application...| Weblog on marginalia.nu
Marginalia Search very recently gained the ability to filter results by Autonomous System, not only searching by ASN but by the organization information for that AS. At a glance this seems like a somewhat frivolous feature, but it has interesting effects. Autonomous Systems are part of the Internet’s routing infrastructure. If your mental model of an IP number is that they are the phone number of the computer, this is something akin to a postal code.| Weblog on marginalia.nu
A simple guide to reading in 9 simple steps Navigate to the desired article. Dismiss the GDPR banner It may seem safe to start reading, but you need to wait about 10 seconds as the various ad auctions resolve and scripts load in Wait while the article is populated with ads. While the article is in front of you, there is no point to starting to read yet, as the minute’s worth of layout shift will make you lose your place.| Weblog on marginalia.nu
The Marginalia Crawler has seen improvements! A long term problem with the crawler design is that if for whatever reason the crawler shuts down, then it needs to re-start fetching whatever domains it was currently traversing during the termination from zero. This isn’t fantastic, since not only does crawling a website take a fair bit of time, it’s a nuisance for the server admins to re-crawl stuff that was already fetched, and a real liability for ending up in robots.| Weblog on marginalia.nu
I’ve been working on getting anchor tag keywords into the search engine, basically using link texts to complement the keywords on a webpage. The problem I’m attempting to address is that many websites don’t really describe themselves particularly well. As Steve Ballmer’s stage performance once illustrated, merely repeating a word doesn’t on its own make what you’re saying relevant to the term. Another good example of how it falls short is PuTTY’s website, which will be used as a...| Weblog on marginalia.nu
So a bit of an update on what I’ve been working on. This will be adapted into release notes in a while, but I haven’t quite wrapped a bow on the change set yet. Still, it has certainly been a few weeks. Didn’t quite land how busy I’ve been until I set down to draft this post. Them’s some changes, and I’m skipping a few to keep this meandering post at a sane length.| Weblog on marginalia.nu
As a general observation, I tend to be more productive when I know what to do next at any given moment. There are days when I’ve seemingly gotten a “week” of work done on an afternoon, those are the days when what I needed to do was very clear, and I basically just had a list of items to tick off one by one. There have admittedly also those ignoble weeks weeks when I’ve gotten an afternoon’s work done, mostly they are weeks when it’s not been at all clear what to do next.| Weblog on marginalia.nu
I ran into a bit of a puzzling situation yesterday, testing some of the new index construction changes before they’re going live in a few days. The process crashed with a pretty non-descript stack trace complaining about illegal instructions, so first glance it looked more like it was within the realm of freak JVM bug, cosmic ray, hardware error maybe. I was doing this on my developer workstation, which also spawned a popup complaining that the hard drive it was working on had nearly run ou...| Weblog on marginalia.nu
I’m happy to announce that the generous people at FUTO have granted the project $15,000 with no strings attached to help the search engine out with some more server power. FUTO is a young Austin, TX-based organization “dedicated to developing, both through in-house engineering and investment, technologies that frustrate centralization and industry consolidation”. It’s one to keep an eye on, I believe their heart is in the right place and they have every possibility of making a real di...| Weblog on marginalia.nu
So… I’ve had the most unreal week of coding. Zero exaggeration, I’ve halved the RAM requirements of the search engine, removed the need to take the system offline during an upgrade, removed hard limits on how many documents can be indexed, and quadrupled soft limits on how many keywords can be in the corpus. It’s been a long term goal to keep it possible to run and operate the system on low-powered hardware, and so far improvements have been made, to the point where my 32 Gb RAM devel...| www.marginalia.nu
I’ve started going on a long walk each morning immediately as I wake up, and it’s had the unexpected side-effect of fixing my broken circadian rhythm. For decades, as long as I can remember, I’ve been what you might consider a serious night owl. Regardless of how long I slept or when I woke up, I would get nothing requiring any sort of thought done until sometime after lunch, and it wasn’t really until late at night that my brain really kicked into gear.| Weblog on marginalia.nu
I recently got a smaller computer screen. It’s actually not that small, it’s 27", but the resolution is modest compared to what is available. And in short, it’s fantastic. It’s not an expensive screen, it’s not a fancy screen; but it’s comparatively a small screen. For a few years I was using a 34" ultra-wide monitor, which has been causing me nothing but grief. It’s sort of crept up on me that so many small annoyances in my computer-use all originated from using this screen.| Weblog on marginalia.nu
This is a bit of an what I’ve been working on style of post. It’s also a bit of a complement for the release notes of the upcoming release which should be dropping in a week or so. There’s some spit and polish still missing from these things, but if I don’t write about them now too much will have been ejected from the cache to make a well written post about it.| Weblog on marginalia.nu
I’m working on Marginalia Search full time. I left the office for the last time today, and it’s the strangest feeling. I’ve quit jobs, taken time off work, been laid off, but this is different from any of those things. This is deliberate. There’s a note of relief. I’ve essentially been working two pretty demanding jobs; one for pay and one for passion and the joy of making a difference.| Weblog on marginalia.nu
I use print debugging all the time I know how to use a debugger. I use a debugger sometimes, but most of my debugging is done by print statements that are like A B C . . 5 D . D , {true, 30} . , . , 10 E I think Clean Code makes some valid points I don’t think it should be your bible or treated as infallable, having seen the sort of code that came before it, yeah, Uncle Bob got some things right.| Weblog on marginalia.nu
I killed the old memex.marginalia.nu site. Not because it wasn’t great, but because I don’t have the time to maintain the software, which was quite janky, and perhaps most of all I wasn’t really feeling it. The new site looks superficially similar, but it’s actually just a Hugo template that emulates some of the memex’ capabilities. Although some of the coolest stuff is sadly gone as a result. Thankfully I decided to use an extremely portable markup format when building the original...| Weblog on marginalia.nu
I’ve come to think LLMs/GPTs/whatever are a threat to conventional search engines because the modern web is an unbelievably annoying dumpster fire. They don’t really provide better or faster answers, what they provide is an experience that is not a complete pain in the ass. This frog has been simmering for a long while now and we’re so used to it that seeing literally anything else seems revolutionary. You visit a website and need to dismiss a cookie policy notification, a request to sh...| Weblog on marginalia.nu
I’ve moved Marginalia’s sources to Github. Can’t pick every battle. The main reason is I’m kind of tired of the amount of spam bots that keep signing up to my Gitea. The juice of self-hosting a public-access git forge, even locked down to prevent arbitrary repo creation, that juice just isn’t worth the squeeze. This is not without some consideration. To be blunt, I don’t like Github. Their use of dark patterns leaves a real nasty after-taste.| Weblog on marginalia.nu
This is a bit of a follow up to the previous post. The Grand Code Restructuring [ 2023-03-17 ] Marginalia’s search result quality has, for a long while, been pretty good as long as your search query is a single term, but for multiple search terms it’s been a bit hit-and-miss. Marginalia was never great at this, but the quality of results in this usage pattern has taken a bit of a dive recently due to a re-write of the index last fall.| Weblog on marginalia.nu
This is a very brief post announcing a fascinating discovery. It appears to be possible to use the cosine similarity approach powering explore2.marginalia.nu as a substitute for the link graph in an eigenvector-based ranking algorithm (i.e. PageRank). The original PageRank algorithm can be conceptualized as a simulation of where a random visitor would end up if they randomly clicked links on websites. With this model in mind, the modification replaces the link-clicking with using explore2 for...| Weblog on marginalia.nu
I don’t know if I’m just imagining it, but has the Internet gone progressively more crazy the last decade or so? It’s like everyone is so damn angry all the time. If they aren’t angry they’re bitter and resentful. And when they aren’t angry or bitter, they’re so depressed they’re barely able to crawl out of bed. And if they aren’t angry, bitter, or depressed, they have crippling anxiety. Every other week there’s some public blow-out where some person or another just loses ...| Weblog on marginalia.nu
The most common (and most costly) operation of the marginalia search engine’s index is something like given a set of documents containing one keyword, find each documents containing another keyword. The naive approach is to just iterate over each document identifier in the first set and do a membership test in the b-tree containing the second. This is an O(m log n)-operation, which on paper is pretty fast. It turns out it can be made faster.| Weblog on marginalia.nu
While this post is about programming, it also draws an extended analogy to dungeons and dragons, specifically two of its classes, that correspond to two attitudes toward programming. In D&D, wizards study magic. They prepare their magic spells ahead of time. While they may learn a large number of magic spells, they need to prepare them ahead of time and can’t just cast them at will. Wizard programmers prefer up-front design.| Weblog on marginalia.nu
I get my best ideas when I’m not working. This seems paradoxical, but past a point, the more I work on a project the slower it seems to go. I’ll find changes to do, but lose any sort of vision. If I’m not programming at all, I rarely get good ideas as well. There appears to be some magic stoichiometric mixture where I work on a project for a while, then force myself to take a break somewhere far away from any keyword for a day or two, the ideas start to roll in at a pace where I can bar...| Weblog on marginalia.nu
One of the more common feature requests I’ve gotten for Marginalia Search is the ability to search by date. I’ve been a bit reluctant because this has the smell of a a surprisingly hard problem. Or rather, a surprisingly large number of easy problems. The initial hurdle we’ll encounter is that among structured data, pubDate in available in RDFa, OpenGraph, JSON+LD, and Microdata. A few examples: <meta property="datePublished" content="2022-08-24" /> <meta itemprop="datePublished" conten...| Weblog on marginalia.nu
By which I mean there are deeply problematic assumptions in the very notion of scaling: Scaling changes the rules, and scaling problems exist in both directions. If what you are doing effortlessly scales up, it almost always means it’s egregiously sub-optimal given your present needs. These assertions are all very abstract. I’ll illustrate with several examples, to try and build an intuition for scaling. You most likely already know what I’m saying is true, but you may need reminding th...| Weblog on marginalia.nu
A very brief note to announce reaching a long term goal and major milestone for marginalia search. The search engine now indexes 106,857,244 documents! The previous record was a bit south of seventy million. A hundred million has been a pie-in-the-sky goal for a very long time. It’s seemed borderline impossible to index a that many documents on a PC. Turns out it’s not. It’s more than possible. Twice this may even be technically doable, but is way past the pain point of sheer logistics.| Weblog on marginalia.nu
In the primordial days of Marginalia Search, it used a dynamic approach to crawling the Internet. It ran a number of crawler threads, 32 or 64 or some such, that fetched jobs from a director service, that grabbed them straight out of the URL database, these jobs were batches of 100 or so documents that needed to be crawled. Crawling was not planned ahead of time, but rather decided through a combination of how much of a website had been visited, and the quality score of that website determine...| Weblog on marginalia.nu
I discovered someone has made a cryptocurrency called “Memex Marginalia Inu”. It appears to have been created February 23, which is around when the entry “I Have No Capslock And I Must Scream” went absurdly viral to the point where Elon Musk tweeted a link to it. I Have No Capslock… Mr Musk’s twitter orbit is exceptionally strange. The tweet was followed by a deluge of bizarre activity, strange emails with calls about stonk canine lunar expeditions, and apparently also a cryptocur...| Weblog on marginalia.nu
Bots are absolutely crippling the Internet ecosystem. The “future” in the film Terminator 2 is set in the 2020s. If you apply its predictions to the running of a website, it’s honestly very accurate. Modern bot traffic is virtually indistinguishable from human traffic, and can pummel any self-hosted service into the ground, flood any form with comment spam, and is a chronic headache for almost any small scale web service operator.| Weblog on marginalia.nu
I’d like to discuss a mental somersault that I’ve found has caused me a lot of grief in the past, which is prescriptive descriptions. Let’s break this down a bit: A descriptive statement is a statement about how something is. A prescriptive statement is a statement about how something must be, a rule or a law. If I stay up a bit late, do most of my work in the evenings, wake up tired and just sort of putter about until noon, I might describe myself as a night person because of this.| Weblog on marginalia.nu
I found myself effectively without a job on short notice. I’m not at all worried about finding another one, I have savings, and I have experience, and I have demonstrable skill. What I am concerned about is finding a source of income that’s compatible with putting some time on my personal projects. Last bunch of years, I’ve been working 32 hour weeks, which is a pretty sweet deal especially combined with the zero hour commute you get working from home during the pandemic.| Weblog on marginalia.nu
I’m going to think out loud for a moment about a problem I’m considering. RAM is a precious resource on any server. Look at VPS servers, and you’ll be hard pressed to find one with much more than 32 Gb. Look at leasing a dedicated server, and it’s the RAM that really drives up the price. My server has 128 Gb, and it it’s so full it needs to unbutton its pants to sit down comfortably.| Weblog on marginalia.nu
This is mostly a post to complain about something that chafes. I wish there was a programming language (ideally several) that acknowledged that computers have hard drives, not just a processor, RAM and other_devices[]. Something that has struck me when I’ve been working with the search engine is how unfinished the metaphor for accessing physical disks is in most programming languages. It feels like an after-thought, half left to the operating system to figure out, a byzantine relic of the d...| Weblog on marginalia.nu
The search engine index has grown quite considerably the last few weeks. It’s actually surpassed 50 million documents, which is quite some milestone. In February it was sitting at 27-28 million or so. About 80% of this is side-loading all of stackoverflow and stackexchange, and part of it is additional crawling. The crawler has to date fetched 91 million URLs, but only about a third of what is fetched actually qualifies for indexing for various reasons, some links may be dead, some may be r...| Weblog on marginalia.nu
Let’s define a simple mathematical function, the function will perform integer factoring. It will take an integer, and return two integers, the product of which is the first integer. F(int32 n) = (int32 A, int32 B) so that A*B = n This is fairly straight forward, mathematical, objective. Let’s examine some answers an implementation might give. F 50 = (5, 10) on ARM F 50 = (10, 5) on Intel| Weblog on marginalia.nu
I’ve caught some bug and don’t have the energy to write more than a brief note. I want to commemorate the fact that work on the Marginalia search engine started one year ago. The first commit was on February 26th 2021, and contained a sketch for a website crawler and some data models. In many ways, the paint is barely dry, yet it feels like this project has been around for a long while.| Weblog on marginalia.nu
Not what I had intended to do this Saturday, but a hard drive failed on the server this morning, or at least so it seemed. MariaDB server went down, dmesg was full of error messages for the nvme drive it’s running off. That’s a pretty important drive. The drive itself may actually be okay, the working hypothesis is either the drive itself or the bus overheated and reset. After a reboot the system seems fine.| Weblog on marginalia.nu
Black hat SEO is endlessly fascinating phenomenon to study. This post is about some tactics they use to make their sites rank higher. The goal of blackhat SEO is to boost the search engine ranking of a page nobody particularly wants to see, usually ePharma, escort services, online casinos, shitcoins, hotel bookings; the bermuda pentagon of shady websites. The theory behind most modern search engines is that if you get links from a high ranking domain, then your domain gets a higher ranking as...| Weblog on marginalia.nu
I’ve been thinking a lot about how difficult it has become to discover quality content on the Internet, not because it isn’t there, but because the signal to noise ratio is really bad, and most venues of discovery don’t seem to be able to handle it. Recommendation algorithms seem to work almost too well, to the point where it’s all kind of just showing you things you already like, rarely anything new that you might like.| Weblog on marginalia.nu
A person might think I’m illusive, writing and working under a pseudonym. It’s not that I’m hiding, if you send me an email, I’ll respond to you with an email address containing a decent chunk of my real name. It’s not out of shame I wear clothes. Besides bringing utility, marginalia.nu is an experiment, a bit of an art project, a place to challenge conventions and see what is and isn’t necessary.| Weblog on marginalia.nu
As is often the case these dark winter seasons, I’ve fallen into a bit of a funk. Inspiration it seems is as rare as sunlight, and sunlight is scarce indeed in the winters of the north. I do know what is missing, novelty. I’ve fallen into consuming “content”. Infinite scroll is the torture rack of the spirit. What is necessary doing new things and seeing new inspiring sights, exposing myself to new inspiring thoughts.| Weblog on marginalia.nu
At a previous job, we had a new and fancy office. The light switches were state of the art. There was an on button, and a separate off button. When you pressed the on button, the lights would fade on. When you pressed the off button, they would fade off. In the cloud somewhere was two functions that presumably looked a bit like this: fun turnOnLamp() { while (!bright()) increaseBrightness(); } fun turnOffLamp() { while (!| Weblog on marginalia.nu
The phenomenon of “normies” is an interesting one. The term itself is a bit problematic and not one I’d typically use, but as a phenomenon they are still worth investigating. Their perhaps biggest distinguishing feature is that they don’t get “it”, whatever it is. It’s tempting to think that these are an especially mindless type of person with no personality and little in terms of thought going on. I have a theory that normies may not actually exist.| Weblog on marginalia.nu
I’ve been thinking recently about the emphasis put on “new”, specifically for search engines, but the discussion has some merit even in a wider context. I will start wide and narrow down. It is common to conflate new with good, and most being young sometime between 1950-2000 will indeed have seen marvellous improvements in quality of life and technology with each passing year. In the light of that, it’s at least easy to explain how one might confuse the two.| Weblog on marginalia.nu
This is a response to the post “Making Gemini Easy” over on ~tomasino, and the title is a bit tongue-in-cheek haha-but-no-really. gemini://tilde.team/~tomasino/journal/20211103-making-gemini-easy.gmi I think the idea that we need to shield the users from how technology works is a terrible, terrible mistake. It disempowers the users, and concentrates power in the hands of a technological elite, and that divide is only going to grow. We already have an alarming number of people working with...| Weblog on marginalia.nu
I want you to consider for a moment all the human lifetime wasted in ideological stalemates on the Internet, all that energy, all that anger and frustration. Imagine if you take even a fraction of that time, and put it to creating something constructive instead, learning skills, doing anything meaningful. It boggles the mind, doesn’t it? It must amount to entire human lifetimes every week. Ideology, or in a wider sense, ethics, is all about what should be done.| Weblog on marginalia.nu
You are invited to a dinner party. After talking for a while food is served on the table. You pounce. “Haha, suckers!”, you think, and load all the food on your plate and leave nothing but scraps for the hosts. You feel victorious. Serves them right for inviting you into their home. You wolf down the food with ravenous appetite while they look on. That was tasty, but now you got a piece of meat stuck between your teeth so you go to the bathroom and borrow some floss and use one the hosts...| Weblog on marginalia.nu
There has been a bit of discussion over on Gemini recently regarding poorly behaved bots. I feel I need to add some perspective from the other side; as a bot operator (even though I don’t operate Gemini bots). Writing a web spider is pretty easy on paper. You have your standards, and you can test against your own servers to make sure it behaves before you let it loose. You probably don’t want to pound the server into silicon dust, so you add a crawl delay and parallelize the crawling, and...| Weblog on marginalia.nu
A recurring problem when searching for text is identifying which parts of the text are in some sense useful. A first order solution is to just extract every word from the text, and match documents against whether they contain those words. This works really well if you don’t have a lot of documents to search through, but as the corpus of documents grows, so does the number of matches. It’s possible to bucket the words based on where they appear in the document, but this is not something I...| Weblog on marginalia.nu
Optimization is arguably a lot about intuition. You have a hunch, and see if it sticks. Sure you can use profilers and instrumentation, but they are more like hunch generators than anything else. This one wasn’t as intuitive, at least not to me, but it makes sense when you think about it. I have an 8 Gb file of dense binary data. This data consists of 4 Kb chunks and is an unsorted list containing first an URL identifier with metadata and then a list of word identifiers.| Weblog on marginalia.nu
I’ve been dealing with a botnet for the last few days, that’s been sending junk search queries at an increasingly aggressive rate. They were reasonably easy to flag and block but just kept increasing the rate until that stopped working. Long story short, my patience ran out and put my website behind cloudflare. I didn’t want to have to do this, because it does introduce a literal man in the middle and that kinda undermines the whole point of HTTPS, but I just don’t see any way around it.| Weblog on marginalia.nu
An idea I’ve had for a long time with regards to navigating the web is to find a way to browse it. “Browse” a difficult word to use, because it has a newer connotation of just using a web browser, I mean it in the old pre-Internet sense, browse like when you flip through a magazine, or peruse an antiques shop, not really looking for anything in particular just sort of seeing if anything catches your eye.| Weblog on marginalia.nu
Since my search engine has expanded its scope to include blogs as well as primordial text documents, I’ve done some thinking about how to keep up with newer websites that actually grow and see updates. Otherwise, as the crawl goes on, it tends to find fewer and fewer interesting web pages, and as the interesting pages are inevitably crawled to exhaustion, accumulate an ever growing amount of junk. Re-visiting each page and looking for new links in previously visited pages is probably off th...| Weblog on marginalia.nu
The last few days I’ve felt like my first attempt at a ranking algorithm for the search engine was pretty good, like it was producing some pretty interesting results. It felt close to what I wanted to accomplish. The first ranking algorithm was a simple link-counting algorithm that did some weighting to promote pages that look in a certain fashion. It did seem to keep the page quality up, but also seemed to as a strange side-effect promote very “1996”-looking websites.| Weblog on marginalia.nu
Been working on improving Marginalia Search query parsing and understanding. This is going to be a pretty long update, as it’s a few months’ work. Apart from cleaning up the somewhat messy query parsing code, a problem I’m trying to address is that the search engine is currently only good at dealing with fairly focused queries, they don’t need to be short, but if you try to qualify a search that is too broad by adding more terms, it often doesn’t produce anything useful.| www.marginalia.nu
So the search engine is moving to a new server soon, thanks to the generous grant mentioned recently. If you visit search.marginalia.nu now, it may or may not use the old or new server. It’ll be like this for a while, since I need them both for testing and maintenance type work. I’ll also apologize if this post is a bit chaotic. It is a reflection of a very chaotic couple of weeks that apart from setting up this migration also involved a very short notice invitation for a presentation at ...| www.marginalia.nu
A year ago I walked out of the office for the last time. I handed in my corpo laptop, said some good-byes, and since then I have been my own boss. This first year has been funded by an NLnet grant, which I’m in the midst of wrapping up. As of now, the work is all done, the final request for payment has been sent. There’s a similar last-day-of-school levity to both these events.| www.marginalia.nu
The best description of my problem solving process is the Feynman algorithm, which is sometimes presented as a joke where the hidden subtext is “be smart”, but I disagree. The “algorithm” is a surprisingly lucid description of how thinking works in the context of hard problems where the answer can’t simply be looked up or trivially broken down, iterated upon in a bottom-up fashion, or approached with similar methods. Feynman’s thinking algorithm is described like this:| www.marginalia.nu
I’ve experimentally replaced some of the Java implementations of quicksort and binary search with calls to C++ code, and saw huge benefits for the sorting code but the same or worse performance for binary search. The Marginalia Search engine is mainly written in Java, which is language that is good at many things, but not particularly pleasant to work with when it comes to low level systems programming. Unfortunately, a part of building an internet search engine involves database-adjacent l...| www.marginalia.nu
There is an episode of Star Trek where a character is for plot reasons trapped in a shrinking parallel universe. As time passes, people she knows one by one just vanish and she is the only one who seems to notice. Eventually it gets to an absurd point. She asks if it really makes sense if a ship made for a thousand people would have a crew of a few people, and everyone just sort of like shrugs and looks at her like she’s crazy.| www.marginalia.nu
In general I don’t like to fuss over code, but this is exactly what I’ve been doing in preparation of the NLnet funded work. I’ve spent the last month restructuring Marginalia’s code base. It’s not completely done, but I’ve made great headway. Things got the way they got because in general for experimental solo-development projects, I think it makes sense to be fairly tolerant of technical debt. Since refactoring is something that is extremely difficult to break up into parallel t...| www.marginalia.nu
No time like the project’s two year anniversary to drop this particular bomb… Marginalia’s gotten an NLNet grant. This means I’ll be able to work full time on this project at least a year. https://nlnet.nl/project/Marginalia/ This grant is essentially the best-case scenario for funding this project. It’ll be able to remain independent, open-source, and non-profit. I won’t start in earnest for a few months as I’ve got loose ends to tie up before I can devote that sort of time.| www.marginalia.nu
For clarification, this is discussing no other thing called Memex than memex.marginalia.nu, the website you’re probably visiting right now. That, or you’re reading this over gemini at marginalia.nu, which is serving the same content over a different protocol. I wanted to build a cross-protocol static site generator designed in a way that is equally understandable by both humans and machines. This groundedness is an appealing property I really admire about the gemini protocol and gemtext f...| www.marginalia.nu
This is a write-up about an experiment from a few months ago, in how to find websites that are similar to each other. Website similarity is useful for many things, including discovering new websites to crawl, as well as suggesting similar websites in the Marginalia Search random exploration mode. A link to a slapdash interface for exploring the experimental data. The approach chosen was to use the link graph look for websites that are linked to from the same websites.| www.marginalia.nu
Anchor texts are a very useful source of keywords for a search engine, and in an older version of the search engine, it used the text of such hyperlinks as a supplemental source for keywords, but due to a few redesigns, this feature has fallen off. Last few days has been spent working on trying to re-implement it in a new and more powerful fashion. This has largely been enabled by a crawler re-design from a few months ago, which offers the crawled data in a lot more useful fashion and allows ...| www.marginalia.nu
After a bit of soul searching with regards to the future of the website, I’ve decided to open source the code for marginalia.nu, all of its services, including the search engine, encyclopedia, memex, etc. A motivating factor is the search engine has sort of grown to a scale where it’s becoming increasingly difficult to productively work on as a personal solo project. It needs more structure. What’s kept me from open sourcing it so far has also been the need for more structure.| www.marginalia.nu
There are a lot of ways of building software, there are many languages you could choose to build it with, many libraries to rely on, many frameworks to leverage, many architectural approaches, many platforms to choose, many paradigms of daily operations to follow. It takes years to get in-depth experience with just one permutation of these options. I’ve been programming for over twenty years, only half the time professionally, but that is how long I’ve been building software.| www.marginalia.nu
I’ve been working lately on a bit of an overhaul of how the search engine does indexing. How it indexes its indices. “Index” is a bit of an overloaded term here, and it’s not the first that will crop up. Let’s start from the beginning and build up and examine the problem of searching for a number in a list of numbers. You have a long list of numbers, let’s sort them because why not.| www.marginalia.nu
In a near future, a team of desktop computer designers are looking at the latest telemetry and updating the schematics of the hardware-as-a-service self-assembling nanohardware. Steve: “Hmm, they don’t seem to be using the power button very often.” Bob: “Compared to the other buttons, it’s only used 0.1% of the time” Steve: “Remove it?” Bob: “Remove it!” Computers now instantly boot up when plugged into the wall, and run until the plug is pulled.| www.marginalia.nu
It’s been a productive several weeks. I’ve got the feature pulling updates from RSS working, as mentioned earlier. I’ve spent the last weeks designing the search engine’s web design, and did the MEMEX too for good measure. It needed to be done as the blog theme that previously made the foundation for the design off had several problems, including loading a bunch of unnecessary fonts, and not using the screen space of desktop browsers well at all.| www.marginalia.nu
This entry is about a few problems the search engine has been struggling with lately, and how I’ve been attempting to remedy them. Before the article starts, I wanted to share an amusing new thing in the world of Internet spam. For a while, people have been adding things like “reddit” to the end of their Google queries to get less blog spam. Well, guess what? The blog spammers are adding “reddit” to the end of their titles now.| www.marginalia.nu
Search results are only as good as the search engine’s ability to figure out what a page is about. Sure a keyword may appear in a page, but is it the topic of the page, or just some off-hand mention? I didn’t really know anything about data mining or keyword extraction starting out, so I’ve had to learn on the fly. I’m just going to briefly list some of my first naive attempts at keyword extraction, just to give a context.| www.marginalia.nu
This is an reply to a series of posts on anglo-centrism in programming languages that have been floating around in Gemini lately. gemini://nytpu.com/gemlog/2021-10-31.gmi gemini://alsd.eu/en/2021-11-04-thoughts-anglocentrism-cs.gmi Around thirty years ago I was a kid with a computer. I learned to program quite a few years before I learned English. I also used DOS without understanding English. I knew what to type to do things, but I didn’t know what the words meant. I could start programs, ...| www.marginalia.nu
There are a lot of small websites on the Internet: Interesting websites, beautiful websites, unique websites. Unfortunately they are incredibly hard to find. You cannot find them on Google or Reddit, and while you can stumble onto them with my search engine, it is not in a very directed fashion. It is an unfortunate state of affairs. Even if you do not particularly care for becoming the next big thing, it’s still discouraging to put work into a website and get next to no traffic beyond the ...| www.marginalia.nu
This is a theory that’s previously been stated in log/39-normie-hypothesis.gmi, but I think it’s worth expanding on as it’s become very relevant with the recent Reddit shit-show actualizing just how bad that website has gotten along with social media in general. I think the model demonstrate how the ’enshittification’ process is an inevitability with any social media that is run on a venture capital model. An online community can be like a village, where you have familiar faces, col...| www.marginalia.nu