This information is outdated The marginalia search project info now lives on about.marginalia-search.com. What is this search engine’s name? Let’s call it Marginalia Search as that’s what most people seem to do. There is some confusion, perhaps self-inflicted problem as I’m not really into branding and logos, and to make matters worse I’ve used a lot of different internal names, including Astrolabe and Edge Crawler. But most people seem to favor “marginalia search”.| Marginalia Search on marginalia.nu
This information is outdated The marginalia search project info now lives on about.marginalia-search.com. An API for the search engine is available through api.marginalia.nu. The API is simple enough to be self-explanatory. Examples: https://api.marginalia.nu/public/ https://api.marginalia.nu/public/search/json+api https://api.marginalia.nu/public/search/json+api?index=0 https://api.marginalia.nu/public/search/json+api?index=0&count=10 The ‘index’ parameter selects the search index, corre...| Marginalia Search on marginalia.nu
This information is outdated The marginalia search project info now lives on about.marginalia-search.com. This search engine is a small non-profit operation, and I don’t want it to be cause any inconvenience. If it is indeed being a nuisance, please let me know! Send an email to kontakt@marginalia.nu and I’ll do my best to fix it as soon as possible. Telling me lets me fix whatever problem there is much faster, and if you are experiencing problems, then so are probably others as well.| Marginalia Search on marginalia.nu
This information is outdated The marginalia search project info now lives on about.marginalia-search.com. I’m just one guy building all of this on my own. I’d like to expand the search engine and make it more useful. My hope is that it will bring value to its users and enable a thriving independent Internet. The search engine doesn’t have any secret sauce, all the source code is publicly available and as far as is legally and logistically possible, the data is also available.| www.marginalia.nu
As some of the work planned for Marginalia Search this year has been progressing a bit faster than anticipated, there was time to implement an unplanned change. This post details the implementation of a system for detecting when servers are online, to avoid serving dead links and improve data quality, and for detecting when websites have significant changes including ownership transfers and parking. Table Of Contents Feature Rationale Data Representation Live Data Event Data Change Detection ...| www.marginalia.nu
The most recent change to the search engine is a system that profiles websites based on their rendered DOM. The goal is identifying advertisements, trackers, nuisance popovers, and similar elements. The search engine already tries to do this, but isn’t very good at it because it’s only looking at static code. It turns out to be somewhat difficult to determine what a website that has non-trivial javascript will look like based its source code alone, as this would require us to among other ...| www.marginalia.nu
The search engine has recently gained the ability to index the PDF file format. The change will deploy over a few months. Extracting text information from PDFs is a significantly bigger challenge than it might seem. The crux of the problem is that the file format isn’t a text format at all, but a graphical format. It doesn’t have text in the way you might think of it, but more of a mapping of glyphs to coordinates on “paper”.| www.marginalia.nu
A problem the search engine’s crawler has struggled with for some time is that it takes a fairly long time to finish up, usually spending several days wrapping up the final few domains. This has been actualized recently, since the migration to slop crawl data has dropped memory requirements of the crawler by something like 80%, and as such I’ve been able to increase the number of crawling tasks, which has led to a bizarre case where 99.| www.marginalia.nu
I’m happy and grateful to announce that the Marginalia Search project has been accepted for a second nlnet grant. All the details are not yet finalized, but tentatively the grant will go toward addressing most of the items in the project roadmap for 2025. I’ve already been working full time on the project since summer 2023, and this grant secures additional development time, and extends the runway to a comfortable degree.| www.marginalia.nu
This update is a few days late, the canonical birth date of the project is Feb 26. It has been another year of Marginalia Search. The project is still ongoing, still my full time job, although the project is entering a somewhat more mature phase of development, most of the big pieces are in place and do a decent job at what they do. The roadmap for the project is available on GitHub.| www.marginalia.nu
A while back an update went live that, with some caveats, changes the time it takes for an update on a website to reflect in the search engine index from up to 2 months to 1-2 days. Conditions being if the website has an RSS or Atom feed. The big crawl job takes about two months, and is run partition by partition, meaning there’s typically a slice of the index that is two months stale at any given point in time.| www.marginalia.nu
I recently put together a small library called Slop, for intermediate on-disk data representation for the search engine, replacing a few ad-hoc formats I had in place before. This post isn’t so much an attempt to convince anyone else to use this library, as it makes trade-offs catering to a fairly niche use case, but to explore some of its design ideas, as it all came together very nicely, in the hopes that other libraries can draw ideas from it.| www.marginalia.nu
Marginalia Search now properly supports phrase matching. This not only permits a more robust implementation of quoted search queries, but also helps promote results where the search terms occur in the document exactly in the same order as they do in the query. This is a write-up about implementing this change. This is going to be a relatively long post, as it represents about 4 months of work. I’m also happy and grateful to announce that the nlnet people reached out after the run of the gra...| www.marginalia.nu
The project has been haunted by a mysterious bug since sometime February. It relates to the code that constructs the index, particularly the code that merges partial indices. In short the search engine constucts the reverse index through successive merging of smaller indices, which reduces the overall memory requirement. You can conceptualize the revese index itself as two files, one with offset pointers into another file, which has sorted numbers. This code runs after each partition finishes...| www.marginalia.nu
I set out a little over a week ago to add a service registry to Marginalia Search, primarily to reduce its dependence on docker. I would like it to be able to run on bare metal as well, which poses a problem since configuring the application manually is a bit of a headache with dozens of ports that need to be set up. It would also be desirable to be able to run multiple instances of important services in order elliminate downtime during upgrades.| Weblog on marginalia.nu
Been working on improving Marginalia Search query parsing and understanding. This is going to be a pretty long update, as it’s a few months’ work. Apart from cleaning up the somewhat messy query parsing code, a problem I’m trying to address is that the search engine is currently only good at dealing with fairly focused queries, they don’t need to be short, but if you try to qualify a search that is too broad by adding more terms, it often doesn’t produce anything useful.| www.marginalia.nu
A year ago I walked out of the office for the last time. I handed in my corpo laptop, said some good-byes, and since then I have been my own boss. This first year has been funded by an NLnet grant, which I’m in the midst of wrapping up. As of now, the work is all done, the final request for payment has been sent. There’s a similar last-day-of-school levity to both these events.| www.marginalia.nu
I’ve experimentally replaced some of the Java implementations of quicksort and binary search with calls to C++ code, and saw huge benefits for the sorting code but the same or worse performance for binary search. The Marginalia Search engine is mainly written in Java, which is language that is good at many things, but not particularly pleasant to work with when it comes to low level systems programming. Unfortunately, a part of building an internet search engine involves database-adjacent l...| www.marginalia.nu
This information is outdated The marginalia search project info now lives on about.marginalia-search.com. Last Updated: 2024-05-21 This privacy policy is in effect on search.marginalia.nu. Technology Use Javascript Minimal Cookies No Local Storage No Tracking Pixels No Social Media Buttons No Third Party Requests No CDN No Access Log Retention Up to 24h Search queries are very sensitive and private and as a result logging is at a minimum. No information about which links are clicked is gather...| www.marginalia.nu
This information is outdated The marginalia search project info now lives on about.marginalia-search.com. Ever feel like the Internet has gotten a bit… I don’t know, samey? There’s funny images scrolling by and you blow some air through your nose and keep scrolling and then someone has done something upsetting and you write an angry comment and then you scroll some more. Remember when you used to explore the Internet, when you used to discover cool little websites made by people and it ...| www.marginalia.nu