Blog/2026-04-27/Fuzzy Wiki Search

From Rest of What I Know
Now, with semantic search

One thing that annoyed me on this wiki is that the search bar shows inline results based on a prefix search, and then if you ask for a full search you get a full-text search. But sometimes I only remember things I've talked about by the vague idea that they relate to. And then it becomes quite hard for me to find it all. But these days we have vector embeddings, so I thought I'd add a vector embedding and vector search to my wiki.

To be honest, I've found the results quite pleasing. So here's what I did, which is pretty bog-standard stuff. In my case, I wanted a few constraints:

  • I wanted the search to be fast enough to occur inline
  • I didn't want to pay variable cost for searches
  • I don't mind load-shedding, but I do mind end of service
  • I wanted it to be amenable to LLM-based building

These mostly meant I was leaning towards a local solution that didn't rely on cloud services since GPT Embeddings are good, but they have variable cost and I have to implement rate-limiting and so on and I need to ensure I have credits and so on. I'm already having a lot of billing fatigue with the various credit models everywhere. Plus the other convenience is that this site runs on an old AMD EPYC 7402P 24-Core Processor based system with 128 GB of RAM which is far overkill. It's barely exercised, the fans barely spin, and it's generally got a lot of room to do work.

Platform

[edit | edit source]

These days, when I write code, I prefer it to be either Rust or Go. Some degree of static validation allows LLMs to spin till they write acceptable code. For this particular project, Hugging Face has a minimal ML framework called Candle that works well with the models on their site so I decided on Rust.

One big downside of Rust is that cargo crates can include build.rs custom build instructions which are completely unsandboxed. This is a problem because any kind of supply-chain attack will render your entire computer ruined. I was sticking to a bunch of first-class crates here, but in the future I will probably have to build some kind of constrained build-environment for the platforms with arbitrary code execution on dev machines: npm, cargo, pip, and so on.

Embedding Models

[edit | edit source]

It turns out there are a lot of these models that run quite well on a CPU for a corpus of this small size. I considered:

I'd have to make some trade-offs since the various models had different functionality

Open-weight embedding models considered
Model Max tokens Dim Notes
BAAI/bge-small-en-v1.5 512 384 Loads as-is via candle's stock BertModel.
BAAI/bge-base-en-v1.5 ← in use 512 768 Same
BAAI/bge-large-en-v1.5 512 1024 Same
mixedbread-ai/mxbai-embed-large-v1 512 1024 (Matryoshka → 768/512/256) Same. Output vector can be client-side truncated
nomic-ai/nomic-embed-text-v1.5 8192 768 (Matryoshka → 512/256/128/64) Doesn't work with stock. Uses rotary position embeddings

If I want my documents properly embedded, I need a large context-size, but the rotary embeddings aren't supported natively in upstream candle, so I decided I'd settle for the simplest model that would work. I decided to just start with bge-base-en-v1.5 and if it was good enough I'd use it.

Context Limits

[edit | edit source]

The BERT models only use the first 512 tokens of context, which is roughly the first 1.8k characters or so. So it's better on shorter documents than longer documents. This led to quite a few documents going missing, but then I realized something. I have a <meta> description tag on each of the posts that's formed using Claude to summarize the document. I could chunk each document and store k vectors for each chunk and max-merge them all, but perhaps it would be sufficient to max-merge the document text to the extent it was in there and the description tag that is built by an LLM. Surprisingly, this worked quite well and I'm just going to stick to using it this way.

The description tag is something I have a bot called Special:Contributions/AutoDescriptor AutoDescriptor that goes around writing the descriptions, but I override or write them myself where necessary.

Structure

[edit | edit source]

So in the end what I settled for is:

  • a Semantic Search Service that holds all the vectors and the embeddings and so on. This has all the model code, and the embedded data, and responds to queries.
  • a Mediawiki plugin that has:
    • a hook that re-embeds on update
    • an inline search hook that appends queries after 3 characters or more are typed by querying the semantic search service
    • a search hook that adds the semantic search results to the bottom of the search results page as well

And the results are quite nice. The whole thing works super-fast because the corpus is small and the model is light enough to not get bottlenecked. Overall, I'm quite pleased with the outcome!