The first three iterations of RawWeb.org's tech stack

Table of Contents

RawWeb.org is a search engine project I launched in 2024-08. The initial goal was to help more people discover personal digital gardens that are often overlooked by mainstream search engines. I also wanted to explore some interesting tech stacks through practical implementation.

Currently, it has indexed 17k sites and 615k articles. Feel free to submit your favorite independent blogs.

This article only represents my personal experience and views.

Middleware

PostgreSQL is used as the database, instead of SQLite, because I might need Pg’s rich plugins in the future. Redis is used for caching. RabbitMQ is used as the message queue.

Additionally, a search engine requires crawler and full-text search capabilities.

Elasticsearch is used for full-text search. The reason for not implementing inverted indexing myself or using lightweight solutions like Meilisearch is that ES has better Chinese tokenizers.

To reduce potential risks and development complexity, the crawler only obtains data from websites’ RSS feeds. Therefore, the crawler is simply implemented as an HTTP requester and RSS parser.

Keeping things simple, all the above components are deployed as single-node, without any optimization tricks (I don’t know how).

Multi-language Content Support

This is a search engine capable of indexing content in multiple languages, where tokenization quality determines search result quality.

To configure specialized tokenizers for different languages, multiple fields are set up in Elasticsearch, such as content-en, content-zh, to store content in different languages.

This involves:

Natural language detection
Routing content to dedicated fields with specialized tokenizers

First, clean the raw content:

Parse HTML, remove useless tags like style, script;
Remove code, URLs, and other content as much as possible to avoid affecting language detection accuracy;
Remove HTML and XML tags to get plain text;
Remove excess whitespace characters.

Then identify the content’s language. There are two approaches:

The first is lingua, which has implementations in Python, Go, and other languages. It has excellent performance and accuracy, and allows selective loading of language models. The downside is that it increases the executable size by about 100MB.

The second is Elasticsearch’s built-in lang_ident_model_1, which requires creating a pipeline to call. In testing, the accuracy was good but performance was an issue. With the same data, it was 4 times slower than the Python version of lingua running on lower-spec hardware. I suspect this is because lang_ident_model_1 needs to test all supported languages, while lingua only needs to load a few language models.

Considering performance and flexibility, lingua was ultimately chosen. Lingua has high and low accuracy modes, with low accuracy offering about 2x performance improvement without significant accuracy loss for inputs over 120 characters. So currently, a hybrid approach of high and low accuracy detection is used, with input being the title and content sampling. In actual testing, detecting one article only takes 100μs.

Once the content’s language is determined, the best tokenizer can be set for it. Based on W3Techs’ estimated internet content distribution, separate tokenizers are set for the most mainstream languages - Chinese, English, Spanish, Russian, German, French, and Japanese, while other languages use the default tokenizer.

Backend

The crawler is a simple Go program. The main backend went through three iterations with Django, Nest.js, and Go.

v1 - Django

Tech stack:

Django v5
django-ninja as API endpoint
huey as task queue, though I only used it for managing scheduled tasks
uv as package manager

I had been recommended Django multiple times before, and I wanted to learn a batteries-included framework through this project. Considering it has Django Admin, I used it for prototype development.

Django’s documentation quality is among the best I’ve seen, making it very pleasant to read. Since the project is frontend-backend separated and doesn’t use built-in plugins like auth and view, Django’s “batteries” didn’t reduce my workload, and the overall development experience wasn’t particularly exciting.

Considering the framework’s stability and community prosperity, I would probably like Django if I were a dynamic language enthusiast. Unfortunately, I’ve been deeply influenced by Go’s philosophy, and Django’s level of “magic” exceeded my comfort zone, like using field name + double underscore + method name to build query conditions. BTW, It’s hard to imagine I once wanted to learn RoR.

Finally, after development was complete, even with all built-in plugins disabled, async maximally utilized, and Uvicorn in use, the load test results were far below my expectations. So I started looking into rebuilding with Node.js.

v2 - Nest.js

Tech stack:

TypeScript
Components wrapped by Nest
Prisma as ORM

Since the main latency in a search request comes from waiting for Elasticsearch, with the web service mainly acting as a request forwarder, this I/O-intensive scenario is very suitable for Node.js.

Popular frameworks include Nest.js and Adonis.js, and I ultimately chose the more popular Nest. Don’t ask why not Express or Fastify - they’re not frameworks.

Nest seems more like a dependency injector plus multiple officially maintained components (modules). Although it includes common components like cache and message queue, from my observation, most are wrappers around third-party libraries, so Nest users don’t need to piece things together themselves. However, even with official wrappers, I was still unfortunately affected by underlying library changes (cache-manager@6).

For developers with Java/Spring background, Nest might be great. But for me, Nest’s various decorators, pipes, and other concepts created a heavy mental burden. When switching back to a Nest project after two or three months, I needed to review the documentation to confirm their usage.

Additionally, while the documentation appears comprehensive, its quality is far below Django’s. For example, I couldn’t understand the module lifecycle part from the documentation alone, and finally had to rely on an article analyzing the source code to roughly figure it out.

Exploring new technology is always good, but choosing Nest for this project was a mistake because the project’s complexity was even less than the complexity Nest introduced.

v3 - Go

Tech stack:

Echo as API endpoint
GORM Gen as ORM

Based on the previous two experiences, I’ve temporarily demystified batteries-included frameworks. After coming full circle, I found my true love was still the original - Go.

I previously had two main complaints about web development with Go:

The syntax is too basic, making CRUD uncomfortable
Lack of good ORM or SQL builder

Fortunately, both issues have been largely resolved.

Thanks to the development of LLM and AI IDE, Go’s basic syntax is no longer a disadvantage but has become somewhat of an advantage (to me), as LLMs can very easily understand the code, and AI completion is very accurate.

Regarding ORM, a quite popular opinion in the Go community is that “ORM is harmful,” preferring approaches like sqlc generating Go code from SQL, or sqlx directly using SQL. ORM indeed sometimes makes simple things complex - for example, Prisma only recently started supporting true JOIN. However, a well-designed, type-safe ORM can greatly improve CRUD experience.

GORM Gen made me fall in love with GORM again. Through code generation, it not only achieves type safety but, more importantly, can generate Go code from custom SQL, meaning I have almost full SQL capabilities.

Thus, this code refactoring with Go was very enjoyable, except for the disastrous official Elasticsearch SDK.

Go also reduced the infra burden, no longer needing multi-stage builds in Dockerfile (without CI server or GitHub Action, the previous two tech stacks required building Docker images after pushing code to production environment).

Keeping things simple, I also removed RabbitMQ, instead using a database table to store tasks and providing an API for the crawler to sync data. Since Redis might be simplified away in the future, I didn’t use Redis as message queue here.

Alternatives

There are some interesting options I passed on but might try in the future:

C# & .Net. I’ve heard C# is very enjoyable to write with, and .Net is a great enterprise framework. But I’m not interested in OOP, and I’m concerned about whether Microsoft might make risky moves in .Net open source work again (Hot Reload removed from dotnet watch - Why?).
Elixir & Phoenix. Elixir’s features seem very suitable for high-concurrency scenarios, and the development experience is very good. But I currently don’t have the energy to learn functional programming.

Easter egg

Are you looking for Rust? Haha, I'll never learn it for web development.

Frontend

The frontend uses my favorite SvelteKit, compiled into hybrid SSG and SPA pages. UI components are from shadcn-svelte.

React is good, but I equally dislike most things in its ecosystem, especially Next.js. I don’t understand why the community keeps getting “richer” while making developers more miserable. Svelte is currently my painkiller, and I recommend you try it too.

Infrastructure

Avoiding vendor lock-in, only using generic infra technologies:

Backend services are orchestrated with Docker Compose, compiled and deployed to VPS by a simple Shell script
Main backend services use Hetzner’s Arm VPS, currently two Debian instances with 2 vCPU + 4G RAM (Great value for money, welcome to use my aff to register, you’ll get €20 credit)
Crawler service is on another budget VPS
Web pages, CDN, DNS are on Cloudflare
Monitoring service uses self-hosted Uptime Kuma, and some services are connected to New Relic

Future plans include setting up a Prometheus + Grafana observability system to visualize metrics like search volume and new indexing volume.