Restarting with Content-Based Document Recommendations

Paulo Roberto Cruz França — Fri, 17 May 2024 23:54:44 GMT

When I was finishing college, I had to complete a year-long capstone project to graduate. It was supposed to span two semesters, with grades at the end of each one. For me, that project became a document recommendation system — and it has stayed with me ever since.

TL;DR

Back in college, I built a content-based recommendation system for large text documents using about 100k academic papers from Brazilian universities. At the time, I used Latent Semantic Indexing (LSI) to generate recommendations.

Now, years later, I want to revisit this project with a new approach:

Create a document server to manage papers
Experiment with different vectorization methods
Store document embeddings in a vector database (like pgvector)
Expose recommendations through a web API

A bit of history

When I first worked on this project, I was nearing the end of a long (and sometimes painful) journey through Computer Science at the Federal University of Tocantins. I officially graduated in the second semester of 2018 (though the calendar year was 2019).

Here’s what I built back then:

Crawled ~100k academic documents from three federal universities
Used LSI (Latent Semantic Indexing) to represent documents as vectors
Calculated similarities using Euclidean distance or cosine similarity
Stored recommendations for each document in a Postgres database

It worked — but with a big caveat: if someone wanted to add a new document, I had to recompute LSI for the entire dataset, which was both slow and painful.

What I want to do now

This time, I want a cleaner, more scalable solution:

Document Server
- A Django-based web app with an API to register documents
- Each document will have metadata (title, description, source URLs)
- The server will keep both a private copy (for processing) and the public link (to send users back to the source)
Vectorization Alternatives
- Instead of just LSI, I’ll experiment with newer approaches (TF-IDF, embeddings, transformers, etc.)
Vector Database
- Store document embeddings in something like pgvector to allow efficient similarity search without recomputing everything

Where to follow along

I’ll be sharing progress here, but you can also follow along on my other channels:

GitHub: github.com/paulorobertocruz
YouTube: My Chan n el

I’m excited to bring this project back to life — not just as a throwback to my graduation work, but as a way to explore how modern tools make large-scale recommendations more practical.

Tech by Paulo Roberto

Restarting with Content-Based Document Recommendations

TL;DR

A bit of history

What I want to do now

Where to follow along