<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Tech by Paulo Roberto]]></title><description><![CDATA[Tech by Paulo Roberto]]></description><link>https://pauloroberto.heetoo.dev</link><generator>RSS for Node</generator><lastBuildDate>Sat, 30 May 2026 16:42:43 GMT</lastBuildDate><atom:link href="https://pauloroberto.heetoo.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[Restarting with Content-Based Document Recommendations]]></title><description><![CDATA[When I was finishing college, I had to complete a year-long capstone project to graduate. It was supposed to span two semesters, with grades at the end of each one. For me, that project became a document recommendation system — and it has stayed with...]]></description><link>https://pauloroberto.heetoo.dev/restarting-with-content-based-document-recommendations</link><guid isPermaLink="true">https://pauloroberto.heetoo.dev/restarting-with-content-based-document-recommendations</guid><category><![CDATA[Django]]></category><category><![CDATA[web scraping]]></category><category><![CDATA[LLM's ]]></category><category><![CDATA[vector database]]></category><category><![CDATA[vector embeddings]]></category><category><![CDATA[pgvector]]></category><category><![CDATA[lsi]]></category><dc:creator><![CDATA[Paulo Roberto Cruz França]]></dc:creator><pubDate>Fri, 17 May 2024 23:54:44 GMT</pubDate><content:encoded><![CDATA[<p>When I was finishing college, I had to complete a year-long capstone project to graduate. It was supposed to span two semesters, with grades at the end of each one. For me, that project became a <strong>document recommendation system</strong> — and it has stayed with me ever since.</p>
<h3 id="heading-tldr">TL;DR</h3>
<p>Back in college, I built a <strong>content-based recommendation system</strong> for large text documents using about <strong>100k academic papers</strong> from Brazilian universities. At the time, I used Latent Semantic Indexing (LSI) to generate recommendations.</p>
<p>Now, years later, I want to revisit this project with a new approach:</p>
<ul>
<li><p>Create a <strong>document server</strong> to manage papers</p>
</li>
<li><p>Experiment with <strong>different vectorization methods</strong></p>
</li>
<li><p>Store document embeddings in a <strong>vector database</strong> (like <code>pgvector</code>)</p>
</li>
<li><p>Expose recommendations through a <strong>web API</strong></p>
</li>
</ul>
<h3 id="heading-a-bit-of-history">A bit of history</h3>
<p>When I first worked on this project, I was nearing the end of a long (and sometimes painful) journey through Computer Science at the <strong>Federal University of Tocantins</strong>. I officially graduated in the second semester of 2018 (though the calendar year was 2019).</p>
<p>Here’s what I built back then:</p>
<ul>
<li><p>Crawled ~100k academic documents from three federal universities</p>
</li>
<li><p>Used <strong>LSI (Latent Semantic Indexing)</strong> to represent documents as vectors</p>
</li>
<li><p>Calculated similarities using <strong>Euclidean distance</strong> or <strong>cosine similarity</strong></p>
</li>
<li><p>Stored recommendations for each document in a <strong>Postgres database</strong></p>
</li>
</ul>
<p>It worked — but with a big caveat: if someone wanted to add a <em>new</em> document, I had to <strong>recompute LSI for the entire dataset</strong>, which was both slow and painful.</p>
<h3 id="heading-what-i-want-to-do-now">What I want to do now</h3>
<p>This time, I want a cleaner, more scalable solution:</p>
<ol>
<li><p><strong>Document Server</strong></p>
<ul>
<li><p>A Django-based web app with an API to register documents</p>
</li>
<li><p>Each document will have metadata (title, description, source URLs)</p>
</li>
<li><p>The server will keep both a private copy (for processing) and the public link (to send users back to the source)</p>
</li>
</ul>
</li>
<li><p><strong>Vectorization Alternatives</strong></p>
<ul>
<li>Instead of just LSI, I’ll experiment with newer approaches (TF-IDF, embeddings, transformers, etc.)</li>
</ul>
</li>
<li><p><strong>Vector Database</strong></p>
<ul>
<li>Store document embeddings in something like <strong>pgvector</strong> to allow efficient similarity search without recomputing everything</li>
</ul>
</li>
</ol>
<h3 id="heading-where-to-follow-along">Where to follow along</h3>
<p>I’ll be sharing progress here, but you can also follow along on my other channels:</p>
<ul>
<li><p>GitHub: <a target="_blank" href="http://github.com/paulorobertocruz">github.com/paulorobertocruz</a></p>
</li>
<li><p><a target="_blank" href="https://github.com/paulorobertocruz">YouTube: My Chan</a><a target="_blank" href="https://www.youtube.com/channel/UCTaSw4fi0jFaPtUWRtX5zyg">n</a><a target="_blank" href="https://github.com/paulorobertocruz">el</a></p>
</li>
</ul>
<p>I’m excited to bring this project back to life — not just as a throwback to my graduation work, but as a way to explore how modern tools make large-scale recommendations more practical.</p>
]]></content:encoded></item></channel></rss>