BookBuddy

AI-Driven Book Recommendation App with Natural Language User Input

This project is a book recommendation service that suggests books based on a user's inputted genre and book titles. It's built upon a database of 7000 books retrieved from Kaggle. Using openn AI as the large language model, vector embeddings were created with the Kaggle dataset to allow for quick vector search to find semantically similar books through natural language input.

Tool/Framework/Service

Software Dev:
- NextJS, ReactJS
- Tailwindcss
- Vercel Server
‍
Large Language Model:
- Vector Similar Search
- Text Gneration
- Open AI API, Cohere AI API
- Weaviate Vectorization Database

Key Points

Input genre and book titles to get book recommendations with AI

The OpenAI text embedding model vectorizes book descriptions and user preferences, enabling accurate searches for matching books and improving user experience over traditional book recommendations ystems

LLM related Service and Data Pipeline Building

Not only implement pipeline in web dev environment, but also create Python workflow to configure, access and manage vector embeddings in Vectorization Database

Minimalistic experimental interface

This project is experimental and technology-focused, so I streamlined the interface to prioritize and deliver the core functionalities.

View the Application

Key Features.

This process is designed to be simple and user-friendly: users enter natural language to describe their book preferences and interests, set the number of top search results they'd like to see, and view AI-generated explanations for each book recommendation

One.

Natural language Search

User A often feels confused by the artistic nature of book titles when searching for books, frequently turning to Google for recommendations from others. Now, with BookBuddy, he can simply describe the type of content he enjoys—no matter how detailed—and receive tailored suggestions effortlessly.

Two.

Book Recommendation Reason

User A wants to understand why a particular book might be of interest to him. BookBuddy provides a clear explanation of the possible reasons, helping him gain a better understanding of the book's key aspects in the process.

Three.

Book Details and Purchase Link

User A found his ideal book from the list of recommendations and decided to purchase it online. BookBuddy thoughtfully provides a convenient Amazon purchase link to make the process seamless.

LLM Technical Details

I use diagrams to illustrate the principles of RAG (Retrieval-Augmented Generation) and provide a step-by-step explanation of how I applied RAG and Vector Similarity Search to construct the application's data pipeline. Sample code snippets are included for clarity, and the complete code is also available in BookBuddy's GitHub repository.

One.

Retrieval-Augmented Generation

Retrieval augmented generation is a powerful technique that retrieves relevant data to provide to large language models (LLMs) as context, along with the task prompt. It is also called RAG, generative search, or in-context learning in some cases.

The first step is to retrieve relevant data through a query. Then, in the second step, the LLM is prompted with a combination of the retrieve data with a user-provided query.This provides in-context learning for the LLM, which causes it to use the relevant and up-to-date data rather than rely on recall from its training, or even worse, hallucinated outputs.

In bookbuddy I used Weaviate's integration with Cohere's APIs allows to access AI models' capabilities directly from Weaviate to reduce the development complexity.

Using RAG, BookBuddy generates recommendation reasons by analyzing the information from the books with the highest similarity scores based on the results of the vector similarity search.

- Configure a weaviate vector database collection to use Open AI for textual embeddings
- Weaviate will perform a search, retrieve the most relevant objects, and then pass them to the Cohere provided generative model access to generate outputs.

Two.

Vector Similar Search

In vector similarity search, 7000book csv data include textual data was represented as vectors in a space with many dimensions. Each dimension represents a specific characteristic or attribute of the data. Given a book query vector, the vector search database retrieves the most similar vectors from the indexed dataset. The query vector is generated using the text2vec open ai model used to create the indexed vectors. The similarity between the query vector and the indexed vectors is measured using a distance metric, and the most similar vectors with book information are returned as the search results. The retrieved vectors are ranked based on their similarity scores, and the top-N most similar vectors are returned to the user.

Click to see NextJS API Sample to implement Vector Similar Search Query

1import type { NextApiRequest, NextApiResponse } from "next";
2import weaviate from "weaviate-client";
3import { NearTextType } from "types";
4
5export default async function handler(
6  req: NextApiRequest,
7  res: NextApiResponse<Object>
8) {
9  console.log("Received request", req.body);
10  try {
11    const bookPreference = req.body.query;
12    const userInterests = req.body.userInterests;
13    const returnLimits = req.body.returnLimits;
14    console.log(
15      "bookPreference:",
16      bookPreference,
17      "userInterests:",
18      userInterests
19    );
20    const WEAVIATE_CLUSTER_URL = process.env.WEAVIATE_CLUSTER_URL;
21    const WEAVIATE_API_KEY = process.env.WEAVIATE_API_KEY;
22
23    if (!WEAVIATE_CLUSTER_URL || !WEAVIATE_API_KEY) {
24      res
25        .status(500)
26        .json({ error: "Weaviate cluster URL or API key is missing" });
27      return;
28    }
29
30    let headers: { [key: string]: string } = {};
31    if (process.env.COHERE_API_KEY) {
32      headers["X-Cohere-Api-Key"] = process.env.COHERE_API_KEY;
33    }
34
35    const client = await weaviate.connectToWeaviateCloud(WEAVIATE_CLUSTER_URL, {
36      authCredentials: new weaviate.ApiKey(WEAVIATE_API_KEY),
37      headers: {
38        "X-Cohere-Api-Key": process.env.COHERE_API_KEY || "",
39        "X-OpenAI-Api-Key": process.env.OPENAI_API_KEY || "",
40      },
41    });
42
43    let nearText: NearTextType = {
44      concepts: bookPreference,
45    };
46
47    let generatePrompt =
48      "Briefly describe why this book might be interesting to someone who has interests or hobbies in " +
49      userInterests +
50      ". The book's title is {title}, with a description: {description}, 
51      and is in the genre: {categories}. Don't make up anything that wasn't given in this prompt and don't ask how 	
52      you can help.";
53
54    const myCollection = client.collections.get("Book");
55
56    // step 1: vector search
57    let searchResult = await myCollection.query.nearText(nearText.concepts, {
58      limit: returnLimits,
59      returnProperties: [
60        "title",
61        "isbn10",
62        "isbn13",
63        "categories",
64        "thumbnail",
65        "description",
66        "num_pages",
67        "average_rating",
68        "published_year",
69        "authors",
70      ],
71      returnMetadata: ["distance"],
72    });
73
74    // step 2: generative query
75    const generativeResult = await myCollection.generate.nearText(
76      nearText.concepts,
77      {
78        singlePrompt: generatePrompt,
79      },
80      {
81        limit: returnLimits,
82      }
83    );
84
85    // merge the results
86    const books = searchResult.objects.map((item, index) => ({
87      properties: item.properties,
88      distance: item.metadata?.distance,
89      generatedPrompt: generativeResult.objects[index].generated,
90    }));
91
92    console.log("Books found:", books);
93    res.status(200).json({ books });
94  } catch (err) {
95    console.error("Error handling request:", err);
96    res.status(500).json({ error: "Internal server error" });
97  }
98}
99

Three.

7KBook Dataset from Kaggle

The dataset used for embeddings was sourced from Kaggle and includes 7,000 books. Each entry contains 12 data fields, such as title, author, description, published year, rating, thumbnail (display book cover),ISBN-10 (book identifier- utilized to build url to get book purchase url in Amazon), ISBN-13 (book identifier), and more.

Click to see Dataset in Google Sheet Click to see Dataset in Kaggle

Four.

Configure and Search Data Pipeline

In this project, I utilized Python in conjunction with Weaviate's API to configure the vector database, which serves as the foundation for subsequent application development. Additionally, I implemented a Python-based search script leveraging this database to simulate user search behavior. This approach ensures the seamless operation of the entire data pipeline, mitigating potential risks during the later stages of web development.

Click to see NextJS API Sample to implement Vector Similar Search Query

1import os
2import csv
3import weaviate
4
5from dotenv import load_dotenv
6
7load_dotenv()
8
9WEAVIATE_CLUSTER_URL = os.getenv('WEAVIATE_CLUSTER_URL')
10WEAVIATE_API_KEY = os.getenv('WEAVIATE_API_KEY')
11OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
12COHERE_API_KEY = os.getenv('COHERE_API_KEY')
13
14client = weaviate.Client(
15    url=WEAVIATE_CLUSTER_URL,
16    auth_client_secret=weaviate.AuthApiKey(api_key=WEAVIATE_API_KEY), 
17    additional_headers={"X-OpenAI-Api-Key": OPENAI_API_KEY, "X-Cohere-Api-Key": COHERE_API_KEY})
18
19client.schema.delete_class("Book")
20
21class_obj = {
22    "class": "Book",
23    "vectorizer": "text2vec-openai",
24    "moduleConfig": {
25        "text2vec-openai": {
26            "model": "ada",
27            "modelVersion": "002",
28            "type": "text"
29        },
30        "generative-cohere": {
31
32        }
33    }
34}
35
36client.schema.create_class(class_obj)
37
38f = open("./data-pipeline/7k-books-kaggle.csv", "r", encoding='utf-8')
39current_book = None
40try:
41  with client.batch as batch:  # Initialize a batch process
42    batch.batch_size = 100
43    reader = csv.reader(f)
44    # Iterate through each row of data
45    for book in reader:
46      current_book = book
47      # 0 - isbn13
48      # 1 - isbn10
49      # 2 - title
50      # 3 - subtitle
51      # 4 - authors
52      # 5 - categories
53      # 6 - thumbnail
54      # 7 - description
55      # 8 - published_year
56      # 9 - average_rating
57      # 10 - num_pages
58      # 11 - ratings_count
59
60      properties = {
61          "isbn13": book[0],
62          "isbn10": book[1],
63          "title": book[2],
64          "subtitle": book[3],
65          "authors": book[4],
66          "categories": book[5],
67          "thumbnail": book[6],
68          "description": book[7],
69          "published_year": book[8],
70          "average_rating": book[9],
71          "num_pages": book[10],
72          "ratings_count": book[11],
73      }
74
75      batch.add_data_object(data_object=properties, class_name="Book")
76      # print(f"{book[2]}: {uuid}", end='\n')
77except Exception as e:
78  print(f"something happened {e}. Failure at {current_book}")
79
80f.close()
81

1import os
2import weaviate
3import json
4
5from dotenv import load_dotenv
6
7load_dotenv()
8
9WEAVIATE_CLUSTER_URL = os.getenv('WEAVIATE_CLUSTER_URL')
10
11
12WEAVIATE_API_KEY = os.getenv('WEAVIATE_API_KEY') 
13
14print("theWEAVIATE_API_KEY is ",WEAVIATE_API_KEY)
15
16OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
17print("the open ai key is ",OPENAI_API_KEY)
18COHERE_API_KEY = os.getenv('COHERE_API_KEY')
19print("the cohere key is ",COHERE_API_KEY)
20
21client = weaviate.Client(
22    url=WEAVIATE_CLUSTER_URL,
23    auth_client_secret=weaviate.AuthApiKey(api_key=WEAVIATE_API_KEY),
24    additional_headers={"X-OpenAI-Api-Key": OPENAI_API_KEY, "X-Cohere-Api-Key": COHERE_API_KEY})
25
26nearText = {
27    "concepts":
28    ["technology", "data structures and algorithms", "distributed systems"]
29}
30generate_prompt = "Explain why this book might be interesting to someone who likes playing the violin, 
31rock climbing, and doing yoga. the book's title is {title}, with a description: {description}, and is 
32in the genre: {categories}."
33response = (client.query.get("Book", [
34    "title",
35    "isbn10",
36    "isbn13",
37    "categories",
38    "thumbnail",
39    "description",
40    "num_pages",
41    "average_rating",
42    "published_year",
43    "authors",
44]) .with_generate(single_prompt=generate_prompt).with_near_text(nearText).with_limit(10).do())
45
46print(json.dumps(response, indent=4))
47