SVD Recommendation System in Ruby

本文介绍如何使用SVD(奇异值分解)为Ruby Web应用程序构建简单的推荐系统。通过分解用户评分矩阵来识别相似用户并进行个性化推荐。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

SVD Recommendation System in Ruby

One day, a bunch of friends, who happened to be big Family Guy fans, decided to put together a site to rank and share their thoughts on the show. Soon thereafter they had a Rails site up and running, and all was well, and other fans joined in hordes. A web 2.0 success! Then one day they realized that they could no longer track everyone's ratings, their user-base was too large, and so it occurred to one of the developers: "Wouldn't it be cool if we could use the collective knowledge of our whole community to recommend and rank episodes for each user individually?"

Sounds familiar, right? In fact, recommendation systems are a billion-dollar industry, and growing. In academic jargon this problem is known as Collaborative Filtering, and a lot of ink has been spilled on the matter. Netflix, for one, announced a 1 million dollar competition last year for a system that beats their algorithm by +10% percent. It goes without saying that a lot of different systems have been proposed and explored in theory and practice. However, one of the most successful and widely used approaches to this day also happens to be one of the simplest: Singular Value Decomposition (SVD), also affectionately referred to in the literature as LSI (Latent Semantic Indexing), dimensionality reduction, or projection.

Linear Algebra Refresher

SVD methods are a direct consequence of a theorem in linear algebra:

Any MxN matrix A whose number of rows M is greater than or equal to its number of columns N, can be written as the product of an MxM column-orthogonal matrix U, and MxN diagonal matrix W with positive or zero elements (singular values), and the transpose of an NxN orthogonal matrix V.

More intuitively, assume that we have a matrix where every column represents a user, and every row represents a product (or a Family Guy season, in our case). Thus, with M users and N products, we are looking at an MxN matrix. The theorem simply states that we can decompose such a matrix into three components: (MxM) call it U, (MxN) call itS, and (NxN) call it V. More importantly, we can use this decomposition to approximate the original MxN matrix. By taking the first k eigenvalues of the matrix S, we can effectively obtain a compressed representation of the data. So why do we care? (Mathies click here, we'll wait.)

Machine Learning & Information Retrieval

One of the most fundamental, and fun properties of Machine Learning is its close correlation to the concept of data compression - if we can identify significant concepts (clusters of users, for example) then we can represent a large dataset with fewer bits. However, this logic also works in reverse! If we can represent our data with fewer bits (compress our data), then we have identified 'significant' concepts! I bet you see where we're headed - SVD's allow us to compress a large matrix by approximating it in a smaller-dimensional space.

SVD's found wide application in the field of Information Retrieval (IR) where this process is often referred to as Latent Semantic Indexing (LSI). In these applications the columns of the matrix are the documents, and the rows are the individual words. Running SVD allows us to collapse this matrix into a smaller-dimensional space where highly correlated items (for example, words that often occur together) are captured as a single feature. Essentially, we are discarding the noise, and keeping the signal. In practice, the IR guys usually collapse their ginormous matrices to 100, 200, or 300 dimensions (from original 10000+) and then perform similarity calculations. In case you're curious, this same method has also found many uses in image compression and computer vision applications.

Dimensionality Reduction

Back to our Family Guy developers. For the sake of brevity we will use a very simple example with only 4 users, and 6 seasons (User x Rating matrix shown above). Cranking this matrix through the SVD yields three different components: matrix U (6x6), matrix S (6x6), matrix V (4x4). Now, we will collapse this matrix from a (6x4) space into a 2-Dimensional one. To do this, we simply take the first two columns of U, S and V. The end result:

Now, because we are working with a 2-Dimensional space, we can plot our results (below). We can treat the first column of U can as x , and the second column as y - these are the seasons. Same process is repeated for matrix V - these are the users.

Do you see what happened? Because we are working with a small example it's hard to call two users a 'cluster' but you will nonetheless notice that Ben and Fred are located very close to each other - now compare their respective ratings in our original matrix. Very cool, huh! Same pattern re-occurs for Seasons 5 and 6. Our dimensionality reduction technique effectively captured the fact that Ben and Fred seem to have similar taste - we're halfway there!

Finding Similar Users

Next, Bob joins the site and shares with us a few of his season ratings ([5,5,0,0,0,5] for seasons 1-6) - it's our goal to give him a recommendation based on this data. Intuitively, we want to find users similar to Bob, thus if we can 'embed' Bob into our 2-Dimensional space and look where he is located, we will be able to answer this question. To do this, we perform the following calculation:

First line is the general formula to project a new user into our space - I won't motivate the math behind it, but if you're interested, check the document I referenced in the Linear Algebra Refresher section. The important result is that we have the x, and y coordinates for Bob. Let's add them to our earlier graph:

The green triangle represents Bob. It's not immediately evident which user is closest, but if we extend the vector (from the origin - green line), we can see that Ben's and Fred's vectors are, in fact, very similar. A common way to judge similarity between any two vectors is to look at the angles separating them: cosine similarity. From our graph we can intuitively tell that the angle between Ben and Bob is smaller than the one between Ben and Fred. To determine this, let's iterate over all users and compute their cosine similarities. Furthermore, let's discard anyone whose similarity is less than 0.90 (outside of the shaded region). We get: Ben (0.987), Fred (0.955). Hence, we conclude that Ben and Bob have the most similar tastes, though Fred is pretty close also!

What happens now is up to you. Here is one very simple strategy: find the most similar user and compare his/her items against that of the new user; take the items that the similar user has rated and the new user has not and return them in decreasing order of ratings. Thus, Ben rated every season except 4, and Bob rated seasons 1,2 and 6. We take the set difference ([1,2,3,5,6] - [1,2,6] = [3,5]) which are the seasons Ben rated but Bob hasn't seen and return them in the decreasing order of Ben's ratings: Season 5 (5 stars), Season 3 (3 stars).

Will you just give me the code already?

For the brave ones that made it to here, below is the equivalent of what we just did on paper.. in Ruby. First, install thelinalg library, and now you're ready to roll:

require 'linalg'

users = { 1 => "Ben", 2 => "Tom", 3 => "John", 4 => "Fred" }
m = Linalg::DMatrix[
           #Ben, Tom, John, Fred
            [5,5,0,5], # season 1
            [5,0,3,4], # season 2
            [3,4,0,3], # season 3
            [0,0,5,3], # season 4
            [5,4,4,5], # season 5
            [5,4,5,5]  # season 6
            ]

# Compute the SVD Decomposition
u, s, vt = m.singular_value_decomposition
vt = vt.transpose

# Take the 2-rank approximation of the Matrix
#   - Take first and second columns of u  (6x2)
#   - Take first and second columns of vt (4x2)
#   - Take the first two eigen-values (2x2)
u2 = Linalg::DMatrix.join_columns [u.column(0), u.column(1)]
v2 = Linalg::DMatrix.join_columns [vt.column(0), vt.column(1)]
eig2 = Linalg::DMatrix.columns [s.column(0).to_a.flatten[0,2], s.column(1).to_a.flatten[0,2]]

# Here comes Bob, our new user
bob = Linalg::DMatrix[[5,5,0,0,0,5]]
bobEmbed = bob * u2 * eig2.inverse

# Compute the cosine similarity between Bob and every other User in our 2-D space
user_sim, count = {}, 1
v2.rows.each { |x|
    user_sim[count] = (bobEmbed.transpose.dot(x.transpose)) / (x.norm * bobEmbed.norm)
    count += 1
  }

# Remove all users who fall below the 0.90 cosine similarity cutoff and sort by similarity
similar_users = user_sim.delete_if {|k,sim| sim < 0.9 }.sort {|a,b| b[1] <=> a[1] }
similar_users.each { |u| printf "%s (ID: %d, Similarity: %0.3f) \n", users[u[0]], u[0], u[1]  }

# We'll use a simple strategy in this case:
#   1) Select the most similar user
#   2) Compare all items rated by this user against your own and select items that you have not yet rated
#   3) Return the ratings for items I have not yet seen, but the most similar user has rated
similarUsersItems = m.column(similar_users[0][0]-1).transpose.to_a.flatten
myItems = bob.transpose.to_a.flatten

not_seen_yet = {}
myItems.each_index { |i|
  not_seen_yet[i+1] = similarUsersItems[i] if myItems[i] == 0 and similarUsersItems[i] != 0
}

printf "\n %s recommends: \n", users[similar_users[0][0]]
not_seen_yet.sort {|a,b| b[1] <=> a[1] }.each { |item|
  printf "\tSeason %d .. I gave it a rating of %d \n", item[0], item[1]
}

print "We've seen all the same seasons, bugger!" if not_seen_yet.size == 0
svd-recommender-gsl.rb - Ruby/GSL version, courtesy of Joshua Bassett

Running our algorithm produces:

Ben (ID: 1, Similarity: 0.987)
Fred (ID: 4, Similarity: 0.955)

Ben recommends:
  Season 5 .. I gave it a rating of 5
  Season 3 .. I gave it a rating of 3

That's it! A 50 line SVD recommendation / collaborative filtering system for a Rails app. with the help of some simple linear algebra.

In other iterations: Decision Tree LearningBayes ClassificationSupport Vector Machines

在IT领域,尤其是地理信息系统(GIS)中,坐标转换是一项关键技术。本文将深入探讨百度坐标系、火星坐标系和WGS84坐标系之间的相互转换,并介绍如何使用相关工具进行批量转换。 首先,我们需要了解这三种坐标系的基本概念。WGS84坐标系,即“World Geodetic System 1984”,是一种全球通用的地球坐标系统,广泛应用于GPS定位和地图服务。它以地球椭球模型为基础,以地球质心为原点,是国际航空和航海的主要参考坐标系。百度坐标系(BD-09)是百度地图使用的坐标系。为了保护隐私和安全,百度对WGS84坐标进行了偏移处理,导致其与WGS84坐标存在差异。火星坐标系(GCJ-02)是中国国家测绘局采用的坐标系,同样对WGS84坐标进行了加密处理,以防止未经授权的精确位置获取。 坐标转换的目的是确保不同坐标系下的地理位置数据能够准确对应。在GIS应用中,通常通过特定的算法实现转换,如双线性内插法或四参数转换法。一些“坐标转换小工具”可以批量转换百度坐标、火星坐标与WGS84坐标。这些工具可能包含样本文件(如org_xy_格式参考.csv),用于提供原始坐标数据,其中包含需要转换的经纬度信息。此外,工具通常会附带使用指南(如重要说明用前必读.txt和readme.txt),说明输入数据格式、转换步骤及可能的精度问题等。x86和x64目录则可能包含适用于32位和64位操作系统的软件或库文件。 在使用这些工具时,用户需要注意以下几点:确保输入的坐标数据准确无误,包括经纬度顺序和浮点数精度;按照工具要求正确组织数据,遵循读写规则;注意转换精度,不同的转换方法可能会产生微小误差;在批量转换时,检查每个坐标是否成功转换,避免个别错误数据影响整体结果。 坐标转换是GIS领域的基础操作,对于地图服务、导航系统和地理数据分析等至关重要。理解不同坐标系的特点和转换方法,有助于我们更好地处
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值