CS209-Course-Notes

最新推荐文章于 2025-11-27 03:37:24 发布

原创最新推荐文章于 2025-11-27 03:37:24 发布 · 632 阅读

0 ·

CC 4.0 BY-SA版权

Notes 专栏收录该内容

1 篇文章

订阅专栏

本文介绍了不同类型的在线广告，如搜索广告、原生广告及展示广告，并探讨了它们的工作原理和技术实现方式。此外，还详细讲解了信息检索系统中倒排索引、查询处理等关键技术。

Lecture1:

Ads types

Search Ads:

logic: match ad’s keywords to user’s query
ads format: text, image
ads position: main line, side bar, top, bottom of search result

Native Ads:

logic: match ad’s keywords to web page or APP’s context
ads format: text, image , style should also match context of web page or APP
ads position: embedded in original content of web page or APP

Display Ads:

logic: match user’s demographic, interests to ad’s category interests collected from user behavior: page dwell time, click, video engagement time
ads format: image , animation(gif)， video, audio
ads position: sidebar, top, bottom of page or App

Ads data structure

What is a campaign:

A campaign focuses on a theme or a group of products
set a budget
choose your audience
write your ad including keywords, ad content

Ad:

AdID
CampaignID
Keywords
Bid
Description
LandingPage

Campaign

CampaignID
Budget*

Search Ads Workflow

Lecture2:

Information Retrieval(IR)

Finding material(usually documents) of an unstructured nature(usually text) that satisfies an information need from within large collections(stored on computer), e.g. web search, e-mail search, etc.

Inverted index

For each term $t$ , we must store a list of all documents that contain $t$ , which is identified by a $docID$ .

Inverted index construction

Tokenization
Cut character sequence into word tokens
Normalization
Map text and query term to same form: lower case, U.S.A->USA
Stemming
We may wish different forms of a to match: am, are, is->be; cars, car’s, cars’->car
Stop words
Omit very common words like preposition: of, on

How to process a query like $A$ and $B$ , “Alice and Bruce”
- Locate and merge, form is (term : docs)
How to process a query like $A$ $B$ , “star wars”
- New form (term, num of docs; doc1: pos1, pos2, …; doc2: pos1, pos2, …;), calculate distance in shared index.

Application of IR in search ads

build inverted index for ad: Key->term in key words, Value->list(Adid)
build forward index for ad detail info
process query
rank ads candidates

How to rank ads candidates?

Relevance score = num of words match in our query/ total num of key words

Web service

Web services are client and server applications that communicate over the WWW via Hyper Text Transfer Protocol(HTTP)

Component:

Client: PC, mobile phone, tablet
Protocol: HTTP
Web Server: Tomcat, nginx, IIS, Jetty
Data Layer: SQL database, NOSQL, document

The HTTP is an application-level protocol for distributed, collaborative, hypermedia information systems. It is used to deliver data on WWW

Connectionless
The HTTP client, i.e., a browser initiates an HTTP request and after a request is made, the client disconnects from the server and waits for a response. The server processes the request and re-established the connection with the client to send a response back.
Media independent
Any type of data can be sent by HTTP as long as both the client and the server know how to handle the data content.
Stateless
HTTP is connection-less and it is a direct result of HTTP being a stateless protocol. The server and client are aware of each other only during a current request.

How web server handle http request?

AuthTrans
Verify any authorization info sent in the request
NameTrans
Translate the logical URL into a local file system path
PathCheck
Check local file system path for validity and check the the requestor has access privileges to the requested resource on the file system
ObjectType
Determing the Multi-purpose Internet Mail Encoding(MIME-type) of the requested resource
ParseParams
Process incoming request data read by the service step
Service(generate response)
generate and return the response to the client
Error
if an error happens, the server log the error message and aborts the process

Map Reduce

Map
Divides the input into ranges and creates a map task to transfer each partition
input: any string
output: key, value
Shuffle
Distribute partitions to different machine by key
Reduce
Collects the various results and combines them to answer the larger problem that the master node needs to solve
input : key, list(value)

Lecture3

Query Rewrite

Goal
Find queries related to the issued one, which would allow us to retrieve relevant ads that were not matched by the original
Approach
Find K nearest neighbors of original query, semantically
similar queries
Intuition
If we can find vector representation of query, then we can
calculate similarity by cosine of two vectors

Normally, a customer would generate a query like “an outdoor beach furniture”, we would first find its K-nearest neighbors(similar queries) and compare their similarity. To generate a vector, we would use the input one-hot vector to calculate word vector, and the word vector will calculate the output vector, e.g. the word vector of “ant” times the output layer of word “car” will give us a value which will be put into softmax layer to calculate probability

word2vec

skip gram model
- for a given word in a sentence, what is the probability of each and every other word in our vocabulary appearing anywhere within a small window around the input word
- for example, given word “trump”, trained model is going to say that words like “president” ,”elect” and “donald” have a high probability of appearing nearby, and unrelated words like “cook” and “movie” have a low probability
skip gram model training
- training data: vocabulary of V unique words
- input word representation: one-hot vector for each word, this vector will have V components (one for every word in our vocabulary) and we’ll place a “1” in the position corresponding to the word, and 0s in all of the other positions
- output : a single vector containing, for every word in our vocabulary, the probability that each word would appear near the input word.

How to calculate query rewrite with word2vec?

term level: replace query term with similar terms
phrase level: replace phrase from query and embed it with similar phrases

Query Intent Extraction

Goal
Generate sub-queries which preserve the intent of the original query the best and allow us to retrieve more relevant ads
Approach
Logistic regression classifier is used to determine the goodness of each sub-query
Intuition
Historically good sub query has more clicks on relevant ads
which contain terms in sub query

How to generate sub-queries?

remove stop words
generate n gram as sub-query (2<= n <= N - 1)

How to quantify good sub-query?
Mutual Click Intent

Feature

MCI Features

Click Intent Rank(CIR)
CIR quantify the contribution of each token to query intent and indicate how important token v is in the query
intuition: important tokens can generate good sub query
query: stella artois beer prices

CIR

Apply PageRank algorithm
PageRank for CIR

CIR Features