Google/foobar - Spy snippets

本文介绍了一个算法问题:如何从给定的文档中找出包含所有指定搜索词的最短片段。文章提供了问题背景、具体要求及初步解决方案思路。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

原文链接: http://hankerzheng.com/blog/google-foobar-spy-snippet

Question: 

Write a function called answer(document, searchTerms) which returns the shortest snippet of the document, containing all of the given search terms. The search terms can appear in any order.

The length of a snippet is the number of words in the snippet. For example, the length of the snippet "tastiest color of carrot" is 4. (Who doesn't like a delicious snack!)

The document will be a string consisting only of lower-case letters [a-z] and spaces. Words in the string will be separated by a single space. A word could appear multiple times in the document.
searchTerms will be a list of words, each word comprised only of lower-case letters [a-z]. All the search terms will be distinct.

Search terms must match words exactly, so "hop" does not match "hopping".

Return the first sub-string if multiple sub-strings are shortest. For example, if the document is "world there hello hello where world" and the search terms are ["hello", "world"], you must return "world there hello".

The document will be guaranteed to contain all the search terms.

The number of words in the document will be at least one, will not exceed 500, and each word will be 1 to 10 letters long. Repeat words in the document are considered distinct for counting purposes.
The number of words in searchTerms will be at least one, will not exceed 100, and each word will not be more than 10 letters long.

Languages
=========

To provide a Python solution, edit solution.py
To provide a Java solution, edit solution.java

Test cases
==========

Inputs:
    (string) document = "many google employees can program"
    (string list) searchTerms = ["google", "program"]
Output:
    (string) "google employees can program"

Inputs:
    (string) document = "a b c d a"
    (string list) searchTerms = ["a", "c", "d"]
Output:
    (string) "c d a"

Thoughts:

拿到这题目第一步就是先将所有的关键字找出来,并且能够同时得到该关键字在原文中的位置,因此有:

     if len(searchTerms) == 0:
          return []
     search_kw = []
     #pre-process the search terms, prune the repeating terms
     for kw in searchTerms:
          if not kw in search_kw:
               search_kw.append(kw)

     #generate the search_dict, search_dic[keyword_index] = [posi1, posi2, ...]
     word_pool = document.split()
     search_dict = {k:[] for k in range(len(search_kw))}
     for i, word in enumerate(word_pool):
          for j, kw in enumerate(search_kw):
               if kw == word:
                    search_dict[j].append(i)
     
第二部就是对得到的字典进行操作,从而得到题目所需要的字符串了。因此,此题目可以抽象成:
Given n arrays, choose one integer from each array to create a new array. How to make the max-min in the new array the minimum.
想法1:
简单地采用贪婪。以下为pseudo-code. 主要思想为,对第一个数列,设其中一个为基准,然后对其他数列字,选择那个离这个基准最近的值。然后对第一个数列的所有值进行遍历,最后输出最小的数列。
for cali_posi in arrays[0]:
chosen_posi = [cali_posi]
for array_index in range(1, len(arrays)):
num = get_min_dist_to_cali( arrays[array_index], cali_posi)
chosen_numbers.append(num)
length = cover_range(chosen_posi)
if length < min_len:
min_len = length
final_chosen = [start, end]
但是这个方法是错的,不能用贪婪来解答。
如果输入数列为[4], [2,5], [1],贪婪得到的结果为[4,5,1] 而全局最小值为[4,2,1]
即使将选择条件从“距离基准最近的”改为“整个字符串最短的”也无济于事,因为后一个数的选择依赖于前一个数的选择。此题既不能用动态规划也不能用贪婪来解。

想法2:
问了二妈这道题,二妈给出了一种方法,比暴力解优很多,其主要思想为:先对所有的数组进行排序,然后选择每个数组最小的数,得到一个起始序列n0,求起始序列的cover_range. 找到起始序列n0中的最小的元素,将该元素上调一档,得到一个新数列n1, 求新数列n1的cover_range. 以此类推。
循环的终止条件为 最小的元素没法继续上调。
所有conver_range中最小的就是题目的解。
今天太晚了,明天更新代码!
===================================
代码更新:
def answer(document, searchTerms):
#pre-processing
	if len(searchTerms) == 0:
		return []
	search_kw = []
	#pre-process the search terms, prune the repeating terms
	for kw in searchTerms:
		if not kw in search_kw:
			search_kw.append(kw)
	#generate the search_arrays, search_dic[keyword_index] = [posi1, posi2, ...]
	word_pool = document.split()
	search_arrays = [[] for k in range(len(search_kw))]
	for i, word in enumerate(word_pool):
		for j, kw in enumerate(search_kw):
			if kw == word:
				search_arrays[j].append(i)
	if [] in search_arrays:
		#keyword doesn't match, return []
		return []
#choose the smallest as the cali array. note: arrays are given in the sorted way
	final_choices = []
	cali_array = [0 for x in range(len(search_arrays))]
	while 1:
		try:
			choices = [search_arrays[i][item] for i, item in enumerate(cali_array)]
		except:
			break
		min_posi, max_posi = min(choices),max(choices)
		final_choices.append([abs(min_posi - max_posi), min_posi, max_posi])	#[len, start, end]
		array_index_for_min_posi = choices.index(min_posi)
		cali_array[array_index_for_min_posi] += 1
	result = min(final_choices)
	return ' '.join(word_pool[result[1]:result[2]+1])


评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值