Storing hundreds of millions of simple key-value pairs in Redis

为解决3亿张照片到用户ID的快速映射问题,Instagram利用Redis哈希特性实现高效存储与检索。通过将照片ID分桶并使用哈希存储,大幅减少内存占用,从21GB降至5GB,并保持O(1)查询效率。

本文转自:http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value-pairs

When transitioning systems, sometimes you have to build a little scaffolding. At Instagram, we recently had to do just that: for legacy reasons, we need to keep around a mapping of about 300 million photos back to the user ID that created them, in order to know which shard to query (see more info about our sharding setup). While eventually all clients and API applications will have been updated to pass us the full information, there are still plenty who have old information cached. We needed a solution that would:

  1. Look up keys and return values very quickly
  2. Fit the data in memory, and ideally within one of the EC2 high-memory types (the 17GB or 34GB, rather than the 68GB instance type)
  3. Fit well into our existing infrastructure
  4. Be persistent, so that we wouldn’t have to re-populate it if a server died

One simple solution to this problem would be to simply store them as a bunch of rows in a database, with “Media ID” and “User ID” columns. However, a SQL database seemed like overkill given that these IDs were never updated (only inserted), didn’t need to be transactional, and didn’t have any relations with other tables.

Instead, we turned to Redis, an advanced key-value store that we use extensively here at Instagram (for example, it powers our main feed). Redis is a key-value swiss-army knife; rather than just normal “Set key, get key” mechanics like Memcached, it provides powerful aggregate types like sorted sets and lists. It has a configurable persistence model, where it background saves at a specified interval, and can be run in a master-slave setup. All of our Redis deployments run in master-slave, with the slave set to save to disk about every minute.

At first, we decided to use Redis in the simplest way possible: for each ID, the key would be the media ID, and the value would be the user ID:

SET media:1155315 939
GET media:1155315
> 939

While prototyping this solution, however, we found that Redis needed about 70 MB to store 1,000,000 keys this way. Extrapolating to the 300,000,000 we would eventually need, it was looking to be around 21GB worth of data—already bigger than the 17GB instance type on Amazon EC2.

We asked the always-helpful Pieter Noordhuis, one of Redis’ core developers, for input, and he suggested we use Redis hashes. Hashes in Redis are dictionaries that are can be encoded in memory very efficiently; the Redis setting ‘hash-zipmap-max-entries’ configures the maximum number of entries a hash can have while still being encoded efficiently. We found this setting was best around 1000; any higher and the HSET commands would cause noticeable CPU activity. For more details, you can check out the zipmap source file.

To take advantage of the hash type, we bucket all our Media IDs into buckets of 1000 (we just take the ID, divide by 1000 and discard the remainder). That determines which key we fall into; next, within the hash that lives at that key, the Media ID is the lookup key *within* the hash, and the user ID is the value. An example, given a Media ID of 1155315, which means it falls into bucket 1155 (1155315 / 1000 = 1155):

HSET "mediabucket:1155" "1155315" "939"
HGET "mediabucket:1155" "1155315"
> "939"

The size difference was pretty striking; with our 1,000,000 key prototype (encoded into 1,000 hashes of 1,000 sub-keys each), Redis only needs 16MB to store the information. Expanding to 300 million keys, the total is just under 5GB—which in fact, even fits in the much cheaper m1.large instance type on Amazon, about 1/3 of the cost of the larger instance we would have needed otherwise. Best of all, lookups in hashes are still O(1), making them very quick.

If you’re interested in trying these combinations out, the script we used to run these tests is available as a Gist on GitHub (we also included Memcached in the script, for comparison—it took about 52MB for the million keys). And if you’re interested in working on these sorts of problems with us, drop us a note, we’re hiring!.


基于STM32 F4的永磁同步电机无位置传感器控制策略研究内容概要:本文围绕基于STM32 F4的永磁同步电机(PMSM)无位置传感器控制策略展开研究,重点探讨在不依赖物理位置传感器的情况下,如何通过算法实现对电机转子位置和速度的精确估计与控制。文中结合嵌入式开发平台STM32 F4,采用如滑模观测器、扩展卡尔曼滤波或高频注入法等先进观测技术,实现对电机反电动势或磁链的估算,进而完成无传感器矢量控制(FOC)。同时,研究涵盖系统建模、控制算法设计、仿真验证(可能使用Simulink)以及在STM32硬件平台上的代码实现与调试,旨在提高电机控制系统的可靠性、降低成本并增强环境适应性。; 适合人群:具备一定电力电子、自动控制理论基础和嵌入式开发经验的电气工程、自动化及相关专业的研究生、科研人员及从事电机驱动开发的工程师。; 使用场景及目标:①掌握永磁同步电机无位置传感器控制的核心原理与实现方法;②学习如何在STM32平台上进行电机控制算法的移植与优化;③为开发高性能、低成本的电机驱动系统提供技术参考与实践指导。; 阅读建议:建议读者结合文中提到的控制理论、仿真模型与实际代码实现进行系统学习,有条件者应在实验平台上进行验证,重点关注观测器设计、参数整定及系统稳定性分析等关键环节。
### Scrapy-Redis 部署教程 Scrapy-Redis 是一个用于分布式爬虫的扩展工具,它允许多个 Scrapy 爬虫共享请求队列和去重集合。以下是关于如何部署 Scrapy-Redis 的详细说明: #### 1. 安装依赖项 为了运行 Scrapy-Redis,需要先安装必要的软件包。可以通过 pip 工具来完成这些操作。 ```bash pip install scrapy redis scrapy-redis ``` 此命令会自动下载并安装所需的库文件[^1]。 #### 2. 设置 Redis 数据库 确保本地或者远程服务器上已经启动了一个可用的 Redis 实例。如果尚未配置,则可以按照官方文档中的指导进行设置[^3]。 对于简单的开发环境来说,在同一台机器上的默认端口 `6379` 启动即可满足需求;而对于生产环境中则建议采用更安全的方式连接到受密码保护的服务实例上去[^2]。 #### 3. 修改项目配置文件 settings.py 在项目的根目录下找到名为 **settings.py** 的Python脚本,并添加如下几行代码以启用scrapy_redis中间件功能以及指定调度器类名等参数值: ```python # Enables scheduling storing requests queue in redis. SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Ensure all spiders share same duplicates filter through redis. DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # Default requests serializer is pickle, but it can be changed to any other serializers. REDIS_URL = 'redis://localhost:6379' ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 300, } ``` 上述片段定义了几个重要的选项,其中包括使用哪个模块作为计划程序、怎样过滤重复URL地址访问记录等等。 #### 4. 创建 Spider 文件 创建一个新的spider文件夹并将下面这个例子保存进去命名为example_spider.py : ```python import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class MySpider(CrawlSpider): name = 'my_spider' allowed_domains = ['domain.com'] start_urls = ['http://www.domain.com'] rules = ( Rule(LinkExtractor(), callback='parse_item', follow=True), ) def parse_item(self, response): self.logger.info('Hi, this is an item page! %s', response.url) yield {'url': response.url} ``` 这段代码展示的是一个基本版本的网络爬取逻辑实现方式之一——通过继承CrawlSpider基类来自动生成链接提取规则并调用相应的解析函数处理页面数据内容。 #### 5. 运行多个进程或节点 最后一步就是可以在不同的终端窗口里分别执行以下命令从而开启多线程模式下的抓取作业啦! ```bash scrapy crawl my_spider --nolog -a spider=master & scrapy crawl my_spider --nolog -a spider=slave & ... ``` 这里我们给定了两个额外的关键字参数分别是`--nolog`(关闭日志输出) 和 `-a spider={role}`(设定角色为主从关系),这样就可以轻松构建起一套完整的分布式的web scraping框架体系结构出来咯! --- ### 注意事项 虽然以上步骤能够帮助快速搭建好基础架构,但在实际应用过程中还需要考虑更多细节方面的问题比如错误恢复机制设计、资源利用率优化调整等方面的工作量可能会更大一些哦~
评论 1
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值