Storing hundreds of millions of simple key-value pairs in Redis

为解决3亿张照片到用户ID的快速映射问题,Instagram利用Redis哈希特性实现高效存储与检索。通过将照片ID分桶并使用哈希存储,大幅减少内存占用,从21GB降至5GB,并保持O(1)查询效率。

本文转自:http://instagram-engineering.tumblr.com/post/12202313862/storing-hundreds-of-millions-of-simple-key-value-pairs

When transitioning systems, sometimes you have to build a little scaffolding. At Instagram, we recently had to do just that: for legacy reasons, we need to keep around a mapping of about 300 million photos back to the user ID that created them, in order to know which shard to query (see more info about our sharding setup). While eventually all clients and API applications will have been updated to pass us the full information, there are still plenty who have old information cached. We needed a solution that would:

  1. Look up keys and return values very quickly
  2. Fit the data in memory, and ideally within one of the EC2 high-memory types (the 17GB or 34GB, rather than the 68GB instance type)
  3. Fit well into our existing infrastructure
  4. Be persistent, so that we wouldn’t have to re-populate it if a server died

One simple solution to this problem would be to simply store them as a bunch of rows in a database, with “Media ID” and “User ID” columns. However, a SQL database seemed like overkill given that these IDs were never updated (only inserted), didn’t need to be transactional, and didn’t have any relations with other tables.

Instead, we turned to Redis, an advanced key-value store that we use extensively here at Instagram (for example, it powers our main feed). Redis is a key-value swiss-army knife; rather than just normal “Set key, get key” mechanics like Memcached, it provides powerful aggregate types like sorted sets and lists. It has a configurable persistence model, where it background saves at a specified interval, and can be run in a master-slave setup. All of our Redis deployments run in master-slave, with the slave set to save to disk about every minute.

At first, we decided to use Redis in the simplest way possible: for each ID, the key would be the media ID, and the value would be the user ID:

SET media:1155315 939
GET media:1155315
> 939

While prototyping this solution, however, we found that Redis needed about 70 MB to store 1,000,000 keys this way. Extrapolating to the 300,000,000 we would eventually need, it was looking to be around 21GB worth of data—already bigger than the 17GB instance type on Amazon EC2.

We asked the always-helpful Pieter Noordhuis, one of Redis’ core developers, for input, and he suggested we use Redis hashes. Hashes in Redis are dictionaries that are can be encoded in memory very efficiently; the Redis setting ‘hash-zipmap-max-entries’ configures the maximum number of entries a hash can have while still being encoded efficiently. We found this setting was best around 1000; any higher and the HSET commands would cause noticeable CPU activity. For more details, you can check out the zipmap source file.

To take advantage of the hash type, we bucket all our Media IDs into buckets of 1000 (we just take the ID, divide by 1000 and discard the remainder). That determines which key we fall into; next, within the hash that lives at that key, the Media ID is the lookup key *within* the hash, and the user ID is the value. An example, given a Media ID of 1155315, which means it falls into bucket 1155 (1155315 / 1000 = 1155):

HSET "mediabucket:1155" "1155315" "939"
HGET "mediabucket:1155" "1155315"
> "939"

The size difference was pretty striking; with our 1,000,000 key prototype (encoded into 1,000 hashes of 1,000 sub-keys each), Redis only needs 16MB to store the information. Expanding to 300 million keys, the total is just under 5GB—which in fact, even fits in the much cheaper m1.large instance type on Amazon, about 1/3 of the cost of the larger instance we would have needed otherwise. Best of all, lookups in hashes are still O(1), making them very quick.

If you’re interested in trying these combinations out, the script we used to run these tests is available as a Gist on GitHub (we also included Memcached in the script, for comparison—it took about 52MB for the million keys). And if you’re interested in working on these sorts of problems with us, drop us a note, we’re hiring!.


代码下载地址: https://pan.quark.cn/s/b4a8e0160cfc 齿轮与轴系零件在机械设备中扮演着至关重要的角色,它们负责实现动力传输、调整运动形态以及承受工作载荷等核心功能。 在机械工程的设计实践中,齿轮和轴系的设计是一项关键的技术任务,其内容涵盖了材料选用、构造规划、承载能力分析等多个技术层面。 下面将系统性地介绍《齿轮及轴系零件结构设计指导书》中的核心知识点。 一、齿轮设计1. 齿轮种类:依据齿廓轮廓的不同,齿轮可划分为直齿齿轮、斜齿轮以及人字齿轮等类别,各类齿轮均具有特定的性能特点与适用工况,能够满足多样化的工作环境与载荷需求。 2. 齿轮规格参数:模数大小、压力角数值、齿数数量、分度圆尺寸等是齿轮设计的基础数据,这些参数直接决定了齿轮的物理尺寸与运行性能。 3. 齿轮材质选用:齿轮材料的确定需综合评估其耐磨损性能、硬度水平以及韧性表现,常用的材料包括铸铁、钢材、铝合金等。 4. 齿轮强度验证:需进行齿面接触应力分析与齿根弯曲应力分析,以确保齿轮在实际运行过程中不会出现过度磨损或结构破坏。 5. 齿轮加工工艺:涉及切削加工、滚齿加工、剃齿加工、淬火处理等工艺流程,工艺方案的选择将直接影响齿轮的加工精度与使用寿命。 二、轴设计1. 轴的分类方式:依据轴在机械装置中的功能定位与受力特点,可将轴划分为心轴、转轴以及传动轴等类型。 2. 轴的材料选择:通常采用钢材作为轴的材料,例如碳素结构钢或合金结构钢,特殊需求时可选用不锈钢材料或轻质合金材料。 3. 轴的构造规划:需详细考虑轴的轴向长度、截面直径、键槽布置、轴承安装位置等要素,以满足轴的强度要求、刚度要求以及稳定性要求。 4. 轴的强度验证:需进行轴的扭转强度分析与弯曲强度分析,以防止轴在运行过程中发生塑性变形...
### Scrapy-Redis 部署教程 Scrapy-Redis 是一个用于分布式爬虫的扩展工具,它允许多个 Scrapy 爬虫共享请求队列和去重集合。以下是关于如何部署 Scrapy-Redis 的详细说明: #### 1. 安装依赖项 为了运行 Scrapy-Redis,需要先安装必要的软件包。可以通过 pip 工具来完成这些操作。 ```bash pip install scrapy redis scrapy-redis ``` 此命令会自动下载并安装所需的库文件[^1]。 #### 2. 设置 Redis 数据库 确保本地或者远程服务器上已经启动了一个可用的 Redis 实例。如果尚未配置,则可以按照官方文档中的指导进行设置[^3]。 对于简单的开发环境来说,在同一台机器上的默认端口 `6379` 启动即可满足需求;而对于生产环境中则建议采用更安全的方式连接到受密码保护的服务实例上去[^2]。 #### 3. 修改项目配置文件 settings.py 在项目的根目录下找到名为 **settings.py** 的Python脚本,并添加如下几行代码以启用scrapy_redis中间件功能以及指定调度器类名等参数值: ```python # Enables scheduling storing requests queue in redis. SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Ensure all spiders share same duplicates filter through redis. DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # Default requests serializer is pickle, but it can be changed to any other serializers. REDIS_URL = 'redis://localhost:6379' ITEM_PIPELINES = { 'scrapy_redis.pipelines.RedisPipeline': 300, } ``` 上述片段定义了几个重要的选项,其中包括使用哪个模块作为计划程序、怎样过滤重复URL地址访问记录等等。 #### 4. 创建 Spider 文件 创建一个新的spider文件夹并将下面这个例子保存进去命名为example_spider.py : ```python import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule class MySpider(CrawlSpider): name = 'my_spider' allowed_domains = ['domain.com'] start_urls = ['http://www.domain.com'] rules = ( Rule(LinkExtractor(), callback='parse_item', follow=True), ) def parse_item(self, response): self.logger.info('Hi, this is an item page! %s', response.url) yield {'url': response.url} ``` 这段代码展示的是一个基本版本的网络爬取逻辑实现方式之一——通过继承CrawlSpider基类来自动生成链接提取规则并调用相应的解析函数处理页面数据内容。 #### 5. 运行多个进程或节点 最后一步就是可以在不同的终端窗口里分别执行以下命令从而开启多线程模式下的抓取作业啦! ```bash scrapy crawl my_spider --nolog -a spider=master & scrapy crawl my_spider --nolog -a spider=slave & ... ``` 这里我们给定了两个额外的关键字参数分别是`--nolog`(关闭日志输出) 和 `-a spider={role}`(设定角色为主从关系),这样就可以轻松构建起一套完整的分布式的web scraping框架体系结构出来咯! --- ### 注意事项 虽然以上步骤能够帮助快速搭建好基础架构,但在实际应用过程中还需要考虑更多细节方面的问题比如错误恢复机制设计、资源利用率优化调整等方面的工作量可能会更大一些哦~
评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值