Scaling Digg and Other Web Applications

本文探讨了Digg网站如何应对大规模流量挑战,通过采用MemcacheDB等技术解决数据库写入瓶颈问题,并分享了其在分布式系统设计方面的经验。

Joe Stump, Lead Architect at Digg, gave this presentation at the Web 2.0 Expo. I couldn't find the actual presentation, but fortunately Kris Jordan took some great notes. That's how key moments in history are accidentally captured forever. Joe was also kind enough to respond to my email questions with a phone call.

In this first part of the post Joe shares some timeless wisdom that you may or may not have read before. I of course take some pains to extract all the wit from the original presentation in favor of simple rules. What really struck me however was how Joe thought MemcacheDB Will be the biggest new kid on the block in scaling . MemcacheDB has been around for a little while and I've never thought of it in that way. Well learn why Joe is so excited by MemcacheDB at the end of the post.

Impressive Stats

80th-100th largest site in the world
26 million uniques a month
30 million users.
Uniques are only half that traffic. Traffic = unique web visitors + APIs + Digg buttons.
2 billion requests a month
13,000 requests a second, peak at 27,000 requests a second.
3 Sys Admins, 2 DBAs, 1 Network Admin, 15 coders, QA team
Lots of servers.

Scaling Strategies

Scaling is specialization. When off the shelf solutions no longer work at a certain scale you have to create systems that work for your particular needs.
Lesson of web 2.0: people love making crap and sharing it with the world.
Web 2.0 sucks for scalability. Web 1.0 was flat with a lot of static files. Additional load is handled by adding more hardware. Web 2.0 is heavily interactive. Content can be created at a crushing rate.
Languages don't scale. 100% of the time bottlenecks are in
IO. Bottlenecks aren't in the language when you are handling so many simultaneous requests. Making PHP 300% faster won't matter. Don't optimize PHP by using single quotes instead of double quotes when
the database is pegged.
Don’t share state. Decentralize. Partitioning is required to process a high number of requests in parallel.
Scale out instead of up. Expect failures. Just add boxes to scale and avoid the fail.
Database -driven sites need to be partitioned to scale both horizontally and vertically. Horizontal partitioning means store a subset of rows on a different machines. It is used when there's more data than will fit on one machine. Vertical partitioning means putting some columns in one table and some columns in another table. This allows you to add data to the system without downtime.
Data are separated into separate clusters: User Actions, Users, Comments, Items, etc.
Build a data access layer so partitioning is hidden behind an API.
With partitioning comes the CAP Theorem : you can only pick two of the following three: Strong Consistency, High Availability, Partition Tolerance.
Partitioned solutions require denormalization and has become a big problem at Digg. Denormalization means data is copied in multiple objects and must be kept synchronized.
MySQL replication is used to scale out reads.
Use an asynchronous queuing architecture for near-term processing.
- This approach pushes chunks of processing to another service and let's that service schedule the processing on a grid of processors.
- It's faster and more responsive than cron and only slightly less responsive than real-time.
- For example, issuing 5 synchronous database requests slows you down. Do them in parallel.
- Digg uses Gearman. An example use is to get a permalink. Three operations are done parallel: get the current logged, get the permalink, and grab the comments. All three are then combined to return a combined single answer to the client. It's also used for site crawling and logging. It's a different way of thinking.
- See Flickr - Do the Essential Work Up-front and Queue the Rest and The Canonical Cloud Architecture for more information.
Bottlenecks are in IO so you have tune the database. When the database is bigger than RAM the disk is hit all the time which kills performance. As the database gets larger the table can't be scanned anymore. So you have to:
- denormalize
- avoid joins
- avoid large scans across databases by partitioning
- cache
- add read slaves
- don't use NFS
Run numbers before you try and fix a problem to make sure things actually will work.
Files like for icons and photos are handled by using MogileFS , a distributed file system. DFSs support high request rates because files are distributed and replicated around a network.
Cache forever and explicitly expire.
Cache fairly static content in a file based cache.
Cache changeable items in memcached
Cache rarely changed items in APC . APC is a local cache. It's not distributed so no other program have access to the values.
For caching use the Chain of Responsibility pattern . Cache in MySQL, memcached APC, and PHP globals. First check PHP globals as the fastest cache. If not present check APC, memcached and on up the chain.
Digg's recommendation engine is a custom graph database that is eventually consistent. Eventually consistent means that writes to one partition will eventually make it to all the other partitions. After a write reads made one after another don't have to return the same value as they could be handled by different partitions. This is a more relaxed constraint than strict consistency which means changes must be visible at all partitions simultaneously. Reads made one after another would always return the same value.
Assume 1 million people a day will bang on any new feature so make it scalable from the start. Example : the About page on Digg did a live query against the master database to show all employees. Just did a quick hack to get out. Then a spider went crazy and took the site down.

Miscellaneous

Digg buttons were a major key to generating traffic.
Uses Debian Linux , Apache , PHP, MySQL.
Pick a language you enjoy developing in, pick a coding standard, add inline documentation that's extractable, use a code repository, and a bug tracker. Likes PHP, Track, and SVN.
You are only as good as your people. Have to trust guy next to you that he's doing his job. To cultivate trust empower people to make
decisions. Trust that people have it handled and they'll take care of it. Cuts down on meetings because you know people will do the job right.
Completely a Mac shop.
Almost all developers are local. Some people are remote to offer 24 hour support.
Joe's approach is pragmatic. He doesn't have a language fetish. People went from PHP, to Python/Ruby, to Erlang. Uses vim. Develops from the command line. Has no idea how people constantly change tool sets all the time. It's not very productive.
Services (SOA) decoupling is a big win. Digg uses REST. Internal services return a vanilla structure that's mapped to JSON, XML, etc. Version in URL because it costs you nothing, for example:
/1.0/service/id/xml. Version both internal and external services.
People don't understand how many moving parts are in a website. Something is going to happen and it will go down.

MemcacheDB: Evolutionary Step for Code, Revolutionary Step for Performance

Imagine Kevin Rose, the founder of Digg, who at the time of this presentation had 40,000 followers. If Kevin diggs just once a day that's 40,000 writes. As the most active diggers are the most followed it becomes a huge performance bottleneck. Two problems appear.

You can't update 40,000 follower accounts at once. Fortunately the queuing system we talked about earlier takes care of that.

The second problem is the huge number of writes that happen. Digg has a write problem. If the average user has 100 followers that’s 300 million diggs day. That's 3,000 writes per second, 7GB of storage per day, and 5TB of data spread across 50 to 60 servers.

With such a heavy write load MySQL wasn’t going to work for Digg. That’s where MemcacheDB comes in. In Initial tests on a laptop MemcacheDB was able to handle 15,000 writes a second. MemcacheDB's own benchmark shows it capable of 23,000 writes/second and 64,000 reads/second. At those write rates it's easy to see why Joe was so excited about MemcacheDB's ability to handle their digg deluge.

What is MemcacheDB ? It's a distributed key-value storage system designed for persistent. It is NOT a cache solution, but a persistent storage engine for fast and reliable key-value based object storage and retrieval. It conforms to memcache protocol(not completed, see below), so any memcached client can have connectivity with it. MemcacheDB uses Berkeley DB as a storing backend, so lots of features including transaction and replication are supported .

Before you get too excited keep in mind this is a key-value store. You read and write records by a single key. There aren't multiple indexes and there's no SQL. That's why it can be so fast.

Digg uses MemcacheDB to scale out the huge number of writes that happen when data is denormalized. Remember it's a key-value store. The value is usually a complete application level object merged together from a possibly large number of normalized tables. Denormalizing introduces redundancies because you are keeping copies of data in multiple records instead of just one copy in a nicely normalized table. So denormalization means a lot more writes as data must be copied to all the records that contain a copy. To keep up they needed a database capable of handling their write load. MemcacheDB has the performance, especially when you layer memcached's normal partitioning scheme on top.

I asked Joe why he didn't turn to one of the in-memory data grid solutions? Some of the reasons were:

This data is generated from many different databases and takes a long time to generate. So they want it in a persistent store.
MemcacheDB uses the memcache protocol. Digg already uses memcache so it's a no-brainer to start using MemcacheDB. It's easy to use and easy to setup.
Operations is happy with deploying it into the datacenter as it's not a new setup.
They already have memcached high availability and failover code so that stuff already works.
Using a new system would require more ramp-up time.
If there are any problems with the code you can take a look. It's all open source.
Not sure those other products are stable enough.

So it's an evolutionary step for code and a revolutionary step for performance. Digg is looking at using MemcacheDB across the board.

<think>好的,用户想了解软件开发中的扩展和集成技术。首先,我需要明确这两个概念的定义和常见方法。扩展(Scaling)通常指系统处理更大负载或更复杂任务的能力,而集成(Integration)涉及不同系统或组件之间的协同工作。 根据用户提供的引用,引用[1]提到了标准化API和接口的重要性,以及逐步过渡到AI调度的方法。这可能在集成部分有用。引用[2]提到边缘计算与AI代理的结合,可能涉及分布式系统的扩展技术。引用[3]提到的软件工具如TensorFlow、Flask等,可能在技术实现部分需要提及。引用[4]和[5]似乎不太相关,但可能需要看看是否有其他点可以联系。 接下来,我需要组织内容结构。扩展技术可能包括垂直扩展、水平扩展、微服务架构、负载均衡、缓存机制等。集成方法可能有API集成、中间件、数据管道、事件驱动架构等。每个部分需要结合引用中的内容,例如标准化API(引用[1])、工具如TensorFlow(引用[3])等。 然后,用户要求生成相关问题,需要确保问题覆盖扩展和集成的不同方面,比如性能优化、常见挑战、工具选择等。同时,注意引用标记的正确添加,例如在提到API集成时引用[1],在软件工具时引用[3]。 最后,检查是否符合系统级指令:数学公式的正确格式、中文回答、引用标记自然。可能需要调整段落结构,确保技术点清晰,并引用相关段落。例如,在微服务架构部分可以引用[2]的边缘计算集成,说明分布式系统的扩展。</think>### 扩展与集成技术在软件开发中的实现 #### 一、扩展技术(Scaling Techniques) 1. **垂直扩展(Vertical Scaling)** 通过升级硬件资源(如CPU、内存)提升单节点性能。例如,使用$n$核处理器处理高并发任务时,计算能力与$n$呈线性关系[^2]。 **缺点**:存在物理上限且成本高。 2. **水平扩展(Horizontal Scaling)** 通过增加服务器节点实现分布式处理,常用技术包括: - **负载均衡**:如Nginx、HAProxy分配请求流量。 - **微服务架构**:将单体应用拆分为独立服务,支持按需扩展特定模块(参考引用[2]中边缘计算与分布式系统的结合)。 - **自动扩缩容**:Kubernetes根据实时负载动态调整容器实例。 3. **性能优化技术** - **缓存机制**:Redis/Memcached减少数据库压力。 - **异步处理**:消息队列(如Kafka、RabbitMQ)解耦耗时任务。 - **数据库分片**:将数据按哈希或范围分布到多个节点。 --- #### 二、集成方法(Integration Methods) 1. **API集成** 标准化RESTful或GraphQL接口实现系统间通信(引用[1])。例如: ```python # 使用Flask创建API端点(引用[3]) from flask import Flask, jsonify app = Flask(__name__) @app.route('/predict', methods=['POST']) def predict(): data = request.json return jsonify({"result": model.predict(data)}) ``` 2. **中间件与数据管道** - **ETL工具**:Apache Airflow管理跨系统数据流。 - **企业服务总线(ESB)**:MuleSoft统一协议转换与路由。 3. **事件驱动架构** 使用发布-订阅模式(如Apache Kafka)实现实时响应,满足物联网或金融系统的高吞吐需求(引用[2])。 4. **混合云集成** 通过AWS Outposts或Azure Arc连接本地与云端资源,实现无缝数据同步。 --- #### 三、技术选型与实施建议 1. **工具链** - **开发框架**:TensorFlow/PyTorch用于AI模块(引用[3])。 - **部署工具**:Docker和Kubernetes支持跨环境一致性。 2. **测试策略** 分阶段验证兼容性(引用[1]),如单元测试→接口测试→全链路压测。 3. **渐进式部署** 从试点项目开始(引用[1]),逐步替换旧系统。 ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值