NoSQL Ecosystem

本文深入分析了传统关系型数据库面临的挑战,阐述了NoSQL数据库的兴起背景与特性,包括其在大规模数据存储与处理方面的优势。讨论了不同NoSQL数据库在数据模型、查询方式和持久化设计上的差异,并强调了如何根据业务需求进行选择。同时,文章还提及了Rackspace Cloud在NoSQL生态系统中的贡献及展望。

摘要生成于 C知道 ,由 DeepSeek-R1 满血版支持, 前往体验 >

Unprecedented data volumes are driving businesses to look at alternatives to the traditional relational database technology that has served us well for over thirty years.  Collectively, these alternatives have become known as “NoSQL databases.”

The fundamental problem is that relational databases cannot handle many modern workloads.  There are three specific problem areas: scaling out to data sets like Digg’s (3 TB for green badges) or Facebook’s (50 TB for inbox search) or eBay’s (2 PB overall), per-server performance, and rigid schema design.

Businesses, including The Rackspace Cloud, need to find new ways to store and scale large amounts of data. I recently wrote a post on  Cassandra, a non-relational database we have committed resources to. There are other non-relational databases being worked on and collectively, we call this theNoSQL movement.”

The “NoSQL” term was actually coined by a fellow Racker, Eric Evans when Johan Oskarsson of Last.fm wanted to organize an event to discuss open source distributed databases. The name and concept both caught on.

Some people object to the NoSQL term because it sounds like we’re defining ourselves based on what we aren’t doing rather than what we are. That’s true, to a degree, but the term is still valuable because when a relational database is the only tool you know, every problem looks like a thumb.  NoSQL is making people aware that there are other options out there. But we’re not anti-relational-database for when that really is the best tool for the job; it’s “Not Only SQL,” rather than “No SQL at all.”

One real concern with the NoSQL name is that it’s such a big tent that there is room for very different designs.  If this is not made clear when discussing the various products, it results in confusion.  So I’d like to suggest three axes along which to think about the many database options: scalabilitydata and query model, and persistence design.

I have chosen 10 NoSQL databases as examples.  This is not an exhaustive list, but the concepts discussed are crucial for evaluating others as well.

Scalability

Scaling reads is easy with replication, so when we’re talking about scaling in this context, we mean scaling writes by automatically partitioning data across multiple machines.  We call systems that do this “distributed databases.”  These include CassandraHBaseRiakScalarisVoldemort, and more.  If your write volume or data size is more than one machine can handle then these are your only options if you don’t want to manage partitioning manually.  (You don’t.)

There are two things to look for in a distributed database: 1) support for multiple datacenters and 2) the ability to add new machines to a live cluster transparently to your applications.

Non-distributed NoSQL databases include CouchDBMongoDBNeo4jRedis, and Tokyo Cabinet.  These can serve as persistence layers for distributed systems; MongoDB provides limited support for sharding, as does a separate Lounge project for CouchDB, and Tokyo Cabinet can be used as a Voldemort storage engine.

Data and Query Model

There is a lot of variety in the data models and query APIs in NoSQL databases.

(Respective Links: Thriftmap/reduce viewsThriftCursor, GraphCollectionNested hashesget/putget/putget/put)

Some highlights:

The columnfamily model shared by Cassandra and HBase is inspired by the one described by Google’s Bigtable paper, section 2.  (Cassandra drops historical versions, and adds supercolumns.) In both systems, you have rows and columns like you are used to seeing, but the rows are sparse: each row can have as many or as few columns as desired, and columns do not need to be defined ahead of time.

The Key/value model is the simplest and easiest to implement but inefficient when you are only interested in querying or updating part of a value.  It’s also difficult to implement more sophisticated structures on top of distributed key/value.

Document databases are essentially the next level of Key/value, allowing nested values associated with each key.  Document databases support querying those more efficiently than simply returning the entire blob each time.

Neo4J has a really unique data model, storing objects and relationships as nodes and edges in a graph.  For queries that fit this model (e.g., hierarchical data) they can be 1000s of times faster than alternatives.

Scalaris is unique in offering distributed transactions across multiple keys.  (Discussing the trade-offs between consistency and availability is beyond the scope of this post, but that is another aspect to keep in mind when evaluating distributed systems.)

Persistence Design

By persistence design I mean, “how is data stored internally?”

The persistence model tells us a lot about what kind of workloads these databases will be good at.

In-memory databases are very, very fast (Redis achieves over 100,000 operations per second on a single machine), but cannot work with data sets that exceed available RAM.  Durability (retaining data even if a server crashes or loses power) can also be a problem; the amount of data you can expect to lose between flushes (copying the data to disk) is potentially large.  Scalaris, the other in-memory database on our list, tackles the durability problem with replication, but since it does not support multiple data centers your data will be still be vulnerable to things like power failures.

Memtables and SSTables buffer writes in memory (a “memtable”) after writing to an append-only commit log for durability.  When enough writes have been accepted, the memtable is sorted and written to disk all at once as a “sstable.”  This provides close to in-memory performance since no seeks are involved, while avoiding the durability problems of purely in-memory approaches.  (This is described in more detail in sections 5.3 and 5.4 of the previously-referenced Bigtable paper, as well as in The log-structured merge-tree.)

B-Trees have been used in databases since practically the beginning of time.  They provide robust indexing support, but performance is poor on rotational disks (which are still by far the most cost-effective) because of the multiple seeks involved in reading or writing anything.

An interesting variant is CouchDB’s append-only B-Trees, which avoids the overhead of seeks at the cost of limiting CouchDB to one write at a time.

Conclusion

The NoSQL movement has exploded in 2009 as an increasing number of businesses wrestle with large data volumes.  The Rackspace Cloud is pleased to have played an early role in the NoSQL movement, and continues to commit resources to Cassandra and support events like NoSQL East.

NoSQL conference announcements and related discussion can be found on the Google discussion group.

From http://www.rackspace.com/blog/nosql-ecosystem/

Hadoop生态系统是一个开源的分布式计算框架,用于处理大量数据,尤其适合于大数据处理和分析。它主要包括以下几个核心组件和相关的工具: 1. **Hadoop Distributed File System (HDFS)**:HDFS是一个可靠、高度容错的文件系统,将数据分布在集群的节点上,提供数据存储服务。它的设计目标是高可用性和吞吐量,而不是低延迟访问。 2. **MapReduce**:这是一个编程模型,用于在分布式计算环境中并行处理海量数据。它由两个主要阶段组成:Map阶段处理原始数据,Reduce阶段汇总Map阶段的结果。 3. **YARN (Yet Another Resource Negotiator)**:YARN是一个资源调度平台,负责管理和分配Hadoop集群中的计算资源给MapReduce作业或其他工作负载,如Apache Spark等。 4. **Hive**:基于SQL的数据仓库工具,可以方便地查询和管理HDFS中的数据,提供了一种用户友好的界面。 5. **Pig Latin**:这是一种高级的、接近英语的脚本语言,用于创建针对Hadoop的数据流转换程序。 6. **HBase**:一个列式存储的NoSQL数据库,适用于实时读写的大型数据集,常用于日志处理、社交网络等应用。 7. **Spark**:虽然不是Hadoop的一部分,但在很多场合Spark被视为Hadoop生态系统的补充,因为它提供了更快的速度和内存计算能力。 8. **Apache Mahout**:专注于机器学习算法的库,可以扩展到Hadoop环境进行大规模数据挖掘。 这些组件共同构建了一个强大的大数据处理基础设施,使得企业能够高效地处理PB级别的数据。
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值