关于NoSQL的一些理论

本文探讨了大数据的三大特征:容量、种类和速度,并详细介绍了NoSQL数据库的发展背景及四种主要类型:键值存储、列族存储、文档存储和图数据库。此外,还对比了ACID与BASE数据存储模型的特点,并解释了CAP理论在分布式系统中的应用。

1.     The three Vs of big data

Volume: High volumes of data ranging from dozens of terabytes, and even petabytes.

Variety: Data that's organized in multiple structures, ranging from raw text (which, from a computer's perspective, has little or no discernible structure — many people call this unstructured data) to log files (commonly referred to as being semistructured) to data ordered in strongly typed rows and columns (structured data). To make things even more confusing, some data sets even include portions of all three kinds of data. (This is known as multistructured data.)

Velocity: Data that enters your organization and has some kind of value for a limited window of time — a window that usually shuts well before the data has been transformed and loaded into a data warehouse for deeper analysis (for example, financial securities ticker data, which may reveal a buying opportunity, but only for a short while). The higher the volumes of data entering your organization per second, the bigger your velocity challenge.

2.     NoSQL Theories

NoSQL Data Stores

NoSQL data stores originally subscribed to the notion "Just Say No to SQL" (to paraphrase from an anti-drug advertising campaign in the 1980s), and they were a reaction to the perceived limitations of (SQL-based) relational databases. It's not that these folks hated SQL, but they were tired of forcing square pegs into round holes by solving problems that relational databases weren't designed for. A relational database is a powerful tool, but for some kinds of data (like key-value pairs, or graphs) and some usage patterns (like extremely large scale storage) a relational database just isn't practical. And when it comes to high-volume storage, relational database can be expensive, both in terms of database license costs and hardware costs. (Relational databases are designed to work with enterprise-grade hardware.) So, with the NoSQL movement, creative programmers developed dozens of solutions for different kinds of thorny data storage and processing problems. These NoSQL databases typically provide massive scalability by way of clustering, and are often designed to enable high throughput and low latency.

 REMEMBER  The name NoSQL is somewhat misleading because many databases that fit the category do have SQL support (rather than "NoSQL" support). Think of its name instead as "Not Only SQL."

The NoSQL offerings available today can be broken down into four distinct categories, based on their design and purpose:

·         Key-value stores: This offering provides a way to store any kind of data without having to use a schema. This is in contrast to relational databases, where you need to define the schema (the table structure) before any data is inserted. Since key-value stores don't require a schema, you have great flexibility to store data in many formats. In a key-value store, a row simply consists of a key (an identifier) and a value, which can be anything from an integer value to a large binary data string. Many implementations of key-value stores are based on Amazon's Dynamo paper.

·         Column family stores: Here you have databases in which columns are grouped into column families and stored together on disk.

 TECHNICAL STUFF  Strictly speaking, many of these databases aren't column-oriented, because they're based on Google's BigTable paper, which stores data as a multidimensional sorted map. (For more on the role of Google's BigTable paper on database design, see Chapter 12.)

·         Document stores: This offering relies on collections of similarly encoded and formatted documents to improve efficiencies. Document stores enable individual documents in a collection to include only a subset of fields, so only the data that's needed is stored. For sparse data sets, where many fields are often not populated, this can translate into significant space savings. By contrast, empty columns in relational database tables do take up space. Document stores also enables schema flexibility, because only the fields that are needed are stored, and new fields can be added. Again, in contrast to relational databases, table structures are defined up front before data is stored, and changing columns is a tedious task that impacts the entire data set.

·         Graph databases: Here you have databases that store graph structures — representations that show collections of entities (vertices or nodes) and their relationships (edges) with each other. These structures enable graph databases to be extremely well suited for storing complex structures, like the linking relationships between all known web pages. (For example, individual web pages are nodes, and the edges connecting them are links from one page to another.) Google, of course, is all over graph technology, and invented a graph processing engine called Pregel to power its PageRank algorithm. (And yes, there's a white paper on Pregel.) In the Hadoop community, there's an Apache project called Giraph (based on the Pregel paper), which is a graph processing engine designed to process graphs stored in HDFS.

 REMEMBER  The data storage and processing options available in Hadoop are in many cases implementations of the NoSQL categories listed here. This will help you better evaluate solutions that are available to you and see how Hadoop can complement traditional data warehouses.

ACID versus BASE Data Stores

One hallmark of relational database systems is something known as ACID compliance. As you might have guessed, ACID is an acronym — the individual letters, meant to describe a characteristic of individual database transactions, can be expanded as described in this list:

·         Atomicity: The database transaction must completely succeed or completely fail. Partial success is not allowed.

·         Consistency: During the database transaction, the RDBMS progresses from one valid state to another. The state is never invalid.

·         Isolation: The client's database transaction must occur in isolation from other clients attempting to transact with the RDBMS.

·         Durability: The data operation that was part of the transaction must be reflected in nonvolatile storage (computer memory that can retrieve stored information even when not powered – like a hard disk) and persist after the transaction successfully completes. Transaction failures cannot leave the data in a partially committed state.

Certain use cases for RDBMSs, like online transaction processing, depend on ACID-compliant transactions between the client and the RDBMS for the system to function properly. A great example of an ACID-compliant transaction is a transfer of funds from one bank account to another. This breaks down into two database transactions, where the originating account shows a withdrawal, and the destination account shows a deposit. Obviously, these two transactions have to be tied together in order to be valid so that if either of them fail, the whole operation must fail to ensure both balances remain valid.

Hadoop itself has no concept of transactions (or even records, for that matter), so it clearly isn't an ACID-compliant system. Thinking more specifically about data storage and processing projects in the entire Hadoop ecosystem (we tell you more about these projects later in this chapter), none of them is fully ACID-compliant, either. However, they do reflect properties that you often see in NoSQL data stores, so there is some precedent to the Hadoop approach.

One key concept behind NoSQL data stores is that not every application truly needs ACID-compliant transactions. Relaxing on certain ACID properties (and moving away from the relational model) has opened up a wealth of possibilities, which have enabled some NoSQL data stores to achieve massive scalability and performance for their niche applications. Whereas ACID defines the key characteristics required for reliable transaction processing, the NoSQL world requires different characteristics to enable flexibility and scalability. These opposing characteristics are cleverly captured in the acronym BASE:

·         Basically Available: The system is guaranteed to be available for querying by all users. (No isolation here.)

·         Soft State: The values stored in the system may change because of the eventual consistency model, as described in the next bullet.

·         Eventually Consistent: As data is added to the system, the system's state is gradually replicated across all nodes. For example, in Hadoop, when a file is written to the HDFS, the replicas of the data blocks are created in different data nodes after the original data blocks have been written. For the short period before the blocks are replicated, the state of the file system isn't consistent.

The acronym BASE is a bit contrived, as most NoSQL data stores don't completely abandon all the ACID characteristics — it's not really the polar opposite concept that the name implies, in other words. Also, the Soft State and Eventually Consistent characteristics amount to the same thing, but the point is that by relaxing consistency, the system can horizontally scale (many nodes) and ensure availability.

CAP Theory

TECHNICAL STUFF  No discussion of NoSQL would be complete without mentioning the CAP theorem, which represents the three kinds of guarantees that architects aim to provide in their systems:

·         Consistency: Similar to the C in ACID, all nodes in the system would have the same view of the data at any time.

·         Availability: The system always responds to requests.

·         Partition tolerance: The system remains online if network problems occur between system nodes.

The CAP theorem states that in distributed networked systems, architects have to choose two of these three guarantees — you can't promise your users all three. That leaves you with the three possibilities shown in Figure 11-1:

·         Systems using traditional relational technologies normally aren't partition tolerant, so they can guarantee consistency and availability. In short, if one part of these traditional relational technologies systems is offline, the whole system is offline.

·         Systems where partition tolerance and availability are of primary importance can't guarantee consistency, because updates (that destroyer of consistency) can be made on either side of the partition. The key-value stores Dynamo and CouchDB and the column-family store Cassandra are popular examples of partition tolerant/availability (PA) systems.

·         Systems where partition tolerance and consistency are of primary importance can't guarantee availability because the systems return errors until the partitioned state is resolved.

 REMEMBER  Hadoop-based data stores are considered CP systems (consistent and partition tolerant). With data stored redundantly across many slave nodes, outages to large portions (partitions) of a Hadoop cluster can be tolerated. Hadoop is considered to be consistent because it has a central metadata store (the NameNode) which maintains a single, consistent view of data stored in the cluster. We can't say that Hadoop guarantees availability, because if the NameNode fails applications cannot access data in the cluster.

 

在自媒体领域,内容生产效率与作品专业水准日益成为从业者的核心关切。近期推出的Coze工作流集成方案,为内容生产者构建了一套系统化、模块化的创作支持体系。该方案通过预先设计的流程模块,贯穿选题构思、素材整理、文本撰写、视觉编排及渠道分发的完整周期,显著增强了自媒体工作的规范性与产出速率。 经过多轮实践验证,这些标准化流程不仅精简了操作步骤,减少了机械性任务的比重,还借助统一的操作框架有效控制了人为失误。由此,创作者得以将主要资源集中于内容创新与深度拓展,而非消耗于日常执行事务。具体而言,在选题环节,系统依据实时舆情数据与受众偏好模型生成热点建议,辅助快速定位创作方向;在编辑阶段,则提供多套经过验证的版式方案与视觉组件,保障内容呈现兼具美学价值与阅读流畅性。 分发推广模块同样经过周密设计,整合了跨平台传播策略与效果监测工具,涵盖社交网络运营、搜索排序优化、定向推送等多重手段,旨在帮助内容突破单一渠道局限,实现更广泛的受众触达。 该集成方案在提供成熟模板的同时,保留了充分的定制空间,允许用户根据自身创作特性与阶段目标调整流程细节。这种“框架统一、细节可变”的设计哲学,兼顾了行业通用标准与个体工作习惯,提升了工具在不同应用场景中的适应性。 从行业视角观察,此方案的问世恰逢其时,回应了自媒体专业化进程中对于流程优化工具的迫切需求。其价值不仅体现在即时的效率提升,更在于构建了一个可持续迭代的创作支持生态。通过持续吸纳用户反馈与行业趋势,系统将不断演进,助力从业者保持与行业发展同步,实现创作质量与运营效能的双重进阶。 总体而言,这一工作流集成方案的引入,标志着自媒体创作方法向系统化、精细化方向的重要转变。它在提升作业效率的同时,通过结构化的工作方法强化了内容产出的专业度与可持续性,为从业者的职业化发展提供了坚实的方法论基础。 资源来源于网络分享,仅用于学习交流使用,请勿用于商业,如有侵权请联系我删除!
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值