The Difference Between Trees and Hash Tables

本文详细解析了Java中哈希表(HashMap和HashSet)与树(TreeMap和TreeSet)的区别,包括它们的内部实现、性能特点及适用场景。哈希表提供了快速的插入、删除和查找操作,而树结构则适用于需要保持数据有序或进行复杂搜索的情况。

The Difference Between Trees and Hash Tables

Last updated Mar 14, 2003.

Java provides two primary Set implementations: the TreeSet and the HashSet. Furthermore, it provides two primary Map implementations: the TreeMap and the HashMap. When would you want to use one over the other? What’s the difference?

The short answer is that the TreeMap and TreeSet classes store their data (or keys in the case of the TreeMap) in a Tree whereas the HashMap and HashSet store their data in a Hash Table. The longer, and more important answer, lies in the difference between a tree and a hashtable. The difference can be described in two ways: (1) implementation details and (2) behavior, where behavior describes the functionality that is available as well as the performance under different conditions.

I’ll start with the hash table. A hash table is a contiguous region of memory, similar to an array, in which objects are hashed (a numerical value is computed for the object) into an index in that memory. For example, consider a hash table that holds 100 elements. A hash function would compute a number between 0 and 99 and then insert the item at that location. When that item is requested then the same hash function would compute the same number and know exactly where to go to retrieve that object. There are some details that I’ve overlooked, such as sizing a hash table appropriately, the effect of resizing a hash table, and how to handle collisions (two objects hash to the same value), but I’ll address those as they come up. The key takeaways for using a hash table are:

  • Inserting an object into a hash table is a constant time operation
  • Retrieving an object from a hash table is a constant time operation

Trees, or more specifically binary search trees, store their data in hierarchical a tree. Figure 1shows an example of a tree.

Figure 1

Figure 1. The Structure of a Tree

From figure 1, the top node is the “root” of the tree. The root has one child to its left and one child to its right, in which the child to its left has a lower value and the child to the right has a higher value. This requirement is maintained throughout the entirety of the tree. In this example, the root node has a value of 7: all of its left children are less than 7 and all of its right children are greater than 7. To add a new item to the tree, such as 3, we would examine the root node 7, see that 3 is less than 7, so we would go it its left child. The left child is 5, so we would again traverse to the left child. Finally, because 3 is less than 4, the new node is inserted as the left child of 4. A tree support inserting new elements at a speed that is of the order log(n), meaning that as long as the tree maintains its balance (each node has approximately the same number of children in its left subtree as in its right subtree), you will need to examine the logarithm (base 2) of the number of elements (n) contained within the tree.

Recall that log base 2 is defined as follows:

2^n = number of nodes

So in this case we have 7 nodes, so

2^n = 7, which is just a little under 3 (2^3 = 8)

This means that in order to find an object in a tree with 7 (or 8) nodes, you would need to examine 3 nodes. And the size of the tree grows rapidly with this exponent: 2^4 = 16, 2^5 = 32, 2^6 = 64, 2^7 = 128, and so on. If your tree has 128 elements, it will take 7 operations to find a location to insert a new item and it will take a maximum of 7 operations to find an object. It is not as fast as a hash table (constant time means one operation), but it is still rather fast.

And, as with hash tables, trees are not a panacea because you might have noticed that I added the caveat “as long as the tree maintains its balance.” Unfortunately great effort must be expended in order to keep a tree balanced, which complicates its implementation, but there are several well established algorithms to keep a tree balanced.

Given this background that hash tables store its data in an array like structure with constant time inserts, searches, and deletes and trees store its data in a hierarchical tree with log(n) inserts, log(n) searches, and log(n) deletes, when would you use one over the other?

Looking at performance alone, the hash table is the preferred data structure to hold your set and map data, but it has some significant drawbacks:

  • It requires more memory than is needed to hold its data. This is true because as a hash table becomes more and more populated, the chances for collision (two objects hashing to the same index), become more and more likely. In general, a hash table should not be more than 75% - 80% full. This means that if you want to hold 100 items, you need to have a hash table that can hold 125 – 133 elements.
  • You need to know approximately how many elements you want to hold in the hash table to avoiding having to resize the hash table. For example, consider a hash table that can hold 125 elements, has 100 elements and a load factor (how full it can become before it is resized) or 80%, and the 101st is added. The hash table has crossed that 80% threshold and must be resized: simply operation, right? Wrong! The hash function computes the index for an item in the table, based on the size of the table. Therefore, as the table is resized, to say 150, all 100 objects must be rehashed to an index between 0 and 149 (instead of the previous hash of values between 0 and 124.) Rehashing is very expensive operation that requires “n” operations – so when it occurs, your constant time insert becomes of order O(n), which is far worse than the performance of a tree.
  • Because hash tables compute seemingly arbitrary values for objects, there is no way to extract objects in any natural order. For example, the numbers 1, 2, 3, and 4 may come out as 2, 1, 3, 4 or 4, 1, 3, 2 and so forth. The order is determined by where the hashing algorithm inserts the objects, not on any natural order

Contrasting hash tables with trees yields the following benefits that trees provide:

  • Trees only use the amount of memory needed to hold its items
  • Because trees can grow and shrink as needed, they never suffer from needing to be rehashed – the balancing operation may become complex, but at no point do the insert operation approach an order of O(n) (it never needs to examine every object in the tree during an insert)
  • Because trees store objects in a hierarchical structure, it is simple to extract items in their natural order (as long as that natural order defines the condition by which the tree established its hierarchical structure.)

In short, if you know approximately how many items you need to maintain in your collection, you can maintain additional unused memory in order to avoid hash collisions, and you do not need to extract items in a natural order, then hash tables are your best choice because they offer constant time insertion, search, and deletion. On the other hand, if memory is tight, you do not know how many items you may need to store in memory, and/or you need to be able to extract objects in a natural order, then a tree is a better choice.

Another condition you might want to consider when making your decision is the balance of operations that will be performed on your collection. For example, you may not know how many items you need to store in your collection (so you wouldn’t know how to best size a hash table), but you know that data will be loaded and inserted into your collection on startup and then all future operations will be searches. In this situation, you might be willing to accept rehashing your table (possibly multiple times) during startup to gain constant time search. On the other hand, if items will be consistently added and removed and you do not know how many items will be stored in your collection, then you might be willing to accept O(log n) search time to avoid potentially rehashing a table at runtime.

The point is that there is no general correct answer to whether a hash table or a tree is better, the nature of the data and the nature of the operations you will be performing on that data will determine the correct solution. The key is for you to understand how both solutions work and the benefits and tradeoffs of both solutions so that given the constraints of the problem, you are equipped to make the best decision.

内容概要:本文介绍了基于贝叶斯优化的CNN-LSTM混合神经网络在时间序列预测中的应用,并提供了完整的Matlab代码实现。该模型结合了卷积神经网络(CNN)在特征提取方面的优势与长短期记忆网络(LSTM)在处理时序依赖问题上的强大能力,形成一种高效的混合预测架构。通过贝叶斯优化算法自动调参,提升了模型的预测精度与泛化能力,适用于风电、光伏、负荷、交通流等多种复杂非线性系统的预测任务。文中还展示了模型训练流程、参数优化机制及实际预测效果分析,突出其在科研与工程应用中的实用性。; 适合人群:具备一定机器学习基基于贝叶斯优化CNN-LSTM混合神经网络预测(Matlab代码实现)础和Matlab编程经验的高校研究生、科研人员及从事预测建模的工程技术人员,尤其适合关注深度学习与智能优化算法结合应用的研究者。; 使用场景及目标:①解决各类时间序列预测问题,如能源出力预测、电力负荷预测、环境数据预测等;②学习如何将CNN-LSTM模型与贝叶斯优化相结合,提升模型性能;③掌握Matlab环境下深度学习模型搭建与超参数自动优化的技术路线。; 阅读建议:建议读者结合提供的Matlab代码进行实践操作,重点关注贝叶斯优化模块与混合神经网络结构的设计逻辑,通过调整数据集和参数加深对模型工作机制的理解,同时可将其框架迁移至其他预测场景中验证效果。
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值