- 博客(106)
- 资源 (4)
- 收藏
- 关注
This blog stop updating, please go to http://codethoughts.info
本博客停止更新,新博文将发表于 http://codethoughts.info 欢迎订阅。--------------------------------------------------------------------------------------------This blog stop updating now, please subscribe http://cod...
2015-04-04 11:43:48
308
原创 基础数据结构和算法十四:Directed Graphs
In directed graphs, edges are one-way: the pair of vertices that defines each edge is an ordered pair that specifies a one-way adjacency. Many applications (for example, graphs that represent the ...
2013-12-15 22:40:30
979
原创 基础数据结构和算法十三:Undirected Graphs (2)
Design pattern for graph processing. Since we consider a large number of graph-processing algorithms, our initial design goal is to decouple our implementations from the graph representation. T...
2013-12-13 22:51:29
330
原创 基础数据结构和算法十三:Undirected Graphs
A graph is a set of vertices and a collection of edges that each connect a pair of vertices. Vertex names are not important to the definition, but we need a way to refer to vertices. By convention, ...
2013-12-13 20:15:59
393
原创 Ruby array slicing - weird behavior
If you play around array slicing in irb, it will behavior like below: irb(main):027:0> a = [1,2,3]=> [1, 2, 3]irb(main):028:0> a[2,1] => [3]irb(main):029:0> a[4,1] ...
2013-12-12 09:42:43
129
Ruby中Enumerable#inject用法示范
Enumerable#inject是Ruby核心库中的一个简洁而且强大的API,今天读到一段简洁的代码之后,对这个API产生了浓厚的兴趣,索性搜寻一下资料,总结一下它的用法。代码如下: def text_at(*args) args.inject(@feed) { |s, r| s.send(:at, r)}.inner_textend这段代码完成的功能是:取出X...
2013-12-06 17:36:50
262
原创 Trapped by String#split of Ruby
Today I was trapped by kind of wierd behavior of Ruby's String#split, here's an example:def parse_inline_styles(text) segments = text.split(%r{(</?.*?>)}).reject {|x| x.empty?} segments...
2013-12-05 18:33:21
118
原创 基础数据结构和算法十二:Hash table
Search algorithms that use hashing consist of two separate parts. The first part is to compute a hash function that transforms the search key into an array index. Ideally, different keys would map...
2013-12-02 22:06:01
236
原创 基础数据结构和算法十一:Red-black binary search tree
The insertion algorithm for 2-3 trees just described is not difficult to understand; now, we will see that it is also not difficult to implement. We will consider a simple representation known as...
2013-12-01 12:12:35
273
原创 基础数据结构和算法十:2-3 search tree
Binary search tree works well for a wide variety of applications, but they have poor worst-case performance. Now we introduce a type of binary search tree where costs are guaranteed to be logarit...
2013-11-30 11:07:02
406
原创 基础数据结构和算法九:Binary Search Tree
A binary search tree (BST) is a binary tree where each node has a Comparable key (and an associated value) and satisfies the restriction that the key in any node is larger than the keys in all no...
2013-11-28 22:39:46
236
原创 基础数据结构和算法八:Binary search
Binary search needs an ordered array so that it can use array indexing to dramatically reduce the number of compares required for each search, using the classic and venerable binary search algorithm...
2013-11-28 21:21:41
212
原创 基础数据结构和算法七:Priority queue & Heap sort
Some important applications of priority queues include simulation systems, where the keys correspond to event times, to be processed in chronological order; job scheduling, where the keys correspond...
2013-11-27 19:47:59
424
原创 基础数据结构和算法六:Quick sort
Quick sort is probably used more widely than any other. It is popular because it is not difficult to implement, works well for a variety of different kinds of input data, and is substantially faster...
2013-11-21 19:33:14
277
原创 基础数据结构和算法五:Merge sort
One of mergesort’s most attractive properties is that it guarantees to sort any array of N items in time proportional to N * log N. Its prime disadvantage is that it uses extra space proportional...
2013-11-20 21:44:40
273
原创 基础数据结构和算法四:Shell sort
Shellsort is a simple extension of insertion sort that gains speed by allowing exchanges of array entries that are far apart, to produce partially sorted arrays that can be efficiently sorted, ev...
2013-11-20 19:11:30
226
原创 Comparing two sorting algorithms
Generally we compare algorithms by■ Implementing and debugging them■ Analyzing their basic properties■ Formulating a hypothesis about comparative performance■ Running experiments to validate...
2013-11-19 21:16:59
153
原创 基础数据结构和算法三:Insertion Sort
As in selection sort, the items to the left of the current index are in sorted order during the sort, but they are not in their final position, as they may have to be moved to make room for smaller ...
2013-11-19 21:06:47
208
原创 基础数据结构和算法二:Selection sort
One of the simplest sorting algorithms works as follows: First, find the smallest item in the array and exchange it with the first entry (itself if the first entry is already the smallest). Then,...
2013-11-19 20:57:06
170
原创 基础数据结构和算法一:UnionFind
The problem that we consider is not a toy problem; it is a fundamental computational task, and the solution that we develop is of use in a variety of applications, from percolation in physical ch...
2013-11-19 20:47:04
140
原创 Availability and Reliability with HBase
AvailabilityAvailability in the context of HBase can be defined as the ability of the system to handle failures. The most common failures cause one or more nodes in the HBase cluster to fall off t...
2013-08-25 10:53:19
153
原创 Failed to Run Pig Script with Macro
Pig version:[root@n8 examples]# pig -versionApache Pig version 0.11.0-cdh4.3.0 (rexported) compiled May 27 2013, 20:48:21 Hadoop version:[root@n8 examples]# hadoop versionHadoop 2.0.0-cd...
2013-08-16 19:44:29
182
原创 Solution to Hive Thrift Client Hang without Any Return
Env:Cloudera Manager 4.6.1 with CDH4.3Hadoop 2.0.0-CDH4.3Hive 0.10.0-CDH4.3CentOS 6.4 X86_64 Hive started successfully: [root@n8 hive]# netstat -anlp | grep 10000tcp 0 0 0.0.0.0:...
2013-08-12 19:38:33
141
原创 如何制作Hive数据文件
在学习Hive的过程中我经常遇到的问题是没有合适的数据文件,比如在读《Programming Hive》这本书的时候就因为Employees这张表没有提供示例数据而倍感挫折。因为Hive默认用'\001'(Ctrl+A)作为字段(Fields)分隔符,'\002'(Ctrl+B)作为集合元素(Collections Items)分隔符,'\003'作为Map类型Key/Values分隔符。在编...
2013-08-10 12:05:04
171
原创 Hive - 创建Index失败,原因暂未知
运行环境Cloudera Hive 0.10-CDH4 在我机器上安装的Hive里有如下的表: hive (human_resources)> describe formatted employees;OKcol_name data_type comment# col_name data_type comment ...
2013-08-10 00:08:46
1091
原创 Cascading Terminology and Concepts
Cascading is a data processing API and processing query planner used for defining, sharing, and executing data-processing workflows on a single computing node or distributed computing cluster. On a ...
2013-08-02 23:17:37
152
1
原创 Cascading Kick Start: Word Counting
If you know Hadoop, you're undoubtedly have seen WordCount before, WordCount serves as a hello world for Hadoop apps. This simple program provides a great test case for parallel processing:It req...
2013-07-31 19:36:29
133
原创 Joins with Apache Crunch
Apache Crunch is a Java library for creating MapReduce pipelines that is based on Google's FlumeJava library. Like other high-level tools for creating MapReduce jobs, such as Apache Hive, Apache Pig...
2013-07-30 19:46:21
135
原创 Getting Started with Apache Crunch
The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to wr...
2013-07-29 23:10:34
132
原创 Accelerating Comparison by Providing RawComparator
When a job is in sorting or merging phase, Hadoop leverage RawComparator for the map output key to compare keys. Built-in Writable classes such as IntWritable have byte-level implementation that are...
2013-07-27 21:25:07
111
原创 MapReduce Algorithm - Secondary Sort
Secondary sort is used to sort to allow some records to arrive at a reducer ahead of other records, it requires an understanding of both data arrangement and data flow (partitioning, sorting and gro...
2013-07-25 19:34:46
175
原创 MapReduce Algorithm - Semi-joins
In relational world, semi-join can be defined as a join between two tables returns rows from the first table where one or more matches are found in the second table. The difference between a semi-jo...
2013-07-25 18:15:04
116
原创 MapReduce Algorithm - Another Way to Do Map-side Join
Map-side join is also known as replicated join, and gets is name from the fact that the smallest of the datasets is replicated to all the map hosts. You can find a implementation in Hadoop in Action...
2013-07-25 17:51:08
171
原创 Homework - HBase Shell, Java Client and MapReduce Job
Env:Single Node with CentOS 6.2 x86_64, 2 processors, 4Gb memoryCDH4.3 with Cloudera Manager 4.5HBase 0.94.6-cdh4.3.0 HBase 0.94.6-cdh4.3.0 HBase shell exercise:[root@n8 ~]# hbase shel...
2013-07-21 23:36:22
218
原创 Running MapReduce Job with HBase
Generally there are three different ways of interacting with HBase from a MapReduce application. HBase can be used as data source at the beginning of a job, as a data sink at the end of a job or as ...
2013-07-21 01:50:23
125
原创 Adding HBase Library into Java Classpath
Suppose you write some Java code to operate HBase via HBase Java client interface, you compile and package the java source code into a jar, called examples.jar. In Hadoop cluster you can use "hbase c...
2013-07-20 14:17:36
111
原创 Moving Data in/out of Hadoop Filesystem
Hadoop has a number of built-in mechanisms that can facilitate ingress and egress operations, to name a few:Embedded NameNode HTTP serverWebHDFS and Hadoop interfacesHbase built-in API, be sp...
2013-07-18 23:11:51
112
原创 Enabling Oozie Web Console in CDH3, CDH4 with/without Cloudera Manager
To enable Oozie's web console, you must download and add the ExtJS library to the Oozie server. If you have not already done this, proceed as follows. If you use CDH3, you must do:Download th...
2013-07-16 23:36:37
107
原创 指定Flume日志分类级别
用UDP或TCP接受syslog格式日志的时候,比如:flume dump 'syslogUdp(5140)' 这个命令使用UDP在5140端口接收日志。这时候假如你希望从命令行测试能否成功接收:echo '<37>Hello from cmd.' |nc -u localhost 5140 一定要在测试文本头加上<37>用来对日志进行分类,否则flum...
2013-07-16 08:41:14
804
原创 PageRank Algorithm in MapReduce
In chapter 5 of Data-Intensive Text Processing with MapReduce, it introduces how to implement PageRank algorithm in MapReduce way. Here I am not going to talk more about PageRank itself, please refe...
2013-07-14 12:12:29
192
空空如也
空空如也
TA创建的收藏夹 TA关注的收藏夹
TA关注的人