Tez: 4 Writing a Tez Input/Processor/Output

The previous couple of blogs covered Tez concepts and APIs. This gives some details on what is required to write a custom Input / Processor / Output, along with examples of existing I/P/Os provided by the Tez runtime library.

Tez Task

tez1A Tez task is constituted of all the Inputs on its incoming edges, the Processor configured for the Vertex, and all the Output(s) on it’s outgoing edge.

The number of tasks for a vertex is equal to the parallelism set for that vertex – which is set at DAG construction time, or modified during runtime via user plugins running in the AM.

The diagram shows a single task. The vertex is configured to run Processor1 – has two incoming edges – with the output of the edge specified as Input1 and Input2 respectively, and has a single outgoing edge – with the input to this edge configured as Output1. There will be n such Task instances created per Vertex – depending on the parallelism.

Initialization of a Tez task

The following steps are followed to initialize and run a Tez task.

The Tez framework will first construct instances of the specified Input(s), Processor, Output(s) using a 0 argument constructor.

For a LogicalInput and a LogicalOutput – the Tez framework will set the number of physical connections using the respective setNumPhysicalInputs and setNumPhysicalOutputs methods.

The Input(s), Processor and Output(s) will then be initialized via their respective initialize methods. Configuration and context information is made available to the Is/P/Os via this call. More information on the Context classes is available in the JavaDoc for TezInputContext, TezProcessorContext and TezOutputContext.

The Processor run method will be called with the initialized Inputs and Outputs passed in as arguments (as a Map – connected vertexName to Input/Output).

Once the run method completes, the Input(s), Processor and Output(s) will be closed, and the task is considered to be complete.

Notes for I/P/O writers:

  • Each Input / Processor / Output must provide a 0 argument constructor.
  • No assumptions should be made about the order in which the Inputs, Processor and Outputs will be initialized, or closed.
  • Assumptions should also not be made about how the Initialization, Close and Processor run will be invoked – i.e. on the same thread or multiple threads.

Common Interfaces to be implemented by Input/Processor/Output

  • List<Event> initialize(Tez*Context) -This is where I/P/O receive their corresponding context objects. They can, optionally, return a list of events.
  • handleEvents(List<Event> events) – Any events generated for the specific I/P/O will be passed in via this interface. Inputs receive DataMovementEvent(s) generated by corresponding Outputs on this interface – and will need to interpret them to retrieve data. At the moment, this can be ignored for Outputs and Processors.
  • List<Event> close() – Any cleanup or final commits will typically be implemented in the close method. This is generally a good place for Outputs to generate DataMovementEvent(s). More on these events later.

Providing User Information to an Input / Processor / Output

Information specified in the bytePayload associated with an Input/Processor/Output is made available to the respective I/P/O via their context objects.

Users provide this information as a byte array – and can specify any information that may be required at runtime by the I/P/O. This could include configuration, execution plans for Hive/PIG, etc. As an example, the current inputs use a Hadoop Configuration instance for backward compatibility. Hive may choose to send it’s vertex execution plan as part of this field instead of using the distributed cache provided by YARN.

Typically, Inputs and Outputs exist as a pair – the Input knows how to process DataMovementEvent(s) generated by the corresponding Output, and how to interpret the data. This information will generally be encoded into some form of configuration (specified via the userPayload) used by the Output-Input pair, and should match. As an example – the output Key type configured on an Output should match the Input key type on the corresponding Input.

Writing a Tez LogicalOutput

A LogicalOutput can be considered to have two main responsibilities – 1) dealing with the actual data provided by the Processor – partitioning it for the ‘physical’ edges, serializing it etc, and 2) Providing information to Tez (in effect the subsequent Input) on where this data is available.

Processing the Data

Depending on the connection pattern being used – an Output will generate data to a single ‘physical’ edge or multiple ‘physical’ edges. A LogicalOutput is responsible for partitioning the data into these ‘physical’ edges.

It would typically work in conjunction with the configured downstream Input to write data in a specific data format understood by the downstream Input. This includes a serialization mechanism, compression etc.

As an example: OnFileSortedOutput which is the Output used for a MapReduce shuffle makes use of aPartitioner to partition the data into n partitions (‘physical’ edges) – where n corresponds to the number of downstream tasks. It also sorts the data per partition, and writes it out as Key-Value pairs using Hadoop serialization which is understood by the downstream Input (ShuffledMergedInput in this case).

Providing information on how the data is to be retrieved

A LogicalOutput needs to send out information on how data is to be retrieved by the corresponding downstream Input defined on an edge. This is done by generating DataMovementEvent(s). These events are routed by the AM, based on the connection pattern, to the relevant LogicalInputs.

These events can be sent at anytime by using the TezOutputContext with which the Output was initialized. Alternately, they can be returned as part of the initialize() or close() calls. More onDataMovementEvent(s) further down.

Continuing with the OnFileSortedOutput example: This will generate one event per partition – the sourceIndex for each of these events will be the partition number. This particular Output makes use of the MapReduce ShuffleHandler, which requires downstream Inputs to pull data over HTTP. The payload for these events contains the host name and port for the http server, as well as an identifier which uniquely identifies the specific task and Input instance running this output.

In case of OnFileSortedOutput – these events are generated during the close() call.

View OnFileSortedOutput.java

Specific interface  for a LogicalOutput

  • setNumPhysicalOutputs(int) – This is where a Logical Output is informed about the number of physical outgoing edges for the output.
  • Writer getWriter() – An implementation of the Writer interface, which can be used by a Processor to write to this Output.

Writing a Tez LogicalInput

The main responsibilities of a Logical Input are 1) Obtaining the actual data over the ‘physical’ edges, and 2) Interpreting the data, and providing a single ‘Logical’ view of this data to the Processor.

Obtaining the Data

A LogicalInput will receive DataMovementEvent(s) generated by the corresponding LogicalOutput which generated them. It needs to interpret these events to get hold of the data. The number of DataMovementEvent(s) a LogicalInput receives is typically equal to the number of physical edges it is configured with, and is used as a termination condition.

 

As an example: ShuffledMergedInput (which is the Input on the OnFileSortedOutput-ShuffledMergedInputO-I edge) would fetch data from the ShuffleHandler by interpretting the host, port and identifier from theDataMovementEvent(s) it receives.

Providing a view of the data to the Processor

A LogicalInput will typically expose the data to the Processor via a Reader interface. This would involve interpreting the data, manipulating it if required – decompression, ser-de etc.

Continuing with the ShuffledMergedInput example: This input fetches all the data – one chunk per source task and partition – each of which is sorted. It then proceeds to merge the sorted chunks and makes the data available to the Processor only after this step – via a KeyValues reader implementation.

View ShuffledMergedInput.java

View ShuffledUnorderedKVInput.java

Specific interface  for a LogicalInput

  • setNumPhysicalInputs(int) – This is where a LogicalInput is informed about the number of physical incoming edges.
  • Reader getReader() – An implementation of the Reader interface, which can be used by a Processor to read from this Input

Writing a Tez LogicalIOProcessor

A logical processor receives configured LogicalInput(s) and LogicalOutput(s). It is responsible for reading source data from the Input(s), processing it, and writing data out to the configured Output(s).

A processor is aware of which vertex (vertex-name) a specific Input is from. Similarly, it is aware of the output vertex (via the vertex-name) associated with a specific Output. It would typically validate the Input and Output types, process the Inputs based on the source vertex and generate output for the various destination vertices.

As an example: The MapProcessor validates that it is configured with only a single Input of type MRInput – since that is the only input it knows how to work with. It also validates the Output to be an OnFileSortedOutputor a MROutput. It then proceeds to obtain a KeyValue reader from the MRInput, and KeyValueWriter from theOnFileSortedOutput or MROutput. The KeyvalueReader instance is used to walk all they keys in the input – on which the user configured map function is called, with a MapReduce output collector backed by the KeyValue writer instance.

Specific interface  for a LogicalIOProcessor

  • run(Map<String, LogicalInput> inputs, Map<String, LogicalOutput> outputs) – This is where a processor should implement it’s compute logic. It receives initialized Input(s) and Output(s) along with the vertex names to which thse Input(s) and Output(s) are connected.

DataMovementEvent

DataMovementEvent is used to communicate between Outputs and Inputs to specify location information. A byte payload field is available for this – the contents of which should be understood by the communicating Outputs and Inputs. This byte payload could be interpreted by user-plugins running within the AM to modify the DAG (Auto reduce-parallelism as an example).

DataMovementEvent(s) are typically generated per physical edge between the Output and Input. The event generator needs to set the sourceIndex on the event being generated – and this matches the physical Output/Input that generated the event. Based on the ConnectionPattern specified for the DAG – Tez sets the targetIndex, so that the event receiver knows which physical Input/Output the event is meant for. An example of data movement events generated by a ScatterGather connection pattern (Shuffle) follows, with values specified for the source and target Index.

tez2

In this case the Input has 3 tasks, and the output has 2 tasks. Each input generates 1 partition (physical output) for the downstream tasks, and each downstream task consumes the same partition from each of the upstream tasks.

Vertex1, Task1 will generate two DataMovementEvents – E1 and E2.
E1, sourceIndex = 0 (since it is generated by the 1st physical output)
E2, sourceIndex = 1 (since it is generated by the 2nd physical output)

Similarly Vertex1, Task2 and Task3 will generate two data movement events each.
E3 and E5, sourceIndex=0
E4 and E6, sourceIndex=1

Based on the ScatterGather ConnectionPattern, the AM will route the events to relevant tasks.
E1, E3, E5 with sourceIndex 0 will be sent to Vertex2, Task1
E2, E4, E6 with sourceIndex 1 will be sent to Vertex2, Task2

The destination will see the following targetIndex (based on the physical edges between the tasks (arrows))
E1, targetIndex=0 – first physical input to V2, Task1
E3, targetIndex=1 – second physical input to V2, Task1
E5, targetIndex=2 – third physical input to V2, Task1
Similarly, E2, E4, E6 will have target indices 0,1 and 2 respectively – i.e. first, second and third physical input to V2 Task2.

DataMovement events generated by an Input are routed to the corresponding upstream Input defined on the edge. Similarly data movement events generated by an Ouput are routed to the corresponding downstream Input defined on the edge.

If the Output is one of the Leaf Outputs for a DAG – it will typically not generate any events.

Error Handling

Reporting errors from an Input/Processor/Output
  • Fatal Errors – fatal errors can be reported to Tez via the fatalError method available on the context instances, with which the I/P/O was initialized. Alternately, throwing an Exception from the initialize,close or run methods are considered to be fatal. Fatal errors cause the current running task to be killed.
  • Actionable Non Fatal Errors – Inputs can report the failure to obtain data from a specific Physical connection by sending an InputReaderErrorEventvia the InputContext. Depending on the Edge configuration, this may trigger a retry of the previous stage task which generated this data.
Errors reported to an Input

If the AM determines that data generated by a previous task is no longer available, Inputs which require this data are informed via an InputFailedEvent. The sourceIndex, targetIndex and attemptNumber information on this event would correspond to the DataMovementEvent event with the same values. The Input will typically handle this event by not attempting to obtain data based on the specific DataMovement event, and would wait for an updated DataMovementEvent for the same data.

Notes on Reader and Writer

Tez does not enforce any interface on the Reader and Writer to stay data format agnostic. Specific Writers and Readers can be implemented for Key-Value, Record or other data formats. A KeyValue and KeyValues Reader/Writer interface and implementation, based on Hadoop serialization, is used by the Shuffle Input/Output provided by the Tez Runtime library.

 

 

 

http://hortonworks.com/blog/writing-a-tez-inputprocessoroutput-2/

<think>好的,用户现在遇到了Hive SQL错误,具体是Tez任务无法找到JAR文件导致的。我需要根据之前的回答来提供更具体的解决方案。用户提到了错误信息中的路径是hdfs://node1:8020/user/tez/tez.tar.gz,这可能意味着他们检查过这个路径是否存在问题。 首先,我应该回顾之前的回答,确认已经覆盖了哪些排查步骤。之前提到了验证HDFS资源完整性、检查tez.lib.uris配置、权限检查和版本兼容性。现在用户的问题更具体,需要针对他们提到的路径进行详细检查。 用户这次的问题可能涉及几个方面:tez.tar.gz文件是否存在,权限是否正确,配置是否正确,或者版本是否兼容。需要分步骤引导用户检查这些方面。 首先,检查HDFS路径是否正确。用户提供的路径是/user/tez/tez.tar.gz,而之前的示例路径是/user/tez/lib/tez.tar.gz,可能存在路径不一致的问题。需要让用户确认他们的tez.tar.gz是否真的存在于指定的HDFS路径,并且文件名是否正确。 其次,权限问题。如果文件存在,但权限不正确,比如不是755或者所有者不是hive用户,也会导致问题。需要用户使用hdfs dfs -ls命令检查权限和所有者。 第三,配置是否正确。用户需要确认hive-site.xml中的tez.lib.uris是否指向正确的路径,并且Hive服务是否重启过以使配置生效。如果配置有误或者服务未重启,可能导致路径未被正确识别。 另外,版本兼容性也是一个可能的问题。用户需要检查Tez和Hive的版本是否匹配,特别是如果用户使用的是较新的Hive版本,可能需要升级Tez到对应的版本。 此外,用户可能在之前的操作中没有正确上传tez.tar.gz文件,或者文件损坏,需要重新上传。或者,在打包tez.tar.gz时可能遗漏了必要的JAR文件,导致任务无法找到依赖。 可能还需要检查YARN的日志,查看具体的错误信息,以确定是权限问题、文件缺失还是其他原因。用户可以使用yarn logs命令来获取详细日志,帮助进一步诊断。 总结下来,用户需要按照以下步骤进行排查:检查HDFS文件是否存在且路径正确;验证文件权限和所有者;确认tez.lib.uris配置正确并重启服务;检查版本兼容性;查看YARN日志获取更多错误信息。同时,可能需要重新上传或修复tez.tar.gz文件,确保其包含所有必要的依赖。</think>针对Hive SQL错误 `[1][08S01]` 中明确的资源定位问题 `tez.lib.uris=hdfs://node1:8020/user/tez/tez.tar.gz`,以下是**针对性排查与修复流程**: --- ### 一、关键错误定位 **核心故障特征**: `Could not find jar while attempting to localize resource` 表明系统无法从 `hdfs://node1:8020/user/tez/tez.tar.gz` 获取Tez运行所需的JAR文件。 **直接原因优先级**: 1. `tez.tar.gz` 文件在HDFS路径中缺失 2. 文件权限不足或归属用户错误 3. `tez.lib.uris` 配置路径与实际HDFS路径不一致 4. Tez版本与Hive/Hadoop不兼容 --- ### 二、分步排查与修复 #### 1. **验证HDFS文件存在性** ```bash # 检查tez.tar.gz是否存在于指定路径 hdfs dfs -ls hdfs://node1:8020/user/tez/tez.tar.gz ``` **预期输出**: ``` -rw-r--r-- hive hadoop 12345678 2023-01-01 00:00 hdfs://node1:8020/user/tez/tez.tar.gz ``` **若文件缺失**: - 重新上传完整的tez.tar.gz: ```bash hadoop fs -put /本地路径/tez.tar.gz hdfs://node1:8020/user/tez/ # 确保压缩包包含tez-api.jar等关键文件 ``` #### 2. **权限与归属验证** ```bash # 检查文件权限(需755)和用户组(需hive:hadoop) hdfs dfs -ls -h hdfs://node1:8020/user/tez/tez.tar.gz ``` **典型错误示例**: ``` -rw-r----- root root # 权限不足且用户错误 ``` **修复命令**: ```bash hdfs dfs -chmod 755 hdfs://node1:8020/user/tez/tez.tar.gz hdfs dfs -chown hive:hadoop hdfs://node1:8020/user/tez/tez.tar.gz ``` #### 3. **配置一致性检查** 在 `hive-site.xml` 中验证 `tez.lib.uris` 是否严格匹配HDFS路径: ```xml <!-- 必须包含协议头(node1)和完整路径 --> <property> <name>tez.lib.uris</name> <value>hdfs://node1:8020/user/tez/tez.tar.gz</value> </property> ``` **验证方法**: ```bash # 通过Hive CLI检查配置 hive -e "SET tez.lib.uris;" | grep 'hdfs://node1' # 若输出为空,需重启Hive服务 systemctl restart hive-server2 ``` #### 4. **压缩包内容完整性验证** 解压本地tez.tar.gz检查关键文件: ```bash tar -tvf tez.tar.gz | grep -E 'tez-api|tez-runtime-internals' ``` **必须包含文件**: ``` -rw-r--r-- hive/hadoop 123456 2023-01-01 00:00 tez-api-0.10.1.jar -rw-r--r-- hive/hadoop 654321 2023-01-01 00:00 tez-runtime-internals-0.10.1.jar ``` **若文件缺失**: - 重新打包完整依赖库(建议使用官方提供的tez压缩包) --- ### 三、版本兼容性紧急验证 | 组件 | 最低要求版本 | 验证命令 | |------------|-------------------|----------------------------| | Tez | 0.10.1 | `tez.jar --version` | | Hive | 3.1.0 | `hive --version` | | Hadoop | 3.3.0 | `hadoop version` | **不兼容场景修复**: - 若使用Hive 3.1.x + Tez 0.9.x → 升级Tez至0.10.1 - 若使用Hadoop 2.7.x + Tez 0.10.x → 需添加 `hadoop-mapreduce-client-core` 依赖到tez.tar.gz[^1] --- ### 四、本地化过程深度调试 #### 1. **强制YARN重新加载资源** ```bash # 清理YARN节点本地缓存 yarn rmadmin -refreshNodes yarn nodemanager -refreshLocalResources ``` #### 2. **模拟Tez本地化过程** ```bash # 使用Tez自带工具测试资源获取 hadoop jar $TEZ_HOME/tez-examples.jar orderedwordcount \ -Dtez.lib.uris=hdfs://node1:8020/user/tez/tez.tar.gz \ /input /output ``` **若失败**: - 检查 `stderr` 日志中具体的资源加载错误 --- ### 五、典型案例解决方案 **案例**: ```log ERROR [Localizer]: Failed to localize hdfs://node1:8020/user/tez/tez.tar.gz: File does not exist ``` **根因排查**: 1. HDFS路径中确实无此文件 → 重新上传 2. HDFS NameNode处于安全模式 → 退出安全模式:`hdfs dfsadmin -safemode leave` 3. 网络分区导致节点不可达 → 检查node1网络状态 **案例**: ```log Caused by: java.io.FileNotFoundException: tez-api.jar (No such file or directory) ``` **根因**: tez.tar.gz未包含tez-api.jar **修复**: - 从Apache Tez官网下载官方编译包替换现有文件 ---
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值