[13]-Connectors

最新推荐文章于 2024-09-05 22:56:44 发布

原创最新推荐文章于 2024-09-05 22:56:44 发布 · 466 阅读

0 ·

CC 4.0 BY-SA版权

[23]Presto 专栏收录该内容

36 篇文章

订阅专栏

本文介绍Presto连接器架构的核心组件，包括ConnectorFactory、ConnectorMetadata、ConnectorSplitManager及ConnectorRecordSetProvider等，并通过ExampleHTTPConnector示例说明各组件如何协作完成数据查询。

Connectors

Connectors是Presto queries的’数据源’，即便查询数据源中没有table表，只要实现Presto所需的API，也可以查询数据。

ConnectorFactory

plugin调用getConnectorFactory()–>得到ConnectorFactory–>建立Connector实例:

ConnectorMetadata
ConnectorSplitManager
ConnectorHandleResolver
ConnectorRecordSetProvider

ConnectorMetadata

为Presto提供大量方法列举特定数据的schemas列表,tables列表, columns列表, 和其他元数据metadata，例程实现可参见： Example HTTP Connector和Cassandra connector

ConnectorSplitManger

SplitManger将table数据分块chunks,供Presto向workers分发处理。
例如：

Hive connector 列举hive的每个partition的文件files，为每个文件建立一到多个split。
对于没有分区（partitioned）的数据，一个相对好的策略是整个表作为一个split（Example HTTP connector 即为该类实现）

ConnectorRecordSetProvider

给定一个split和一组columns列表，RecordSetProvider负责将数据传输到Presto的执行引擎execution engine。
它会构建一个RecordSet记录集，继而建立RecordCursor游标，供Presto行行读取各行中的列值（类似于JDBC）

Example HTTP Connector

example.properties

presto-main/etc/catalog/example.properties

connector.name=example-http
metadata-uri=http://s3.amazonaws.com/presto-example/v2/example-metadata.json

查询示例

presto:default> show schemas from example;
       Schema
--------------------
 example
 information_schema
 tpch
(3 rows)

presto:default> show tables from example.example;
  Table
---------
 numbers
(1 row)

presto:default> select * from example.example.numbers;
 text  | value
-------+-------
 one   |     1
 two   |     2
 three |     3
 ten    |    10
 eleven |    11
 twelve |    12
(6 rows)



presto:default> show tables from example.information_schema;
          Table
-------------------------
 __internal_partitions__
 columns
 schemata
 table_privileges
 tables
 views
(6 rows)

http-example connector 代码


//==column
public final class ExampleColumn
{
    private final String name;
    private final Type type;


public final class ExampleColumnHandle
        implements ColumnHandle
{
    private final String connectorId;
    private final String columnName;
    private final Type columnType;
    private final int ordinalPosition;

//=======table 
public class ExampleTable
{
    private final String name;
    private final List<ExampleColumn> columns;
    private final List<ColumnMetadata> columnsMetadata;
    private final List<URI> sources;

public final class ExampleTableHandle
        implements ConnectorTableHandle
{
    private final String connectorId;
    private final String schemaName;
    private final String tableName;

public class ExampleTableLayoutHandle
        implements ConnectorTableLayoutHandle
{
    private final ExampleTableHandle table;

//=======split

public class ExampleSplit
        implements ConnectorSplit
{
    private final String connectorId;
    private final String schemaName;
    private final String tableName;
    private final URI uri;
    private final boolean remotelyAccessible;
    private final List<HostAddress> addresses;

public class ExampleSplitManager
        implements ConnectorSplitManager
{
    private final String connectorId;
    private final ExampleClient exampleClient;
    public ConnectorSplitSource getSplits(...){

    }

//===Record
public class ExampleRecordSetProvider
        implements ConnectorRecordSetProvider{
    public RecordSet getRecordSet(..) {
       根据split和列构造一个
       return new ExampleRecordSet(exampleSplit, handles.build());
    }
}

public class ExampleRecordSet
        implements RecordSet
{
    private final List<ExampleColumnHandle> columnHandles; //所有的列
    private final List<Type> columnTypes;//所有列的类型
    private final ByteSource byteSource;/usi
    @Override
    public List<Type> getColumnTypes()
    {
        return columnTypes;
    }

    @Override
    public RecordCursor cursor()
    {
        return new ExampleRecordCursor(columnHandles, byteSource);
    }


public class ExampleRecordCursor
        implements RecordCursor
{
按照行列迭代返回数据
}

//=====Plugin

主要用于解析文件路径组织库表
public class ExampleClient
{
    /**
     * SchemaName -> (TableName -> TableMetadata)
     */
    private final Supplier<Map<String, Map<String, ExampleTable>>> schemas;
}

//ExampleClient的外观 
public class ExampleMetadata
        implements ConnectorMetadata
{
    private final String connectorId;

    private final ExampleClient exampleClient;
}   


返回一个ExampleConnectorFactory
public class ExamplePlugin
        implements Plugin
{
    @Override
    public Iterable<ConnectorFactory> getConnectorFactories()
    {
        return ImmutableList.of(new ExampleConnectorFactory());
    }
}


//
public class ExampleConnector
        implements Connector
{
    private static final Logger log = Logger.get(ExampleConnector.class);

    private final LifeCycleManager lifeCycleManager;
    private final ExampleMetadata metadata;
    private final ExampleSplitManager splitManager;
    private final ExampleRecordSetProvider recordSetProvider;

}

其他

Presto’s connector architecture creates an abstraction layer for anything that can be represented in a columnar or row-like format, such as HDFS, Amazon S3, Azure Storage, NoSQL stores, relational databases, Kafka streams and even proprietary data stores

connectors

Today Presto is not capable of pushing down aggregations and joins into MySQL. In many cases a simple workaround for this limitation is the creation of views inside MySQL that will be referenced by Presto queries. Such views should contain aggregations and/or joins of MySQL tables. The views are processed inside MySQL (along with any column/filters pushed down by Presto) and the resulting intermediate data is streamed back to Presto for the final processing.

views are expanded inline during query analysis. You are correct in thinking that the view will be executed as if you had written a subquery (that is exactly how it works).