Scope Instruction - Reducer

最新推荐文章于 2021-08-05 19:03:26 发布

weixin_30906701

最新推荐文章于 2021-08-05 19:03:26 发布

阅读量97

点赞数

CC 4.0 BY-SA版权

原文链接：http://www.cnblogs.com/yuqiyang/p/4536157.html

REDUCE

The REDUCE statement is very powerful and is the only way to guarantee that Scope will pass all Row objects sharing a key to a processing object at one time. For example, if the input RowSet looks like:

Key	Value
a	a1
b	b1
a	a2
c	c1
d	d1
b	b2
c	c2
a	a3
b	b3

The Reducer object will be called 4 times with RowSet objects containing:

First Call

In this call, the Row objects are grouped on Key="a".

Key	Value
a	a3
a	a1
a	a2

Second Call

In this call, the Row objects are grouped on Key="b".

Key	Value
b	b1
b	b3
b	b2

Third Call

In this call, the Row objects are grouped on Key="c".

Key	Value
c	c1
c	c2

Fourth Call

In this call, the Row objects are grouped on Key="d".

Key	Value
d	d1

Note: Scope does not guarantee that the calls to the Reducer will be in any particular order. Furthermore, if you need additional sorting you can use the PRESORT clause.

Syntax

All REDUCE statements require:

REDUCE [input]
PRODUCE column(s)
USING reducer
ON key(s)

As with other commands, users can supply a WHERE or HAVING statement to either filter the input or output. The REDUCE command also allows user to sort additional non-key fields by providing a PRESORT clause.

Example

Let's look at an example:

REDUCE

PRODUCE Key, CountOfKey

USING MyReducer(-count)

ON Key

PRESORT Value ASC

WHERE Key != "filter"

HAVING CountOfKey > 2;

What this is saying is:

Input the RowSet produced by the previous command
Filter the input so that only rows having X not equal to "filter" will be input
Group on the input field "Key"
Sort the input on field "Value" in an ascending manner
Use the Reducer called "MyReducer" with the argument "-count" passed in
Output the columns "Key" and "CountOfKey"
Filter the output so that only rows having CountOfKey greater than 2 will be output

Note: The WHERE, HAVING, and PRESORT clauses are optional, but the other clauses are required.

The output RowSet from this command will be:

Key	CountOfKey
a	3
b	3

Reducers

Reducers are used to take a RowSet and produce a RowSet (just like a Processor) with a big difference that the data will be grouped on a particular key. Use reducers for things like:

Generating output based on user sessions
Aggregating across Query
Looking at IP statistics

Internally, the SELECT aggregators (i.e. COUNT, SUM, MIN, ...) are implemented through Reducers.

Please see the REDUCE page for a more comprehensive discussion about how the REDUCE command works.

Syntax

Let's look at the following script:

REDUCE

PRODUCE A,B,C,D

USING MyReducer(-a)

ON X,Y

WHERE X != "scope"

HAVING A != "filter";

What this is saying is:

Input the RowSet produced by the previous command
Filter the input so that only rows having X not equal to "scope" will be further processed.
Group on the input fields X and Y
Use the Reducer called "MyReducer" with the argument "-a" passed in
Output the columns "A", "B", "C", and "D"
Filter the output so that only rows having A not equal to "filter" will be output

Note: The WHERE and HAVING clauses are optional, but the other clauses are required.

Writing a Reducer

Syntax

Note that the implementation of a Processor and a Reducer are very similar. This is by design - but please remember that the semantics (and therefore overhead) are vastly different. Reducers require much more behind the scenes.

The easiest thing to do is to right-click in the editor and select Implement-Reducer. Here's an example of a Reducer that will input the number of rows associated with the key:

#CS

public class SampleReducer: Reducer

{

public override Schema Produces(string[] requestedColumns, string[] args, Schema inputSchema)

{

return inputSchema.Clone();

}

public override IEnumerable<Row> Reduce(RowSet input, Row outputRow, string[] args)

{

int count = 0;

foreach(Row row in input.Rows)

{

// Copy the data on the first Row

if (++count == 1)

{

row.Copy(outputRow);

}

outputRow[3].Set(count);

yield return outputRow;

}

#ENDCS

The two methods that need to be implemented are:

Produces
Reduce

Tips

The outputRow object is already handed to the Reducer.
Do not create your own Row object
Make your code efficient

Advanced

Certain cases may wish to exploit the IsRecursive overload which would increase the performance of their reducer significantly. Users can use this functionality if:

The input and output Schema share the same number of columns and have matching types
Rows in the input RowSet sharing a common key can be processed in any order and in potentially different groups
Partial output RowSets can be combined with partial input RowSets and still produce the same results
The Reducer is idempotent

Please see Map/Reduce for more information on this topic.

Example1:

//Script GUID:6ad8ab74-26f8-4087-a95c-407131f7ff16

//Used for tracking history

test = EXTRACT key, value

FROM "/local/Dev/V-LinTao/test.txt"

USING DefaultTextExtractor;

output =

REDUCE test

PRODUCE key, countofkey

USING SampleReducer

ON key

;

OUTPUT TO "/local/Dev/V-LinTao/count.txt";

#CS

public class SampleReducer: Reducer

{

public override Schema Produces(string[] requestedColumns, string[] args, Schema inputSchema)

{

string outputschema = "key, countofkey";

return new Schema(outputschema);

}

public override IEnumerable<Row> Reduce(RowSet input, Row outputRow, string[] args)

{

int count = 0;

foreach(Row row in input.Rows)

{

// Copy the data on the first Row

if (++count == 1)

{

row.Copy(outputRow);

}

outputRow[1].Set(count);

yield return outputRow;

}

#ENDCS

// Generated by ScopeStudio, version 1.7.8000.1!

测试数据 Test.txt

a	a1
b	b1
a	a2
c	c1
d	d1
b	b2
c	c2
a	a3
b	B3

运行结果 count.txt

a	3
b	3
c	2
d	1

Example2:

test = EXTRACT key, value

FROM "/local/Dev/V-LinTao/test1.txt"

USING DefaultTextExtractor;

output =

REDUCE test

PRODUCE *

USING SampleReducer("key,k111,k222")

ON key

;

OUTPUT TO "/local/Dev/V-LinTao/test2output_freq.txt";

#CS

public class SampleReducer: Reducer

{

public override Schema Produces(string[] requestedColumns, string[] args, Schema inputSchema)

{

string[] schema = args[0].Split(',');

StringBuilder outputSchema = new StringBuilder(schema[0]);

for (int i = 1; i < schema.Length; i++)

{

outputSchema.AppendFormat(",{0}", schema[i]);

}

return new Schema(outputSchema.ToString());

}

private void RefreshRows(string[] rows)

{

for (int i = 1; i < rows.Length; i++)

{

rows[i] = "0";

}

private void SetRows(Row output, string[] rows)

{

for (int i = 0; i < rows.Length; i++)

{

output[i].Set(rows[i]);

}

public override IEnumerable<Row> Reduce(RowSet input, Row outputRow, string[] args)

{

string[] schema = args[0].Split(',');

int count = 0;

string[] UserRow = new string[schema.Length];

foreach(Row row in input.Rows)

{

// Copy the data on the first Row

if (++count == 1)

{

UserRow[0] = row[0].ToString();

RefreshRows(UserRow);

}

for(int i = 1; i < schema.Length; i++)

{

if(schema[i].Contains(row[1].ToString()))

{

UserRow[i] = (1+int.Parse(UserRow[i].ToString())).ToString();

}

SetRows(outputRow, UserRow);

yield return outputRow;

}

#ENDCS

测试数据：test1.txt

a	111
b	222
c	111
d	222
e	111
a	111
b	222

运行结果：test2output_freq.txt

a	2	0
b	0	2
c	1	0
d	0	1
e	1	0

AXConversionFilter_process =

SELECT UserID,

DeviceType,

PageGroup,

ExposureTime,

ConversionTime,

ConversionTime.Subtract(ExposureTime).TotalSeconds AS DateDiff_Seconds

FROM AXConversionFilter_output;

OUTPUT

TO @"E:\custom request\Jaminat\201503\O2_UK_Confirmation\AXConversionFilter_process.txt";

//AXConversionFilter_process =

// EXTRACT UserID,

// DeviceType,

// PageGroup,

// ExposureTime,

// ConversionTime,

// DateDiff_Seconds

// FROM @"E:\custom request\Jaminat\201503\O2_UK_Confirmation\AXConversionFilter_process.txt"

// USING DefaultTextExtractor;

AXConversion_Final =

REDUCE AXConversionFilter_process

PRODUCE UserID,

DeviceType,

PageGroup,

ExposureTime,

ConversionTime,

DateDiff_Seconds

USING SampleReducer

ON UserID, PageGroup;

OUTPUT

TO @"E:\custom request\Jaminat\201503\O2_UK_Confirmation\AXConversion_Final.txt";

#CS

public class SampleReducer: Reducer

{

public override Schema Produces(string[] requestedColumns, string[] args, Schema inputSchema)

{

return inputSchema.Clone();

}

public override IEnumerable<Row> Reduce(RowSet input, Row outputRow, string[] args)

{

double diffseconds = 0;

double diff = 0;

int count = 0;

DateTime exposureTime = DateTime.Now;

DateTime conversionTime = DateTime.Now;

foreach(Row row in input.Rows)

{

// Copy the data on the first Row

if (++count == 1)

{

row.Copy(outputRow);

exposureTime = row[3].DateTime;

conversionTime = row[4].DateTime;

diffseconds = row[5].Double;

}

else

{

diff = row[5].Double;

if(diff > diffseconds && (diff <= 0 && diffseconds <= 0))

{

exposureTime = row[3].DateTime;

conversionTime = row[4].DateTime;

diffseconds = diff;

}

else if(diff < diffseconds && (diff >= 0 && diffseconds >= 0))

{

exposureTime = row[3].DateTime;

conversionTime = row[4].DateTime;

diffseconds = diff;

}

else if(diff > diffseconds && (diff >= 0 && diffseconds <= 0))

{

exposureTime = row[3].DateTime;

conversionTime = row[4].DateTime;

diffseconds = diff;

}

if(diff == 0)

{

exposureTime = row[3].DateTime;

conversionTime = row[4].DateTime;

diffseconds = diff;

}

outputRow[3].Set(exposureTime.ToString());

outputRow[4].Set(conversionTime.ToString());

outputRow[5].Set(diffseconds);

yield return outputRow;

}

#ENDCS

转载于:https://www.cnblogs.com/yuqiyang/p/4536157.html