Scope Instruction - Reducer

REDUCE

 

The REDUCE statement is very powerful and is the only way to guarantee that Scope will pass all Row objects sharing a key to a processing object at one time.  For example, if the input RowSet looks like:

 

Key

Value

a

a1

b

b1

a

a2

c

c1

d

d1

b

b2

c

c2

a

a3

b

b3

The Reducer object will be called 4 times with RowSet objects containing:

 

First Call

 

In this call, the Row objects are grouped on Key="a".

 

Key

Value

a

a3

a

a1

a

a2

Second Call

 

In this call, the Row objects are grouped on Key="b".

 

Key

Value

b

b1

b

b3

b

b2

Third Call

 

In this call, the Row objects are grouped on Key="c".

 

Key

Value

c

c1

c

c2

Fourth Call

 

In this call, the Row objects are grouped on Key="d".

 

Key

Value

d

d1

Note: Scope does not guarantee that the calls to the Reducer will be in any particular order.  Furthermore, if you need additional sorting you can use the PRESORT clause.

 

Syntax

 

All REDUCE statements require:

  • REDUCE [input]
  • PRODUCE column(s)
  • USING reducer
  • ON key(s)

As with other commands, users can supply a WHERE or HAVING statement to either filter the input or output.  The REDUCE command also allows user to sort additional non-key fields by providing a PRESORT clause.

Example

Let's look at an example:

REDUCE

PRODUCE Key, CountOfKey

USING MyReducer(-count)

ON Key

PRESORT Value ASC

WHERE Key != "filter"

HAVING CountOfKey > 2;

What this is saying is:

  1. Input the RowSet produced by the previous command
  2. Filter the input so that only      rows having X not equal to "filter" will be input
  3. Group on the input field      "Key"
  4. Sort the input on field      "Value" in an ascending manner
  5. Use the Reducer called      "MyReducer" with the argument "-count" passed in
  6. Output the columns      "Key" and "CountOfKey"
  7. Filter the output so that      only rows having CountOfKey greater than 2 will be      output

Note: The WHERE, HAVING, and PRESORT clauses are optional, but the other clauses are required. 

The output RowSet from this command will be:

 

Key

CountOfKey

a

3

b

3

 

Reducers

 

Reducers are used to take a RowSet and produce a RowSet (just like a Processor) with a big difference that the data will be grouped on a particular key.  Use reducers for things like:

  • Generating output based on      user sessions
  • Aggregating across Query
  • Looking at IP statistics

Internally, the SELECT aggregators (i.e. COUNT, SUM, MIN, ...) are implemented through Reducers.

Please see the REDUCE page for a more comprehensive discussion about how the REDUCE command works.

Syntax

 

Let's look at the following script:

REDUCE

PRODUCE A,B,C,D

USING MyReducer(-a)

ON X,Y

WHERE X != "scope"

HAVING A != "filter";

What this is saying is:

  1. Input the RowSet produced by the previous command
  2. Filter the input so that only      rows having X not equal to "scope" will be further      processed.
  3. Group on the input fields X      and Y
  4. Use the Reducer called      "MyReducer" with the argument "-a" passed in
  5. Output the columns      "A", "B", "C", and "D"
  6. Filter the output so that      only rows having A not equal to "filter" will be output

Note: The WHERE and HAVING clauses are optional, but the other clauses are required. 

Writing a Reducer

 

 

Syntax

 

Note that the implementation of a Processor and a Reducer are very similar.  This is by design - but please remember that the semantics (and therefore overhead) are vastly different.  Reducers require much more behind the scenes.

 

The easiest thing to do is to right-click in the editor and select Implement-Reducer.  Here's an example of a Reducer that will input the number of rows associated with the key:

#CS

public class SampleReducer: Reducer

{

         public override Schema Produces(string[] requestedColumns, string[] args, Schema inputSchema)

         {

               return inputSchema.Clone();

         }

  

         public override IEnumerable<Row> Reduce(RowSet input, Row outputRow, string[] args)

         {

               int count = 0; 

               foreach(Row row in input.Rows)

               {

                      //  Copy the data on the first Row

                      if (++count == 1)

                     {

                            row.Copy(outputRow); 

                     }

                     

               }

               outputRow[3].Set(count);

               yield return outputRow;

         }

}

#ENDCS

The two methods that need to be implemented are:

  1. Produces
  2. Reduce

Tips

  • The outputRow object is      already handed to the Reducer.
  • Do not create your own Row      object
  • Make your code efficient

Advanced

 

Certain cases may wish to exploit the IsRecursive overload which would increase the performance of their reducer significantly.  Users can use this functionality if:

  • The input and output Schema share the same number of      columns and have matching types
  • Rows in the input RowSet sharing a common key can be      processed in any order and in potentially different groups
  • Partial output RowSets can be combined with partial      input RowSets and still produce the same      results
  • The Reducer is idempotent

Please see Map/Reduce for more information on this topic.

Example1:

 //Script GUID:6ad8ab74-26f8-4087-a95c-407131f7ff16

//Used for tracking history

 

test = EXTRACT key, value

FROM "/local/Dev/V-LinTao/test.txt"

USING DefaultTextExtractor;

 

output =

REDUCE test

PRODUCE key, countofkey

USING SampleReducer

ON key

;

 

OUTPUT TO "/local/Dev/V-LinTao/count.txt";

 

#CS

public class SampleReducer: Reducer

{

         public override Schema Produces(string[] requestedColumns, string[] args, Schema inputSchema)

         {

               string outputschema = "key, countofkey";

               return new Schema(outputschema);              

         }  

 

         public override IEnumerable<Row> Reduce(RowSet input, Row outputRow, string[] args)

         {

               int count = 0; 

               foreach(Row row in input.Rows)

               {

                      //  Copy the data on the first Row

                      if (++count == 1)

                     {

 

                            row.Copy(outputRow);

                     }                    

               }

               outputRow[1].Set(count);

               yield return outputRow;

         }

}

#ENDCS

 

// Generated by ScopeStudio, version 1.7.8000.1!

 

测试数据 Test.txt

a

a1

b

b1

a

a2

c

c1

d

d1

b

b2

c

c2

a

a3

b

B3

 

运行结果 count.txt

a

3

b

3

c

2

d

1

 

Example2:

 test = EXTRACT key, value

FROM "/local/Dev/V-LinTao/test1.txt"

USING DefaultTextExtractor;

 

output =

REDUCE test

PRODUCE *

USING SampleReducer("key,k111,k222")

ON key

;

 

OUTPUT TO "/local/Dev/V-LinTao/test2output_freq.txt";

 

#CS

public class SampleReducer: Reducer

{

     public override Schema Produces(string[] requestedColumns, string[] args, Schema inputSchema)

     {

               string[] schema = args[0].Split(',');

    StringBuilder outputSchema = new StringBuilder(schema[0]);

for (int i = 1; i < schema.Length; i++)

{

 outputSchema.AppendFormat(",{0}", schema[i]);

}

     return new Schema(outputSchema.ToString());           

      }

 

       private void RefreshRows(string[] rows)

        {

            for (int i = 1; i < rows.Length; i++)

            {

                rows[i] = "0";

            }

        }

 

        private void SetRows(Row output, string[] rows)

        {

            for (int i = 0; i < rows.Length; i++)

            {

                output[i].Set(rows[i]);

            }

        }

 

         public override IEnumerable<Row> Reduce(RowSet input, Row outputRow, string[] args)

         {

               string[] schema = args[0].Split(',');

               int count = 0; 

               string[] UserRow = new string[schema.Length];          

 

               foreach(Row row in input.Rows)

               {

                      //  Copy the data on the first Row

                      if (++count == 1)

                     {

                        UserRow[0] = row[0].ToString();

                        RefreshRows(UserRow);

                     }

                        for(int i = 1; i < schema.Length; i++)

                     {

                        if(schema[i].Contains(row[1].ToString()))

                        {

                            UserRow[i] = (1+int.Parse(UserRow[i].ToString())).ToString();

                        }

                      }

               }

               SetRows(outputRow, UserRow);

               yield return outputRow;

         }

}

#ENDCS

 

测试数据:test1.txt

a

111

b

222

c

111

d

222

e

111

a

111

b

222

 

运行结果:test2output_freq.txt

a

2

0

b

0

2

c

1

0

d

0

1

e

1

0

 

 

 

AXConversionFilter_process =

    SELECT UserID,

           DeviceType,

           PageGroup,

           ExposureTime,

           ConversionTime,

           ConversionTime.Subtract(ExposureTime).TotalSeconds AS DateDiff_Seconds      

    FROM AXConversionFilter_output;

 

OUTPUT

TO @"E:\custom request\Jaminat\201503\O2_UK_Confirmation\AXConversionFilter_process.txt";

 

//

//AXConversionFilter_process =

//    EXTRACT UserID,

//            DeviceType,

//            PageGroup,

//            ExposureTime,

//            ConversionTime,

//            DateDiff_Seconds

//    FROM @"E:\custom request\Jaminat\201503\O2_UK_Confirmation\AXConversionFilter_process.txt"

//    USING DefaultTextExtractor;

 

AXConversion_Final =

    REDUCE AXConversionFilter_process

    PRODUCE UserID,

            DeviceType,

            PageGroup,

            ExposureTime,

            ConversionTime,

            DateDiff_Seconds

    USING SampleReducer

    ON UserID, PageGroup;

 

OUTPUT

TO @"E:\custom request\Jaminat\201503\O2_UK_Confirmation\AXConversion_Final.txt";

 

#CS

public class SampleReducer: Reducer

{

         public override Schema Produces(string[] requestedColumns, string[] args, Schema inputSchema)

         {

               return inputSchema.Clone();

         }

  

         public override IEnumerable<Row> Reduce(RowSet input, Row outputRow, string[] args)

         {

               double diffseconds = 0;

                    double diff = 0;

                   int count = 0;

                    DateTime exposureTime = DateTime.Now;

                    DateTime conversionTime = DateTime.Now;

               foreach(Row row in input.Rows)

               {

                      //  Copy the data on the first Row

                     if (++count == 1)

                     {

                            row.Copy(outputRow); 

                                exposureTime = row[3].DateTime;

                                conversionTime = row[4].DateTime;

                                diffseconds = row[5].Double;

                     }           

                         else

                            {

                               diff = row[5].Double;

                               if(diff > diffseconds && (diff <= 0 && diffseconds <= 0))

                                {                               

                                exposureTime = row[3].DateTime;

                                conversionTime = row[4].DateTime;

                                diffseconds = diff;                           

                                }

                               else if(diff < diffseconds && (diff >= 0 && diffseconds >= 0))

                                {

                                exposureTime = row[3].DateTime;

                                conversionTime = row[4].DateTime;

                                diffseconds = diff;

                                }

                                else if(diff > diffseconds && (diff >= 0 && diffseconds <= 0))

                                {

                                exposureTime = row[3].DateTime;

                                conversionTime = row[4].DateTime;

                                diffseconds = diff;

                                }

                                if(diff == 0)

                                {

                                exposureTime = row[3].DateTime;

                                conversionTime = row[4].DateTime;

                                diffseconds = diff;

                                }

                    

                            }

               }

 

                     outputRow[3].Set(exposureTime.ToString());

                     outputRow[4].Set(conversionTime.ToString());

                     outputRow[5].Set(diffseconds);

 

              

               yield return outputRow;

         }

}

#ENDCS

转载于:https://www.cnblogs.com/yuqiyang/p/4536157.html

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值