REDUCE
The REDUCE statement is very powerful and is the only way to guarantee that Scope will pass all Row objects sharing a key to a processing object at one time. For example, if the input RowSet looks like:
Key | Value |
a | a1 |
b | b1 |
a | a2 |
c | c1 |
d | d1 |
b | b2 |
c | c2 |
a | a3 |
b | b3 |
The Reducer object will be called 4 times with RowSet objects containing:
First Call
In this call, the Row objects are grouped on Key="a".
Key | Value |
a | a3 |
a | a1 |
a | a2 |
Second Call
In this call, the Row objects are grouped on Key="b".
Key | Value |
b | b1 |
b | b3 |
b | b2 |
Third Call
In this call, the Row objects are grouped on Key="c".
Key | Value |
c | c1 |
c | c2 |
Fourth Call
In this call, the Row objects are grouped on Key="d".
Key | Value |
d | d1 |
Note: Scope does not guarantee that the calls to the Reducer will be in any particular order. Furthermore, if you need additional sorting you can use the PRESORT clause.
Syntax
All REDUCE statements require:
- REDUCE [input]
- PRODUCE column(s)
- USING reducer
- ON key(s)
As with other commands, users can supply a WHERE or HAVING statement to either filter the input or output. The REDUCE command also allows user to sort additional non-key fields by providing a PRESORT clause.
Example
Let's look at an example:
REDUCE
PRODUCE Key, CountOfKey
USING MyReducer(-count)
ON Key
PRESORT Value ASC
WHERE Key != "filter"
HAVING CountOfKey > 2;
What this is saying is:
- Input the RowSet produced by the previous command
- Filter the input so that only rows having X not equal to "filter" will be input
- Group on the input field "Key"
- Sort the input on field "Value" in an ascending manner
- Use the Reducer called "MyReducer" with the argument "-count" passed in
- Output the columns "Key" and "CountOfKey"
- Filter the output so that only rows having CountOfKey greater than 2 will be output
Note: The WHERE, HAVING, and PRESORT clauses are optional, but the other clauses are required.
The output RowSet from this command will be:
Key | CountOfKey |
a | 3 |
b | 3 |
Reducers
Reducers are used to take a RowSet and produce a RowSet (just like a Processor) with a big difference that the data will be grouped on a particular key. Use reducers for things like:
- Generating output based on user sessions
- Aggregating across Query
- Looking at IP statistics
Internally, the SELECT aggregators (i.e. COUNT, SUM, MIN, ...) are implemented through Reducers.
Please see the REDUCE page for a more comprehensive discussion about how the REDUCE command works.
Syntax
Let's look at the following script:
REDUCE
PRODUCE A,B,C,D
USING MyReducer(-a)
ON X,Y
WHERE X != "scope"
HAVING A != "filter";
What this is saying is:
- Input the RowSet produced by the previous command
- Filter the input so that only rows having X not equal to "scope" will be further processed.
- Group on the input fields X and Y
- Use the Reducer called "MyReducer" with the argument "-a" passed in
- Output the columns "A", "B", "C", and "D"
- Filter the output so that only rows having A not equal to "filter" will be output
Note: The WHERE and HAVING clauses are optional, but the other clauses are required.
Writing a Reducer
Syntax
Note that the implementation of a Processor and a Reducer are very similar. This is by design - but please remember that the semantics (and therefore overhead) are vastly different. Reducers require much more behind the scenes.
The easiest thing to do is to right-click in the editor and select Implement-Reducer. Here's an example of a Reducer that will input the number of rows associated with the key:
#CS
public class SampleReducer: Reducer
{
public override Schema Produces(string[] requestedColumns, string[] args, Schema inputSchema)
{
return inputSchema.Clone();
}
public override IEnumerable<Row> Reduce(RowSet input, Row outputRow, string[] args)
{
int count = 0;
foreach(Row row in input.Rows)
{
// Copy the data on the first Row
if (++count == 1)
{
row.Copy(outputRow);
}
}
outputRow[3].Set(count);
yield return outputRow;
}
}
#ENDCS
The two methods that need to be implemented are:
- Produces
- Reduce
Tips
- The outputRow object is already handed to the Reducer.
- Do not create your own Row object
- Make your code efficient
Advanced
Certain cases may wish to exploit the IsRecursive overload which would increase the performance of their reducer significantly. Users can use this functionality if:
- The input and output Schema share the same number of columns and have matching types
- Rows in the input RowSet sharing a common key can be processed in any order and in potentially different groups
- Partial output RowSets can be combined with partial input RowSets and still produce the same results
- The Reducer is idempotent
Please see Map/Reduce for more information on this topic.
Example1:
//Script GUID:6ad8ab74-26f8-4087-a95c-407131f7ff16
//Used for tracking history
test = EXTRACT key, value
FROM "/local/Dev/V-LinTao/test.txt"
USING DefaultTextExtractor;
output =
REDUCE test
PRODUCE key, countofkey
USING SampleReducer
ON key
;
OUTPUT TO "/local/Dev/V-LinTao/count.txt";
#CS
public class SampleReducer: Reducer
{
public override Schema Produces(string[] requestedColumns, string[] args, Schema inputSchema)
{
string outputschema = "key, countofkey";
return new Schema(outputschema);
}
public override IEnumerable<Row> Reduce(RowSet input, Row outputRow, string[] args)
{
int count = 0;
foreach(Row row in input.Rows)
{
// Copy the data on the first Row
if (++count == 1)
{
row.Copy(outputRow);
}
}
outputRow[1].Set(count);
yield return outputRow;
}
}
#ENDCS
// Generated by ScopeStudio, version 1.7.8000.1!
测试数据 Test.txt
a | a1 |
b | b1 |
a | a2 |
c | c1 |
d | d1 |
b | b2 |
c | c2 |
a | a3 |
b | B3 |
运行结果 count.txt
a | 3 |
b | 3 |
c | 2 |
d | 1 |
Example2:
test = EXTRACT key, value
FROM "/local/Dev/V-LinTao/test1.txt"
USING DefaultTextExtractor;
output =
REDUCE test
PRODUCE *
USING SampleReducer("key,k111,k222")
ON key
;
OUTPUT TO "/local/Dev/V-LinTao/test2output_freq.txt";
#CS
public class SampleReducer: Reducer
{
public override Schema Produces(string[] requestedColumns, string[] args, Schema inputSchema)
{
string[] schema = args[0].Split(',');
StringBuilder outputSchema = new StringBuilder(schema[0]);
for (int i = 1; i < schema.Length; i++)
{
outputSchema.AppendFormat(",{0}", schema[i]);
}
return new Schema(outputSchema.ToString());
}
private void RefreshRows(string[] rows)
{
for (int i = 1; i < rows.Length; i++)
{
rows[i] = "0";
}
}
private void SetRows(Row output, string[] rows)
{
for (int i = 0; i < rows.Length; i++)
{
output[i].Set(rows[i]);
}
}
public override IEnumerable<Row> Reduce(RowSet input, Row outputRow, string[] args)
{
string[] schema = args[0].Split(',');
int count = 0;
string[] UserRow = new string[schema.Length];
foreach(Row row in input.Rows)
{
// Copy the data on the first Row
if (++count == 1)
{
UserRow[0] = row[0].ToString();
RefreshRows(UserRow);
}
for(int i = 1; i < schema.Length; i++)
{
if(schema[i].Contains(row[1].ToString()))
{
UserRow[i] = (1+int.Parse(UserRow[i].ToString())).ToString();
}
}
}
SetRows(outputRow, UserRow);
yield return outputRow;
}
}
#ENDCS
测试数据:test1.txt
a | 111 |
b | 222 |
c | 111 |
d | 222 |
e | 111 |
a | 111 |
b | 222 |
运行结果:test2output_freq.txt
a | 2 | 0 |
b | 0 | 2 |
c | 1 | 0 |
d | 0 | 1 |
e | 1 | 0 |
AXConversionFilter_process =
SELECT UserID,
DeviceType,
PageGroup,
ExposureTime,
ConversionTime,
ConversionTime.Subtract(ExposureTime).TotalSeconds AS DateDiff_Seconds
FROM AXConversionFilter_output;
OUTPUT
TO @"E:\custom request\Jaminat\201503\O2_UK_Confirmation\AXConversionFilter_process.txt";
//
//AXConversionFilter_process =
// EXTRACT UserID,
// DeviceType,
// PageGroup,
// ExposureTime,
// ConversionTime,
// DateDiff_Seconds
// FROM @"E:\custom request\Jaminat\201503\O2_UK_Confirmation\AXConversionFilter_process.txt"
// USING DefaultTextExtractor;
AXConversion_Final =
REDUCE AXConversionFilter_process
PRODUCE UserID,
DeviceType,
PageGroup,
ExposureTime,
ConversionTime,
DateDiff_Seconds
USING SampleReducer
ON UserID, PageGroup;
OUTPUT
TO @"E:\custom request\Jaminat\201503\O2_UK_Confirmation\AXConversion_Final.txt";
#CS
public class SampleReducer: Reducer
{
public override Schema Produces(string[] requestedColumns, string[] args, Schema inputSchema)
{
return inputSchema.Clone();
}
public override IEnumerable<Row> Reduce(RowSet input, Row outputRow, string[] args)
{
double diffseconds = 0;
double diff = 0;
int count = 0;
DateTime exposureTime = DateTime.Now;
DateTime conversionTime = DateTime.Now;
foreach(Row row in input.Rows)
{
// Copy the data on the first Row
if (++count == 1)
{
row.Copy(outputRow);
exposureTime = row[3].DateTime;
conversionTime = row[4].DateTime;
diffseconds = row[5].Double;
}
else
{
diff = row[5].Double;
if(diff > diffseconds && (diff <= 0 && diffseconds <= 0))
{
exposureTime = row[3].DateTime;
conversionTime = row[4].DateTime;
diffseconds = diff;
}
else if(diff < diffseconds && (diff >= 0 && diffseconds >= 0))
{
exposureTime = row[3].DateTime;
conversionTime = row[4].DateTime;
diffseconds = diff;
}
else if(diff > diffseconds && (diff >= 0 && diffseconds <= 0))
{
exposureTime = row[3].DateTime;
conversionTime = row[4].DateTime;
diffseconds = diff;
}
if(diff == 0)
{
exposureTime = row[3].DateTime;
conversionTime = row[4].DateTime;
diffseconds = diff;
}
}
}
outputRow[3].Set(exposureTime.ToString());
outputRow[4].Set(conversionTime.ToString());
outputRow[5].Set(diffseconds);
yield return outputRow;
}
}
#ENDCS