Hard-题目29:41. First Missing Positive

本文介绍了一个算法问题:在一个未排序的整型数组中找到第一个缺失的正整数。通过使用HashSet进行存储和查找,该解决方案能在O(n)的时间复杂度内完成任务。

题目原文:
Given an unsorted integer array, find the first missing positive integer.

For example,
Given [1,2,0] return 3,
and [3,4,-1,1] return 2.
题目大意:
给出一个未排序的整形数组,寻找第一个缺失的整数。
题目分析:
(这真是原创的)用hashset就好了,判断集合中是否有n却没有n+1.
源码:(language:java)

public class Solution {
    public int firstMissingPositive(int[] nums) {
        HashSet<Integer> set = new HashSet<Integer>();
        int max=0;
        for(int num:nums) {
            set.add(num);
            if(num>max)
                max=num;
        }
        if(!set.contains(1))
            return 1;
        for(int i=1;i<max;i++) {
            if(set.contains(i)&&!set.contains(i+1))
                return i+1;
        }
        return max+1;
    }

}
成绩:
3ms,3.77%,1ms,82.72%
cmershen的碎碎念:
题目要求O(n)复杂度和O(1)空间复杂度,但这个算法用到了hashset,看网上的题解似乎是一种基于桶排序的算法,我也不理解。

Quickstart Note The data files used in the quickstart guide are updated from time to time, which means that the adjusted close changes and with it the close (and the other components). That means that the actual output may be different to what was put in the documentation at the time of writing. Using the platform Let’s run through a series of examples (from almost an empty one to a fully fledged strategy) but not without before roughly explaining 2 basic concepts when working with backtrader Lines Data Feeds, Indicators and Strategies have lines. A line is a succession of points that when joined together form this line. When talking about the markets, a Data Feed has usually the following set of points per day: Open, High, Low, Close, Volume, OpenInterest The series of “Open”s along time is a Line. And therefore a Data Feed has usually 6 lines. If we also consider “DateTime” (which is the actual reference for a single point), we could count 7 lines. Index 0 Approach When accessing the values in a line, the current value is accessed with index: 0 And the “last” output value is accessed with -1. This in line with Python conventions for iterables (and a line can be iterated and is therefore an iterable) where index -1 is used to access the “last” item of the iterable/array. In our case is the last output value what’s getting accessed. As such and being index 0 right after -1, it is used to access the current moment in line. With that in mind and if we imagine a Strategy featuring a Simple Moving average created during initialization: self.sma = SimpleMovingAverage(.....) The easiest and simplest way to access the current value of this moving average: av = self.sma[0] There is no need to know how many bars/minutes/days/months have been processed, because “0” uniquely identifies the current instant. Following pythonic tradition, the “last” output value is accessed using -1: previous_value = self.sma[-1] Of course earlier output values can be accessed with -2, -3, … From 0 to 100: the samples Basic Setup Let’s get running. from __future__ import (absolute_import, division, print_function, unicode_literals) import backtrader as bt if __name__ == '__main__': cerebro = bt.Cerebro() print('Starting Portfolio Value: %.2f' % cerebro.broker.getvalue()) cerebro.run() print('Final Portfolio Value: %.2f' % cerebro.broker.getvalue()) After the execution the output is: Starting Portfolio Value: 10000.00 Final Portfolio Value: 10000.00 In this example: backtrader was imported The Cerebro engine was instantiated The resulting cerebro instance was told to run (loop over data) And the resulting outcome was printed out Although it doesn’t seem much, let’s point out something explicitly shown: The Cerebro engine has created a broker instance in the background The instance already has some cash to start with This behind the scenes broker instantiation is a constant trait in the platform to simplify the life of the user. If no broker is set by the user, a default one is put in place. And 10K monetary units is a usual value with some brokers to begin with. Setting the Cash In the world of finance, for sure only “losers” start with 10k. Let’s change the cash and run the example again. from __future__ import (absolute_import, division, print_function, unicode_literals) import backtrader as bt if __name__ == '__main__': cerebro = bt.Cerebro() cerebro.broker.setcash(100000.0) print('Starting Portfolio Value: %.2f' % cerebro.broker.getvalue()) cerebro.run() print('Final Portfolio Value: %.2f' % cerebro.broker.getvalue()) After the execution the output is: Starting Portfolio Value: 1000000.00 Final Portfolio Value: 1000000.00 Mission accomplished. Let’s move to tempestuous waters. Adding a Data Feed Having cash is fun, but the purpose behind all this is to let an automated strategy multiply the cash without moving a finger by operating on an asset which we see as a Data Feed Ergo … No Data Feed -> No Fun. Let’s add one to the ever growing example. from __future__ import (absolute_import, division, print_function, unicode_literals) import datetime # For datetime objects import os.path # To manage paths import sys # To find out the script name (in argv[0]) # Import the backtrader platform import backtrader as bt if __name__ == '__main__': # Create a cerebro entity cerebro = bt.Cerebro() # Datas are in a subfolder of the samples. Need to find where the script is # because it could have been called from anywhere modpath = os.path.dirname(os.path.abspath(sys.argv[0])) datapath = os.path.join(modpath, '../../datas/orcl-1995-2014.txt') # Create a Data Feed data = bt.feeds.YahooFinanceCSVData( dataname=datapath, # Do not pass values before this date fromdate=datetime.datetime(2000, 1, 1), # Do not pass values after this date todate=datetime.datetime(2000, 12, 31), reverse=False) # Add the Data Feed to Cerebro cerebro.adddata(data) # Set our desired cash start cerebro.broker.setcash(100000.0) # Print out the starting conditions print('Starting Portfolio Value: %.2f' % cerebro.broker.getvalue()) # Run over everything cerebro.run() # Print out the final result print('Final Portfolio Value: %.2f' % cerebro.broker.getvalue()) After the execution the output is: Starting Portfolio Value: 1000000.00 Final Portfolio Value: 1000000.00 The amount of boilerplate has grown slightly, because we added: Finding out where our example script is to be able to locate the sample Data Feed file Having datetime objects to filter on which data from the Data Feed we will be operating Aside from that, the Data Feed is created and added to cerebro. The output has not changed and it would be a miracle if it had. Note Yahoo Online sends the CSV data in date descending order, which is not the standard convention. The reversed=True prameter takes into account that the CSV data in the file has already been reversed and has the standard expected date ascending order. Our First Strategy The cash is in the broker and the Data Feed is there. It seems like risky business is just around the corner. Let’s put a Strategy into the equation and print the “Close” price of each day (bar). DataSeries (the underlying class in Data Feeds) objects have aliases to access the well known OHLC (Open High Low Close) daily values. This should ease up the creation of our printing logic. from __future__ import (absolute_import, division, print_function, unicode_literals) import datetime # For datetime objects import os.path # To manage paths import sys # To find out the script name (in argv[0]) # Import the backtrader platform import backtrader as bt # Create a Stratey class TestStrategy(bt.Strategy): def log(self, txt, dt=None): ''' Logging function for this strategy''' dt = dt or self.datas[0].datetime.date(0) print('%s, %s' % (dt.isoformat(), txt)) def __init__(self): # Keep a reference to the "close" line in the data[0] dataseries self.dataclose = self.datas[0].close def next(self): # Simply log the closing price of the series from the reference self.log('Close, %.2f' % self.dataclose[0]) if __name__ == '__main__': # Create a cerebro entity cerebro = bt.Cerebro() # Add a strategy cerebro.addstrategy(TestStrategy) # Datas are in a subfolder of the samples. Need to find where the script is # because it could have been called from anywhere modpath = os.path.dirname(os.path.abspath(sys.argv[0])) datapath = os.path.join(modpath, '../../datas/orcl-1995-2014.txt') # Create a Data Feed data = bt.feeds.YahooFinanceCSVData( dataname=datapath, # Do not pass values before this date fromdate=datetime.datetime(2000, 1, 1), # Do not pass values before this date todate=datetime.datetime(2000, 12, 31), # Do not pass values after this date reverse=False) # Add the Data Feed to Cerebro cerebro.adddata(data) # Set our desired cash start cerebro.broker.setcash(100000.0) # Print out the starting conditions print('Starting Portfolio Value: %.2f' % cerebro.broker.getvalue()) # Run over everything cerebro.run() # Print out the final result print('Final Portfolio Value: %.2f' % cerebro.broker.getvalue()) After the execution the output is: Starting Portfolio Value: 100000.00 2000-01-03T00:00:00, Close, 27.85 2000-01-04T00:00:00, Close, 25.39 2000-01-05T00:00:00, Close, 24.05 ... ... ... 2000-12-26T00:00:00, Close, 29.17 2000-12-27T00:00:00, Close, 28.94 2000-12-28T00:00:00, Close, 29.29 2000-12-29T00:00:00, Close, 27.41 Final Portfolio Value: 100000.00 Someone said the stockmarket was risky business, but it doesn’t seem so. Let’s explain some of the magic: Upon init being called the strategy already has a list of datas that are present in the platform This is a standard Python list and datas can be accessed in the order they were inserted. The first data in the list self.datas[0] is the default data for trading operations and to keep all strategy elements synchronized (it’s the system clock) self.dataclose = self.datas[0].close keeps a reference to the close line. Only one level of indirection is later needed to access the close values. The strategy next method will be called on each bar of the system clock (self.datas[0]). This is true until other things come into play like indicators, which need some bars to start producing an output. More on that later. Adding some Logic to the Strategy Let’s try some crazy idea we had by looking at some charts If the price has been falling 3 sessions in a row … BUY BUY BUY!!! from __future__ import (absolute_import, division, print_function, unicode_literals) import datetime # For datetime objects import os.path # To manage paths import sys # To find out the script name (in argv[0]) # Import the backtrader platform import backtrader as bt # Create a Stratey class TestStrategy(bt.Strategy): def log(self, txt, dt=None): ''' Logging function fot this strategy''' dt = dt or self.datas[0].datetime.date(0) print('%s, %s' % (dt.isoformat(), txt)) def __init__(self): # Keep a reference to the "close" line in the data[0] dataseries self.dataclose = self.datas[0].close def next(self): # Simply log the closing price of the series from the reference self.log('Close, %.2f' % self.dataclose[0]) if self.dataclose[0] < self.dataclose[-1]: # current close less than previous close if self.dataclose[-1] < self.dataclose[-2]: # previous close less than the previous close # BUY, BUY, BUY!!! (with all possible default parameters) self.log('BUY CREATE, %.2f' % self.dataclose[0]) self.buy() if __name__ == '__main__': # Create a cerebro entity cerebro = bt.Cerebro() # Add a strategy cerebro.addstrategy(TestStrategy) # Datas are in a subfolder of the samples. Need to find where the script is # because it could have been called from anywhere modpath = os.path.dirname(os.path.abspath(sys.argv[0])) datapath = os.path.join(modpath, '../../datas/orcl-1995-2014.txt') # Create a Data Feed data = bt.feeds.YahooFinanceCSVData( dataname=datapath, # Do not pass values before this date fromdate=datetime.datetime(2000, 1, 1), # Do not pass values before this date todate=datetime.datetime(2000, 12, 31), # Do not pass values after this date reverse=False) # Add the Data Feed to Cerebro cerebro.adddata(data) # Set our desired cash start cerebro.broker.setcash(100000.0) # Print out the starting conditions print('Starting Portfolio Value: %.2f' % cerebro.broker.getvalue()) # Run over everything cerebro.run() # Print out the final result print('Final Portfolio Value: %.2f' % cerebro.broker.getvalue()) After the execution the output is: Starting Portfolio Value: 100000.00 2000-01-03, Close, 27.85 2000-01-04, Close, 25.39 2000-01-05, Close, 24.05 2000-01-05, BUY CREATE, 24.05 2000-01-06, Close, 22.63 2000-01-06, BUY CREATE, 22.63 2000-01-07, Close, 24.37 ... ... ... 2000-12-20, BUY CREATE, 26.88 2000-12-21, Close, 27.82 2000-12-22, Close, 30.06 2000-12-26, Close, 29.17 2000-12-27, Close, 28.94 2000-12-27, BUY CREATE, 28.94 2000-12-28, Close, 29.29 2000-12-29, Close, 27.41 Final Portfolio Value: 99725.08 Several “BUY” creation orders were issued, our porftolio value was decremented. A couple of important things are clearly missing. The order was created but it is unknown if it was executed, when and at what price. The next example will build upon that by listening to notifications of order status. The curious reader may ask how many shares are being bought, what asset is being bought and how are orders being executed. Where possible (and in this case it is) the platform fills in the gaps: self.datas[0] (the main data aka system clock) is the target asset if no other one is specified The stake is provided behind the scenes by a position sizer which uses a fixed stake, being the default “1”. It will be modified in a later example The order is executed “At Market”. The broker (shown in previous examples) executes this using the opening price of the next bar, because that’s the 1st tick after the current under examination bar. The order is executed so far without any commission (more on that later) Do not only buy … but SELL After knowing how to enter the market (long), an “exit concept” is needed and also understanding whether the strategy is in the market. Luckily a Strategy object offers access to a position attribute for the default data feed Methods buy and sell return the created (not yet executed) order Changes in orders’ status will be notified to the strategy via a notify method The “exit concept” will be an easy one: Exit after 5 bars (on the 6th bar) have elapsed for good or for worse Please notice that there is no “time” or “timeframe” implied: number of bars. The bars can represent 1 minute, 1 hour, 1 day, 1 week or any other time period. Although we know the data source is a daily one, the strategy makes no assumption about that. Additionally and to simplify: Do only allow a Buy order if not yet in the market Note The next method gets no “bar index” passed and therefore it seems obscure how to understand when 5 bars may have elapsed, but this has been modeled in pythonic way: call len on an object and it will tell you the length of its lines. Just write down (save in a variable) at which length in an operation took place and see if the current length is 5 bars away. from __future__ import (absolute_import, division, print_function, unicode_literals) import datetime # For datetime objects import os.path # To manage paths import sys # To find out the script name (in argv[0]) # Import the backtrader platform import backtrader as bt # Create a Stratey class TestStrategy(bt.Strategy): def log(self, txt, dt=None): ''' Logging function fot this strategy''' dt = dt or self.datas[0].datetime.date(0) print('%s, %s' % (dt.isoformat(), txt)) def __init__(self): # Keep a reference to the "close" line in the data[0] dataseries self.dataclose = self.datas[0].close # To keep track of pending orders self.order = None def notify_order(self, order): if order.status in [order.Submitted, order.Accepted]: # Buy/Sell order submitted/accepted to/by broker - Nothing to do return # Check if an order has been completed # Attention: broker could reject order if not enough cash if order.status in [order.Completed]: if order.isbuy(): self.log('BUY EXECUTED, %.2f' % order.executed.price) elif order.issell(): self.log('SELL EXECUTED, %.2f' % order.executed.price) self.bar_executed = len(self) elif order.status in [order.Canceled, order.Margin, order.Rejected]: self.log('Order Canceled/Margin/Rejected') # Write down: no pending order self.order = None def next(self): # Simply log the closing price of the series from the reference self.log('Close, %.2f' % self.dataclose[0]) # Check if an order is pending ... if yes, we cannot send a 2nd one if self.order: return # Check if we are in the market if not self.position: # Not yet ... we MIGHT BUY if ... if self.dataclose[0] < self.dataclose[-1]: # current close less than previous close if self.dataclose[-1] < self.dataclose[-2]: # previous close less than the previous close # BUY, BUY, BUY!!! (with default parameters) self.log('BUY CREATE, %.2f' % self.dataclose[0]) # Keep track of the created order to avoid a 2nd order self.order = self.buy() else: # Already in the market ... we might sell if len(self) >= (self.bar_executed + 5): # SELL, SELL, SELL!!! (with all possible default parameters) self.log('SELL CREATE, %.2f' % self.dataclose[0]) # Keep track of the created order to avoid a 2nd order self.order = self.sell() if __name__ == '__main__': # Create a cerebro entity cerebro = bt.Cerebro() # Add a strategy cerebro.addstrategy(TestStrategy) # Datas are in a subfolder of the samples. Need to find where the script is # because it could have been called from anywhere modpath = os.path.dirname(os.path.abspath(sys.argv[0])) datapath = os.path.join(modpath, '../../datas/orcl-1995-2014.txt') # Create a Data Feed data = bt.feeds.YahooFinanceCSVData( dataname=datapath, # Do not pass values before this date fromdate=datetime.datetime(2000, 1, 1), # Do not pass values before this date todate=datetime.datetime(2000, 12, 31), # Do not pass values after this date reverse=False) # Add the Data Feed to Cerebro cerebro.adddata(data) # Set our desired cash start cerebro.broker.setcash(100000.0) # Print out the starting conditions print('Starting Portfolio Value: %.2f' % cerebro.broker.getvalue()) # Run over everything cerebro.run() # Print out the final result print('Final Portfolio Value: %.2f' % cerebro.broker.getvalue()) After the execution the output is: Starting Portfolio Value: 100000.00 2000-01-03T00:00:00, Close, 27.85 2000-01-04T00:00:00, Close, 25.39 2000-01-05T00:00:00, Close, 24.05 2000-01-05T00:00:00, BUY CREATE, 24.05 2000-01-06T00:00:00, BUY EXECUTED, 23.61 2000-01-06T00:00:00, Close, 22.63 2000-01-07T00:00:00, Close, 24.37 2000-01-10T00:00:00, Close, 27.29 2000-01-11T00:00:00, Close, 26.49 2000-01-12T00:00:00, Close, 24.90 2000-01-13T00:00:00, Close, 24.77 2000-01-13T00:00:00, SELL CREATE, 24.77 2000-01-14T00:00:00, SELL EXECUTED, 25.70 2000-01-14T00:00:00, Close, 25.18 ... ... ... 2000-12-15T00:00:00, SELL CREATE, 26.93 2000-12-18T00:00:00, SELL EXECUTED, 28.29 2000-12-18T00:00:00, Close, 30.18 2000-12-19T00:00:00, Close, 28.88 2000-12-20T00:00:00, Close, 26.88 2000-12-20T00:00:00, BUY CREATE, 26.88 2000-12-21T00:00:00, BUY EXECUTED, 26.23 2000-12-21T00:00:00, Close, 27.82 2000-12-22T00:00:00, Close, 30.06 2000-12-26T00:00:00, Close, 29.17 2000-12-27T00:00:00, Close, 28.94 2000-12-28T00:00:00, Close, 29.29 2000-12-29T00:00:00, Close, 27.41 2000-12-29T00:00:00, SELL CREATE, 27.41 Final Portfolio Value: 100018.53 Blistering Barnacles!!! The system made money … something must be wrong The broker says: Show me the money! And the money is called “commission”. Let’s add a reasonable 0.1% commision rate per operation (both for buying and selling … yes the broker is avid …) A single line will suffice for it: # 0.1% ... divide by 100 to remove the % cerebro.broker.setcommission(commission=0.001) Being experienced with the platform we want to see the profit or loss after a buy/sell cycle, with and without commission. from __future__ import (absolute_import, division, print_function, unicode_literals) import datetime # For datetime objects import os.path # To manage paths import sys # To find out the script name (in argv[0]) # Import the backtrader platform import backtrader as bt # Create a Stratey class TestStrategy(bt.Strategy): def log(self, txt, dt=None): ''' Logging function fot this strategy''' dt = dt or self.datas[0].datetime.date(0) print('%s, %s' % (dt.isoformat(), txt)) def __init__(self): # Keep a reference to the "close" line in the data[0] dataseries self.dataclose = self.datas[0].close # To keep track of pending orders and buy price/commission self.order = None self.buyprice = None self.buycomm = None def notify_order(self, order): if order.status in [order.Submitted, order.Accepted]: # Buy/Sell order submitted/accepted to/by broker - Nothing to do return # Check if an order has been completed # Attention: broker could reject order if not enough cash if order.status in [order.Completed]: if order.isbuy(): self.log( 'BUY EXECUTED, Price: %.2f, Cost: %.2f, Comm %.2f' % (order.executed.price, order.executed.value, order.executed.comm)) self.buyprice = order.executed.price self.buycomm = order.executed.comm else: # Sell self.log('SELL EXECUTED, Price: %.2f, Cost: %.2f, Comm %.2f' % (order.executed.price, order.executed.value, order.executed.comm)) self.bar_executed = len(self) elif order.status in [order.Canceled, order.Margin, order.Rejected]: self.log('Order Canceled/Margin/Rejected') self.order = None def notify_trade(self, trade): if not trade.isclosed: return self.log('OPERATION PROFIT, GROSS %.2f, NET %.2f' % (trade.pnl, trade.pnlcomm)) def next(self): # Simply log the closing price of the series from the reference self.log('Close, %.2f' % self.dataclose[0]) # Check if an order is pending ... if yes, we cannot send a 2nd one if self.order: return # Check if we are in the market if not self.position: # Not yet ... we MIGHT BUY if ... if self.dataclose[0] < self.dataclose[-1]: # current close less than previous close if self.dataclose[-1] < self.dataclose[-2]: # previous close less than the previous close # BUY, BUY, BUY!!! (with default parameters) self.log('BUY CREATE, %.2f' % self.dataclose[0]) # Keep track of the created order to avoid a 2nd order self.order = self.buy() else: # Already in the market ... we might sell if len(self) >= (self.bar_executed + 5): # SELL, SELL, SELL!!! (with all possible default parameters) self.log('SELL CREATE, %.2f' % self.dataclose[0]) # Keep track of the created order to avoid a 2nd order self.order = self.sell() if __name__ == '__main__': # Create a cerebro entity cerebro = bt.Cerebro() # Add a strategy cerebro.addstrategy(TestStrategy) # Datas are in a subfolder of the samples. Need to find where the script is # because it could have been called from anywhere modpath = os.path.dirname(os.path.abspath(sys.argv[0])) datapath = os.path.join(modpath, '../../datas/orcl-1995-2014.txt') # Create a Data Feed data = bt.feeds.YahooFinanceCSVData( dataname=datapath, # Do not pass values before this date fromdate=datetime.datetime(2000, 1, 1), # Do not pass values before this date todate=datetime.datetime(2000, 12, 31), # Do not pass values after this date reverse=False) # Add the Data Feed to Cerebro cerebro.adddata(data) # Set our desired cash start cerebro.broker.setcash(100000.0) # Set the commission - 0.1% ... divide by 100 to remove the % cerebro.broker.setcommission(commission=0.001) # Print out the starting conditions print('Starting Portfolio Value: %.2f' % cerebro.broker.getvalue()) # Run over everything cerebro.run() # Print out the final result print('Final Portfolio Value: %.2f' % cerebro.broker.getvalue()) After the execution the output is: Starting Portfolio Value: 100000.00 2000-01-03T00:00:00, Close, 27.85 2000-01-04T00:00:00, Close, 25.39 2000-01-05T00:00:00, Close, 24.05 2000-01-05T00:00:00, BUY CREATE, 24.05 2000-01-06T00:00:00, BUY EXECUTED, Price: 23.61, Cost: 23.61, Commission 0.02 2000-01-06T00:00:00, Close, 22.63 2000-01-07T00:00:00, Close, 24.37 2000-01-10T00:00:00, Close, 27.29 2000-01-11T00:00:00, Close, 26.49 2000-01-12T00:00:00, Close, 24.90 2000-01-13T00:00:00, Close, 24.77 2000-01-13T00:00:00, SELL CREATE, 24.77 2000-01-14T00:00:00, SELL EXECUTED, Price: 25.70, Cost: 25.70, Commission 0.03 2000-01-14T00:00:00, OPERATION PROFIT, GROSS 2.09, NET 2.04 2000-01-14T00:00:00, Close, 25.18 ... ... ... 2000-12-15T00:00:00, SELL CREATE, 26.93 2000-12-18T00:00:00, SELL EXECUTED, Price: 28.29, Cost: 28.29, Commission 0.03 2000-12-18T00:00:00, OPERATION PROFIT, GROSS -0.06, NET -0.12 2000-12-18T00:00:00, Close, 30.18 2000-12-19T00:00:00, Close, 28.88 2000-12-20T00:00:00, Close, 26.88 2000-12-20T00:00:00, BUY CREATE, 26.88 2000-12-21T00:00:00, BUY EXECUTED, Price: 26.23, Cost: 26.23, Commission 0.03 2000-12-21T00:00:00, Close, 27.82 2000-12-22T00:00:00, Close, 30.06 2000-12-26T00:00:00, Close, 29.17 2000-12-27T00:00:00, Close, 28.94 2000-12-28T00:00:00, Close, 29.29 2000-12-29T00:00:00, Close, 27.41 2000-12-29T00:00:00, SELL CREATE, 27.41 Final Portfolio Value: 100016.98 God Save the Queen!!! The system still made money. Before moving on, let’s notice something by filtering the “OPERATION PROFIT” lines: 2000-01-14T00:00:00, OPERATION PROFIT, GROSS 2.09, NET 2.04 2000-02-07T00:00:00, OPERATION PROFIT, GROSS 3.68, NET 3.63 2000-02-28T00:00:00, OPERATION PROFIT, GROSS 4.48, NET 4.42 2000-03-13T00:00:00, OPERATION PROFIT, GROSS 3.48, NET 3.41 2000-03-22T00:00:00, OPERATION PROFIT, GROSS -0.41, NET -0.49 2000-04-07T00:00:00, OPERATION PROFIT, GROSS 2.45, NET 2.37 2000-04-20T00:00:00, OPERATION PROFIT, GROSS -1.95, NET -2.02 2000-05-02T00:00:00, OPERATION PROFIT, GROSS 5.46, NET 5.39 2000-05-11T00:00:00, OPERATION PROFIT, GROSS -3.74, NET -3.81 2000-05-30T00:00:00, OPERATION PROFIT, GROSS -1.46, NET -1.53 2000-07-05T00:00:00, OPERATION PROFIT, GROSS -1.62, NET -1.69 2000-07-14T00:00:00, OPERATION PROFIT, GROSS 2.08, NET 2.01 2000-07-28T00:00:00, OPERATION PROFIT, GROSS 0.14, NET 0.07 2000-08-08T00:00:00, OPERATION PROFIT, GROSS 4.36, NET 4.29 2000-08-21T00:00:00, OPERATION PROFIT, GROSS 1.03, NET 0.95 2000-09-15T00:00:00, OPERATION PROFIT, GROSS -4.26, NET -4.34 2000-09-27T00:00:00, OPERATION PROFIT, GROSS 1.29, NET 1.22 2000-10-13T00:00:00, OPERATION PROFIT, GROSS -2.98, NET -3.04 2000-10-26T00:00:00, OPERATION PROFIT, GROSS 3.01, NET 2.95 2000-11-06T00:00:00, OPERATION PROFIT, GROSS -3.59, NET -3.65 2000-11-16T00:00:00, OPERATION PROFIT, GROSS 1.28, NET 1.23 2000-12-01T00:00:00, OPERATION PROFIT, GROSS 2.59, NET 2.54 2000-12-18T00:00:00, OPERATION PROFIT, GROSS -0.06, NET -0.12 Adding up the “NET” profits the final figure is: 15.83 But the system said the following at the end: 2000-12-29T00:00:00, SELL CREATE, 27.41 Final Portfolio Value: 100016.98 And obviously 15.83 is not 16.98. There is no error whatsoever. The “NET” profit of 15.83 is already cash in the bag. Unfortunately (or fortunately to better understand the platform) there is an open position on the last day of the Data Feed. Even if a SELL operation has been sent … IT HAS NOT YET BEEN EXECUTED. The “Final Portfolio Value” calculated by the broker takes into account the “Close” price on 2000-12-29. The actual execution price would have been set on the next trading day which happened to be 2001-01-02. Extending the Data Feed” to take into account this day the output is: 2001-01-02T00:00:00, SELL EXECUTED, Price: 27.87, Cost: 27.87, Commission 0.03 2001-01-02T00:00:00, OPERATION PROFIT, GROSS 1.64, NET 1.59 2001-01-02T00:00:00, Close, 24.87 2001-01-02T00:00:00, BUY CREATE, 24.87 Final Portfolio Value: 100017.41 Now adding the previous NET profit to the completed operation’s net profit: 15.83 + 1.59 = 17.42 Which (discarding rounding errors in the “print” statements) is the extra Portfolio above the initial 100000 monetary units the strategy started with. Customizing the Strategy: Parameters It would a bit unpractical to hardcode some of the values in the strategy and have no chance to change them easily. Parameters come in handy to help. Definition of parameters is easy and looks like: params = (('myparam', 27), ('exitbars', 5),) Being this a standard Python tuple with some tuples inside it, the following may look more appealling to some: params = ( ('myparam', 27), ('exitbars', 5), ) With either formatting parametrization of the strategy is allowed when adding the strategy to the Cerebro engine: # Add a strategy cerebro.addstrategy(TestStrategy, myparam=20, exitbars=7) Note The setsizing method below is deprecated. This content is kept here for anyone looking at old samples of the sources. The sources have been update to use: cerebro.addsizer(bt.sizers.FixedSize, stake=10)`` Please read the section about sizers Using the parameters in the strategy is easy, as they are stored in a “params” attribute. If we for example want to set the stake fix, we can pass the stake parameter to the position sizer like this durint init: # Set the sizer stake from the params self.sizer.setsizing(self.params.stake) We could have also called buy and sell with a stake parameter and self.params.stake as the value. The logic to exit gets modified: # Already in the market ... we might sell if len(self) >= (self.bar_executed + self.params.exitbars): With all this in mind the example evolves to look like: from __future__ import (absolute_import, division, print_function, unicode_literals) import datetime # For datetime objects import os.path # To manage paths import sys # To find out the script name (in argv[0]) # Import the backtrader platform import backtrader as bt # Create a Stratey class TestStrategy(bt.Strategy): params = ( ('exitbars', 5), ) def log(self, txt, dt=None): ''' Logging function fot this strategy''' dt = dt or self.datas[0].datetime.date(0) print('%s, %s' % (dt.isoformat(), txt)) def __init__(self): # Keep a reference to the "close" line in the data[0] dataseries self.dataclose = self.datas[0].close # To keep track of pending orders and buy price/commission self.order = None self.buyprice = None self.buycomm = None def notify_order(self, order): if order.status in [order.Submitted, order.Accepted]: # Buy/Sell order submitted/accepted to/by broker - Nothing to do return # Check if an order has been completed # Attention: broker could reject order if not enough cash if order.status in [order.Completed]: if order.isbuy(): self.log( 'BUY EXECUTED, Price: %.2f, Cost: %.2f, Comm %.2f' % (order.executed.price, order.executed.value, order.executed.comm)) self.buyprice = order.executed.price self.buycomm = order.executed.comm else: # Sell self.log('SELL EXECUTED, Price: %.2f, Cost: %.2f, Comm %.2f' % (order.executed.price, order.executed.value, order.executed.comm)) self.bar_executed = len(self) elif order.status in [order.Canceled, order.Margin, order.Rejected]: self.log('Order Canceled/Margin/Rejected') self.order = None def notify_trade(self, trade): if not trade.isclosed: return self.log('OPERATION PROFIT, GROSS %.2f, NET %.2f' % (trade.pnl, trade.pnlcomm)) def next(self): # Simply log the closing price of the series from the reference self.log('Close, %.2f' % self.dataclose[0]) # Check if an order is pending ... if yes, we cannot send a 2nd one if self.order: return # Check if we are in the market if not self.position: # Not yet ... we MIGHT BUY if ... if self.dataclose[0] < self.dataclose[-1]: # current close less than previous close if self.dataclose[-1] < self.dataclose[-2]: # previous close less than the previous close # BUY, BUY, BUY!!! (with default parameters) self.log('BUY CREATE, %.2f' % self.dataclose[0]) # Keep track of the created order to avoid a 2nd order self.order = self.buy() else: # Already in the market ... we might sell if len(self) >= (self.bar_executed + self.params.exitbars): # SELL, SELL, SELL!!! (with all possible default parameters) self.log('SELL CREATE, %.2f' % self.dataclose[0]) # Keep track of the created order to avoid a 2nd order self.order = self.sell() if __name__ == '__main__': # Create a cerebro entity cerebro = bt.Cerebro() # Add a strategy cerebro.addstrategy(TestStrategy) # Datas are in a subfolder of the samples. Need to find where the script is # because it could have been called from anywhere modpath = os.path.dirname(os.path.abspath(sys.argv[0])) datapath = os.path.join(modpath, '../../datas/orcl-1995-2014.txt') # Create a Data Feed data = bt.feeds.YahooFinanceCSVData( dataname=datapath, # Do not pass values before this date fromdate=datetime.datetime(2000, 1, 1), # Do not pass values before this date todate=datetime.datetime(2000, 12, 31), # Do not pass values after this date reverse=False) # Add the Data Feed to Cerebro cerebro.adddata(data) # Set our desired cash start cerebro.broker.setcash(100000.0) # Add a FixedSize sizer according to the stake cerebro.addsizer(bt.sizers.FixedSize, stake=10) # Set the commission - 0.1% ... divide by 100 to remove the % cerebro.broker.setcommission(commission=0.001) # Print out the starting conditions print('Starting Portfolio Value: %.2f' % cerebro.broker.getvalue()) # Run over everything cerebro.run() # Print out the final result print('Final Portfolio Value: %.2f' % cerebro.broker.getvalue()) After the execution the output is: Starting Portfolio Value: 100000.00 2000-01-03T00:00:00, Close, 27.85 2000-01-04T00:00:00, Close, 25.39 2000-01-05T00:00:00, Close, 24.05 2000-01-05T00:00:00, BUY CREATE, 24.05 2000-01-06T00:00:00, BUY EXECUTED, Size 10, Price: 23.61, Cost: 236.10, Commission 0.24 2000-01-06T00:00:00, Close, 22.63 ... ... ... 2000-12-20T00:00:00, BUY CREATE, 26.88 2000-12-21T00:00:00, BUY EXECUTED, Size 10, Price: 26.23, Cost: 262.30, Commission 0.26 2000-12-21T00:00:00, Close, 27.82 2000-12-22T00:00:00, Close, 30.06 2000-12-26T00:00:00, Close, 29.17 2000-12-27T00:00:00, Close, 28.94 2000-12-28T00:00:00, Close, 29.29 2000-12-29T00:00:00, Close, 27.41 2000-12-29T00:00:00, SELL CREATE, 27.41 Final Portfolio Value: 100169.80 In order to see the difference, the print outputs have also been extended to show the execution size. Having multiplied the stake by 10, the obvious has happened: the profit and loss has been multiplied by 10. Instead of 16.98, the surplus is now 169.80 Adding an indicator Having heard of indicators, the next thing anyone would add to the strategy is one of them. For sure they must be much better than a simple “3 lower closes” strategy. Inspired in one of the examples from PyAlgoTrade a strategy using a Simple Moving Average. Buy “AtMarket” if the close is greater than the Average If in the market, sell if the close is smaller than the Average Only 1 active operation is allowed in the market Most of the existing code can be kept in place. Let’s add the average during init and keep a reference to it: self.sma = bt.indicators.MovingAverageSimple(self.datas[0], period=self.params.maperiod) And of course the logic to enter and exit the market will rely on the Average values. Look in the code for the logic. Note The starting cash will be 1000 monetary units to be in line with the PyAlgoTrade example and no commission will be applied from __future__ import (absolute_import, division, print_function, unicode_literals) import datetime # For datetime objects import os.path # To manage paths import sys # To find out the script name (in argv[0]) # Import the backtrader platform import backtrader as bt # Create a Stratey class TestStrategy(bt.Strategy): params = ( ('maperiod', 15), ) def log(self, txt, dt=None): ''' Logging function fot this strategy''' dt = dt or self.datas[0].datetime.date(0) print('%s, %s' % (dt.isoformat(), txt)) def __init__(self): # Keep a reference to the "close" line in the data[0] dataseries self.dataclose = self.datas[0].close # To keep track of pending orders and buy price/commission self.order = None self.buyprice = None self.buycomm = None # Add a MovingAverageSimple indicator self.sma = bt.indicators.SimpleMovingAverage( self.datas[0], period=self.params.maperiod) def notify_order(self, order): if order.status in [order.Submitted, order.Accepted]: # Buy/Sell order submitted/accepted to/by broker - Nothing to do return # Check if an order has been completed # Attention: broker could reject order if not enough cash if order.status in [order.Completed]: if order.isbuy(): self.log( 'BUY EXECUTED, Price: %.2f, Cost: %.2f, Comm %.2f' % (order.executed.price, order.executed.value, order.executed.comm)) self.buyprice = order.executed.price self.buycomm = order.executed.comm else: # Sell self.log('SELL EXECUTED, Price: %.2f, Cost: %.2f, Comm %.2f' % (order.executed.price, order.executed.value, order.executed.comm)) self.bar_executed = len(self) elif order.status in [order.Canceled, order.Margin, order.Rejected]: self.log('Order Canceled/Margin/Rejected') self.order = None def notify_trade(self, trade): if not trade.isclosed: return self.log('OPERATION PROFIT, GROSS %.2f, NET %.2f' % (trade.pnl, trade.pnlcomm)) def next(self): # Simply log the closing price of the series from the reference self.log('Close, %.2f' % self.dataclose[0]) # Check if an order is pending ... if yes, we cannot send a 2nd one if self.order: return # Check if we are in the market if not self.position: # Not yet ... we MIGHT BUY if ... if self.dataclose[0] > self.sma[0]: # BUY, BUY, BUY!!! (with all possible default parameters) self.log('BUY CREATE, %.2f' % self.dataclose[0]) # Keep track of the created order to avoid a 2nd order self.order = self.buy() else: if self.dataclose[0] < self.sma[0]: # SELL, SELL, SELL!!! (with all possible default parameters) self.log('SELL CREATE, %.2f' % self.dataclose[0]) # Keep track of the created order to avoid a 2nd order self.order = self.sell() if __name__ == '__main__': # Create a cerebro entity cerebro = bt.Cerebro() # Add a strategy cerebro.addstrategy(TestStrategy) # Datas are in a subfolder of the samples. Need to find where the script is # because it could have been called from anywhere modpath = os.path.dirname(os.path.abspath(sys.argv[0])) datapath = os.path.join(modpath, '../../datas/orcl-1995-2014.txt') # Create a Data Feed data = bt.feeds.YahooFinanceCSVData( dataname=datapath, # Do not pass values before this date fromdate=datetime.datetime(2000, 1, 1), # Do not pass values before this date todate=datetime.datetime(2000, 12, 31), # Do not pass values after this date reverse=False) # Add the Data Feed to Cerebro cerebro.adddata(data) # Set our desired cash start cerebro.broker.setcash(1000.0) # Add a FixedSize sizer according to the stake cerebro.addsizer(bt.sizers.FixedSize, stake=10) # Set the commission cerebro.broker.setcommission(commission=0.0) # Print out the starting conditions print('Starting Portfolio Value: %.2f' % cerebro.broker.getvalue()) # Run over everything cerebro.run() # Print out the final result print('Final Portfolio Value: %.2f' % cerebro.broker.getvalue()) Now, before skipping to the next section LOOK CAREFULLY to the first date which is shown in the log: It’ no longer 2000-01-03, the first trading day in the year 2K. It’s 2000-01-24 … Who has stolen my cheese? The missing days are not missing. The platform has adapted to the new circumstances: An indicator (SimpleMovingAverage) has been added to the Strategy. This indicator needs X bars to produce an output: in the example: 15 2000-01-24 is the day in which the 15th bar occurs The backtrader platform assumes that the Strategy has the indicator in place for a good reason, to use it in the decision making process. And it makes no sense to try to make decisions if the indicator is not yet ready and producing values. next will be 1st called when all indicators have already reached the minimum needed period to produce a value In the example there is a single indicator, but the strategy could have any number of them. After the execution the output is: Starting Portfolio Value: 1000.00 2000-01-24T00:00:00, Close, 25.55 2000-01-25T00:00:00, Close, 26.61 2000-01-25T00:00:00, BUY CREATE, 26.61 2000-01-26T00:00:00, BUY EXECUTED, Size 10, Price: 26.76, Cost: 267.60, Commission 0.00 2000-01-26T00:00:00, Close, 25.96 2000-01-27T00:00:00, Close, 24.43 2000-01-27T00:00:00, SELL CREATE, 24.43 2000-01-28T00:00:00, SELL EXECUTED, Size 10, Price: 24.28, Cost: 242.80, Commission 0.00 2000-01-28T00:00:00, OPERATION PROFIT, GROSS -24.80, NET -24.80 2000-01-28T00:00:00, Close, 22.34 2000-01-31T00:00:00, Close, 23.55 2000-02-01T00:00:00, Close, 25.46 2000-02-02T00:00:00, Close, 25.61 2000-02-02T00:00:00, BUY CREATE, 25.61 2000-02-03T00:00:00, BUY EXECUTED, Size 10, Price: 26.11, Cost: 261.10, Commission 0.00 ... ... ... 2000-12-20T00:00:00, SELL CREATE, 26.88 2000-12-21T00:00:00, SELL EXECUTED, Size 10, Price: 26.23, Cost: 262.30, Commission 0.00 2000-12-21T00:00:00, OPERATION PROFIT, GROSS -20.60, NET -20.60 2000-12-21T00:00:00, Close, 27.82 2000-12-21T00:00:00, BUY CREATE, 27.82 2000-12-22T00:00:00, BUY EXECUTED, Size 10, Price: 28.65, Cost: 286.50, Commission 0.00 2000-12-22T00:00:00, Close, 30.06 2000-12-26T00:00:00, Close, 29.17 2000-12-27T00:00:00, Close, 28.94 2000-12-28T00:00:00, Close, 29.29 2000-12-29T00:00:00, Close, 27.41 2000-12-29T00:00:00, SELL CREATE, 27.41 Final Portfolio Value: 973.90 In the name of the King!!! A winning system turned into a losing one … and that with no commission. It may well be that simply adding an indicator is not the universal panacea. Note The same logic and data with PyAlgoTrade yields a slightly different result (slightly off). Looking at the entire printout reveals that some operations are not exactly the same. Being the culprit again the usual suspect: rounding. PyAlgoTrade does not round the datafeed values when applying the divided “adjusted close” to the data feed values. The Yahoo Data Feed provided by backtrader rounds the values down to 2 decimals after applying the adjusted close. Upon printing the values everything seems the same, but it’s obvious that sometimes that 5th place decimal plays a role. Rounding down to 2 decimals seems more realistic, because Market Exchanges do only allow a number of decimals per asset (being that 2 decimals usually for stocks) Note The Yahoo Data Feed (starting with version 1.8.11.99 allows to specify if rounding has to happen and how many decimals) Visual Inspection: Plotting A printout or log of the actual whereabouts of the system at each bar-instant is good but humans tend to be visual and therefore it seems right to offer a view of the same whereabouts as chart. Note To plot you need to have matplotlib installed Once again defaults for plotting are there to assist the platform user. Plotting is incredibly a 1 line operation: cerebro.plot() Being the location for sure after cerebro.run() has been called. In order to display the automatic plotting capabilities and a couple of easy customizations, the following will be done: A 2nd MovingAverage (Exponential) will be added. The defaults will plot it (just like the 1st) with the data. A 3rd MovingAverage (Weighted) will be added. Customized to plot in an own plot (even if not sensible) A Stochastic (Slow) will be added. No change to the defaults. A MACD will be added. No change to the defaults. A RSI will be added. No change to the defaults. A MovingAverage (Simple) will be applied to the RSI. No change to the defaults (it will be plotted with the RSI) An AverageTrueRange will be added. Changed defaults to avoid it being plotted. The entire set of additions to the init method of the Strategy: # Indicators for the plotting show bt.indicators.ExponentialMovingAverage(self.datas[0], period=25) bt.indicators.WeightedMovingAverage(self.datas[0], period=25).subplot = True bt.indicators.StochasticSlow(self.datas[0]) bt.indicators.MACDHisto(self.datas[0]) rsi = bt.indicators.RSI(self.datas[0]) bt.indicators.SmoothedMovingAverage(rsi, period=10) bt.indicators.ATR(self.datas[0]).plot = False Note Even if indicators are not explicitly added to a member variable of the strategy (like self.sma = MovingAverageSimple…), they will autoregister with the strategy and will influence the minimum period for next and will be part of the plotting. In the example only RSI is added to a temporary variable rsi with the only intention to create a MovingAverageSmoothed on it. The example now: from __future__ import (absolute_import, division, print_function, unicode_literals) import datetime # For datetime objects import os.path # To manage paths import sys # To find out the script name (in argv[0]) # Import the backtrader platform import backtrader as bt # Create a Stratey class TestStrategy(bt.Strategy): params = ( ('maperiod', 15), ) def log(self, txt, dt=None): ''' Logging function fot this strategy''' dt = dt or self.datas[0].datetime.date(0) print('%s, %s' % (dt.isoformat(), txt)) def __init__(self): # Keep a reference to the "close" line in the data[0] dataseries self.dataclose = self.datas[0].close # To keep track of pending orders and buy price/commission self.order = None self.buyprice = None self.buycomm = None # Add a MovingAverageSimple indicator self.sma = bt.indicators.SimpleMovingAverage( self.datas[0], period=self.params.maperiod) # Indicators for the plotting show bt.indicators.ExponentialMovingAverage(self.datas[0], period=25) bt.indicators.WeightedMovingAverage(self.datas[0], period=25, subplot=True) bt.indicators.StochasticSlow(self.datas[0]) bt.indicators.MACDHisto(self.datas[0]) rsi = bt.indicators.RSI(self.datas[0]) bt.indicators.SmoothedMovingAverage(rsi, period=10) bt.indicators.ATR(self.datas[0], plot=False) def notify_order(self, order): if order.status in [order.Submitted, order.Accepted]: # Buy/Sell order submitted/accepted to/by broker - Nothing to do return # Check if an order has been completed # Attention: broker could reject order if not enough cash if order.status in [order.Completed]: if order.isbuy(): self.log( 'BUY EXECUTED, Price: %.2f, Cost: %.2f, Comm %.2f' % (order.executed.price, order.executed.value, order.executed.comm)) self.buyprice = order.executed.price self.buycomm = order.executed.comm else: # Sell self.log('SELL EXECUTED, Price: %.2f, Cost: %.2f, Comm %.2f' % (order.executed.price, order.executed.value, order.executed.comm)) self.bar_executed = len(self) elif order.status in [order.Canceled, order.Margin, order.Rejected]: self.log('Order Canceled/Margin/Rejected') # Write down: no pending order self.order = None def notify_trade(self, trade): if not trade.isclosed: return self.log('OPERATION PROFIT, GROSS %.2f, NET %.2f' % (trade.pnl, trade.pnlcomm)) def next(self): # Simply log the closing price of the series from the reference self.log('Close, %.2f' % self.dataclose[0]) # Check if an order is pending ... if yes, we cannot send a 2nd one if self.order: return # Check if we are in the market if not self.position: # Not yet ... we MIGHT BUY if ... if self.dataclose[0] > self.sma[0]: # BUY, BUY, BUY!!! (with all possible default parameters) self.log('BUY CREATE, %.2f' % self.dataclose[0]) # Keep track of the created order to avoid a 2nd order self.order = self.buy() else: if self.dataclose[0] < self.sma[0]: # SELL, SELL, SELL!!! (with all possible default parameters) self.log('SELL CREATE, %.2f' % self.dataclose[0]) # Keep track of the created order to avoid a 2nd order self.order = self.sell() if __name__ == '__main__': # Create a cerebro entity cerebro = bt.Cerebro() # Add a strategy cerebro.addstrategy(TestStrategy) # Datas are in a subfolder of the samples. Need to find where the script is # because it could have been called from anywhere modpath = os.path.dirname(os.path.abspath(sys.argv[0])) datapath = os.path.join(modpath, '../../datas/orcl-1995-2014.txt') # Create a Data Feed data = bt.feeds.YahooFinanceCSVData( dataname=datapath, # Do not pass values before this date fromdate=datetime.datetime(2000, 1, 1), # Do not pass values before this date todate=datetime.datetime(2000, 12, 31), # Do not pass values after this date reverse=False) # Add the Data Feed to Cerebro cerebro.adddata(data) # Set our desired cash start cerebro.broker.setcash(1000.0) # Add a FixedSize sizer according to the stake cerebro.addsizer(bt.sizers.FixedSize, stake=10) # Set the commission cerebro.broker.setcommission(commission=0.0) # Print out the starting conditions print('Starting Portfolio Value: %.2f' % cerebro.broker.getvalue()) # Run over everything cerebro.run() # Print out the final result print('Final Portfolio Value: %.2f' % cerebro.broker.getvalue()) # Plot the result cerebro.plot() After the execution the output is: Starting Portfolio Value: 1000.00 2000-02-18T00:00:00, Close, 27.61 2000-02-22T00:00:00, Close, 27.97 2000-02-22T00:00:00, BUY CREATE, 27.97 2000-02-23T00:00:00, BUY EXECUTED, Size 10, Price: 28.38, Cost: 283.80, Commission 0.00 2000-02-23T00:00:00, Close, 29.73 ... ... ... 2000-12-21T00:00:00, BUY CREATE, 27.82 2000-12-22T00:00:00, BUY EXECUTED, Size 10, Price: 28.65, Cost: 286.50, Commission 0.00 2000-12-22T00:00:00, Close, 30.06 2000-12-26T00:00:00, Close, 29.17 2000-12-27T00:00:00, Close, 28.94 2000-12-28T00:00:00, Close, 29.29 2000-12-29T00:00:00, Close, 27.41 2000-12-29T00:00:00, SELL CREATE, 27.41 Final Portfolio Value: 981.00 The final result has changed even if the logic hasn’t. This is true but the logic has not been applied to the same number of bars. Note As explained before, the platform will first call next when all indicators are ready to produce a value. In this plotting example (very clear in the chart) the MACD is the last indicator to be fully ready (all 3 lines producing an output). The 1st BUY order is no longer scheduled during Jan 2000 but close to the end of Feb 2000. The chart: image Let’s Optimize Many trading books say each market and each traded stock (or commodity or ..) have different rythms. That there is no such thing as a one size fits all. Before the plotting sample, when the strategy started using an indicator the period default value was 15 bars. It’s a strategy parameter and this can be used in an optimization to change the value of the parameter and see which one better fits the market. Note There is plenty of literature about Optimization and associated pros and cons. But the advice will always point in the same direction: do not overoptimize. If a trading idea is not sound, optimizing may end producing a positive result which is only valid for the backtested dataset. The sample is modified to optimize the period of the Simple Moving Average. For the sake of clarity any output with regards to Buy/Sell orders has been removed The example now: from __future__ import (absolute_import, division, print_function, unicode_literals) import datetime # For datetime objects import os.path # To manage paths import sys # To find out the script name (in argv[0]) # Import the backtrader platform import backtrader as bt # Create a Stratey class TestStrategy(bt.Strategy): params = ( ('maperiod', 15), ('printlog', False), ) def log(self, txt, dt=None, doprint=False): ''' Logging function fot this strategy''' if self.params.printlog or doprint: dt = dt or self.datas[0].datetime.date(0) print('%s, %s' % (dt.isoformat(), txt)) def __init__(self): # Keep a reference to the "close" line in the data[0] dataseries self.dataclose = self.datas[0].close # To keep track of pending orders and buy price/commission self.order = None self.buyprice = None self.buycomm = None # Add a MovingAverageSimple indicator self.sma = bt.indicators.SimpleMovingAverage( self.datas[0], period=self.params.maperiod) def notify_order(self, order): if order.status in [order.Submitted, order.Accepted]: # Buy/Sell order submitted/accepted to/by broker - Nothing to do return # Check if an order has been completed # Attention: broker could reject order if not enough cash if order.status in [order.Completed]: if order.isbuy(): self.log( 'BUY EXECUTED, Price: %.2f, Cost: %.2f, Comm %.2f' % (order.executed.price, order.executed.value, order.executed.comm)) self.buyprice = order.executed.price self.buycomm = order.executed.comm else: # Sell self.log('SELL EXECUTED, Price: %.2f, Cost: %.2f, Comm %.2f' % (order.executed.price, order.executed.value, order.executed.comm)) self.bar_executed = len(self) elif order.status in [order.Canceled, order.Margin, order.Rejected]: self.log('Order Canceled/Margin/Rejected') # Write down: no pending order self.order = None def notify_trade(self, trade): if not trade.isclosed: return self.log('OPERATION PROFIT, GROSS %.2f, NET %.2f' % (trade.pnl, trade.pnlcomm)) def next(self): # Simply log the closing price of the series from the reference self.log('Close, %.2f' % self.dataclose[0]) # Check if an order is pending ... if yes, we cannot send a 2nd one if self.order: return # Check if we are in the market if not self.position: # Not yet ... we MIGHT BUY if ... if self.dataclose[0] > self.sma[0]: # BUY, BUY, BUY!!! (with all possible default parameters) self.log('BUY CREATE, %.2f' % self.dataclose[0]) # Keep track of the created order to avoid a 2nd order self.order = self.buy() else: if self.dataclose[0] < self.sma[0]: # SELL, SELL, SELL!!! (with all possible default parameters) self.log('SELL CREATE, %.2f' % self.dataclose[0]) # Keep track of the created order to avoid a 2nd order self.order = self.sell() def stop(self): self.log('(MA Period %2d) Ending Value %.2f' % (self.params.maperiod, self.broker.getvalue()), doprint=True) if __name__ == '__main__': # Create a cerebro entity cerebro = bt.Cerebro() # Add a strategy strats = cerebro.optstrategy( TestStrategy, maperiod=range(10, 31)) # Datas are in a subfolder of the samples. Need to find where the script is # because it could have been called from anywhere modpath = os.path.dirname(os.path.abspath(sys.argv[0])) datapath = os.path.join(modpath, '../../datas/orcl-1995-2014.txt') # Create a Data Feed data = bt.feeds.YahooFinanceCSVData( dataname=datapath, # Do not pass values before this date fromdate=datetime.datetime(2000, 1, 1), # Do not pass values before this date todate=datetime.datetime(2000, 12, 31), # Do not pass values after this date reverse=False) # Add the Data Feed to Cerebro cerebro.adddata(data) # Set our desired cash start cerebro.broker.setcash(1000.0) # Add a FixedSize sizer according to the stake cerebro.addsizer(bt.sizers.FixedSize, stake=10) # Set the commission cerebro.broker.setcommission(commission=0.0) # Run over everything cerebro.run(maxcpus=1) Instead of calling addstrategy to add a stratey class to Cerebro, the call is made to optstrategy. And instead of passing a value a range of values is passed. One of the “Strategy” hooks is added, the stop method, which will be called when the data has been exhausted and backtesting is over. It’s used to print the final net value of the portfolio in the broker (it was done in Cerebro previously) The system will execute the strategy for each value of the range. The following will be output: 2000-12-29, (MA Period 10) Ending Value 880.30 2000-12-29, (MA Period 11) Ending Value 880.00 2000-12-29, (MA Period 12) Ending Value 830.30 2000-12-29, (MA Period 13) Ending Value 893.90 2000-12-29, (MA Period 14) Ending Value 896.90 2000-12-29, (MA Period 15) Ending Value 973.90 2000-12-29, (MA Period 16) Ending Value 959.40 2000-12-29, (MA Period 17) Ending Value 949.80 2000-12-29, (MA Period 18) Ending Value 1011.90 2000-12-29, (MA Period 19) Ending Value 1041.90 2000-12-29, (MA Period 20) Ending Value 1078.00 2000-12-29, (MA Period 21) Ending Value 1058.80 2000-12-29, (MA Period 22) Ending Value 1061.50 2000-12-29, (MA Period 23) Ending Value 1023.00 2000-12-29, (MA Period 24) Ending Value 1020.10 2000-12-29, (MA Period 25) Ending Value 1013.30 2000-12-29, (MA Period 26) Ending Value 998.30 2000-12-29, (MA Period 27) Ending Value 982.20 2000-12-29, (MA Period 28) Ending Value 975.70 2000-12-29, (MA Period 29) Ending Value 983.30 2000-12-29, (MA Period 30) Ending Value 979.80 Results: For periods below 18 the strategy (commissionless) loses money. For periods between 18 and 26 (both included) the strategy makes money. Above 26 money is lost again. And the winning period for this strategy and the given data set is: 20 bars, which wins 78.00 units over 1000 $/€ (a 7.8%) Note The extra indicators from the plotting example have been removed and the start of operations is only influenced by the Simple Moving Average which is being optimized. Hence the slightly different results for period 15 Conclusion The incremental samples have shown how to go from a barebones script to a fully working trading system which even plots the results and can be optimized. A lot more can be done to try to improve the chances of winning: Self defined Indicators Creating an indicator is easy (and even plotting them is easy) Sizers Money Management is for many the key to success Order Types (limit, stop, stoplimit) Some others To ensure all the above items can be fully utilized the documentation provides an insight into them (and other topics) Look in the table of contents and keep on reading … and developing. Best of luck
07-08
""" Outlier Detection Toolbox ========================= This is a single-file distribution (for ease of preview) of a production-grade outlier/anomaly detection toolbox intended to be split into a small package: outlier_detection/ ├── __init__.py ├── utils.py ├── statistical.py ├── distance_density.py ├── model_based.py ├── deep_learning.py ├── ensemble.py ├── visualization.py └── cli.py --- NOTE --- This code block contains *all* modules concatenated (with file headers) so you can preview and copy each file out into separate .py files. When you save them as separate files the package will work as expected. Design goals (what you asked for): - Detailed, well-documented functions (purpose, math, applicability, edge-cases) - Robust handling of NaNs, constant columns, categorical data - Functions return structured metadata + masks + scores so you can inspect - Utilities for ensemble combining methods and producing a readable report - Optional deep learning methods (AutoEncoder/VAE) with clear dependency instructions and graceful error messages if libraries are missing. Dependencies (recommended): pip install numpy pandas scipy scikit-learn matplotlib joblib tensorflow>=2.0 If you prefer PyTorch for deep models you can adapt deep_learning.py accordingly. """ # --------------------------- # File: outlier_detection/__init__.py # --------------------------- __version__ = "0.1.0" # make it easy to import core helpers from typing import Dict from .utils import ensure_dataframe, OutlierResult, summarize_results, recommend_methods from .statistical import z_score_method, modified_z_score, iqr_method, grubbs_test from .distance_density import lof_method, mahalanobis_method, dbscan_method, knn_distance_method from .model_based import ( isolation_forest_method, one_class_svm_method, pca_reconstruction_error, gmm_method, elliptic_envelope_method, ) # deep_learning module is optional (heavy dependency) try: from .deep_learning import autoencoder_method, vae_method except Exception: # graceful: user may not have TF installed; import will raise at use time autoencoder_method = None vae_method = None from .ensemble import ensemble_methods, aggregate_scores from .visualization import plot_boxplot, plot_pair_scatter __all__ = [ "__version__", "ensure_dataframe", "OutlierResult", "summarize_results", "recommend_methods", "z_score_method", "modified_z_score", "iqr_method", "grubbs_test", "lof_method", "mahalanobis_method", "dbscan_method", "knn_distance_method", "isolation_forest_method", "one_class_svm_method", "pca_reconstruction_error", "gmm_method", "elliptic_envelope_method", "autoencoder_method", "vae_method", "ensemble_methods", "aggregate_scores", "plot_boxplot", "plot_pair_scatter", ] # --------------------------- # File: outlier_detection/utils.py # --------------------------- """ Utilities for the outlier detection package. Key responsibilities: - Input validation and type normalization - Handling numeric / categorical separation - Standardization and robust scaling helpers - A consistent result object shape used by all detectors """ from typing import Dict, Any, Tuple, Optional, List import numpy as np import pandas as pd import logging logger = logging.getLogger(__name__) # A simple, documented result schema for detector functions. # Each detector returns a dict with these keys (guaranteed): # - 'mask': pd.Series[bool] same index as input rows; True means OUTLIER # - 'score': pd.Series or pd.DataFrame numeric score (bigger usually means more anomalous) # - 'method': short string # - 'params': dict of parameters used # - 'explanation': short textual note about interpretation OutlierResult = Dict[str, Any] def ensure_dataframe(X) -> pd.DataFrame: """ Convert input into a pandas DataFrame with a stable integer index. Accepts: pd.DataFrame, np.ndarray, list-of-lists, pd.Series. Returns DataFrame with numeric column names if necessary. """ if isinstance(X, pd.DataFrame): df = X.copy() elif isinstance(X, pd.Series): df = X.to_frame() else: # try to coerce df = pd.DataFrame(X) # if no index or non-unique, reset if df.index is None or not df.index.is_unique: df = df.reset_index(drop=True) # name numeric columns if unnamed df.columns = [str(c) for c in df.columns] return df def numeric_only(df: pd.DataFrame, return_cols: bool = False) -> pd.DataFrame: """ Select numeric columns and warn if non-numeric columns are dropped. If no numeric columns found raises ValueError. """ df = ensure_dataframe(df) numeric_df = df.select_dtypes(include=["number"]).copy() non_numeric = [c for c in df.columns if c not in numeric_df.columns] if non_numeric: logger.debug("Dropping non-numeric columns for numeric-only detectors: %s", non_numeric) if numeric_df.shape[1] == 0: raise ValueError("No numeric columns available for numeric detectors. Consider encoding categoricals.") if return_cols: return numeric_df, list(numeric_df.columns) return numeric_df def handle_missing(df: pd.DataFrame, strategy: str = "drop", fill_value: Optional[float] = None) -> pd.DataFrame: """ Handle missing values in data before passing to detectors. Parameters ---------- df : DataFrame strategy : {'drop', 'mean', 'median', 'zero', 'constant', 'keep'} - 'drop' : drop rows with any NaN (useful when most values are present) - 'mean' : fill numeric columns with mean - 'median' : fill numeric with median - 'zero' : fill with 0 - 'constant' : fill with supplied fill_value - 'keep' : keep NaNs (many detectors can handle NaN rows implicitly) fill_value : numeric (used when strategy=='constant') Returns ------- DataFrame cleaned according to strategy. Original index preserved. Notes ----- - Some detectors (LOF, IsolationForest) do NOT accept NaNs; choose strategy accordingly. """ df = df.copy() if strategy == "drop": return df.dropna(axis=0, how="any") elif strategy == "mean": return df.fillna(df.mean()) elif strategy == "median": return df.fillna(df.median()) elif strategy == "zero": return df.fillna(0) elif strategy == "constant": if fill_value is None: raise ValueError("fill_value must be provided for strategy='constant'") return df.fillna(fill_value) elif strategy == "keep": return df else: raise ValueError(f"Unknown missing value strategy: {strategy}") def robust_scale(df: pd.DataFrame) -> pd.DataFrame: """ Scale numeric columns using median and IQR (robust to outliers). Returns a DataFrame of same shape with scaled values. """ df = numeric_only(df) med = df.median() q1 = df.quantile(0.25) q3 = df.quantile(0.75) iqr = q3 - q1 # avoid division by zero iqr_replaced = iqr.replace(0, 1.0) return (df - med) / iqr_replaced def create_result(mask: pd.Series, score: pd.Series, method: str, params: Dict[str, Any], explanation: str) -> OutlierResult: """ Wrap mask + score into the standard result dict. """ # ensure index alignment if not mask.index.equals(score.index): # try to reindex score = score.reindex(mask.index) return { "mask": mask.astype(bool), "score": score, "method": method, "params": params, "explanation": explanation, } def summarize_results(results: Dict[str, OutlierResult]) -> pd.DataFrame: """ Given a dict of results keyed by method name, return a single DataFrame where each column is that method's boolean flag and another column is the score (if numeric). Also returns a short per-row summary like how many detectors flagged the row. """ # Collect masks and scores masks = {} scores = {} for k, r in results.items(): masks[f"{k}_flag"] = r["mask"].astype(int) # flatten score: if DataFrame use mean across columns sc = r["score"] if isinstance(sc, pd.DataFrame): sc = sc.mean(axis=1) scores[f"{k}_score"] = sc masks_df = pd.DataFrame(masks) scores_df = pd.DataFrame(scores) combined = pd.concat([masks_df, scores_df], axis=1) combined.index = next(iter(results.values()))["mask"].index combined["n_flags"] = masks_df.sum(axis=1) combined["any_flag"] = combined["n_flags"] > 0 return combined def recommend_methods(X: pd.DataFrame) -> List[str]: """ Heuristic recommender: returns a short list of methods to try depending on data shape. Rules (simple heuristics): - single numeric column: ['iqr', 'modified_z'] - low-dimensional (n_features <= 10) and numeric: ['mahalanobis','lof','isolation_forest'] - high-dimensional (n_features > 10): ['isolation_forest','pca','autoencoder'] """ df = ensure_dataframe(X) n_features = df.select_dtypes(include=["number"]).shape[1] if n_features == 0: raise ValueError("No numeric features to recommend methods for") if n_features == 1: return ["iqr", "modified_z"] elif n_features <= 10: return ["mahalanobis", "lof", "isolation_forest"] else: return ["isolation_forest", "pca", "autoencoder"] # --------------------------- # File: outlier_detection/statistical.py # --------------------------- """ Statistical / univariate outlier detectors. Each function focuses on single-dimension input (pd.Series) or will operate column-wise if given a DataFrame (then returns DataFrame of scores / masks). """ from typing import Union import numpy as np import pandas as pd from scipy import stats from .utils import create_result, numeric_only def _as_series(x: Union[pd.Series, pd.DataFrame], col: str = None) -> pd.Series: if isinstance(x, pd.DataFrame): if col is None: raise ValueError("If passing DataFrame, must pass column name") return x[col] return x def z_score_method(x: Union[pd.Series, pd.DataFrame], threshold: float = 3.0) -> OutlierResult: """ Z-Score method (univariate) Math: z = (x - mean) / std Flag where |z| > threshold. Applicability: single numeric column, approximately normal distribution. Not robust to heavy-tailed distributions. Returns OutlierResult with score = |z| (higher => more anomalous). """ if isinstance(x, pd.DataFrame): # apply per-column and return a DataFrame score masks = pd.DataFrame(index=x.index) scores = pd.DataFrame(index=x.index) for c in x.columns: res = z_score_method(x[c], threshold=threshold) masks[c] = res["mask"].astype(int) scores[c] = res["score"] # Derive a combined mask: any column flagged mask_any = masks.sum(axis=1) > 0 combined_score = scores.mean(axis=1) return create_result(mask_any, combined_score, "z_score_dataframe", {"threshold": threshold}, "Applied z-score per-column and combined by mean score and any-flag") s = x.dropna() if s.shape[0] == 0: mask = pd.Series([False]*len(x), index=x.index) score = pd.Series([0.0]*len(x), index=x.index) return create_result(mask, score, "z_score", {"threshold": threshold}, "Empty or all-NaN series") mu = s.mean() sigma = s.std(ddof=0) if sigma == 0: score = pd.Series(0.0, index=x.index) mask = pd.Series(False, index=x.index) explanation = "Zero variance: no z-score possible" return create_result(mask, score, "z_score", {"threshold": threshold}, explanation) z = (x - mu) / sigma absz = z.abs() mask = absz > threshold score = absz.fillna(0.0) explanation = f"z-score with mean={mu:.4g}, std={sigma:.4g}; flag |z|>{threshold}" return create_result(mask, score, "z_score", {"threshold": threshold}, explanation) def modified_z_score(x: Union[pd.Series, pd.DataFrame], threshold: float = 3.5) -> OutlierResult: """ Modified Z-score using median and MAD (robust to extreme values). Formula: M_i = 0.6745 * (x_i - median) / MAD Where MAD = median(|x_i - median|) Recommended threshold: 3.5 (common in literature) """ if isinstance(x, pd.DataFrame): masks = pd.DataFrame(index=x.index) scores = pd.DataFrame(index=x.index) for c in x.columns: res = modified_z_score(x[c], threshold=threshold) masks[c] = res["mask"].astype(int) scores[c] = res["score"] mask_any = masks.sum(axis=1) > 0 combined_score = scores.mean(axis=1) return create_result(mask_any, combined_score, "modified_z_dataframe", {"threshold": threshold}, "Applied modified z per-column and combined") s = x.dropna() if len(s) == 0: return create_result(pd.Series(False, index=x.index), pd.Series(0.0, index=x.index), "modified_z", {"threshold": threshold}, "empty") med = np.median(s) mad = np.median(np.abs(s - med)) if mad == 0: # all equal or too small score = pd.Series(0.0, index=x.index) mask = pd.Series(False, index=x.index) return create_result(mask, score, "modified_z", {"threshold": threshold}, "mad==0: no variation") M = 0.6745 * (x - med) / mad score = M.abs().fillna(0.0) mask = score > threshold return create_result(mask, score, "modified_z", {"threshold": threshold, "median": med, "mad": mad}, "Robust modified z-score; higher => more anomalous") def iqr_method(x: Union[pd.Series, pd.DataFrame], k: float = 1.5) -> OutlierResult: """ IQR (boxplot) method. Flags points outside [Q1 - k*IQR, Q3 + k*IQR]. k=1.5 is common; use larger k for fewer false positives. """ if isinstance(x, pd.DataFrame): masks = pd.DataFrame(index=x.index) scores = pd.DataFrame(index=x.index) for c in x.columns: res = iqr_method(x[c], k=k) masks[c] = res["mask"].astype(int) scores[c] = res["score"] mask_any = masks.sum(axis=1) > 0 combined_score = scores.mean(axis=1) return create_result(mask_any, combined_score, "iqr_dataframe", {"k": k}, "Applied IQR per column") s = x.dropna() if s.shape[0] == 0: return create_result(pd.Series(False, index=x.index), pd.Series(0.0, index=x.index), "iqr", {"k": k}, "empty") q1 = np.percentile(s, 25) q3 = np.percentile(s, 75) iqr = q3 - q1 lower = q1 - k * iqr upper = q3 + k * iqr mask = (x < lower) | (x > upper) # score: distance from nearest fence normalized by iqr (if iqr==0 use abs distance) if iqr == 0: score = (x - q1).abs().fillna(0.0) else: score = pd.Series(0.0, index=x.index) score[x < lower] = ((lower - x[x < lower]) / (iqr + 1e-12)) score[x > upper] = ((x[x > upper] - upper) / (iqr + 1e-12)) return create_result(mask.fillna(False), score.fillna(0.0), "iqr", {"k": k, "q1": q1, "q3": q3}, f"IQR fences [{lower:.4g}, {upper:.4g}]") def grubbs_test(x: Union[pd.Series, pd.DataFrame], alpha: float = 0.05) -> OutlierResult: """ Grubbs' test for a single outlier (requires approx normality). This test is intended to *detect one outlier at a time*. Use iteratively (recompute after removing detected outlier) if you expect multiple outliers, but be careful with multiplicity adjustments. Returns mask with at most one True (the most extreme point) unless alpha is very large. """ # For simplicity operate only on a single series. If DataFrame provided, # run per-column and combine (like other funcs) if isinstance(x, pd.DataFrame): masks = pd.DataFrame(index=x.index) scores = pd.DataFrame(index=x.index) for c in x.columns: res = grubbs_test(x[c], alpha=alpha) masks[c] = res["mask"].astype(int) scores[c] = res["score"] mask_any = masks.sum(axis=1) > 0 combined_score = scores.mean(axis=1) return create_result(mask_any, combined_score, "grubbs_dataframe", {"alpha": alpha}, "Applied Grubbs per column") from math import sqrt s = x.dropna() n = len(s) if n < 3: return create_result(pd.Series(False, index=x.index), pd.Series(0.0, index=x.index), "grubbs", {"alpha": alpha}, "n<3: cannot run") mean = s.mean() std = s.std(ddof=0) if std == 0: return create_result(pd.Series(False, index=x.index), pd.Series(0.0, index=x.index), "grubbs", {"alpha": alpha}, "zero std") # compute G statistic for max dev deviations = (s - mean).abs() max_idx = deviations.idxmax() G = deviations.loc[max_idx] / std # critical value from t-distribution t_crit = stats.t.ppf(1 - alpha / (2 * n), n - 2) G_crit = ((n - 1) / sqrt(n)) * (t_crit / sqrt(n - 2 + t_crit ** 2)) mask = pd.Series(False, index=x.index) mask.loc[max_idx] = G > G_crit score = pd.Series(0.0, index=x.index) score.loc[max_idx] = float(G) explanation = f"G={G:.4g}, Gcrit={G_crit:.4g}, alpha={alpha}" return create_result(mask, score, "grubbs", {"alpha": alpha, "G": G, "Gcrit": G_crit}, explanation) # --------------------------- # File: outlier_detection/distance_density.py # --------------------------- """ Distance and density based detectors (multivariate-capable). Functions generally accept a numeric DataFrame X and return OutlierResult. """ from sklearn.neighbors import LocalOutlierFactor, NearestNeighbors from sklearn.cluster import DBSCAN from sklearn.covariance import EmpiricalCovariance from .utils import ensure_dataframe, create_result, numeric_only def lof_method(X, n_neighbors: int = 20, contamination: float = 0.05) -> OutlierResult: """ Local Outlier Factor (LOF). Returns score = -lof. LOF API returns negative_outlier_factor_. We negate so higher score => more anomalous. Applicability: medium-dimensional data, clusters of varying density. Beware: LOF does not provide a predictable probabilistic threshold. """ X = ensure_dataframe(X) Xnum = numeric_only(X) if Xnum.shape[0] < 2: return create_result(pd.Series(False, index=X.index), pd.Series(0.0, index=X.index), "lof", {"n_neighbors": n_neighbors}, "too few samples") lof = LocalOutlierFactor(n_neighbors=min(n_neighbors, max(1, Xnum.shape[0]-1)), contamination=contamination) y = lof.fit_predict(Xnum) negative_factor = lof.negative_outlier_factor_ # higher -> more anomalous score = (-negative_factor) score = pd.Series(score, index=Xnum.index) mask = pd.Series(y == -1, index=Xnum.index) return create_result(mask, score, "lof", {"n_neighbors": n_neighbors, "contamination": contamination}, "LOF: higher score more anomalous") def knn_distance_method(X, k: int = 5) -> OutlierResult: """ k-NN distance based scoring: compute distance to k-th nearest neighbor. Points with large k-distance are candidate outliers. Returns score = k-distance (bigger => more anomalous). """ X = ensure_dataframe(X) Xnum = numeric_only(X) if Xnum.shape[0] < k + 1: return create_result(pd.Series(False, index=X.index), pd.Series(0.0, index=X.index), "knn_distance", {"k": k}, "too few samples") nbrs = NearestNeighbors(n_neighbors=k + 1).fit(Xnum) distances, _ = nbrs.kneighbors(Xnum) # distances[:, 0] is zero (self). take k-th neighbor kdist = distances[:, k] score = pd.Series(kdist, index=Xnum.index) # threshold: e.g., mean + 2*std thr = score.mean() + 2 * score.std() mask = score > thr return create_result(mask, score, "knn_distance", {"k": k, "threshold": thr}, "k-distance method") def mahalanobis_method(X, threshold_p: float = 0.01) -> OutlierResult: """ Mahalanobis distance based detection. Computes D^2 for each point. One can threshold by chi-square quantile with df=n_features: P(D^2 > thresh) = threshold_p. We return score = D^2. Applicability: data approximately elliptical (multivariate normal-ish). """ X = ensure_dataframe(X) Xnum = numeric_only(X) n, d = Xnum.shape if n <= d: # covariance ill-conditioned; apply shrinkage or PCA beforehand explanation = "n <= n_features: covariance may be singular, consider PCA or regularization" else: explanation = "" cov = EmpiricalCovariance().fit(Xnum) mahal = cov.mahalanobis(Xnum) score = pd.Series(mahal, index=Xnum.index) # default threshold: chi2 quantile from scipy.stats import chi2 thr = chi2.ppf(1 - threshold_p, df=d) if d > 0 else np.inf mask = score > thr return create_result(mask, score, "mahalanobis", {"threshold_p": threshold_p, "chi2_thr": float(thr)}, explanation) def dbscan_method(X, eps: float = 0.5, min_samples: int = 5) -> OutlierResult: """ DBSCAN clusterer: points labeled -1 are considered noise -> outliers. Applicability: non-spherical clusters, variable density; choose eps carefully. """ X = ensure_dataframe(X) Xnum = numeric_only(X) if Xnum.shape[0] < min_samples: return create_result(pd.Series(False, index=X.index), pd.Series(0.0, index=X.index), "dbscan", {"eps": eps, "min_samples": min_samples}, "too few samples") db = DBSCAN(eps=eps, min_samples=min_samples).fit(Xnum) labels = db.labels_ mask = pd.Series(labels == -1, index=Xnum.index) # score: negative of cluster size (noise points get score 1) # To keep simple: noise -> 1, else 0 score = pd.Series((labels == -1).astype(float), index=Xnum.index) return create_result(mask, score, "dbscan", {"eps": eps, "min_samples": min_samples}, "DBSCAN noise points flagged") # --------------------------- # File: outlier_detection/model_based.py # --------------------------- """ Model-based detectors: tree ensembles, SVM boundary, PCA reconstruction, GMM These functions are intended for multivariate numeric data. """ from sklearn.ensemble import IsolationForest from sklearn.svm import OneClassSVM from sklearn.decomposition import PCA from sklearn.mixture import GaussianMixture from sklearn.covariance import EllipticEnvelope from .utils import ensure_dataframe, numeric_only, create_result def isolation_forest_method(X, contamination: float = 0.05, random_state: int = 42) -> OutlierResult: """ Isolation Forest Returns mask and anomaly score (higher => more anomalous). Good general-purpose method for medium-to-high dimensional data. """ X = ensure_dataframe(X) Xnum = numeric_only(X) if Xnum.shape[0] < 2: return create_result(pd.Series(False, index=X.index), pd.Series(0.0, index=X.index), "isolation_forest", {"contamination": contamination}, "too few samples") iso = IsolationForest(contamination=contamination, random_state=random_state) iso.fit(Xnum) pred = iso.predict(Xnum) # decision_function: higher -> more normal, so we invert raw_score = -iso.decision_function(Xnum) score = pd.Series(raw_score, index=Xnum.index) mask = pd.Series(pred == -1, index=Xnum.index) return create_result(mask, score, "isolation_forest", {"contamination": contamination}, "IsolationForest: inverted decision function as score") def one_class_svm_method(X, kernel: str = "rbf", nu: float = 0.05, gamma: str = "scale") -> OutlierResult: """ One-Class SVM for boundary-based anomaly detection. Carefully tune nu and gamma; not robust to large datasets without subsampling. """ X = ensure_dataframe(X) Xnum = numeric_only(X) if Xnum.shape[0] < 5: return create_result(pd.Series(False, index=X.index), pd.Series(0.0, index=X.index), "one_class_svm", {"nu": nu}, "too few samples") ocsvm = OneClassSVM(kernel=kernel, nu=nu, gamma=gamma) ocsvm.fit(Xnum) pred = ocsvm.predict(Xnum) # decision_function: positive => inside boundary (normal); invert raw_score = -ocsvm.decision_function(Xnum) score = pd.Series(raw_score, index=Xnum.index) mask = pd.Series(pred == -1, index=Xnum.index) return create_result(mask, score, "one_class_svm", {"nu": nu, "kernel": kernel}, "OneClassSVM: invert decision_function for anomaly score") def pca_reconstruction_error(X, n_components: int = None, explained_variance: float = None, threshold: float = None) -> OutlierResult: """ PCA-based reconstruction error. If n_components not set, choose the minimum components to reach explained_variance (if provided). Otherwise uses min(n_features, 2). Score: squared reconstruction error per sample. Default threshold: mean+3*std. """ X = ensure_dataframe(X) Xnum = numeric_only(X) n, d = Xnum.shape if n == 0 or d == 0: return create_result(pd.Series(False, index=X.index), pd.Series(0.0, index=X.index), "pca_recon", {}, "empty data") if n_components is None: if explained_variance is not None: temp_pca = PCA(n_components=min(n, d)) temp_pca.fit(Xnum) cum = np.cumsum(temp_pca.explained_variance_ratio_) n_components = int(np.searchsorted(cum, explained_variance) + 1) n_components = max(1, n_components) else: n_components = min(2, d) pca = PCA(n_components=n_components) proj = pca.fit_transform(Xnum) recon = pca.inverse_transform(proj) errors = ((Xnum - recon) ** 2).sum(axis=1) score = pd.Series(errors, index=Xnum.index) if threshold is None: threshold = score.mean() + 3 * score.std() mask = score > threshold return create_result(mask, score, "pca_recon", {"n_components": n_components, "threshold": float(threshold)}, "PCA reconstruction error") def gmm_method(X, n_components: int = 2, contamination: float = 0.05) -> OutlierResult: """ Gaussian Mixture Model based anomaly score (log-likelihood). Score: negative log-likelihood (bigger => less likely => more anomalous). Threshold: empirical quantile of scores. """ X = ensure_dataframe(X) Xnum = numeric_only(X) if Xnum.shape[0] < n_components: return create_result(pd.Series(False, index=X.index), pd.Series(0.0, index=X.index), "gmm", {}, "too few samples") gmm = GaussianMixture(n_components=n_components) gmm.fit(Xnum) logprob = gmm.score_samples(Xnum) score = pd.Series(-logprob, index=Xnum.index) thr = score.quantile(1 - contamination) mask = score > thr return create_result(mask, score, {"n_components": n_components, "threshold": float(thr)}, "gmm", "GMM negative log-likelihood") def elliptic_envelope_method(X, contamination: float = 0.05) -> OutlierResult: """ EllipticEnvelope fits a robust covariance (assumes data come from a Gaussian-like ellipse). Flags outliers outside the ellipse. """ X = ensure_dataframe(X) Xnum = numeric_only(X) ee = EllipticEnvelope(contamination=contamination) ee.fit(Xnum) pred = ee.predict(Xnum) # decision_function: larger -> more normal; invert raw_score = -ee.decision_function(Xnum) score = pd.Series(raw_score, index=Xnum.index) mask = pd.Series(pred == -1, index=Xnum.index) return create_result(mask, score, "elliptic_envelope", {"contamination": contamination}, "EllipticEnvelope") # --------------------------- # File: outlier_detection/deep_learning.py # --------------------------- """ Deep learning based detectors (AutoEncoder, VAE). These require TensorFlow/Keras installed. If not present, importing this module will raise an informative ImportError. Design: a training function accepts X (numpy or DataFrame) and returns a callable `score_fn(X_new) -> pd.Series` plus a threshold selection helper. """ from typing import Callable import numpy as np import pandas as pd # lazy import to avoid hard TF dependency if user doesn't need it try: import tensorflow as tf from tensorflow.keras import layers, models, backend as K except Exception as e: raise ImportError("TensorFlow / Keras is required for deep_learning module. Install with `pip install tensorflow`. Error: " + str(e)) from .utils import ensure_dataframe, create_result def _build_autoencoder(input_dim: int, latent_dim: int = 8, hidden_units=(64, 32)) -> models.Model: inp = layers.Input(shape=(input_dim,)) x = inp for h in hidden_units: x = layers.Dense(h, activation='relu')(x) z = layers.Dense(latent_dim, activation='relu', name='latent')(x) x = z for h in reversed(hidden_units): x = layers.Dense(h, activation='relu')(x) out = layers.Dense(input_dim, activation='linear')(x) ae = models.Model(inp, out) return ae def autoencoder_method(X, latent_dim: int = 8, hidden_units=(128, 64), epochs: int = 50, batch_size: int = 32, validation_split: float = 0.1, threshold_method: str = 'quantile', threshold_val: float = 0.99, verbose: int = 0) -> OutlierResult: """ Train an AutoEncoder on X and compute reconstruction error as anomaly score. Parameters ---------- X : DataFrame or numpy array (numeric) threshold_method : 'quantile' or 'mean_std' threshold_val : if quantile -> e.g. 0.99 means top 1% flagged; if mean_std -> number of stds Returns ------- OutlierResult where score = reconstruction error and mask = score > threshold Notes ----- - This trains on the entire provided X. For actual anomaly detection, it's common to train the autoencoder only on "normal" data. If you have labels, pass only normal subset for training. - Requires careful scaling of inputs before training (robust_scale recommended). """ Xdf = ensure_dataframe(X) Xnum = Xdf.select_dtypes(include=['number']).fillna(0.0) input_dim = Xnum.shape[1] if input_dim == 0: return create_result(pd.Series(False, index=Xdf.index), pd.Series(0.0, index=Xdf.index), "autoencoder", {}, "no numeric columns") # convert to numpy arr = Xnum.values.astype(np.float32) ae = _build_autoencoder(input_dim=input_dim, latent_dim=latent_dim, hidden_units=hidden_units) ae.compile(optimizer='adam', loss='mse') ae.fit(arr, arr, epochs=epochs, batch_size=batch_size, validation_split=validation_split, verbose=verbose) recon = ae.predict(arr) errors = np.mean((arr - recon) ** 2, axis=1) score = pd.Series(errors, index=Xdf.index) if threshold_method == 'quantile': thr = float(score.quantile(threshold_val)) else: thr = float(score.mean() + threshold_val * score.std()) mask = score > thr return create_result(mask, score, "autoencoder", {"latent_dim": latent_dim, "threshold": thr}, "AutoEncoder reconstruction error") def vae_method(X, latent_dim: int = 8, hidden_units=(128, 64), epochs: int = 50, batch_size: int = 32, threshold_method: str = 'quantile', threshold_val: float = 0.99, verbose: int = 0) -> OutlierResult: """ Variational Autoencoder (VAE) anomaly detection. Implementation note: VAE is more involved; here we provide a simple implementation that uses reconstruction error as score. For strict probabilistic anomaly scoring one would use the ELBO / likelihood; this minimal implementation keeps it practical. """ # For brevity we reuse autoencoder path (a more complete VAE impl is possible) return autoencoder_method(X, latent_dim=latent_dim, hidden_units=hidden_units, epochs=epochs, batch_size=batch_size, threshold_method=threshold_method, threshold_val=threshold_val, verbose=verbose) # --------------------------- # File: outlier_detection/ensemble.py # --------------------------- """ Combine multiple detectors and produce an aggregated report. Provides strategies: union, intersection, majority voting, weighted sum of normalized scores. """ from typing import List, Dict import numpy as np import pandas as pd from .utils import ensure_dataframe, create_result def normalize_scores(scores: pd.DataFrame) -> pd.DataFrame: """Min-max normalize each score column to [0,1].""" sc = scores.copy() for c in sc.columns: col = sc[c] mn = col.min() mx = col.max() if mx == mn: sc[c] = 0.0 else: sc[c] = (col - mn) / (mx - mn) return sc def aggregate_scores(results: Dict[str, Dict], method: str = 'weighted', weights: Dict[str, float] = None) -> Dict: """ Aggregate multiple OutlierResult dictionaries produced by detectors. Returns an OutlierResult-like dict with: - mask (final boolean by threshold on aggregate score), - score (aggregate numeric score) Aggregation methods: - 'union' : any detector flagged => outlier (score = max of normalized scores) - 'intersection' : flagged by all detectors => outlier - 'majority' : flagged by >50% detectors - 'weighted' : weighted sum of normalized scores (weights provided or equal) """ # collect masks and scores into DataFrames masks = pd.DataFrame({k: v['mask'].astype(int) for k, v in results.items()}) raw_scores = pd.DataFrame({k: (v['score'] if isinstance(v['score'], pd.Series) else pd.Series(v['score'])) for k, v in results.items()}) raw_scores.index = masks.index norm_scores = normalize_scores(raw_scores) if method == 'union': agg_score = norm_scores.max(axis=1) elif method == 'intersection': agg_score = norm_scores.min(axis=1) elif method == 'majority': agg_score = masks.sum(axis=1) / max(1, masks.shape[1]) elif method == 'weighted': if weights is None: weights = {k: 1.0 for k in results.keys()} # align weights w = pd.Series({k: weights.get(k, 1.0) for k in results.keys()}) # make sure weights sum to 1 w = w / w.sum() agg_score = (norm_scores * w).sum(axis=1) else: raise ValueError("Unknown aggregation method") # default threshold: 0.5 mask = agg_score > 0.5 return create_result(mask, agg_score, f"ensemble_{method}", {"method": method}, "Aggregated ensemble score") def ensemble_methods(X, method_list: List[str] = None, method_params: Dict = None) -> Dict[str, Dict]: """ Convenience: run multiple detectors by name and return dict of results. method_list: list of names from ['iqr','modified_z','z_score','lof','mahalanobis','isolation_forest', ...] method_params: optional dict mapping method name to params """ from . import statistical, distance_density, model_based, deep_learning X = ensure_dataframe(X) if method_list is None: method_list = ['iqr', 'modified_z', 'isolation_forest', 'lof'] if method_params is None: method_params = {} results = {} for m in method_list: params = method_params.get(m, {}) try: if m == 'iqr': results[m] = statistical.iqr_method(X, **params) elif m == 'modified_z': results[m] = statistical.modified_z_score(X, **params) elif m == 'z_score': results[m] = statistical.z_score_method(X, **params) elif m == 'lof': results[m] = distance_density.lof_method(X, **params) elif m == 'mahalanobis': results[m] = distance_density.mahalanobis_method(X, **params) elif m == 'dbscan': results[m] = distance_density.dbscan_method(X, **params) elif m == 'knn': results[m] = distance_density.knn_distance_method(X, **params) elif m == 'isolation_forest': results[m] = model_based.isolation_forest_method(X, **params) elif m == 'one_class_svm': results[m] = model_based.one_class_svm_method(X, **params) elif m == 'pca': results[m] = model_based.pca_reconstruction_error(X, **params) elif m == 'gmm': results[m] = model_based.gmm_method(X, **params) elif m == 'elliptic': results[m] = model_based.elliptic_envelope_method(X, **params) elif m == 'autoencoder': results[m] = deep_learning.autoencoder_method(X, **params) else: logger.warning("Unknown method requested: %s", m) except Exception as e: logger.exception("Method %s failed: %s", m, e) return results # --------------------------- # File: outlier_detection/visualization.py # --------------------------- """ Simple plotting helpers for quick inspection. Note: plotting is intentionally minimal; for report-quality figures users can adapt styles. The functions return the matplotlib Figure object so they can be further customized. """ import matplotlib.pyplot as plt from .utils import ensure_dataframe def plot_boxplot(series: pd.Series, show: bool = True): df = ensure_dataframe(series) col = df.columns[0] fig, ax = plt.subplots() ax.boxplot(df[col].dropna()) ax.set_title(f"Boxplot: {col}") if show: plt.show() return fig def plot_pair_scatter(X, columns: list = None, show: bool = True): X = ensure_dataframe(X) if columns is not None: X = X[columns] cols = X.columns.tolist()[:4] # avoid huge plots fig, axes = plt.subplots(len(cols) - 1, len(cols) - 1, figsize=(4 * (len(cols) - 1), 4 * (len(cols) - 1))) for i in range(1, len(cols)): for j in range(i): ax = axes[i - 1, j] ax.scatter(X[cols[j]], X[cols[i]], s=8) ax.set_xlabel(cols[j]) ax.set_ylabel(cols[i]) fig.suptitle("Pairwise scatter (first 4 numeric cols)") if show: plt.show() return fig # --------------------------- # File: outlier_detection/cli.py # --------------------------- """ A very small CLI to run detectors on a CSV file and output a CSV report. Usage (example): python -m outlier_detection.cli detect input.csv output_report.csv --methods iqr,isolation_forest """ import argparse import pandas as pd from .ensemble import ensemble_methods, aggregate_scores def main(): parser = argparse.ArgumentParser(description='Outlier detection CLI') sub = parser.add_subparsers(dest='cmd') det = sub.add_parser('detect') det.add_argument('input_csv') det.add_argument('output_csv') det.add_argument('--methods', default='iqr,modified_z,isolation_forest,lof') args = parser.parse_args() df = pd.read_csv(args.input_csv) methods = args.methods.split(',') results = ensemble_methods(df, method_list=methods) agg = aggregate_scores(results, method='weighted') summary = pd.concat([pd.DataFrame({k: v['mask'].astype(int) for k, v in results.items()}), pd.DataFrame({k: v['score'] for k, v in results.items()})], axis=1) summary['ensemble_score'] = agg['score'] summary['ensemble_flag'] = agg['mask'].astype(int) summary.to_csv(args.output_csv, index=False) print(f"Wrote report to {args.output_csv}") if __name__ == '__main__': main()改成中文说明并返回代码给我
08-27
评论
成就一亿技术人!
拼手气红包6.0元
还能输入1000个字符
 
红包 添加红包
表情包 插入表情
 条评论被折叠 查看
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值