Python金融大数据分析——第4章数据类型和结构笔记_金融大数据第四章笔记-优快云博客

本文是Python金融大数据分析的第4章笔记，涵盖了基本数据类型如整数、浮点数和字符串，深入讨论了浮点数的精度问题及decimal模块。此外，介绍了元组、列表、字典和集合等数据结构，特别是列表的切片操作和字典的无序特性。还涉及了NumPy数据结构，包括数组、结构数组和内存布局，并讲解了代码向量化的重要性。

摘要生成于 C知道，由 DeepSeek-R1 满血版支持，前往体验 >

第4章数据类型和结构

第4章数据类型和结构

4.1 基本数据类型

4.1.1 整数

a = 10
type(a)  # int
# 调用 blt_length 方法， 获得表现 int 对象所需的位数
a.bit_length()  # 4
a = 100000
# 对象所赋的整数值越大， 需要的位数越多
a.bit_length()  # 17
googol = 10 ** 100
googol.bit_length()  # 333

4.1.2 浮点数

type(1 / 4)  # float
b = 0.35
type(b)  # float
b + 0.1  # 0.44999999999999996

出现以上结果的原因是浮点数在内部以二进制形式表：也就是说．十进制数 n ( O

c = 0.5
c.as_integer_ratio()  # (1, 2)

as_integer_ratio()将一个float用分数表示出来,返回的是一个二元元组
0.5可以精确保存．因为它具备精确（有限）的二进制表示：0.5=1/2。但是. b=0.35和预期的实数0.35=7/20不同：

b.as_integer_ratio()  # (3152519739159347, 9007199254740992)

精度取决于表示数值所用的位数。一般说， Python运行的所有平台使用IEEE 754双精度标准（ 64 位）作为内部表示。这相当于 15 位数字的相对精度。由于这个主题在多个金融应用领域中都很重要。有时候必须确保数值的精确（至少尽可能达到最佳）。例如，在加总一组数量很多的数值时，这个问题就可能很重要。在这种情况下。某个种类或者幅度的表示误差可能汇聚起，造成和基准值的显著偏差。
decimal 模块为浮点数提供了任意精确度的对象，以及使用这些数值时处理精度问题的多个选项：

import decimal
from decimal import Decimal

decimal.getcontext()
# context(prec:2a, rounding=ROUND_HALF_EVEN, Emin=-999999999, Emax=999999,999, capitals=l, flags=[], traps=[Overflow, InvalidOperation, DivisionsyZero])
d = Decimal(1) / Decimal(11)  # Decimal('0.09090909090909090909090909091')

decimal.getcontext().prec = 4  # lower precision tha.n default
e = Decimal(1) / Decimal(11)  # Decimal('0.09091')
decimal.getcontext().prec = 50  # higher precision than default
f = Decimal(1) / Decimal(11)  # Decimal('0.090909090909090909090909090909090909090909090909091')

g = d + e + f  # Decimal('0.27272818181818181818181818181909090909090909090909')

4.1.3 字符串

t = 'this is a string object'
t.capitalize()  # 'This is a string object'
t.split()  # ['this', 'is', 'a', 'string', 'object']
t.find('string')  # 10
t.find('Python')  # -1
t.replace(' ', '|')  # 'this|is|a|string|object'
'http://www.python.org'.strip('htp:/')  # 'www.python.org'

精选字符串方法

方法	参数	返回/结果
cpitalize	()	复制字符串，将第一个字符改成大写
count	sub[,star[,end]])	计算子字符串出现的次数
decode	[encoding[,errors]])	用encoding指定的编码方式（例如UTF-8）解码字将串
encode	([encoding[,errors]])	字符串编码形式
find	(sub[,start[,end]])	找到的子字符串（最低）索引
join	(seq)	连接seq序列中的字符串
replace	(old,new[,count]) 用new替换前count个old
split	([sep[,maxsplit]])	字符串中的单词列表，以sep作为分隔符
splitlines	([keepends])	如果keepends为真,以行结束符/换行符分隔的行
strip	(chars)	从字符串首／尾删除chars中的字符
upper	()	复制字符串，所有字母政为大写

正则表达式

import re

series = """
'01/18/2014 13:00:00',100,'1st;
'01/18/2014 13:30:00',110,'snd;
'01/18/2014 14:00:00',120,'3rd;
"""
dt = re.compile("'[0-9/:\s]+'")  # 正则两边的单引号必须有，不然匹配的结果都散了
result = dt.findall(series)
# ["'01/18/2014 13:00:00'", "'01/18/2014 13:30:00'", "'01/18/2014 14:00:00'"]

from datetime import datetime

pydt = datetime.strptime(result[0].replace("'", ""), '%m/%d/%Y %H:%M:%S')
print(pydt)  # 2014-01-18 13:00:00

4.2 基本数据结构

4.2.1 元组

元组（tple）是一种高级的数据结构，但是其应用仍然相当简单有限，通过在圆括号

t = (1, 2.5, 'data')
type(t)  # tuple
t = 1, 2.5, 'data'  # 可以去掉括号
type(t)  # tuple
# Python使用零起点编号，元组的第3个元素在索引位置2上
t[2]  # 'data'
type(t[2])  # str
t.count('data')  # 1
t.inde