Python基础总结及数据分析代码记录（一、环境准备及数据结构；二、数据采集与操作）

本文链接：https://blog.youkuaiyun.com/jwy19900622/article/details/125044942

学习记录

20220530----使用Python进行数据分析，文档内容为日常课程学习代码总结内容及操作内容整理
20220601----继续整理
20220603----Python数据分析基础

一、环境准备及数据结构

（一）Python基础

Python中的任何数值、字符串、数据结构、函数、类、模块等都是对象；
变量不需指定类型，但是仍是“强类型语言”，只是不显示地表示（注意：Python中变量是没有类型的，对象才有类型）；如 x = 5，变量x是没有类型的，而是指向整型对象5的一个变量；
可变与不可变
大部分Python对象是可变(mutable)的，e.g.列表、字典、自定义的类；
字符串和元组是不可变的(immutable)；

def print_info():
    """
        This is a function for information print # 一般用于放置文档说明(docstring)或多行字符串
    """
    print ('Hello Python')
print_info()
print ('%d %s cost me $%.2f.' %(2, 'books', 20)) # 字符串格式化(%)

s_val = '3.1415926' # 类型转换str bool int float
print ('string value: %s' %s_val)
f_val = float(s_val)
print ('float value: %f' %f_val)
i_val = int(f_val)
print ('int value: %i' %i_val)

（二）编码方式

ASCII：早起计算机保存英文字符的编码方式
GB2312：对ASCII的中文扩展
GBK/GB18030：包括了GB2312的所有内容，同时又增加了近20000个新的汉字和符号
Unicode：包括了全球的符合和编码。每个符号用3~4个字节表示，浪费空间
UTF-8：变长的编码方式，在互联网上使用最广的一种Unicode的实现方式

（三）魔法命令

%timeit [x for x in range(10)] # 多次执行一条语句，并返回平均时间
%%time 
l = [    ] # 返回执行一条语句的时间，多条语句
for x in range(100000):
    l.append(x)
%reset # 删除当前空间的全部变量

（四）时间与日期模块

from datetime import datetime
dt = datetime(2022, 5, 30, 22, 30, 15)
print (dt.month, dt.day, dt.hour, dt.minute, dt.second)

from datetime import datetime
dt = datetime(2022, 5, 30, 22, 30, 15)
print (dt.month, dt.day, dt.hour, dt.minute, dt.second)
dt.strftime('%Y/%m/%d %H:%M') # strftime将datetime类型格式化为字符串
dt.strftime('%d/%m/%Y %H:%M')
datetime.strptime('20220530', '%Y%m%d') # strptime将字符串解析为datetime类型

（四）循环与控制

m = 100 # 控制流if elif else
if m < 0:
    print ('negative')
elif m > 0:
    print ('positive')
else:
    print ('zero')

l = range(20) # [0, 1, 2, ..., 19]
for i in l: # for循环 continue, break
    if i%2 == 0:
        continue
    if i == 15:
        break
    print (i)

i = 1 # while 循环
sum = 0
while i <= 100:
    sum += i # sum = sum + i
    print(sum)
    i += 1 # i = i + 1 
#   print(i)
#   print (sum)

list(range(1, 11))   
list(range(0, 10, 3))

（五）异常控制

def divide(x, y): # 异常处理
    """
        do division
    """
    try:
        return (x/y)
    except:
        return ('error happens')
print (divide(8,2))
print (divide('8', 2))

（六）Python中的引用

import copy
a = [[1, 2, 3], [4, 5, 6]]
b = a
c = copy.copy(a)
d = copy.deepcopy(a)
print ('a-id:', id(a))
print ('b-id:',id(b))
print ('c-id:',id(c))
print ('d-id:',id(d))
a.append(15)
a[1][2] = 10
print ('processed...')
print (a)
print (b)
print (c)
print (d)

（七）Python数据结构

1.元组（tuple）

tup1 = 1, 2, 3
print (tup1) # 创建元组，一维、定长、不可变的对象序列
tup2 = (1, 2, 3), (4, 5) # 嵌套元组
print (tup2)
l = [1, 2, 3]
print (tuple(l)) # 转换为元组，list->tuple, string->tuple
str = ('Hello')
print (tuple(str))
tup3 = tuple(str) # 访问元组
print (tup3[0])
tup1 + tup2 # 合并元组

def return_multiple(): # 函数返回多个值
    t = (1, 2, 3)
    return (t)
a, _, c = return_multiple()
print (c)

tuple_lst = [(1, 2), (3, 4), (5, 6)] # 元组列表迭代
for x, y in tuple_lst:
    print (x+y)
tup3.count('o') # count 返回指定次数

2.列表（list）

lst_1 = [1, 2, 3, 'a', 'b', (4, 5)] # 创建列表[  ]，转换为列表，tuple->list
print (lst_1) # 变长，可变
lst_2 = list(range(1, 9))
print (lst_2)

lst_3 = list(tup3)
print (lst_3)
lst_4 = list(range(10))
print(lst_4)
# 添加、移除元素，append insert pop remove
lst_4.append(11) # 添加元素
print (lst_4)
lst_4.insert(0, 100) # 指定位置插入
print (lst_4)
item = lst_4.pop(0) # 删除指定位置的元素并返回
print (item)
print (lst_4)
lst_4.remove(1) # 删除指定的值，注意0在这里是“值”不是“位置”
print (lst_4)

lst_3 = lst_1 + lst_2
print (lst_3)
print(lst_1)
print(lst_2)
lst_1.extend(lst_2) # 合并列表
print (lst_1)

import random # 排序操作
lst_5 = list(range(10))
print(lst_5)
random.shuffle(lst_5) # random.shuffle用于将一个列表中的元素打乱
print(lst_5)
lst_5.sort() # 直接排序，无需创建新对象
print (lst_5)
lst_6 = ['Welcome', 'to', 'Python', 'Data', 'Analysis', 'Course']
lst_6.sort()
print (lst_6)
lst_6.sort(key = len, reverse=True)
print (lst_6)

切片操作 [start_idx : stop_idx : step]注：结果不包含stop_idx的元素；

step：正负数均可，其绝对值决定了切取数据时的“步长”，正负号决定了“切确方向”，正表示“从左往右”取值，负表示“从右往左”取值。当step省略时，默认为1；
start_index：表示起始索引（包含该索引对应值）。该参数省略时，表示从对象“端点”开始取值，至于是从“起点”还是“终点”开始，则由step参数的正负决定，step为正表示从起点开始，为负表示从终点开始；
end_index：表示终止索引（不包含该索引对应值）。该参数省略时，表示一直取到数据“端点”，“端点”同样由step参数的正负决定，step为正时直到“终点”，为负表示直到”起点“；
一个完整的切片表达式包含两个:，用于分隔三个参数start_index, end_index, step。当只有一个:时，默认第三个参数step=1；当一个:也没有时，start_index=end_index，表示切取start_index指定的那个元素；

lst_5 = list(range(10))
print (lst_5)
print(lst_5[1:5])
print (lst_5[1:5])
print (lst_5[5:])
print (lst_5[:5])
print (lst_5[-5:])
print (lst_5[-5:-2])
print (lst_5[::2])
print (lst_5[::-1])

3.常用序列函数

lst_6 = ['Welcome', 'to', 'Python', 'Data', 'Analysis', 'Course'] # (0, 'Welcome')
for i, item in enumerate(lst_6): # enumerate，for循环时记录索引，逐个返回元组(i, item)
    print ('%i-%s' %(i, item))
str_dict = dict((i, item) for i, item in enumerate(lst_6))
print (str_dict)

import random
lst_5 = list(range(10))
print(lst_5)
random.shuffle(lst_5)
print(lst_5)

lst_5_sorted = list(sorted(lst_5)) # sorted 返回新的有序列表, 区别：list中的sort()是就地排序
print (lst_5)
print(lst_5_sorted)

lst_6 = ['Welcome', 'to', 'Python', 'Data', 'Analysis', 'Course']
lst_7 = list(range(5)) #[0, 1, 2, 3, 4]
lst_8 = ['a', 'b', 'c']
zip_1st = list(zip(lst_6, lst_8, lst_7)) # zip“压缩”将多个序列的对应位置的元素
zip_2st = tuple(zip(lst_6, lst_8, lst_7))
print (zip_1st)
print (zip_2st)
tuple(zip(*zip_1st)) # zip(*元组列表) “解压缩”，zip的逆操作

print(lst_6)
list(reversed(lst_6)) # reversed 逆序迭代，可配合list返回逆序列表

4.字典（dict）

empty_dict = {} # 创建字典
dict1 = {'a': 1, 2: 'b', '3': [1, 2, 3]}
print (empty_dict)
print (dict1)

dict1[4] = (4, 5) # 插入元素
print (dict1)
del dict1[2] # 删除元素 del
print(dict1)
a_value = dict1.pop('a') # 删除元素 pop
print(a_value)
print(dict1)

print dict1.keys() # 获取字典键
print dict1.values() # 获取字典值
print(dict1)
dict2 = {10: 'new1'}
print(dict2)
dict1.update(dict2) # 合并字典
print(dict1)

dict_3 = {} # 普通方法，通过多个列表创建字典
l1 = range(10)
l2 = list(reversed(range(10)))
for i1, i2 in zip(l1, l2):
    dict_3[i1] = i2
print (dict_3)

5.高级特性

（1）集合推导式
列表推导式，使用一句表达式构造一个新列表，可包含过滤、转换等操作；

exp for item in collection if condition
%%timeit
#普通方法
result1 = []
for i in range(10000):
    if i%2 == 0:
        result1.append(i)
str_lst = ['Welcome', 'to', 'Python', 'Data', 'Analysis', 'Course']
result3 = [x.upper() for x in str_lst if len(x) > 4]
print (result3)

（2）字典推导式

{key_exp : value_exp for item in collection if condition}
dict1 = {key : value for key, value in enumerate(reversed(range(10)))}
print (dict1)

（3）集合推导式

{exp for item in collection if condition}
set_1 = {i for i in range(10)}
print (set_1)

（4）嵌套列表推导式（按照嵌套顺序理解）

lists = [list(range(10)), list(range(10, 20))]
print (lists)
type(lists)
evens = [item for lst in lists for item in lst if item % 2 == 0]
print (evens)

（5）多函数模式

str_lst = ['$1.123', ' $1123.454', '$899.12312'] # 处理字符串

def remove_space(str):
    """
        remove space
    """
    str_no_space = str.replace(' ', '')
    return str_no_space

def remove_dollar(str):
    """
        remove $
    """
    if '$' in str:
        return str.replace('$', '')
    else:
        return str

def clean_str_lst(str_lst, operations):
    """
        clean string list
    """
    result = []
    for item in str_lst:
        for op in operations:
            item = op(item)
        result.append(item)
    return result

clean_operations = [remove_space, remove_dollar]
result = clean_str_lst(str_lst, clean_operations)
print (result)

（八）函数式编程

函数式编程：函数本身可以赋值给变量；赋值后变量为函数；允许将函数本身作为参数传入另一个函数；允许返回一个函数。

1.map/reduce

1.map(func, lst)，将传入的函数变量func作用到lst变量的每个元素中，并将结果组成新的列表返回
2.reduce(func(x,y), lst)，其中func必须有两个参数。每次func计算的结果继续和序列的下一个元素做累积计算。
lst=[a1, a2, a3, …, an]
reduce(func(x,y), lst)= func(func(func(a1, a2), a3), …, an)

import math
x_1_lst = [x**2 for x in range(10)]
print (x_1_lst)
x_sqrt_lst = map(math.sqrt, x_1_lst)
print (list(x_sqrt_lst))
x_2_float_lst = map(float, x_1_lst)
print (list(x_2_float_lst))
x_3_str_lst = map(str, x_2_lst)
print (list(x_3_str_lst))
str_lst = map(str, range(5)) # ['0', '1', ...]
print (list(str_lst))
print (str_lst)
name_lst = ['poNNY MA', 'rObIN li', 'steve JOBS', 'bILL gates']
standard_name_lst = map(str.title, name_lst)
print (list(standard_name_lst)) # 规范字符串

2.filter（筛选数列）

filter
• 筛选序列
• filter(func, lst)，将func作用于lst的每个元素，然后根据返回值是True或False判断是保留还是丢弃该元素。
• map，reduce和filter中的函数变量均可以是匿名函数。

number_lst = range(-10, 10)

def is_negative(x):
    return x < 0

filtered_lst = filter(is_negative, number_lst)
print (list(number_lst))
print (list(filtered_lst))

3.map/reduce/filter与匿名函数

x_lst = range(10)
result_lst = map(lambda item : item**2 +item**3, x_lst)
print (list(x_lst))
print (list(result_lst))

number_lst = range(-10, 10)
filtered_lst = filter(lambda x : x<0, number_lst)
print (list(number_lst))
print (list(filtered_lst))

（九）环境准备（Python包管理）

安装：pip install xxx, conda install xxx
卸载：pip uninstall xxx, conda
uninstall xxx 升级：pip install –upgrade xxx, conda update xx

二、数据采集与操作

（一）读取文件/写入文件

txt_filename = './files/python_baidu.txt' # 打开文件
file_obj = open(txt_filename, 'r', encoding='utf-8') # 读取整个文件内容
all_content = file_obj.read()
file_obj.close() # 关闭文件
print(all_content)

txt_filename = './files/python_baidu.txt' # 打开文件，逐行读取
file_obj = open(txt_filename, 'r', encoding='utf-8')
line1 = file_obj.readline() # 逐行读取
print(line1)
line2 = file_obj.readline() # 继续读下一行
print(line2)
file_obj.close()# 关闭文件

txt_filename = './files/python_baidu.txt' # 打开文件，读取全部内容，返回列表
file_obj = open(txt_filename, 'r', encoding='utf-8')
lines = file_obj.readlines()
for i, line in enumerate(lines):
    print ('{}: {}'.format(i, line))
file_obj.close() # 关闭文件

txt_filename = './files/test_write.txt' # 打开文件
file_obj = open(txt_filename, 'w', encoding='utf-8') # 写入全部内容
file_obj.write("《Python数据分析》")
file_obj.close()

txt_filename = './files/test_write.txt' # 打开文件
file_obj = open(txt_filename, 'w', encoding='utf-8') # 写入字符串列表
lines = ['这是第%i行\n' %n for n in range(100)]
file_obj.writelines(lines)
file_obj.close()

（二）CSV (Comma-Separated Values)

import pandas as pd
filename = './files/gender_country.csv' # df_obj.to_csv()写操作
df = pd.read_csv(filename, encoding='utf-16') # df_obj = pd.read_csv()，返回DataFrame类型的数据
print(type(df))
print(df.head())

filename = './files/pandas_output.csv'
#df.to_csv(filename)
df.to_csv(filename, index=None, encoding='utf-8')

（三）Pandas（基于NumPy构建，索引在左，数值在右；索引是pandas自动创建）

数据结构
• Series，类似于一维数组的对象。
• DataFrame，表格型数据结构，每列可以是不同的数据类型，可表示二
维或更高维的数据

（四）JSON (JavaScript Object Notation）

轻量级的数据交换格式
• 语法规则
• 数据是键值对
• 由逗号分隔
• { }保存对象，如{key1:val1, key2,:val2}
• [ ]保存数组，如[val1, val2, …, valn]

读操作
• json.load(file_obj)
• 返回值是dict类型
• 类型转换 json -> csv
• 编码操作
• json.dumps()
• 编码注意
• ensure_ascii=False

import json
filename = './files/global_temperature.json'
with open(filename, 'r') as f_obj:
    json_data = json.load(f_obj)
print(type(json_data)) # 返回值是dict类型
print(json_data.keys())

print(json_data['data'].keys())
print(json_data['data'].values())

year_str_lst = json_data['data'].keys() # 转换key
year_lst = [int(year_str) for year_str in year_str_lst]
print(year_lst)
temp_str_lst = json_data['data'].values() # 转换value
temp_lst = [float(temp_str) for temp_str in temp_str_lst]
print(temp_lst)

import pandas as pd
year_se = pd.Series(year_lst, name = 'year') # 构建 dataframe
temp_se = pd.Series(temp_lst, name = 'temperature')
result_df = pd.concat([year_se, temp_se], axis = 1)
print(result_df.head())
result_df.to_csv('./files/json_to_csv.csv', index = None) # 保存csv

book_dict = [{'书名':'无声告白', '作者':'伍绮诗'}, {'书名':'我不是潘金莲', '作者':'刘震云'}, {'书名':'沉默的大多数 (王小波集)', '作者':'王小波'}]
filename = './files/json_output.json' # 写入json数据
with open(filename, 'w', encoding='utf-8') as f_obj:
    f_obj.write(json.dumps(book_dict, ensure_ascii=False))

（五）爬虫简介

import urllib.request

test_url = "https://www.sina.com.cn/"
response = urllib.request.urlopen(test_url) # 通过URL下载
print(response.getcode()) # 200 表示访问成功
print(response.read())

request = urllib.request.Request(test_url) # 通过Request访问
request.add_header("user-agent", "Mozilla/5.0")
response = urllib.request.urlopen(request)
print(response.getcode()) # 200 表示访问成功
print(response.read())

# 通过cookie访问
import http.cookiejar
cookie_jar = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cookie_jar))
urllib.request.install_opener(opener)
response = urllib.request.urlopen(test_url)
print(response.getcode()) # 200 表示访问成功
print(response.read())
print(cookie_jar)

未完待续

20220530 Lecture01
20220531Lecture01
20220601Lecture01
20220603codes/codes/examples/
20220603Pandas_codeslect03_codes/examples

                                                                                   2022年6月5日
                                                                                     青海·西宁