pandas.dataframe用法总结:
1 df[df.Datatype=='train'] 返回的是一个dataframe ,中括号里==返回的为series 它的特点是有索引有值
2 df['Class'] 返回的为Series type(tr['Class']=) <class 'pandas.core.series.Series'>
3
####surce code###############################################################################
# Takes in dataframes and a list of selected features (column names)
# and returns (train_x, train_y), (test_x, test_y)
def train_test_data(complete_df, features_df, selected_features):
'''Gets selected training and test features from given dataframes, and
returns tuples for training and test features and their corresponding class labels.
:param complete_df: A dataframe with all of our processed text data, datatypes, and labels
:param features_df: A dataframe of all computed, similarity features
:param selected_features: An array of selected features that correspond to certain columns in `features_df`
:return: training and test features and labels: (train_x, train_y), (test_x, test_y)'''
# get the training features
df = pd.concat([complete_df,features_df],axis=1)
tr = df[df.Datatype=='train']
print("type(sf=",type(tr))
print("tr=",tr)
print("type(df.Datatype=='train')=",type(df.Datatype == 'train'))
train_x = tr[selected_features].values
print("train_x=",train_x)
# And training class labels (0 or 1)
t = df.Datatype == "Class"
print("type(t)=",type(t),"t=",t)
train_y = tr['Class'].values
# get the test features and labels
test= df[df.Datatype == 'test']
test_x = test[selected_features].values
test_y = test['Class'].values
return (train_x, train_y), (test_x, test_y)
"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
test_selection = list(features_df)[:2] # first couple columns as a test
print("test_selection=",test_selection)
# test that the correct train/test data is created
(train_x, train_y), (test_x, test_y) = train_test_data(complete_df, features_df, test_selection)
# params: generated train/test data
tests.test_data_split(train_x, train_y, test_x, test_y)
#result##############################################################################################
test_selection= ['c_1', 'c_2']
type(sf= <class 'pandas.core.frame.DataFrame'>
tr= File Task Category Class \
0 g0pA_taska.txt a 0 0
2 g0pA_taskc.txt c 2 1
3 g0pA_taskd.txt d 1 1
4 g0pA_taske.txt e 0 0
5 g0pB_taska.txt a 0 0
.. ... ... ... ...
89 g4pD_taske.txt e 1 1
90 g4pE_taska.txt a 1 1
91 g4pE_taskb.txt b 2 1
92 g4pE_taskc.txt c 3 1
93 g4pE_taskd.txt d 0 0
Text Datatype c_1 \
0 inheritance is a basic concept of object orien... train 0.398148
2 the vector space model also called term vector... train 0.869369
3 bayes theorem was names after rev thomas bayes... train 0.593583
4 dynamic programming is an algorithm design tec... train 0.544503
5 inheritance is a basic concept in object orien... train 0.329502
.. ... ... ...
89 dynamic programming is a method of providing s... train 0.845188
90 object oriented programming is a style of prog... train 0.485000
91 pagerankalgorithm is also known as link analys... train 0.950673
92 the definition of term depends on the applicat... train 0.551220
93 bayes theorem or bayes rule or something cal... train 0.361257
c_2 c_3 c_4 c_5 c_6 lcs_word
0 0.079070 0.009346 0.000000 0.000000 0.000000 0.191781
2 0.719457 0.613636 0.515982 0.449541 0.382488 0.846491
3 0.268817 0.156757 0.108696 0.081967 0.060440 0.316062
4 0.115789 0.031746 0.005319 0.000000 0.000000 0.242574
5 0.053846 0.007722 0.003876 0.000000 0.000000 0.161172
.. ... ... ... ... ... ...
89 0.546218 0.400844 0.347458 0.302128 0.273504 0.643725
90 0.105528 0.025253 0.005076 0.000000 0.000000 0.242718
91 0.878378 0.823529 0.800000 0.780822 0.761468 0.839506
92 0.328431 0.285714 0.252475 0.233831 0.220000 0.283019
93 0.031579 0.000000 0.000000 0.000000 0.000000 0.161765
[70 rows x 13 columns]
type(df.Datatype=='train')= <class 'pandas.core.series.Series'>
train_x= [[0.39814815 0.07906977]
[0.86936937 0.71945701]
[0.59358289 0.2688172 ]
[0.54450262 0.11578947]
[0.32950192 0.05384615]
[0.59030837 0.15044248]
[0.75977654 0.50561798]
[0.51612903 0.07027027]
[0.44086022 0.11891892]
[0.97945205 0.91724138]
[0.95138889 0.7972028 ]
[0.97647059 0.85798817]
[0.81176471 0.55621302]
[0.44117647 0.03030303]
[0.48888889 0.06741573]
[0.81395349 0.67058824]
[0.61111111 0.15492958]
[1. 1. ]
[0.63402062 0.20207254]
[0.58293839 0.29047619]
[0.63793103 0.42857143]
[0.42038217 0.07692308]
[0.68776371 0.40677966]
[0.67664671 0.31927711]
[0.76923077 0.53355705]
[0.71226415 0.37914692]
[0.62992126 0.33992095]
[0.71573604 0.26020408]
[0.33206107 0.03065134]
[0.71721311 0.36213992]
[0.87826087 0.71179039]
[0.52980132 0.35548173]
[0.57211538 0.14009662]
[0.31967213 0.04115226]
[0.53 0.13567839]
[0.78 0.65829146]
[0.65269461 0.18674699]
[0.44394619 0.15315315]
[0.66502463 0.39108911]
[0.72815534 0.30731707]
[0.76204819 0.54984894]
[0.94701987 0.67333333]
[0.36842105 0.0619469 ]
[0.53289474 0.09933775]
[0.61849711 0.16860465]
[0.51030928 0.09326425]
[0.57983193 0.11814346]
[0.40703518 0.06565657]
[0.51546392 0.09310345]
[0.58454106 0.27669903]
[0.6171875 0.33858268]
[1. 0.96153846]
[0.99166667 0.96638655]
[0.5505618 0.15819209]
[0.41935484 0.07608696]
[0.83516484 0.45555556]
[0.92708333 0.69473684]
[0.492891 0.05714286]
[0.70873786 0.52682927]
[0.86338798 0.66483516]
[0.96060606 0.92097264]
[0.43801653 0.08333333]
[0.73366834 0.35353535]
[0.51388889 0.09302326]
[0.48611111 0.07906977]
[0.84518828 0.54621849]
[0.485 0.10552764]
[0.95067265 0.87837838]
[0.55121951 0.32843137]
[0.36125654 0.03157895]]
type(t)= <class 'pandas.core.series.Series'> t= 0 False
1 False
2 False
3 False
4 False
...
95 False
96 False
97 False
98 False
99 False
Name: Datatype, Length: 100, dtype: bool
Tests Passed!
本文详细介绍了 Pandas 库中的 DataFrame 对象的使用方法,包括条件筛选、列选择及数据切分等高级功能,适用于数据预处理和特征工程场景。
674

被折叠的 条评论
为什么被折叠?



