pandas.dataframe用法总结何时返回dataframe 何时返回series

最新推荐文章于 2022-01-16 01:10:25 发布

原创最新推荐文章于 2022-01-16 01:10:25 发布 · 1.2k 阅读

1 ·

CC 4.0 BY-SA版权

pandas 专栏收录该内容

72 篇文章

订阅专栏

本文详细介绍了 Pandas 库中的 DataFrame 对象的使用方法，包括条件筛选、列选择及数据切分等高级功能，适用于数据预处理和特征工程场景。

pandas.dataframe用法总结：

1 df[df.Datatype=='train'] 返回的是一个dataframe ,中括号里==返回的为series 它的特点是有索引有值

2 df['Class'] 返回的为Series type(tr['Class']=) <class 'pandas.core.series.Series'>

####surce code###############################################################################

# Takes in dataframes and a list of selected features (column names)
# and returns (train_x, train_y), (test_x, test_y)
def train_test_data(complete_df, features_df, selected_features):
'''Gets selected training and test features from given dataframes, and
returns tuples for training and test features and their corresponding class labels.
:param complete_df: A dataframe with all of our processed text data, datatypes, and labels
:param features_df: A dataframe of all computed, similarity features
:param selected_features: An array of selected features that correspond to certain columns in `features_df`
:return: training and test features and labels: (train_x, train_y), (test_x, test_y)'''

# get the training features
df = pd.concat([complete_df,features_df],axis=1)
tr = df[df.Datatype=='train']

print("type(sf=",type(tr))
print("tr=",tr)
print("type(df.Datatype=='train')=",type(df.Datatype == 'train'))
train_x = tr[selected_features].values
print("train_x=",train_x)
# And training class labels (0 or 1)
t = df.Datatype == "Class"
print("type(t)=",type(t),"t=",t)
train_y = tr['Class'].values

# get the test features and labels
test= df[df.Datatype == 'test']
test_x = test[selected_features].values
test_y = test['Class'].values

return (train_x, train_y), (test_x, test_y)

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
test_selection = list(features_df)[:2] # first couple columns as a test
print("test_selection=",test_selection)
# test that the correct train/test data is created
(train_x, train_y), (test_x, test_y) = train_test_data(complete_df, features_df, test_selection)

# params: generated train/test data
tests.test_data_split(train_x, train_y, test_x, test_y)

#result##############################################################################################

test_selection= ['c_1', 'c_2']
type(sf= <class 'pandas.core.frame.DataFrame'>
tr=               File Task  Category  Class  \
0   g0pA_taska.txt    a         0      0   
2   g0pA_taskc.txt    c         2      1   
3   g0pA_taskd.txt    d         1      1   
4   g0pA_taske.txt    e         0      0   
5   g0pB_taska.txt    a         0      0   
..             ...  ...       ...    ...   
89  g4pD_taske.txt    e         1      1   
90  g4pE_taska.txt    a         1      1   
91  g4pE_taskb.txt    b         2      1   
92  g4pE_taskc.txt    c         3      1   
93  g4pE_taskd.txt    d         0      0   

                                                 Text Datatype       c_1  \
0   inheritance is a basic concept of object orien...    train  0.398148   
2   the vector space model also called term vector...    train  0.869369   
3   bayes theorem was names after rev thomas bayes...    train  0.593583   
4   dynamic programming is an algorithm design tec...    train  0.544503   
5   inheritance is a basic concept in object orien...    train  0.329502   
..                                                ...      ...       ...   
89  dynamic programming is a method of providing s...    train  0.845188   
90  object oriented programming is a style of prog...    train  0.485000   
91  pagerankalgorithm is also known as link analys...    train  0.950673   
92  the definition of term depends on the applicat...    train  0.551220   
93   bayes theorem or bayes rule  or something cal...    train  0.361257   

         c_2       c_3       c_4       c_5       c_6  lcs_word  
0   0.079070  0.009346  0.000000  0.000000  0.000000  0.191781  
2   0.719457  0.613636  0.515982  0.449541  0.382488  0.846491  
3   0.268817  0.156757  0.108696  0.081967  0.060440  0.316062  
4   0.115789  0.031746  0.005319  0.000000  0.000000  0.242574  
5   0.053846  0.007722  0.003876  0.000000  0.000000  0.161172  
..       ...       ...       ...       ...       ...       ...  
89  0.546218  0.400844  0.347458  0.302128  0.273504  0.643725  
90  0.105528  0.025253  0.005076  0.000000  0.000000  0.242718  
91  0.878378  0.823529  0.800000  0.780822  0.761468  0.839506  
92  0.328431  0.285714  0.252475  0.233831  0.220000  0.283019  
93  0.031579  0.000000  0.000000  0.000000  0.000000  0.161765  

[70 rows x 13 columns]
type(df.Datatype=='train')= <class 'pandas.core.series.Series'>
train_x= [[0.39814815 0.07906977]
 [0.86936937 0.71945701]
 [0.59358289 0.2688172 ]
 [0.54450262 0.11578947]
 [0.32950192 0.05384615]
 [0.59030837 0.15044248]
 [0.75977654 0.50561798]
 [0.51612903 0.07027027]
 [0.44086022 0.11891892]
 [0.97945205 0.91724138]
 [0.95138889 0.7972028 ]
 [0.97647059 0.85798817]
 [0.81176471 0.55621302]
 [0.44117647 0.03030303]
 [0.48888889 0.06741573]
 [0.81395349 0.67058824]
 [0.61111111 0.15492958]
 [1.         1.        ]
 [0.63402062 0.20207254]
 [0.58293839 0.29047619]
 [0.63793103 0.42857143]
 [0.42038217 0.07692308]
 [0.68776371 0.40677966]
 [0.67664671 0.31927711]
 [0.76923077 0.53355705]
 [0.71226415 0.37914692]
 [0.62992126 0.33992095]
 [0.71573604 0.26020408]
 [0.33206107 0.03065134]
 [0.71721311 0.36213992]
 [0.87826087 0.71179039]
 [0.52980132 0.35548173]
 [0.57211538 0.14009662]
 [0.31967213 0.04115226]
 [0.53       0.13567839]
 [0.78       0.65829146]
 [0.65269461 0.18674699]
 [0.44394619 0.15315315]
 [0.66502463 0.39108911]
 [0.72815534 0.30731707]
 [0.76204819 0.54984894]
 [0.94701987 0.67333333]
 [0.36842105 0.0619469 ]
 [0.53289474 0.09933775]
 [0.61849711 0.16860465]
 [0.51030928 0.09326425]
 [0.57983193 0.11814346]
 [0.40703518 0.06565657]
 [0.51546392 0.09310345]
 [0.58454106 0.27669903]
 [0.6171875  0.33858268]
 [1.         0.96153846]
 [0.99166667 0.96638655]
 [0.5505618  0.15819209]
 [0.41935484 0.07608696]
 [0.83516484 0.45555556]
 [0.92708333 0.69473684]
 [0.492891   0.05714286]
 [0.70873786 0.52682927]
 [0.86338798 0.66483516]
 [0.96060606 0.92097264]
 [0.43801653 0.08333333]
 [0.73366834 0.35353535]
 [0.51388889 0.09302326]
 [0.48611111 0.07906977]
 [0.84518828 0.54621849]
 [0.485      0.10552764]
 [0.95067265 0.87837838]
 [0.55121951 0.32843137]
 [0.36125654 0.03157895]]
type(t)= <class 'pandas.core.series.Series'> t= 0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Name: Datatype, Length: 100, dtype: bool
Tests Passed!

#https://notebookinstance.notebook.us-east-2.sagemaker.aws/notebooks/CN-ML_SageMaker_Studies/Project_Plagiarism_Detection/2_Plagiarism_Feature_Engineering.ipynb