pandas.dataframe用法总结 何时返回dataframe 何时返回series

本文详细介绍了 Pandas 库中的 DataFrame 对象的使用方法,包括条件筛选、列选择及数据切分等高级功能,适用于数据预处理和特征工程场景。

pandas.dataframe用法总结:

1 df[df.Datatype=='train']  返回的是一个dataframe  ,中括号里==返回的为series 它的特点是有索引有值

2  df['Class']  返回的为Series   type(tr['Class']=) <class 'pandas.core.series.Series'>

 

####surce code###############################################################################

# Takes in dataframes and a list of selected features (column names) 
# and returns (train_x, train_y), (test_x, test_y)
def train_test_data(complete_df, features_df, selected_features):
    '''Gets selected training and test features from given dataframes, and 
       returns tuples for training and test features and their corresponding class labels.
       :param complete_df: A dataframe with all of our processed text data, datatypes, and labels
       :param features_df: A dataframe of all computed, similarity features
       :param selected_features: An array of selected features that correspond to certain columns in `features_df`
       :return: training and test features and labels: (train_x, train_y), (test_x, test_y)'''
    
    # get the training features
    df = pd.concat([complete_df,features_df],axis=1)
    tr = df[df.Datatype=='train']
    
    print("type(sf=",type(tr))
    print("tr=",tr)
    print("type(df.Datatype=='train')=",type(df.Datatype == 'train'))
    train_x = tr[selected_features].values
    print("train_x=",train_x)
    # And training class labels (0 or 1)
    t = df.Datatype == "Class"
    print("type(t)=",type(t),"t=",t)
    train_y = tr['Class'].values
    
    # get the test features and labels
    test= df[df.Datatype == 'test']
    test_x = test[selected_features].values
    test_y = test['Class'].values
    
    return (train_x, train_y), (test_x, test_y)

 

"""
DON'T MODIFY ANYTHING IN THIS CELL THAT IS BELOW THIS LINE
"""
test_selection = list(features_df)[:2] # first couple columns as a test
print("test_selection=",test_selection)
# test that the correct train/test data is created
(train_x, train_y), (test_x, test_y) = train_test_data(complete_df, features_df, test_selection)

# params: generated train/test data
tests.test_data_split(train_x, train_y, test_x, test_y)

#result##############################################################################################

test_selection= ['c_1', 'c_2']
type(sf= <class 'pandas.core.frame.DataFrame'>
tr=               File Task  Category  Class  \
0   g0pA_taska.txt    a         0      0   
2   g0pA_taskc.txt    c         2      1   
3   g0pA_taskd.txt    d         1      1   
4   g0pA_taske.txt    e         0      0   
5   g0pB_taska.txt    a         0      0   
..             ...  ...       ...    ...   
89  g4pD_taske.txt    e         1      1   
90  g4pE_taska.txt    a         1      1   
91  g4pE_taskb.txt    b         2      1   
92  g4pE_taskc.txt    c         3      1   
93  g4pE_taskd.txt    d         0      0   

                                                 Text Datatype       c_1  \
0   inheritance is a basic concept of object orien...    train  0.398148   
2   the vector space model also called term vector...    train  0.869369   
3   bayes theorem was names after rev thomas bayes...    train  0.593583   
4   dynamic programming is an algorithm design tec...    train  0.544503   
5   inheritance is a basic concept in object orien...    train  0.329502   
..                                                ...      ...       ...   
89  dynamic programming is a method of providing s...    train  0.845188   
90  object oriented programming is a style of prog...    train  0.485000   
91  pagerankalgorithm is also known as link analys...    train  0.950673   
92  the definition of term depends on the applicat...    train  0.551220   
93   bayes theorem or bayes rule  or something cal...    train  0.361257   

         c_2       c_3       c_4       c_5       c_6  lcs_word  
0   0.079070  0.009346  0.000000  0.000000  0.000000  0.191781  
2   0.719457  0.613636  0.515982  0.449541  0.382488  0.846491  
3   0.268817  0.156757  0.108696  0.081967  0.060440  0.316062  
4   0.115789  0.031746  0.005319  0.000000  0.000000  0.242574  
5   0.053846  0.007722  0.003876  0.000000  0.000000  0.161172  
..       ...       ...       ...       ...       ...       ...  
89  0.546218  0.400844  0.347458  0.302128  0.273504  0.643725  
90  0.105528  0.025253  0.005076  0.000000  0.000000  0.242718  
91  0.878378  0.823529  0.800000  0.780822  0.761468  0.839506  
92  0.328431  0.285714  0.252475  0.233831  0.220000  0.283019  
93  0.031579  0.000000  0.000000  0.000000  0.000000  0.161765  

[70 rows x 13 columns]
type(df.Datatype=='train')= <class 'pandas.core.series.Series'>
train_x= [[0.39814815 0.07906977]
 [0.86936937 0.71945701]
 [0.59358289 0.2688172 ]
 [0.54450262 0.11578947]
 [0.32950192 0.05384615]
 [0.59030837 0.15044248]
 [0.75977654 0.50561798]
 [0.51612903 0.07027027]
 [0.44086022 0.11891892]
 [0.97945205 0.91724138]
 [0.95138889 0.7972028 ]
 [0.97647059 0.85798817]
 [0.81176471 0.55621302]
 [0.44117647 0.03030303]
 [0.48888889 0.06741573]
 [0.81395349 0.67058824]
 [0.61111111 0.15492958]
 [1.         1.        ]
 [0.63402062 0.20207254]
 [0.58293839 0.29047619]
 [0.63793103 0.42857143]
 [0.42038217 0.07692308]
 [0.68776371 0.40677966]
 [0.67664671 0.31927711]
 [0.76923077 0.53355705]
 [0.71226415 0.37914692]
 [0.62992126 0.33992095]
 [0.71573604 0.26020408]
 [0.33206107 0.03065134]
 [0.71721311 0.36213992]
 [0.87826087 0.71179039]
 [0.52980132 0.35548173]
 [0.57211538 0.14009662]
 [0.31967213 0.04115226]
 [0.53       0.13567839]
 [0.78       0.65829146]
 [0.65269461 0.18674699]
 [0.44394619 0.15315315]
 [0.66502463 0.39108911]
 [0.72815534 0.30731707]
 [0.76204819 0.54984894]
 [0.94701987 0.67333333]
 [0.36842105 0.0619469 ]
 [0.53289474 0.09933775]
 [0.61849711 0.16860465]
 [0.51030928 0.09326425]
 [0.57983193 0.11814346]
 [0.40703518 0.06565657]
 [0.51546392 0.09310345]
 [0.58454106 0.27669903]
 [0.6171875  0.33858268]
 [1.         0.96153846]
 [0.99166667 0.96638655]
 [0.5505618  0.15819209]
 [0.41935484 0.07608696]
 [0.83516484 0.45555556]
 [0.92708333 0.69473684]
 [0.492891   0.05714286]
 [0.70873786 0.52682927]
 [0.86338798 0.66483516]
 [0.96060606 0.92097264]
 [0.43801653 0.08333333]
 [0.73366834 0.35353535]
 [0.51388889 0.09302326]
 [0.48611111 0.07906977]
 [0.84518828 0.54621849]
 [0.485      0.10552764]
 [0.95067265 0.87837838]
 [0.55121951 0.32843137]
 [0.36125654 0.03157895]]
type(t)= <class 'pandas.core.series.Series'> t= 0     False
1     False
2     False
3     False
4     False
      ...  
95    False
96    False
97    False
98    False
99    False
Name: Datatype, Length: 100, dtype: bool
Tests Passed!

 

#https://notebookinstance.notebook.us-east-2.sagemaker.aws/notebooks/CN-ML_SageMaker_Studies/Project_Plagiarism_Detection/2_Plagiarism_Feature_Engineering.ipynb

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值