Naive Bayesian

The Naive Bayesian classifier is based on Bayes’ theorem with independence assumptions between predictors. A Naive Bayesian model is easy to build, with no complicated iterative parameter estimation which makes it particularly useful for very large datasets. Despite its simplicity, the Naive Bayesian classifier often does surprisingly well and is widely used because it often outperforms more sophisticated classification methods. 
 
Algorithm
Bayes theorem provides a way of calculating the posterior probability, P(c|x), from P(c)P(x), and P(x|c). Naive Bayes classifier assume that the effect of the value of a predictor (x) on a given class (c) is independent of the values of other predictors. This assumption is called class conditional independence.

  • P(c|x) is the posterior probability of class (target) given predictor (attribute). 
  • P(c) is the prior probability of class
  • P(x|c) is the likelihood which is the probability of predictor given class
  • P(x) is the prior probability of predictor.
 
Example:
The posterior probability can be calculated by first, constructing a frequency table for each attribute against the target. Then, transforming the frequency tables to likelihood tables and finally use the Naive Bayesian equation to calculate the posterior probability for each class. The class with the highest posterior probability is the outcome of prediction. 

 
The zero-frequency problem
Add 1 to the count for every attribute value-class combination (Laplace estimator) when an attribute value (Outlook=Overcast) doesn’t occur with every class value (Play Golf=no).
 
Numerical Predictors
Numerical variables need to be transformed to their categorical counterparts (binning) before constructing their frequency tables. The other option we have is using the distribution of the numerical variable to have a good guess of the frequency. For example, one common practice is to assume normal distributions for numerical variables.
 
The probability density function for the normal distribution is defined by two parameters (mean and standard deviation).

 
Example:
 

Humidity

 

Mean

StDev

Play Golf

yes

86968065708070907579.110.2

no

8590709591    86.29.7
 

 
Predictors Contribution
Kononenko's information gain as a sum of information contributed by each attribute can offer an explanation on how values of the predictors influence the class probability.

The contribution of predictors can also be visualized by plotting nomograms. Nomogram plots log odds ratios for each value of each predictor. Lengths of the lines correspond to spans of odds ratios, suggesting importance of the related predictor. It also shows impacts of individual values of the predictor.

 

reference:http://www.saedsayad.com/naive_bayesian.htm

Sure, here's how you can perform sentiment analysis of restaurant comments with the steps you mentioned: ```python import pandas as pd import jieba from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import CountVectorizer from sklearn.naive_bayes import MultinomialNB from sklearn.pipeline import Pipeline from sklearn.metrics import accuracy_score # Load the data df = pd.read_csv('restaurant_comments.csv', encoding='gb18030') # Create the sentiment label df['sentiment'] = df['star'].apply(lambda x: 1 if x > 3 else 0) # Tokenize the Chinese comments df['comments'] = df['comments'].apply(lambda x: ' '.join(jieba.cut(x))) # Split the dataset into training and test sets X_train, X_test, y_train, y_test = train_test_split(df['comments'], df['sentiment'], test_size=0.2) # Create a pipeline with CountVectorizer and MultinomialNB pipeline = Pipeline([ ('vectorizer', CountVectorizer()), ('classifier', MultinomialNB()) ]) # Train the model pipeline.fit(X_train, y_train) # Test the trained model using test dataset and print the accuracy y_pred = pipeline.predict(X_test) accuracy = accuracy_score(y_test, y_pred) print('Accuracy:', accuracy) ``` In this code, we first load the data using pandas with encoding='gb18030'. Then we create the sentiment label based on the 'star' column. We tokenize the Chinese comments using jieba, split the dataset into training and test sets using train_test_split from sklearn. We then create a pipeline with CountVectorizer and MultinomialNB. We fit the pipeline on the training data and test the trained model using test dataset. Finally, we print the accuracy score of the model using accuracy_score from sklearn. You can try other models as well and tune the hyperparameters to improve the accuracy.
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值