Combining Probabilities

http://www.mathpages.com/home/kmath267/kmath267.htm


Combining Probabilities

 

Suppose Mr. Smith, who is correct 75% of the time, claims that a certain event X will not occur. It would seem on this basis that the probability of X occurring is 0.25. On the other hand, Mr. Jones, who is correct 60% of the time, claims that X will occur. Given both of these predictions, with their respective reliabilities, what is the probability that X will occur?

 

Essentially the overall context has 3 parameters, S (Smith's prediction), J (Jones's prediction), R (the real outcome). Thus, letting "y" and "n" respectively denote X and not X, the eight possibilities along with their probabilities are

 

 

where

                     

Also, since S = R at 75% of the time, we have

 

 

and since J = R at 60% of the time, we have

 

 

From this we want to determine the probability that R = y given that S = n and J = y. Thus, we need to find the value of p3/(p2 + p3), which is the probability of [n y y] divided by the probability of [n y *], where "*" indicates "either y or n". Clearly the problem is under-specified. Setting A = p0 + p7, B = p1 + p6, C = p2 + p5, and D = p3 + p4, the conditions can be written as

 

 

which is three linear equations in four unknowns (with the extra constraint that each probability is in the interval 0 to 1), so there are infinitely many solutions. For example, we can set A = 0.5, B = 0.25, C = 0.15, and D = 0.10, and satisfy all three equations, but we could also set A = 0.6, B = 0.25, C = 0.15, and D = 0.00. Furthermore, even if we arbitrarily select one of these solutions, there are still infinitely many ways of partitioning C and D to give the values of p2 and p3.

 

For example, suppose we take the solution with C = 0.15 and D = 0.10. We then have  p2 + p5 = 0.15  and  p3 + p4 = 0.10. If we take p4 = 0.10 and p5 = 0.00, we have p3 = 0.00 and p2 = 0.15, so the probability of X is 0. On the other hand, we can equally well take p4 = 0.00 and p5 = 0.15, which gives p3 = 0.10 and p2 = 0.00, so the probability of X is 1. Thus, any answer from 0 to 1 is strictly consistent with the stated conditions.

 

Nevertheless, in real life the problems we confront are often (always?) underspecified, but we must still make decisions. Are there any "reasonable" assumptions we could make, in the absence of more information, that would enable us to give a "reasonable" answer? One approach would be to estimate (guess) how much correlation exists between the correctness of J and S. For example, since S seems to be smarter than J, we might assume that S is correct whenever J is correct, as well as being correct on some experiments when J is incorrect. This would imply that p3 = p4 = 0 and p2 > 0, so the probability of X is 0.

 

Another approach would be to assume that the correctness of S and J's predictions are statistically independent, in the sense that they are each just as likely to be right regardless of whether the other is right or wrong. This assumption implies

 

 

and

 

Letting u = 0.6 and v = 0.75 denote the probabilities of correctness for Jones and Smith respectively, these equations together with the previous constraints uniquely determine the four sums

 

 

but this still doesn't uniquely determine the value of p3/(p2 + p3). We need at least one more assumption. I would suggest that we assume symmetry between "y" and "n". In other words, assume that probability of any combination is equal to the probability of the complementary combination, given by changing each "y" to an "n" and vice versa. This amounts to the assumption that X (the real answer) has an a priori probability of 1/2, and that the probability of predicting correctly is the same regardless of whether R is "y" or "n". On this basis we have p0 = p7, p1 = p6, p2 = p5, and p3 = p4, so we have p2 = 6/40, p3 = 3/40, and p3/(p2 + p3) = 1/3.

 

Therefore, assuming S and J are not correlated, and assuming "y" and "n" are symmetrical, the probability of X, given that S[75%] says X will not occur and J[60%] says it will, is 33.3%. Any (positive) correlation between S and J would tend to lower this probability.

 

Notice that our resulting value for the probability of X is not equal to the a priori value of 1/2 that we assumed by imposing symmetry between "y" and "n". If we have some a priori reason to believe the probability of X is something different than 1/2, we could re-do the calculation using this value. (Of course, we cannot use the computed probability of X for the particular conditions at hand, because the a priori probability of X applies to all possible conditions, not just when Smith says it won't occur and Jones says it will.) To account for this additional information (if we have it), we can let x denote the a priori probability of X, and then write the individual state probabilities as

 

 

On this basis the probability of X is

 

 

Naturally if we have no knowledge of the a priori probability of X, we just assume x = 1/2, and this formula reduces to the one given previously.

 

For a slightly more complicated case, suppose Mr. Red's ability to correctly identify the outcome of a TRUE/FALSE experiment is 75%, Mr. Green's is 60% and Mr. Blue's is 55%. If Mr. Blue, Mr. Green, and Mr. Red all agree that the outcome of the experiment is TRUE, is the resulting probability of "TRUE" 75% or is it weighted somewhere between 75% and 55% ?

 

Again this is underspecified, but if we impose the assumptions of (1) pairwise independence and (2) "y"/"n" symmetry, then in the general case of N prognosticators these two assumptions are sufficient to uniquely determine the answer. In other words, if N people with reliabilities r1, r2, ..., rN have each predicted the outcome will be 'TRUE', and if we assume the correctness of their predictions have no correlation, and that there is symmetry between TRUE and FALSE outcomes, then the probability of a "TRUE" outcome is

 

 

Thus, in the particular example described above with r1 = 3/4, r2 = 3/5, and r3 = 11/20, the probability of "TRUE" is 11/13 (i.e., about 84.6%).

 

Let Q = [q1, q2, ..., qN] denote a logical vector (i.e., each component qj is either +1 for "TRUE" or −1 for "FALSE") and let Q' denote the negation (logical complement) of Q. Also, define

 

 

and let F(Q) denote the product of f(ri, qi), i = 1 to N. Then the probability that the outcome will be TRUE given the predictions Q is given by

 

 

Incidentally, a more perspicacious (but equivalent) way of expressing these relations is

 

 

These results are formally correct, given the stated assumptions, but as discussed earlier, the most important thing to realize about these problems is that they are underspecified and have no definite answer. For example, if the a priori probability of the outcome "TRUE" is known to be x, then the above formula becomes

 

 

Given various sets of assumptions, all of which satisfy the stated conditions of the problem, the correct probability can have any value from 0.0 to 1.0.

 

The formula P = F(Q)/[F(Q) + F(Q')] is valid only for one specific set of assumptions, and those assumptions are not particularly realistic. It assumes that the correctness of Smith's predictions is totally uncorrelated with the correctness of Jones's predictions, which would almost certainly notbe the case in any realistic situation. (It's much more likely that Jones and Smith use at least some of the same criteria for making their predictions). To really answer the original question we would need to supply more information, specifically, the probabilities of each of the eight possible combinations of predictions and outcomes, as discussed previously.

 

For another example, suppose the Yankees and the Red Sox are playing, and the Red Sox have won 70% of their games, and the Yankees have won 50% of their games. What is the probability that the Yankees will win? Again the context is clearly underspecified, because the conditions of the question can be met by many different contexts, leading to many different outcome distributions. However, if we need to assign a probability based on this information alone, it's clear that our answer must assume the probability of Y beating R is some function of y and w (the fraction of games won by Y and R respectively).  Thus we need a function F(y,r) such that

 

 

It follows that

 

and 0 ≤ F(x,y) ≤ 1 for any x,y in [0,1]. One class of functions that satisfies this requirement is

 

 

where f is any mapping from [0,1] to [0,+inf]. For example, suppose y = 0.5 and r = 0.7.  Taking f(x) = x this gives Y a 41.7% chance of winning and R a 58.3% chance of winning. More generally, if we set f(x) = xk and reduce the exponent k so it approaches 0, the probabilities approach 50/50, whereas with k greater than 1 the probability of Y winning goes to zero.

 

What is the "best" or optimal choice for f(x)? We might assume each team has a "skill level", and this level is distributed binomially. Then, given the percentage of games won by a certain team we could infer the skill level by integration over the whole population, assuming that each team plays every other team the same number of times, and assuming Pr{i beats j} = si/(si + sj).

 

Another approach is to use the expression (Y)(R)/[(Y)(R) + (Y')(R')], where Y' and R' are the conjugates of Y and R. The two possible outcomes are Ywins-Rloses, and Yloses-Rwins. To find the probability (only from w/l record) of R winning we would then have

 

 

This formula has a certain aesthetic appeal, but it also has some possibly counter-intuitive consequences.  For example, suppose the two best teams in the league, X and Y, win x = 99% and y = 97% of their games, respectively. We might expect these two teams to be fairly evenly matched, which would be consistent with the formula

 

 

In contrast, the alternative formula gives

 

 

It isn't obvious that a 99% team should be this heavily favored over a 97% team. If this really was the applicable formula, then the presence of a 99% team in the league would almost preclude the existence of a 97% team, depending on how many teams are in the league and how often these teams play each other.

 

One possible objection to the simple weighting function f(y)/[f(y) + f(r)] with f(x) = x is that it seems the "system" will tend towards equilibrium. For systems of more than two teams, the teams will always come to equilibrium regardless of the initial conditions. In other words, each team would converge to the same win/loss record. On the other hand, a team with a winning percentage of 0.800 will have ample opportunity to sustain their winning ways by using the alternative expression

 

 

as a model. It's a good idea to impose the overall equilibrium requirement on the whole population when deriving a model. Of course, the second model is really a special case of the "simple weighted" model. In other words, we have

 

 

and dividing the numerator and denominator by (1-x)(1-y) gives the equivalent form

 

 

where f(z) = z/(1-z). This particular function f(z) is not unique in giving a self consistent population.

 

A more fundamental approach would be to model the underlying process. For example, suppose there are 256 ranked players in the world, with skill levels ranging from 1 to 9 distributed binomially as follows

 

 

Of course, "skill" might be a matrix rather than a scalar, and you could get into all sorts of interesting interactions (scissors cuts paper, paper wraps stone, stone breaks scissors, etc), but let's just assume that "skill" in this game can be modeled by a simple scalar. 

 

Now we must also specify to what extent skill determines the outcome of a contest. If the game's outcome is largely determined by chance, then the world's most skillful player may only beat the least skillful player 60% of the time. One way of modeling this is to say that the probability of player Pm beating player Pn is

 

 

where sj is the skill of player Pj and the constant k determines the importance of skill in this game. As k goes to 0 all the probabilities go to 0.5, meaning that the outcome of a game is only weakly determined by skill. If k is very large, then the more skillful player will almost always win.

 

Now we have a simple but complete model for which we can compute the long-term win/loss records of each skill level. In general, for a league of 2N players with binomially distributed skill levels and assuming a skill factor of k (and every player plays every other player equally often), the "winning percentage" of a player with skill level q is

 

 

where C(N,j) is the binomial coefficient N!/((n-j)! j!). Taking N = 8 and k = 2, the winning percentages for each of the 9 skill levels are as shown below:

 

 

Of course, the weighted average of all these winning percentages is 50%. Also, since Win(q) is invertible, it follows that for any system of this general type the formula for predicting winners can be expressed in the form f(x)/[f(x) + f(y)].

 

Return to MathPages Main Menu

 


### Cosine Entropy in Machine Learning or Information Theory Cosine entropy is not a standard term within the conventional frameworks of machine learning or information theory; however, one can infer its meaning by combining elements from cosine similarity and traditional definitions of entropy. #### Understanding Cosine Similarity Cosine similarity measures the cosine of the angle between two non-zero vectors. In an N-dimensional space, this metric evaluates how similar these vectors are irrespective of their magnitude but based on orientation alone[^1]. The formula for calculating cosine similarity \( \text{similarity} \) between two vectors A and B is given as follows: \[ \text{similarity}(A,B)=\frac{\sum_{i=1}^{n}{A_iB_i}}{\sqrt{\sum_{i=1}^{n}{A_i^2}\cdot\sum_{i=1}^{n}{B_i^2}}} \] This measure ranges from -1 (indicating opposite directions) through 0 (orthogonal vectors indicating no correlation) up to +1 (identical direction). #### Exploring Traditional Entropy Concepts Entropy quantifies uncertainty associated with random variables. Shannon's definition of entropy H(X), where X represents discrete random variable outcomes, uses probabilities p(x): \[ H(X)=-\sum_x{p(x)\log_2(p(x))} \] In context-specific applications like text classification using decision trees mentioned elsewhere[^2], different types of entropies such as information gain play crucial roles during node splits maximizing purity at child nodes after branching out parent ones. #### Hypothetical Conceptualization of Cosine Entropy Combining both ideas could lead to defining **cosine entropy**, possibly measuring disorder concerning directional alignment rather than probability distributions directly. Such a concept might find utility when analyzing high dimensional data spaces common in modern datasets encountered across various domains including natural language processing tasks involved while constructing knowledge graphs via machine learning techniques described earlier. However, there exists no widely accepted formal mathematical formulation specifically labeled 'cosine entropy' within mainstream literature reviewed here. Therefore any implementation would be experimental requiring rigorous validation before practical application considerations. ```python import numpy as np def cosine_similarity(vec_a, vec_b): dot_product = np.dot(vec_a, vec_b) norm_a = np.linalg.norm(vec_a) norm_b = np.linalg.norm(vec_b) return dot_product / (norm_a * norm_b) # Note: Since cosine entropy isn't formally defined, # below function demonstrates a hypothetical approach. def hypothetical_cosine_entropy(vectors): n_vectors = len(vectors) total_similarities = sum(cosine_similarity(vectors[i], vectors[j]) for i in range(n_vectors) for j in range(i+1, n_vectors)) avg_similarity = total_similarities / ((n_vectors*(n_vectors-1))/2) # Assuming higher average similarities imply lower entropy/disorder return -(avg_similarity*np.log(avg_similarity)) vectors_example = [ np.array([1., 2., 3.]), np.array([-1., -2., -3.]), np.array([4., 5., 6.]) ] print(hypothetical_cosine_entropy(vectors_example)) ```
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值