Browse course

By: Johannes Korbmacher

Logic and probability


In Chapters 1. Logic and AI and 2. Valid inference , we’ve distinguished between deductive inference (the truth of the premises guarantees the truth of the conclusion) and inductive inference (the truth of the premises makes the truth of the conclusion more likely). So far, we’ve only focused on models of deductive inference. In this chapter, you’ll learn more about how to model inductive inference.

We’ll begin by looking at how to model probabilities in a logical setting. We’ll then go through a useful method for calculating probabilities using probability truth-tables. After this, we turn to conditional probabilities, which then allow us to discuss a simple model of inductive validity. To conclude the chapter, we look at some examples of valid and invalid inductive inferences in our model.

Models of probability


The probability of an event describes the how likely it is that the event occurs. Think, for example, of how likely it is that you roll a 3 on a 6 sided-die, or how likely it is that it rains tomorrow.

What this really means is actually a matter of substantial debate. There different schools of thought on the interpretation of probability:

Here, we won’t take sides in this debate. We’ll work with a pre-theoretic notion of probability and focus on the formalism.

Probabilities play many important roles in AI research. On the one hand, they are the basis for the statistical methods used in modern subsymbolic machine learning methods. On the other hand, probabilities are the main method for dealing with inductive inference and uncertainty in symbolic settings, such as expert systems.

Different applications of probabilities in AI use different (though equivalent) ways of developing the formalism. Here, we’ll develop a symbolic approach to probabilities, which models probabilities as real numbers assigned to formulas in formal languages.

But you should be aware of the fact that mathematical probability theory, in general, is a broader theory that can be applied in many different, symbolic and subsymbolic settings.

For now, we work ( again ) with a propositional language $\mathcal{L}$ with variables $p_1,\dots,p_n$ and connectives $\neg,\land,\lor$.

A probability for $\mathcal{L}$ is a function $Pr:\mathcal{L}\to\mathbb{R}$, which assigns real numbers to formulas, subject to the following three conditions:

  1. $Pr(A)\geq 0$, for all $A\in\mathcal{L}$

  2. $Pr(A)=1$, whenever $\vDash A$ (i.e. $A$ is a classical logical truth)

  3. $Pr(A\lor B)=Pr(A)+Pr(B)$, whenever $\vDash \neg (A\land B)$ (i.e. $A\land B$ is unsatisfiable)

These are the so-called Kolmogorov axioms for classical probability theory. Note that they use notions from classical logic (logical truth and satisfiability).1

These are only the basic axioms, there are also derived laws. For example, it’s easy to show that for all formulas $A$, we have: $$P(\neg A)=1-P(A).$$

To see this, we can reason as follows:

In a similar way, you can show rules like the following:

These kinds of laws hold for all probabilities. And importantly, there are many possible probability measures, which correspond to different modeling situations. But how do we give a concrete probability? Well, we have to give a rule that specifies the probability of every formula. Let’s look at an example.

Suppose we’re modeling the throw of a single, 6-sided die. Correspondingly, we have six propositional variables in our language: $$\mathsf{RESULT}_1,\mathsf{RESULT}_2,\mathsf{RESULT}_3,\mathsf{RESULT}_4,\mathsf{RESULT}_5,\mathsf{RESULT}_6,$$ where $\mathsf{RESULT}_i$ means that the result of the die-roll is $i$.

If we’re dealing with a fair die, we can define a corresponding probability measure $Pr_{\text{fair}}$ as follows:

$$Pr_{\text{fair}}(\mathsf{RESULT}_i)=\frac{1}{6}\text{, for }i=1,\dots,6$$ $$Pr_{\text{fair}}(\mathsf{RESULT}_i\land \mathsf{RESULT}_j)=0\text{, for all }i\neq j$$ $$Pr_{\text{fair}}(\neg\mathsf{RESULT}_1\land \dots\neg\mathsf{RESULT}_6)=0$$

The third rule ensures that we have at least one result: it’s impossible that the result isn’t $1$, isn’t $2$, $\dots$, isn’t $6$.

The second rule formalizes the fact that a die can’t show two numbers at the same time, it’s impossible that the result is both a $1$ and a $6$, say. In other words, we assume that the results are mutually exclusive.

The first rule, then, captures the fact that each result is equally likely, i.e. the die is fair. Since there are $6$ mutually exclusive results, the only way to achieve this is by assigning $\frac{1}{6}$ to each.

It turns out that these two rules are sufficient to define a value $Pr_\text{fair}(A)$ for each formula $A$. Showing this is not hard, but not trivial either.2 Here we won’t go into the details, but instead highlight that probabilities are not recursive.

That is, the first rule is not sufficient to calculate the probabilities of all formulas. To calculate the probability of the die being 2 or 4, for example,

$$Pr_\text{fair}(\mathsf{RESULT}_2\lor \mathsf{RESULT}_4),$$

we need both rules:

$$Pr_\text{fair}(\mathsf{RESULT}_2\lor \mathsf{RESULT}_4)=\frac{1}{6}+\frac{1}{6}=\frac{1}{3}$$

But note that crucially, we needed both rules in this calculation. The probabilities of the propositional variables are not enough, we also need the values of their conjunctions.3

In our simple example, this was easy: no conjunction could be true. But in more complex modeling scenarios, that makes specifying probabilities explicitly fairly difficult. This is why we’ll now describe a method for defining probabilities that’s easier to implement.

Probability truth-tables


It turns out that we can use truth-tables to specify probabilities for a whole language.4 To see how this works, let’s discuss the abstract theory first, and then go through a concrete example.

The basic idea is that we can specify probabilities by saying how likely different valuations are. Here, we think of valuations again as logically possible scenarios and of the likelihood of a valuation as how likely it is that the things in the actual world turn out exactly like in the scenario.

So, we have $\mathsf{RAIN}$ and $\mathsf{SUN}$ as our propositional variables, then we have 4 different valuations:

These correspond to the “world” where it rains and the sun shines ($\nu_1$), it rains but the sun doesn’t shine ($\nu_2$), and so forth.

We now measure with a probability weight $m$ how likely each of these scenarios is. For example, we could say:

What this means is that the likelihood of it raining and the sun shining is 0.5, the likelihood of it raining and the sun not shining 0.3, and so forth.

We could use any weight distribution here, as long as they sum up to $1$. This expresses the assumption that at least one of the different scenarios will happen.

So, more generally, a probability weight for our language is a function $m$, which assigns to each valuation $\nu$ a positive value $0\leq m(\nu)$ such that these values sum up to one: $$\sum_{\nu}m(\nu)=1$$

Once we have these values, we can calculate the probability of each formula by adding the probability weights of the worlds, where the formula is true: $$Pr_m(A)=\sum_{\nu(A)=1}m(\nu).$$

So, in our example, to get the probability of $\mathsf{RAIN},$ for example, we’d calculate $$Pr(\mathsf{RAIN})=\sum_{\nu(\mathsf{RAIN})=1} m(\nu)=$$ $$m(\nu_1)+m(\nu_2)=0.5+0.3=0.8$$

That is, in our model, the probability of it raining is 0.8. This is because the only scenarios where it rains are $\nu_1$ and $\nu_2$ and they have a combined likelihood of 0.8.

It turns out, as a matter of mathematical fact, that every probability function $Pr$ can be expressed in this way. We shall not look into the proof of this, but rather how to use this fact as a convenient method for specifying probabilities.

The point is that it’s much easier to specify a weight for each valuation than to assign a value to each formula: there are typically way fewer valuations than relevantly distinct formulas. Which brings us to the connection with truth-tables.

Remember from Chapter 5.2 that in a truth-table, each line corresponds to a valuation:

This means that in the table, we can conveniently give the information of our measure as follows:

So far, so good. What makes this method really powerful is that it allows us to calculate the probability of a formula: simply calculate the values of the formula for each row in the usual way and then sum up the $m$-values for each row where the formula gets value $1$.

So, in our example, if we want to know the probability of $\mathsf{RAIN}\lor\neg\mathsf{SUN}$, we’d proceed as follows:

Then, we can see that: $$Pr(\mathsf{RAIN}\lor\neg\mathsf{SUN})=m(\nu_1)+m(\nu_2)+m(\nu_4)=0.5+0.3+0=0.8$$

This is a powerful method for representing and calculating probabilities. But it also has its limits. For example, it suffers from the same problem of combinatorial explosion as the truth-tables method for validity checking. For example, writing the full truth-table for our die example from above, we need $2^6=64$ rows. That means that we need to use “tricks” to handle them efficiently. In this example, we can use the fact that every row where more than one $\mathsf{RESULT}_i$ gets value $1$ will have $m$-value $0$, so we can leave it out of the table:

Unfortunately, computationally efficient implementations of probabilities are beyond what we can discuss in this setting.

Conditional probabilities


Having outlined what probabilities are and how we can describe them, we now turn our attention to modeling inductive inference using probabilities. As we’ve discussed already in Chapter 2. Valid inference , the idea is to use conditional probabilities for this purpose.

The conditional probability of one formula $A$ given another formula $B$ is, intuitively, a measure of how likely it is that $A$ is true assuming that $B$ is true. This idea is formally captured in the definition:

$$Pr(A\mid B)=\frac{Pr(A\land B)}{Pr(B)}$$

Here we need to assume that $Pr(B)\neq 0$ to avoid division by 0.

Let’s look at how this works in some examples:

Besides helping us to define valid inductive inference, which we’ll turn to in the next section, conditional probabilities have a lot of theoretical use.

One important application that underlies many practical applications is to define the notion of probabilistic independence. We say that two formulas $A,B$ are probabilistically independent iff $$Pr(A|B)=Pr(A).$$

If $A,B$ are probabilistically independent, then whether $B$ occurs “doesn’t matter” for whether $A$ occurs.

For example, in the following example, we have that whether it rains and the sun is shining are independent:

To see this, consider:

$$Pr(\mathsf{SUN}\mid \mathsf{RAIN})=\frac{Pr(\mathsf{SUN}\land\mathsf{RAIN})}{Pr(\mathsf{RAIN})}$$ $$=\frac{0.5}{0.5+0.3=0.8}=0.625=0.5+0.125=Pr(\mathsf{SUN})$$

It’s worth noting that if $A,B$ are probabilistically independent, we can calculate the probability of $A\land B$ from the probabilities of $A,B$:

This holds in our example, since $$Pr(\mathsf{RAIN})=0.5+0.3=0.8$$ $$Pr(\mathsf{SUN})=0.5+0.125=0.625$$ $$Pr(\mathsf{RAIN}\land\mathsf{SUN})=0.5=0.8\times 0.625$$

But it’s important to note that this doesn’t hold in general. In the original table:

We have:

$$Pr(\mathsf{RAIN})=0.5+0.3=0.8$$ $$Pr(\mathsf{SUN})=0.5+0.2=0.7$$ $$Pr(\mathsf{RAIN}\land\mathsf{SUN})=0.5\neq 0.8\times 0.7=0.56$$

Inductive validity


It is now time to turn to inductive validity. But before we give an account, it is important to note that inductive logic is a broad field, which is closely related to statistical inference. Given this, we should note that the account of inductive validity we present here is just one view of valid inference in a narrow sense.

The situation is similar to the one we face in deductive logic: Boolean logic, many-valued logic, fuzzy logic, etc. all give slightly different models of deductively valid inference, which are applicable in different circumstances, under different assumptions, etc.

Correspondingly, there are different models of inductive inference. The model we’ll discuss in this section is a simple version of what’s known as Bayesian inference. But we won’t go into the details of Bayesian statistics and the difference to other paradigms. Instead, we’ll focus on the idea of using conditional probability as a measure of inductive support, which is especially important in Bayesian statistics but on a general level, is characteristic of inductive inference.

The basic idea we’re pursuing is, as discussed in Chapter 2. Valid inference , the idea that an inference is inductively valid just in case the premises make the conclusion more likely. We interpret this “more likely” as: the conditional probability of the conclusion given the premises is higher than the unconditional probability of the conclusion.

Using the symbol $\mid\approx$ for inductive validity, the definition is: $$ P_1,P_2,\dots\mid\approx C \Leftrightarrow Pr(C|P_1\land P_2\land \dots)\geq Pr(C)$$ This concept is essentially what Rudolf Carnap calls the increase of firmness concept of evidential support: the premises, if we were to add them to our knowledge bank, would raise the degree with which our KB supports the conclusion.

There are a few things to note about the definition, at this point:

We’ll continue thinking about the simple increase of firmness concept of inductive validity $\mid\approx$. We can note the following inductive laws, which are independent of the choice of $Pr$:

The latter law will allow us to observe the inductive validity of some well-known principles, one of which we’ll discuss on the following section.

The law is a sort of mathematical proof of the so-called deductive-nomological model of confirmation, which though not unproblematic has been hugely influential in thinking about what confirmation is.

Validities and fallacies


To conclude our theoretical discussion of inductive inference, let’s discuss two concrete examples of inductive inference, one valid, one invalid.

Enumerative induction


Enumerative induction is the principle that we can infer $\forall x P(x)$ from sufficiently many and well-chosen instances $P(a_1), P(a_2),\dots$. In our simple setting, we can express this as: $$P(a_1),P(a_2),\dots\mid\approx \forall xP(x).$$ The simple, standard example of this inference is that all observed swans, sampled well and wide, were wide, so all swans are white. Of course, this inference is deductively invalid—it can always happen that there’s a non-white swan somewhere—but we can show, mathematically, that the inference is inductively valid in the (weak) increase of firmness sense.

This follows from the previous law that if the conclusion implies the premises, an inference is inductively valid. Let’s suppose for a moment that we have extended our treatment of probabilities to a FOL language, which is rather straight-forward but takes a lot of space.5 Assuming that we still get our theorem (which we do), we can infer the validity of enumerative induction from the simple fact that $$\forall xP(x)\vDash P(a_i),$$ for each $i$ following the law of universal instantiation.

Even more so, given some reasonable assumptions, since the function $$Pr(\forall xP(x)\mid \dots)$$ grows monotonically, if we put every increasing numbers of instances in6 $$P(a_1)$$ $$P(a_1)\land P(a_2)$$ $$P(a_1)\land P(a_2)\land \dots$$ we will eventually “crack” every threshold of $\epsilon\geq 0.5$ we could chose.

This means that on our simple model of inductive validity, for sufficiently large $n$, $$P(a_1),\dots,P(a_n)\mid\overset{!}{\approx}\forall xP(x)$$

Conjunction fallacy


Here’s the original Linda problem from Tversky and Kahneman:

Linda is 31 years old, single, outspoken, and very bright. She majored in philosophy. As a student, she was deeply concerned with issues of discrimination and social justice, and also participated in anti-nuclear demonstrations.

Which is more probable?

  1. Linda is a bank teller.
  2. Linda is a bank teller and is active in the feminist movement.

In experimental surveys, people reliably judge option 2 more likely as option 1, which is a probabilistic fallacy.

We can now see why: Since $$\mathsf{BANK}\land\mathsf{FEMINIST}\vDash \mathsf{BANK},$$ we can infer that $$Pr(\mathsf{BANK}\land\mathsf{FEMINIST})\leq Pr(\mathsf{BANK}),$$ for any $Pr$ we could chose.

There is an extensive psychological literature that tries to explain why people make this and related fallacies. But for us, from a logical perspective, the crucial point is that people do. In symbolic inductive reasoning, we have to be careful not fall into the trap of these fallacies, and the best way is to use logical methods to model inductive reasoning.

Further readings


An excellent introduction to the formalism of Bayesian epistemology is:

Chapters 12 and 13 of Russel and Norvig are an excellent introduction to more detailed uses of probabilistic reasoning in AI.

Notes:


  1. There are also non-classical probability theories, which build on alternative background logics. For example, there is strong Kleene probability theory. Check out this paper↩︎

  2. The proof involves a different normal form than the CNFs we used in Chapter 5. Boolean satisfiability , called disjunctive normal forms (DNFs). It also uses the facts that equivalent formulas have the same probability and that if one formula implies the other the probability of the one is lower than that of the other. ↩︎

  3. … for example. There are other ways of specifying a probability, using all probabilities of disjunctions, for example. But that’s not the important point. ↩︎

  4. The only constraint is that we only have finitely many propositional variables $p_1,\dots,p_n$ and not infinitely many $p_1,p_2,\dots$. But in concrete, AI modeling situations, this isn’t usually a practical problem. ↩︎

  5. See the Titelbaum book in the suggested readings for how to do this precisely. ↩︎

  6. This mathematical fact is easy to show but we omit it here for brevity’s sake. Can you fill the gap? ↩︎

Last edited: 21/10/2024