Chapter 11 Less known features
This final chapter is a more practical and applied one, describing in detail the rest of the graphical user interface as well as the most used functions in package QCA. Although widely used, there are still some not so obvious features that remain unemployed in most of the cases. In a way, this is a chapter about all the little secrets that each function has, that in many situations are capable of making the usage experience as straightforward as possible.
Both the written functions and the graphical user interface were designed with this very purpose, to give the impression of extreme simplicity and allow users to concentrate less on the R code and focus more to the substantive theoretical part of the analysis.
A lot of effort has been spent to write the R code in order to allow all possible styles of specifying causal expressions. Negation, for example, can be specified in four different ways:
- by subtracting from 1, using the
1 - A
type of expression when the conditionA
is a fuzzy numeric or even logical vector - by using the universal negation exclamation sign
!A
, for logical vectors - by using a tilde in front of the condition’s name, like:
~A
The exclamation sign is already part of the base R, and it is implemented in most programming languages. Using lower and upper case letters was the original style from the beginnings of QCA but it is no longer used. Using a tilde is the new standard, a reason for the previously used argument use.tilde
is now deprecated from all functions, since using a tilde is now the default when signaling a negation.
All these possibilities are not native to a software like R, they had to be custom developed. While there are specific tips and tricks for each function, there are also some overall features that are valid for multiple functions across the package.
For example, in the past some functions had an argument called incl.cut, while others had a similar argument called
incl.cut1, although both were in fact referring to one and the same thing, specifying an inclusion cut-off. The former argument
incl.cut1is still possible (all functions are backwards compatible with older arguments), however the modern versions of the package have all these arguments uniform across functions, in this case using the more universal
incl.cut`.
Specifying multiple inclusion cut-offs is made possible using R’s vectorized nature. Many function arguments accept one value as an input, and sometime more values using a vector. In fact, one simple but often forgotten R feature is that single values are in fact vectors of length 1 (scalars). But there is nothing wrong with providing multiple values with one argument, as in the case of the directional expectations argument dir.exp
, making the two former separate arguments incl.cut1
and incl.cut0
redundant in favor of a single argument incl.cut
. This is just one example of small, but very effective ways to simplify the user experience as much as possible.
11.1 Boolean expressions
There are multiple functions that deal with arbitrary sum of products (SOP) type of expressions, specified as text: pof()
, fuzzyor()
, fuzzyand()
, compute()
, simplify()
to name a few, but basically these types of strings are recognized throughout the package, for instance the function truthTable()
recognizes whether the outcome is negated or not, using a tilde sign.
The standard way of specifying strings in R is to enclose them within double quotes, and this is recommended as a good practice advice. In addition, some of the functions recognize a negation of an object using a tilde sign, even without the quotes.
Most of these functions rely heavily on a less known function called translate()
, that transforms any SOP expression into a corresponding matrix with the names of the conditions on the columns, and the values from the expression in the cells. This function is the workhorse for almost all other functions that deal with such expressions, and it is capable of detecting negations such as:
A B C
A 1
~B*C 0 1
In this example, the condition B is negated, because it has a tilde sign before. In previous versions, it was important and a good practice advice to name all objects using upper case letters, but since version 3.7 of the package this is no longer required.
Most of the following examples are going to use this function translate()
, even though it has nothing to do with the QCA methodology. It is just a substitute for any of the QCA specific functions that deal with string expressions. What works for this function, works for all of them, for instance being able to recognize multi-value sets, and even negated them if the number of levels is known for each set name:
A B C
A[1] 1
~B[1]*C[1] 0,2 1
The output is relatively straightforward, with one row for each of the disjunctive terms in the expression, the cells representing the values for each set, after translating the expression. All of those values are subsequently used in the recoding step: fuzzy sets are inverted (if negated), while binary and multi-value crisp sets are transformed (recoded) to a binary crisp set. In this example, values 0 and 2 for the set B are going to be recoded to 1, and the former value 1 is going to be recoded to 0.
The expression ~B[1]
is translated as “all values in set B except 1”, and since the set B has 3 values, all others cannot be anything else but 0 and 2, assuming the set is properly calibrated and values always start from 0.
Using a star sign “*
” to separate conjunctions is recommended, but not needed for multi-value expressions, the set names being already separated by using the curly brackets notation for the values. The same happens when providing the set names:
A B C D
AB 1 1
~CD 0 1
Specifying the set names has another useful side effect, to order the terms in the expression according to the order of their names in the snames
argument. In the absence of the set names, they are by default sorted alphabetically as in the following example:
S1: B~DERUV + BI~LRTU + ~DEILTV
By contrast, when the set names are provided they are properly sorted in the output. Here, the condition URB precedes the condition LIT and the first term of the expression changes:
S1: ~DEV*LIT + URB*~LIT
If a dataset is provided, neither the number of levels nor the set names are needed because they are taken directly from the dataset:
[1] 1 1 1 0 1 1 1 0 0 1 0 1 0 0 0 0 1 1
The example above uses the multi-value version of the Lipset data, where the causal condition DEV has three values. Negating the value 0 implies all other values, 1 and 2. The same effect is obtained by directly specifying them between the curly brackets:
[1] 1 1 1 0 1 1 1 0 0 1 0 1 0 0 0 0 1 1
There are many ways to specify the causal conditions into an expression, and the functions from package QCA work well with those from the companion package admisc:
[1] 1 1 1 0 1 1 1 0 0 1 0 1 0 0 0 0 1 1
Out of all functions in package QCA that deal with Boolean expressions, the function pof()
to calculate parameters of fit is by far the most versatile. Just like compute()
it evaluates a Boolean expression with either objects from the workspace or column names from a specific dataset, and it does that relative to a certain outcome for sufficiency or necessity relation. This function is the Swiss army knife, many input and many purpose function of the entire package, and it is employed by all other functions that require parameters of fit (most notably the output from the main minimization function).
The most basic type of input, when thinking about inclusion, coverage and all other parameters of fit, involves two calibrated vectors. The first will play the role of the causal condition and the second the role of the outcome. In fact, the first two arguments of the function pof()
are called setms
and outcome
.
The name setms
denotes “set membership scores”, to emphasize that almost any object can contain such scores: a data frame, a matrix of implicants, and simple vectors, and this argument can even be a string containing a Boolean expression. It can even be a numeric vector containing the row numbers from the implicants matrix, in which situation the setms
argument is automatically transformed into their corresponding set membership scores. Since it can carry so many things, the function is specifically programmed to recognize what kind of operations should be performed on which kind of input.
The argument outcome
can also be a string containing the name of the outcome (and it will be searched for in the list of objects or in a specified dataset), or it can be a proper vector containing the membership scores for the outcome set. To begin an example, the fuzzy version of the CVF data will be used (Cebotari and Vink 2013):
Here, the conditions are the first five columns of the CVF dataset, and the outcome is the column called PROTEST. Both are now separate objects in the workspace, and can be used as such and negated using the 1-
notation:
inclS PRI covS covU
----------------------------------------
1 ~DEMOC 0.601 0.354 0.564 0.042
2 ~ETHFRACT 0.614 0.337 0.661 0.036
3 ~GEOCON 0.601 0.246 0.317 0.000
4 ~POLDIS 0.493 0.250 0.631 0.035
5 ~NATPRIDE 0.899 0.807 0.597 0.025
----------------------------------------
A number of things have happened, even with such a simple command. First of all, the function pof()
automatically detected the input for the setms
argument is a dataset. Consequently, it determined that each column contains set membership scores and calculated parameters of fit for each. Last but not least, it determined that each such column should be negated (because all conditions in the dataset were subtracted from 1), and their corresponding names in the output have a tilde in front to signal their negation.
This is a simple dataset containing single columns, but there are examples of datasets containing set membership scores for more complex expressions, such as the solutions resulting from function minimize()
or those resulting from the function superSubset()
. As in the case of most functions, the output of these two functions are lists containing all sorts of components, the relevant ones for this section being the component pims
(prime implicants membership scores) from function minimize()
and component coms
(combinations membership scores) from function superSubset()
.
For more complex expressions, it applies the negate()
function to present their negated counterpart in the output.
ttCVF <- truthTable(CVF, outcome = "PROTEST", incl.cut = 0.8)
cCVF <- minimize(ttCVF, details = TRUE)
colnames(cCVF$pims)
[1] "DEMOC*ETHFRACT*GEOCON"
[2] "ETHFRACT*GEOCON*POLDIS"
[3] "DEMOC*ETHFRACT*POLDIS*~NATPRIDE"
[4] "DEMOC*GEOCON*POLDIS*NATPRIDE"
[5] "~DEMOC*~ETHFRACT*GEOCON*~POLDIS*~NATPRIDE"
The component coms
from function superSubset()
has similar complex column names from the resulting (necessary) expressions:
[1] "STB" "LIT*STB" "DEV+URB+IND"
In these examples, the component pims
contains the set membership scores for the resulting prime implicants in the conservative solution. One possible use case is to check if their negation is also sufficient for the outcome:
inclS PRI covS covU
---------------------------------------------------------------------
1 ~DEMOC+~ETHFRACT+~GEOCON 0.575 0.362 0.841 0.000
2 ~ETHFRACT+~GEOCON+~POLDIS 0.508 0.288 0.790 0.000
3 ~DEMOC+~ETHFRACT+~POLDIS+NATPRIDE 0.526 0.334 0.892 0.000
4 ~DEMOC+~GEOCON+~POLDIS+~NATPRIDE 0.542 0.351 0.893 0.011
5 DEMOC+ETHFRACT+~GEOCON+POLDIS+NATPRIDE 0.567 0.388 0.945 0.044
---------------------------------------------------------------------
The resulting rows in the output are equivalent to the negated expressions from the component pims
. Naturally, the same negated expression can be supplied manually, or directly using the function negate()
:
inclS PRI covS covU
-----------------------------------------
1 ~DEMOC 0.601 0.354 0.564 0.137
2 ~ETHFRACT 0.614 0.337 0.661 0.210
3 ~GEOCON 0.601 0.246 0.317 0.029
4 expression 0.575 0.362 0.841 -
-----------------------------------------
The command from the previous example is unnecessarily complicated, however. As shown in chapters 5 and 6, the function pof()
accepts fully contained string expressions, using the left arrow notation “<-
” for necessity and right arrow “->
” for the sufficiency relation:
inclS PRI covS covU
-----------------------------------------
1 ~DEMOC 0.601 0.354 0.564 0.137
2 ~ETHFRACT 0.614 0.337 0.661 0.210
3 ~GEOCON 0.601 0.246 0.317 0.029
4 expression 0.575 0.362 0.841 -
-----------------------------------------
All of these different examples show just how versatile this function is, accepting almost any type of input that carries or can generate set membership scores. And there is even more to reveal, its output is an object of class "pof"
that has a dedicated print method.
For instance, the output of the minimization process usually prints the parameters of fit for the solution model(s), which usually resides in the component named IC
(inclusion and coverage):
data(LF) # if not already loaded
ttLF <- truthTable(LF, "SURV", incl.cut = 0.7)
cLF <- minimize(ttLF, details = TRUE)
cLF$IC
inclS PRI covS covU
-----------------------------------------------
1 DEV*~URB*LIT*STB 0.809 0.761 0.433 0.196
2 DEV*LIT*IND*STB 0.843 0.821 0.622 0.385
-----------------------------------------------
M1 0.871 0.851 0.818
Since none of the functions truthTable()
and minimize()
did not specify showing the cases for each solution term, this is not printed. In this situation, either rerun the commands with the argument show.cases = TRUE
or, perhaps even more simple, simply ask the printing function itself to show the cases:
inclS PRI covS covU cases
-------------------------------------------------------------------
1 DEV*~URB*LIT*STB 0.809 0.761 0.433 0.196 FI,IE; FR,SE
2 DEV*LIT*IND*STB 0.843 0.821 0.622 0.385 FR,SE; BE,CZ,NL,UK
-------------------------------------------------------------------
M1 0.871 0.851 0.818
11.2 Negate expressions
The discussion about Boolean expressions could not have ended without a more in-depth discussion about the different ways to negate such expressions. Previous chapters have already used the function negate()
in several places, but it was always in the context of another topic being introduced, therefore it never got properly explained.
The structure of the function is rather simple and self-explanatory:
The first argument is a Boolean expression, and as previously shown it needs the set names if the causal conditions in the expression are not separated by formal conjunctive “*
” or disjunctive “+
” signs. To negate multi-value expressions, it needs to know the number of levels, and all of these arguments are passed to the underlying function translate()
that does the heavy lifting.
The final argument simplify
passes the output expression to the function simplify()
(by default) but that is left to the discretion of the user.
The negation of Boolean expressions is possible due to Augustus De Morgan, a British mathematician who lived in the 19th century. De Morgan formulated two laws that have passed the test of time, and are highly used in formal logics and computer programming:
In plain language, these two laws are translated quite simply as:
- the negation of a disjunction is a conjunction of negations
- the negation of a conjunction is a disjunction of negations
An example of the first law could be: “not(smartphone or tablet)”, which after negation is easily understood as: “not smartphone and not tablet”.
The second law is a bit more tricky but just as simple, for instance a Boolean expression such as “not young male” is also understood as “not(young and male)”, which after negation is finally understood as: “not young or not male”. It can be older, or female, just not both young and male at the same time.
N1: ~A + ~B
Conjunctions in Boolean expressions are formulated in two ways. Usually, they are expressed with a star sign “*
” but there are situations, especially when the set names contain a single letter, that conjunctions are expressed by a simple juxtaposition of the letters:
N1: ~AC + ~B~C
In this example, it is clear for a human there are three sets A, B and C, but this is far from trivial for a computer program. All functions dealing with Boolean expressions in package QCA can detect such situations, most of the times, but it would be a good practice advice to always use a star sign for the conjunctions, even when it is very clear.
The function negate()
does more than interpreting a simple Boolean expression, it can also detect an object resulting from the minimization process (that has a class QCA_min
). When the input is such an object, it searches through all its solution components and negates all of them:
M1: DEV*STB
N1: ~DEV + ~STB
It can even detect and negate intermediate solutions, as shown in section 8.9:
data(LF)
ttLF <- truthTable(LF, outcome = SURV, incl.cut = 0.8)
iLF <- minimize(ttLF, include = "?", dir.exp = "1,1,1,1,1")
negate(iLF)
M1-C1P1-1: DEV*URB*LIT*STB + DEV*LIT*~IND*STB
N1: ~DEV + ~LIT + ~STB + ~URB*IND
This output has a specific codification C1P1-1
, which means that at the combination between the first (and the only) conservative solution C1
and the first (and the only) parsimonious solution there is only a single intermediate solution -1
being generated.
But there are situations with many intermediate solutions, in which case there will be multiple numbers where -1
is, and all of those for all possible combinations of conservative and parsimonious solutions, which themselves can be generated in multiple numbers depending on the level of ambiguity found in the data.
11.3 Factorize expressions
Factorizing a SOP (sum of products) expression leads to what might be seen as an opposite, POS (product of sums) expression: it finds all possible combinations of common factors in a given expression.
The common factors in a SOP expression, especially when the expression is about a sufficiency relation, are all INUS conditions. Therefore the need to factorize an expression is a quest to find those INUS conditions which are found in as many solution terms as possible, revealing their relative importance when the outcome is present.
Similar to the negate()
function, the conditions in the expression are treated in alphabetical order, unless otherwise specified using the same argument called snames
or separated using the *
sign:
F1: ~four*(three + ~one*two) + ~one*three
F2: ~four*three + ~one*(three + ~four*two)
F3: ~four*~one*two + three*(~four + ~one)
Wherever possible, the input can be transformed into a complete POS expression, where all common factors are combined with exactly the same other INUS conditions. It is not always guaranteed that such a POS expression is possible, but it can be searched by activating the argument pos
:
F1: (~a + ~b)(~c + d)
Naturally, such a factorization is possible using a minimization object directly, in which case it is applied on every model in the output:
data(CVF)
pCVF <- minimize(CVF, outcome = PROTEST, incl.cut = 0.8,
include = "?", use.letters = TRUE)
factorize(pCVF)
M1: ~E + ~A*B*D + A*B*C + A*C*D
F1: ~E + ~A*B*D + A*C*(B + D)
F2: ~E + A*C*D + B*(~A*D + A*C)
F3: ~E + A*B*C + D*(~A*B + A*C)
M2: ~E + ~A*B*D + A*B*~D + A*C*D
F1: ~E + ~A*B*D + A*(B*~D + C*D)
F2: ~E + A*C*D + B*(~A*D + A*~D)
F3: ~E + A*B*~D + D*(~A*B + A*C)
M3: ~E + A*B*C + A*C*D + B*C*D
F1: ~E + B*C*D + A*C*(B + D)
F2: ~E + A*C*D + B*C*(A + D)
F3: ~E + C*(A*B + A*D + B*D)
F4: ~E + A*B*C + C*D*(A + B)
M4: ~E + A*B*~D + A*C*D + B*C*D
F1: ~E + B*C*D + A*(B*~D + C*D)
F2: ~E + A*C*D + B*(A*~D + C*D)
F3: ~E + A*B*~D + C*D*(A + B)
Finally, the negation of a certain model might be interesting to factorize:
F1: E*(~A*~B + ~A*~D + A*~C + ~B*~D)
F2: ~A*E*(~B + ~D) + E*(A*~C + ~B*~D)
F3: ~B*E*(~A + ~D) + E*(~A*~D + A*~C)
F4: ~D*E*(~A + ~B) + E*(~A*~B + A*~C)
11.4 More parameters of fit
The parameters of fit are traditionally revolving around inclusion and coverage (raw and unique), plus PRI for sufficiency and RoN for necessity. These are the standard parameters, as far as the current best practices indicate.
But standards are always subject to possible changes. Through sheer testing from the community, special situations can be found where the standard measures don’t seem to provide all the answers. Haesebrouck (2015), for instance, believes that QCA’s most heavily used parameter of fit - consistency measure - is significantly flawed, because:
…inconsistent cases with small membership scores exert greater bearing on the consistency score than inconsistent cases with large membership scores. In consequence, the measure does not accurately express the degree to which empirical evidence supports statements of sufficiency and necessity.
Starting from the classical consistency formula from equation (6.3):
\[\begin{equation} inclS_{X\phantom{.}\rightarrow\phantom{.}Y\phantom{.}} = \frac{\sum{min(\mbox{X}, \phantom{.}\mbox{Y})}}{\sum{\mbox{X}}} \end{equation}\]
Haesebrouck observes that cases with a large membership score in X have a greater impact on the consistency measure than cases with a low membership scores in X, even if their inconsistent part is equal: X = 1 and Y = 0.75 versus X = 0.25 and Y = 0. The impact of the inconsistent part (0.25) in the first situation is lower than the impact of an equal inconsistent part (0.25) in the second situation. He also observes that membership scores in X can be redefined by the sum of the consistent plus inconsistent parts:
\[\begin{equation} inclS_{X\phantom{.}\rightarrow\phantom{.}Y\phantom{.}} = \frac{\sum{min(\mbox{X}, \phantom{.} \mbox{Y})}}{\sum{(min(\mbox{X}, \phantom{.}\mbox{Y})\phantom{.} + \phantom{.}max(\mbox{X} - \mbox{Y}, 0))}} \end{equation}\]
The consistent part of X is equal to the numerator of the fraction, while the inconsistent part is either equal to 0 (if X is completely consistent) or equal to the difference between X and Y, when X is larger than Y. As a consequence, he proposes to increase the impact of the inconsistent part for higher membership in X by first multiplying it with the value of X, than taking its square root:
\[\begin{equation} inclH_{X\phantom{.}\rightarrow\phantom{.}Y\phantom{.}} = \frac{\sum{min(\mbox{X}, \phantom{.}\mbox{Y})}}{\sum{(min(\mbox{X}, \mbox{Y})\phantom{.} +\phantom{.} \sqrt{max(\mbox{X} - \mbox{Y}, \phantom{.} 0)\cdot{\mbox{X}}})}} \tag{11.1} \end{equation}\]
This is a possible improvement over the standard formula, a candidate for a new standard if there will be enough recognition and usage from the academic community. However, it is difficult for any new candidate to replace the state of the art standard, since most software offer only the standard measures.
For any such situations, the function pof()
offers the possibility to add alternative measures via the argument add
. It accepts a function, simplistically defined with two parameters x
and y
, that returns any sort of inner calculation based on these two parameters.
The example below defines such a function and assigns it to an object called inclH
. The object is served to the function pof()
, which augments the output with the new function name:
inclH <- function(x, y) {
sum(fuzzyand(x, y)) /
sum(fuzzyand(x, y) + sqrt(fuzzyor(x - y, 0)*x))
}
pof("DEV -> SURV", data = LF, add = inclH)
Warning in data.frame(..., check.names = FALSE): row names were found
from a short variable and have been discarded
inclS PRI covS covU inclH
----------------------------------------
1 1 0.775 0.743 0.831 - 0.720
2 2 0.775 0.743 0.831 - 0.720
----------------------------------------
If more such functions need to be provided, the argument add
also accepts a list object, containing a function on each component. The names of the components will become the names of the new parameters of fit in the augmented output, trimmed to the first five characters.
While still at the parameters of fit section, for most situations a SOP (sum of products) expression should be sufficient, when referring to specific column names from a dataset. This is a rather new improvement of this function, previously such expressions could also be provided in their matrix equivalent.
For instance, an expression such as DEV\(\cdot\)ind + URB\(\cdot\)STB could also be written in the matrix form:
DS <- matrix(c(1, -1, -1, 0, -1,
-1, 1, -1, -1, 1), ncol = 5, byrow = TRUE)
colnames(DS) <- colnames(LF)[1:5]
DS
DEV URB LIT IND STB
[1,] 1 -1 -1 0 -1
[2,] -1 1 -1 -1 1
This matrix used a standard value of 1 for the presence of a condition, a value of 0 for the absence of the condition and a value or -1 if the condition is minimized. In this example, since DEV\(\cdot{\sim}\)IND + URB\(\cdot\)STB is the only parsimonious solution for the outcome SURV (at a 0.75 inclusion cut-off), their parameters of fit could (in previous versions of the QCA package) be obtained through this matrix, but that is equivalent to the direct SOP expression:
inclS PRI covS covU
-----------------------------------------
1 DEV*~IND 0.815 0.721 0.284 0.194
2 URB*STB 0.874 0.845 0.520 0.430
3 expression 0.850 0.819 0.714 -
-----------------------------------------
As mentioned in section 11.1, functions minimize()
and superSubset()
generate in their output components named pims
(prime implicants membership scores) and coms
(component membership scores), that can be used as input for the parameters of fit.
DEV*~IND URB*STB
AU 0.27 0.12
BE 0.00 0.89
CZ 0.10 0.91
EE 0.16 0.07
FI 0.58 0.03
FR 0.19 0.03
These are the membership scores in the sets defined by the prime implicants DEV\(\cdot\)ind and URB\(\cdot\)STB, which can be verified using the function compute()
:
[1] 0.27 0.00 0.10 0.16 0.58 0.19 0.04 0.04 0.07 0.72 0.34 0.06 0.02
[14] 0.01 0.01 0.03 0.33 0.00
The component pims
from the output of the function minimize()
can be used to calculate parameters of fit directly, with the same results:
inclS PRI covS covU
---------------------------------------
1 DEV*~IND 0.815 0.721 0.284 0.194
2 URB*STB 0.874 0.845 0.520 0.430
---------------------------------------
11.5 XY plots
An XY plot is a scatterplot between two objects measured in fuzzy sets. It is a visualization tool to inspect to what extent one set is a subset of the other, in order to assess the sufficiency and/or necessity of one set to the other.
There are multiple ways of obtaining such a plot, but the most straightforward way is to use the function XYplot()
from package QCA. It has the simplest possible structure of arguments, but most importantly it offers all the flexibility and the wealthy choice of parameters from the native function plot()
, including all graphical parameters that are available via ?par
.
The function has the following structure:
XYplot(x, y, data, relation = "sufficiency", mguides = TRUE,
jitter = FALSE, clabels = NULL, enhance = FALSE, model = FALSE, ...)
Similar to all other functions in this package, it has a set of default values but the only one that has a visible effect is the relational argument, calculating the parameters of fit for the sufficiency relation. The total number of parameters is kept to a minimum, and most of them are deactivated by default (not jittering the points with the argument jitter
, not enhancing XY plot via the argument enhanced
, etc.)
The argument mguides
adds two lines (a horizontal and a vertical one) through the middle of the XY plot, allowing to divide the plot region into four areas corresponding to a 2 \(\times\) 2 cross-table. This is a very useful way to visualize where the fuzzy coordinates of the points would be located into the binary crisp cells of a table.
The two main arguments for this function are x
and y
, and they accept a variety of things as possible inputs. The most straightforward interpretation of these arguments are two numerical vectors containing fuzzy values between 0 and 1. The first (x
) is going to be used for the horizontal axis, and the second (y
) for the vertical axis.
For a first exemplification, we are going to use the same dataset CVF that was already loaded in section 11.1. The outcome of this data is called PROTEST (more exactly ethnopolitical protest), and we might be interested which of the causal conditions might be sufficient. A first to draw the attention is called NATPRIDE (national pride), with the following plausible hypothesis: the absence of the national pride is a sufficient condition for ethnic protests.
This hypothesis could be visualized with the following command:
As it can be seen in figure 11.1, most of the points are located in the upper left part of the plot, indicating sufficiency. This is confirmed by a relatively high inclusion score of 0.899 reported right below the “Sufficiency relation” title, with a similar high PRI score of 0.807, both confirming that an absence of national pride is indeed associated with ethnopolitical protests.
However, it cannot be concluded this is a causal relationship because the coverage is rather small (0.597) which means that although clearly sufficient, the absence of national pride is not a necessary condition for the appearance of ethnopolitical protests.
Returning to the command, the condition NATPRIDE was negated by subtraction from 1 (as usual in fuzzy sets), and this always works when the input is already represented by numerical vectors. But the same result would have been obtained with the following (unevaluated) command:
In this example, the input for the argument x
is not a numerical vector, but the name of the condition (NATPRIDE), which is to be found in the data called CVF
(hence the need to specify the additional argument data
).
Similar to many other functions in package QCA, especially function pof(){.R}
, these column names can be provided with or without the double quotes, and they can be negated using the 1-
notation (subtracting the set membership scores from 1) or using a preceding tilde.
This plot is interesting for the usage of row names for the case labels added to the plot. A bit off-topic, but this is an example of how not to use the row names. Where possible (and most of the QCA data refer to countries as cases) the best approach is to use a two letters abbreviation of each case and explain somewhere in the help file what each abbreviation stands for.
For countries, this is rather easy because there are already established standards for the two letter names of the countries. But even here that would have been possible, for instance using CB instead of CroatsBosnia. The net effect of using very long names is their likely overlap, especially in datasets with a large number of cases. The argument jitter
adds some random noise to the position of each point, to avoid a complete overlap (especially when the values are very close), but it doesn’t completely solve the problem.
As the case labels in this plot are very long and very dense, perhaps a better approach is to use the row numbers as case identifiers. This is the purpose of the argument clabels
, which can be a vector of case labels with the same length as the numeric vectors specified via argument x
and y
. Alternatively, it can be a logical vector of the same length as the number of rows in the data
argument, to add only those row names where the vector has a true value.
For this plot, the argument clabels
has been supplied with a vector of row numbers instead of country names, to avoid overlapping labels especially for very close points. R has a native, base function called jitter()
which in turn has two additional parameters called factor
and amount
to control how much the points are randomly jittered. These arguments are not part of the formal arguments in the function XYplot()
but they can still be used because they are automatically captured and interpreted via the three dots ...
argument.
This argument with three dots ...
is a special one, among others it can be used to access all graphical parameters from the native plot function. The most useful, to give just a few examples, are: col
to add specific colors to the points, bg
to fill the points with specific colors, pch
(point character) to plot the points using different other signs like squares, triangles or diamonds, cex
(character expansion, defaulted to 0.8) to control the size of the points and their associated labels etc.
The figure below is such an example, that mimics the enhanced plot presented by C. Q. Schneider and Rohlfing (2016), to aid with process tracing by using different character points for the various quadrants of the plot. The input, to demonstrate a complete range of possibilities for this function, is a SOP expression:
The choice of point characters is very close to the description from the original paper by Schneider and Rohlfing:
- the dark filled points in zone 1 (upper right quadrant, above the diagonal) indicate the typical cases
- the cross, also in zone 1 indicate the most typical case (closest to the main diagonal)
- the light filled points in zone 2 (upper right quadrant, below the diagonal) indicate the deviant cases consistency in degree
- there are no deviant cases consistency in kind to show in zone 3 (lower right quadrant), otherwise they would have a diamond shape
- the triangles in zone 4 (entire lower left quadrant) indicate the individually irrelevant cases
- the squares in zone 5 (upper left quadrant) indicate the deviant cases coverage
By default, the enhanced plot only displays the labels for the typical and for the deviant cases consistency in degree (zones 1 and 2), a specific choice for the sufficiency relation. When the expression is the result of a minimization process, however, the advice is add the labels for the points in zones 4 and 5, where the expression has a set membership score below 0.5.
sol <- "natpride + DEMOC*GEOCON*POLDIS + DEMOC*ETHFRACT*GEOCON"
XYplot(sol, "PROTEST", data = CVF, enhance = TRUE, model = TRUE)
For this purpose, the function XYplot()
has another logical argument called model
, which produces effects only when the enhance
argument is activated. The XY plot above corresponds to one of the parsimonious solutions in the CVF
data, using an inclusion cut-off value of 0.85 to construct the truth table.
Activating the argument model
is only necessary when the input is an expression, but otherwise the function XYplot()
accepts entire objects containing the result of a minimization. For instance, since there are two parsimonious solutions, the same output could have been obtained by using this equivalent command:
ttCVF <- truthTable(CVF, PROTEST, incl.cut = 0.85)
pCVF <- minimize(ttCVF, include = "?")
XYplot(pCVF$solution[1], CVF$PROTEST, enhance = TRUE)
In this command, the function XYplot()
automatically detects the input is a solution model (from the containing object pCVF
), so there is no need to specifically switch model = TRUE
to produce the same figure as above. For the second solution, only the number between the square brackets must be changed: [2]
. For the intermediate solutions, the minimization object has a component called i.sol
which contains all combinations of conservative and parsimonious solutions (ex. C1P1
), each containing the solutions.
All these figures have been produced using predefined settings, but otherwise users are free to create their own choice of points, or which cases from which zones to be labeled. In this direction, it is perhaps worth reminding that argument clabels
also accepts a logical vector of the same length as the number of rows in the data. In this scenario, the XY plot will print the case labels only for those row names where the argument clabels
has a true value, thus having a direct command over which points should be labeled.
When the argument clabels
is logical, the case labels for the points will be taken from the row names of the data (if provided), or otherwise an automatic sequence of case numbers will be produced. This is clearly a manual operation (different from the above automatic settings), but on the other hand it gives full flexibility over all possible choices with respect to the plot. Below is a possible final example of an XY plot, using all features:
sol <- compute(sol, data = CVF) # turn expression to set membership scores
col <- rep("darkgreen", nrow(CVF)) # define a vector of colors
col[sol < 0.5 & CVF$PROTEST < 0.5] <- "red" # zone 4
col[sol < 0.5 & CVF$PROTEST >= 0.5] <- "blue" # zone 5
clabels <- logical(nrow(CVF))
clabels[c(2, 5, 11)] <- TRUE # only these three labels to print
XYplot(sol, CVF$PROTEST, enhance = TRUE, clabels = clabels,
col = col, bg = col, xlab = "Solution model")
Naturally, the graphical user interface has a dedicated menu to construct such diagrams:
Graphs / XY plots
The dialog is less sophisticated, compared to the written command (especially with respect to the very latest enhancements), but on the other hand it compensates with much more interactivity. Figure 11.7 shows all possibilities, from negating conditions and outcome with a single click, changing the relationship using the dedicated radio button below the outcome selection box, to jittering points and rotating their labels to improve visibility.
In the default set up, the points are not labeled but respond with a label upon a mouse over (the point) event. The entire dialog is designed for a quick look over the possible set relations between one condition and one outcome (from the same dataset), with their associated parameters of fit.
Future versions of the dialog should include selecting minimization objects (not just dataframes), and of course enhance the plot to enable process tracing. It is also a possible idea to provide an export button to either SVG, or a bitmap version (PNG, BMP among others) or even the most portable, a PDF version of the plot.
11.6 Venn diagrams
The Venn diagrams are another type of useful visualization tools, particularly useful for the truth table analysis. As it turns out, there are exactly as many intersections in a Venn diagram as the number of rows in a truth table.
This is by no means a coincidence, since both truth table and Venn diagrams gravitate around the same number \(2^k\), where k is the number of causal conditions. Figure 11.8 is the most simple example, with two sets and \(2^2 = 4\) intersections.
It should be stated, however, that truth tables can have more than \(2^k\) rows when at least one of the causal conditions has more than two levels, therefore truth tables can be visualized with a Venn diagram only for binary crisp sets or fuzzy sets (in the later case, the fuzzy values are transformed to binary crisp values anyways).
There are many possibilities to draw Venn diagrams in R, using various functions from packages gplots, eVenn, Vennerable and probably one of the best, package VennDiagram. An overview of such possibilities was once written by Murdoch (2004), but things have changed a lot in the mean time. A more recent presentation was written by Wilkinson (2011).
This section is going to introduce a different package with the same name, called venn version 1.5 (Dusa 2017). Perhaps confusingly, it has exactly the same name as the one written and described by Murdoch, but his version was either not submitted to CRAN (the official R package repository), or it was abandoned and removed many years before this new package appeared.
Either way, the initial attempt by Murdoch could only draw up to three sets, and more recent packages can deal with up to five sets, while the current package venn can draw up to seven sets. For up to 3 sets, the shapes can be circular, but using circles for more than 3 sets is not possible. For 4 and 5 sets they shapes can be ellipsoidal, but the default shapes are even better, while for more than 5 sets the shapes cannot be continuous (they might be monotone, but not continuous). The 7 sets diagram is called “Adelaide” (Ruskey and Weston 2005).
To complete the presentation of available R packages, it is worth mentioning venneuler and eulerr that facilitate drawing not only Venn but also Euler diagrams. While the Venn diagrams always have \(2^k\) intersections, Euler diagrams are not bounded to this restriction and can draw sets completely outside of each other (if there is no intersection between them). Euler diagrams are not suitable for QCA research, but one of their interesting features is to allow for proportional area drawing (larger sets or larger intersections appearing larger in the plot).
The complete syntax for the function venn()
is presented below:
venn(x, snames = "", ilabels = FALSE, ellipse = FALSE, zcolor = "bw",
opacity = 0.3, plotsize = 15, ilcs = 0.6, sncs = 0.85, borders = TRUE,
box = TRUE, par = TRUE, ggplot = FALSE, ...)
There are a number of arguments to describe, one of them being already used above. ilabels
stands for intersection labels, and adds a unique number to each intersection, for instance 12 means the intersection between the first and the second sets, number 1 means what is unique to the first set and number 0 means what is outside all sets. When a four set diagram is going to be drawn, the intersection of all four sets will have the number 1234.
Unlike the previous approaches to draw Venn diagrams using complicated mathematical approximations to draw (proportional) intersections between sets, the package venn employs a static approach. All intersections between all combinations of up to 7 sets are pre calculated and available with the package as a dataset containing x and y coordinates for each point that define a particular shape. This makes the drawing extremely fast, with the added advantage the predefined shapes maximize the area of each intersection to the largest extent possible.
The first argument x
can be many things, but most importantly (and similar to the sibling package QCA) it can be a SOP - sum of products - expression, where intersections (conjunctions) are indicated with an asterisk “*
” sign, while unions (disjunctions) are indicated with a plus “+
” sign. For this reason, it should be combined with the argument snames
to determine the total number of sets (and their order) to be drawn in the diagram.
Like all Venn diagrams functions, venn()
can also use colors to the highlite certain intersections, and even use transparencies to show when two areas overlap (the default value of the argument opacity
is 0.3, where more opacity means less transparency). An example could be the union of two sets A and B, from a universe containing 4 sets:
The predefined color of the union can be changed using the argument zcolor
(zone color). It is less about an intersection color, but a “zone” because in the example above the union of A and B contains multiple possible unique intersections, and the entire zone is the union of all these unique shapes.
The predefined value of the zcolor
argument is "bw"
(black and white), when no expression is given and only the shapes of the sets are drawn, as in figure 11.8. While color
accepts a vector of color with the same length as the total number of sets, it has yet another predefined value called "style"
, which uses a set of unique and distinct colors generated by the excellent function colorRampPalette()
from package grDevices.
For 4 and 5 sets, there are alternative ellipsoidal shapes that are produced by activating the argument elipse
, but from 6 set onwards the shapes are monotone, but not continuous.
Although sets are usually understood as causal conditions in QCA (each column from a dataframe representing a set), the function venn()
also accepts lists with unequal number of elements. Each set is represented by a list component, and the task is to calculate all possible intersections where the sets have common elements, a quite normal scenario in bioinformatics. A possible example is:
set.seed(12345)
x <- list(First = 1:20, Second = 10:30, Third = sort(sample(25:50, 15)))
x
venn(x) # argument counts is automatically activated
When the input is a list, the function invisibly returns (not printed on the screen but could be assigned to an object) the truth table and the counts for each intersection, with an attribute called “intersections” containing their elements. There are 11 common elements between the First and the Second set, and 4 elements between the Second and the Third set. No common elements exist between all three sets and the number 0 is not printed in such situations to make a difference from the value 0 from the argument ilabels
(which means elements outside all sets).
The counts for each intersection (similar to the intersection labels) have a default, small font size, despite the large space available. That is necessary to have a uniform font size for all possible intersections, large and small. The more sets are added to the diagram, the smaller the intersections become. But this is not a restriction, since users can manually change the size of these labels using the argument ilcs
, which has a similar effect as the cex
argument from the base package, for the intersection labels (il
). There is also another related argument called sncs
, to adjust the font size for the set names. The actual base cex
argument is used to control the width of the set border lines.
Similar to function XYplot()
presented in the previous section, the function venn()
has a final three dots ...
parameter to allow the usage of all other graphical parameters from the base plot functions. For instance, one could customize a plot by changing the line type via the parameter lty
, as well as its color via the parameter col
:
This is one of the best shapes that accommodates 5 sets into a single diagram. It is also possible to use an ellipse as a base shape, obtaining some of the intersections very small in size, while some are very large. The light bulb shape is unique and currently offered only by package venn, enlarging the size of all intersections to the maximum extent possible.
All of these diagrams look very nice, but the most important feature of the function venn()
is its ability to recognize a truth table object and plot all of its intersections according to the output value. Using the object ttCVF
produced in the previous section (a truth table on the CVF data, using a 0.85 inclusion cut-off), the command to plot this object is as simple as:
When plotting a truth table, the choice of colors is fixed, to have an easily recognizable standard of the colors associated with the positive output configurations (green), negative output configurations (orange), contradictions (blue) and the rest of the intersections with no empirical evidence (remainders, white). The legend explaining these colors is drawn at the bottom.
The argument opacity
is optional, but it can use the inclusion scores for each configuration to draw the colors, which means a high inclusion turns the color for a particular intersection more opaque (less transparent). This means of course that positive output configurations will always be more opaque, and more significant differences are noticeable only between the yellow intersections for the negative output configurations, where the range of inclusion scores is larger.
In figure 11.12, many of the intersections from set NATPRIDE are associated with the negative output, while many of the intersections from set GEOCON are associated with the positive output. In the middle part, there are intersections (configurations) belonging to multiple sets that are associated with either one of the outputs.
The counts from each intersection represent the number of cases associated with each intersection, but they are not important when deriving the final solution. It is the configuration itself, not the number of cases associated with it, that has a contribution on the final solution(s), as long as the configuration passes the frequency cut-off. However, this is a useful information to visualize how the empirical information from the database is distributed in the truth table configurations. The minimized solutions are:
M1: ~NATPRIDE + DEMOC*GEOCON*POLDIS + (DEMOC*ETHFRACT*GEOCON)
-> PROTEST
M2: ~NATPRIDE + DEMOC*GEOCON*POLDIS + (DEMOC*ETHFRACT*~POLDIS)
-> PROTEST
In a similar vein with the function XYplot()
, and consistent with a general approach in package QCA, the function venn()
is built to automatically recognize a solution if it comes from a minimization object (such as pCVF
). As a consequence, it is not needed to specify the set names because they are taken directly from the minimization object.
In figure 11.13, the first term of the solution (natpride
, absence of national pride, the area outside of the set NATPRIDE) is drawn using a pale green color, the intersection DEMOC*GEOCON*POLDIS
in red and DEMOC*ETHFRACT*GEOCON
in blue. In principle, the information from this diagram confirms the previous one, where NATPRIDE was mainly associated with a negative output (here, the absence of NATPRIDE is sufficient for the presence of the outcome) and GEOCON is present in both the other conjunctions from the solution. There is one particular area where all colors overlap, at the intersection between the first four sets, and outside the fifth.
In the graphical user interface, a Venn diagram can be produced via the following menu:
Graphs / Venn diagram
In the absence of a truth table object, this menu will open an empty dialog. It first starts by looking for such objects in the workspace, and creates a diagram only if they exist (from the truth table dialog, the option Assign has to be checked, with a suitable name). If multiple such objects exist, the Venn diagram will be drawn for the latest one. It is also worth mentioning that truth tables are also produced automatically by the minimization function (objects of class "qca"
contain truth table objects of class "tt"
), therefore they all count for the Venn diagram dialog.
Figure 11.14 is very similar to the previous figure 11.12, just produced in the graphical user interface. It has the intersection labels instead of the counts, but the most important difference is the ability to interactively explore what cases are associated with each intersection. On hovering the mouse over the intersections, an event is triggered to show a message containing the names of each associated cases. Such an event is naturally impossible in the normal R graphical window, and demonstrates the complementary features that HTML based, Javascript events can bring to enhance the experience of normal R graphics.
The particular region that is hovered, at the intersection between the first four sets, is also the intersection where all three terms from the first solution overlap in figure 11.13. Perhaps interesting is the case associated with this intersection (AlbaniansFYROM), which is the same case identified as the most typical case in the enhanced XY plot from figure 11.4. Further research is needed to confirm if this is a fact or just a coincidence, but the information seems to converge.
Custom labels
The function venn()
has a predefined set of labels for the intersections, either as numbers representing the sets to which an intersection belongs to, or counting how many cases belong to a certain intersection (configuration).
There are however situations when users need to add custom labels to various intersections, either with the names of the cases belonging to a specific configuration, or any other text description. In order to do that, it is important to understand the conceptual difference between an intersection, a “zone”, and an an “area”.
A zone is a union of set intersections. If a zone contains all intersections from a particular set, the zone would be equivalent to the set itself. For instance in figure 11.8, the set A
has two intersections: Ab
(what is inside of A
but not in B
), and AB
(the intersection between A
and B
). We could also say the set A
is the union between A~B
and AB
(in other words, A~B + AB = A
).
An area can have one or multiple zones, depending on the complexity of the Venn diagram, with more sets increasing the complexity. For instance there are four sets in figure 11.9, and the area B~C
(what is inside B
but not in C
), has two zones: the first consists from the intersections 2 and 12 (~AB~C~D + AB~C~D
), and the second zone with the intersections 24 and 124 (~AB~CD + AB~CD
). This happens because the set B
is transversally sectioned by the set C
. An area can also be specified by an entire solution, which by definition has multiple zones, for each solution term.
The figure 11.13 could be improved by adding a label containing the cases for the solution term DEMOC*ETHFRACT*GEOCON
, which is the second term from the first solution of the object pCVF
. Such a label should be located in the diagram using a set of coordinates for the X and Y axes, therefore we need to calculate these coordinates:
This is facilitated by the function getCentroid()
from package venn, which returns a list of coordinates for each zone determined by the function getZones()
. In this example, there is a single zone for this particular solution term, and the result is unlisted to obtain a vector of two numbers for the coordinates.
From the inclusion and coverage scores table, we can see the cases associated with this term: "HungariansRom"
, "CatholicsNIreland"
, "AlbaniansFYROM"
and "RussiansEstonia"
. Having the coordinates of the centroid, adding the label is now a simple matter of:
venn(pCVF$solution[1], zcol = "#ffdd77, #bb2020, #1188cc")
cases <- paste(c("HungariansRom", "CatholicsNIreland", "AlbaniansFYROM",
"RussiansEstonia"), collapse = "\n")
text(coords[1], coords[2], labels = cases, cex = 0.85)
The specification of collapse = "\n"
in function paste()
above makes sure that all four cases are printed below each other (the string "\n"
is interpreted similarly to pressing the Enter key after each case name). The argument cex
adjusts the font size of the text in the label, according to circumstances.