While we developped the Genderartnet's map, we worked in parallel with Anne-Laure Buisson, a statistician, who used her skills to extract a series of informations and relationships from the texts extracted from the urls of the artists. Anne-Laure applied a method called Principal Component Analysis on the dataset. Together with the designer Ludivine Loiseau, they created two graphs displaying the results of her research. Click on the thumbnails to view them comfortably.
To understand the richness of these graphs, you may want to learn a few background information on the technique and the conventions used in the PCA and present in these maps. This is why we describe here below the general principal of the PCA and have compiled some informations to help you find your way.
The Principal Component Analysis (PCA)
To get to know the logic of PCA, one must see it as a way to chose the most appropriate angle to take a picture. If one wants to photograph a table and easy to recognize as such, one may chose to take a picture from the side of the object rather than from the above. The real object existing in three dimensions is reduced to an image in two dimensions. But we lose less information when we chose the right angle ('of sight').
The PCA consists of a reduction of an object present in thousands of dimensions into a graph of two dimensions. This will be necessarily imperfect, but it will be legible.
The PCA starts from the idea that it is possible to represent the artists by points in a multidimensional space. Every word constitutes a dimension of this space, and the value of the axis are equal to the number of times the word is used by the artist. Here, as the lexicon extracted from the artists websites is very large, one can imagine a space with thousands of dimensions, as many as there are words in the lexicon. Of course, it is impossible to “see” this space, one can only try to imagine it. The PCA has only, therefore, as objective to find a representation, as complete as possible, of this cloud, in only two dimensions. And to keep a maximum of information while summarizing it enough to make it legible.
To represent the semantic relationships between artists, we have chosen one criteria: the words they use the most on their web page. We have therefore tried to summarize these thousands of words/dimensions to find affinities (common characteristics) between the different words, to see which words are used in a different way, or, on the contrary, the same way by different artists, and to summarize this disparate information. The software used to produce the PCA's and generate the graphs is the free software “R” (r-cran project).
Illustration
We are going to illustrate all this using a small number of words and artists. The following table presents for a totally imaginary example, the number of times each word is used by the artists represented by letters.
|
|
Gilroy |
smile |
politique |
beautiful |
free |
house |
|
A |
6 |
10 |
2 |
0 |
5 |
0 |
|
B |
15 |
11 |
3 |
1 |
6 |
1 |
|
C |
8 |
12 |
4 |
2 |
7 |
2 |
|
E |
0 |
0 |
15 |
15 |
2 |
0 |
|
F |
1 |
1 |
20 |
20 |
3 |
1 |
|
I |
5 |
0 |
0 |
0 |
10 |
0 |
|
L |
8 |
6 |
3 |
3 |
13 |
3 |
One can make the following observations about the table:
Artist A uses 6 times the word “Gilroy”, while F only uses it once.
Artists E and F use the words in a similar fashion
If we now compare the columns of the words “smile” and “Gilroy” we can see that they are nearly used the same way by the different artists as well as “politique” and “beautiful”.
We have made the PCA for this small table and the graph below presents the results.
![]()
We can see how the remarks above are translated into the graph.
A is in the region of the graph where “Gilroy” is located, while F is at the opposite end
E and F are quite close in the graph
the words “Gilroy” and “smile” are very close, as well as “politique” and “beautiful”
Incidently we can make observations that weren't easy to make when reading the table: L and house are very close and quite isolated from the other words and artists. This is due to the fact that, while L uses “house” only a few times (3 times) , as house is a word rarely used, it distinguishes strongly L from the other artists: the farther a word is from the center of the graph, the more it is “characteristic” and the further an artist is from the center of the graph, the more she is defined by the words present in the same region, the center of the graph being an average zone. Of course, we also lose some information in this representation. For instance, we don't know anymore how much time is each word used. But this representation makes visible, in a graphical manner, the results and shows in particular the similarities between the words, and between the artists, but also the relationships between the words and the artists.
Linear combination
What do the axes produced by the PCA represent? In fact, they are each of them a combination (linear) of the dimensions (words collected at the start), if one word is more important for an axis, it receives more value. The weight of the different words can be expressed by the correlation between every axes of the PCA and every dimension/word presented in the table below, for the six axes of our example.
|
|
Axe1 |
Axe2 |
Axe3 |
Axe4 |
Axe5 |
Axe6 |
|
Gilroy |
-0.45 |
0.1 |
0,328 |
0,826 |
0,014 |
0,002 |
|
smile |
-0.38 |
0.17 |
0,634 |
-0,472 |
-0,440 |
0,094 |
|
politique |
0.47 |
-0.24 |
0,328 |
0,162 |
-0,345 |
-0,683 |
|
beautiful |
0.48 |
-0.29 |
0,225 |
0,207 |
-0,266 |
0,724 |
|
free |
-0.38 |
-0.44 |
-0,507 |
0,060 |
-0,634 |
-0,017 |
|
house |
-0.24 |
-0.79 |
0,273 |
-0,152 |
0,463 |
-0,027 |
A number nearing zero indicates the quasi-absence of correlation between the word and the axis, for instance, for “Gilroy” in the axis 2. On the contrary, the strongest the value, either positively or negatively, for instance for “house” in the second axis, the more the word has a relationship with the axis. The important words in the definition of an axis are therefore those which have a strong correlation (either positively or negatively). We see that for axis 1, the horizontal one, the worlds “Gilroy” and “Smile”, negatively (on the left), and “politique” and “beautiful”, positively (on the right) are the most important, “house” having nearly no influence. But for the second axis, vertical, “house” has a coefficient of correlation which is the strongest and it is the principal word that serves to distribute words and artists for this axis. For the axes 3 and 4, “Smile” and “Gilroy” will have more importance. We can, in fact, find these values back in the graph in projecting the words on the concerned axis:the further a word is from the center of the graph, the more it has influence in the concerned dimension. Let's note that the orientation of the axes(up/down, right/left) is arbitrary and can vary from one software to the other.
How many axis(es)?
Until now we have presented the PCA as using only two axes, which is in fact a simplification made for pedagogical purpose. At the end of the PCA, we generally obtain as many axes as dimensions. The two axes presented above are in fact the two best axes in terms of quantity of information presented. We could also use the axes 3,4,5 and 6, which also contain information but less and less.
The “eigen values” graph shows the percentage of information for each axis of the PCA: as we can see, the first axis contains a large part of the information (nearly 60%), whereas the last contains none.

But in this example, we also see that the axes 2 and 3 have nearly the same quantity of information. And therefore, to present axes 1 and 3 would have been almost as informative as presenting axes 1 and 2. For information, we present here below the biplots for the axes 3 and 4, and 5 and 6.