A Workflow for network analysis
Aurélien Goutsmedt and Alexandre Truc
Source:vignettes/workflow-network.Rmd
workflow-network.Rmd
In the following article, we will go other the different steps to create a bibliometric network, manipulate it, prepare the plotting and eventually plot it.
First step: creating the network and keeping the main component
As a point of departure, you need your bibliometric data to be prepared in a certain format:1
- you need a
nodes
table. For instance, it may be a list of articles with metadata (author(s), title, journal, etc.). Nodes must have a unique identifier and all the information about a node are gathered on only one row (in case of articles, you need one row per article). - you need a
directed_edges
table, that is a table that links your nodes with another variable that will be used to build the edges between your nodes. For instance, the table could links articles (your nodes) with the references cited by these articles. It can also be a journal or a list of authors (if you are interested in collaboration). In yourdirected_edges
table, you need the identifier of the nodes (also present in thenodes
table), and the unique identifier of the categories (references cited, journals, authors…) the nodes are linked to.
As soon as you have a nodes and a edges file (see the biblionetwork
package for creating such files), you can create a graph, using
tidygraph and the tbl_graph()
function. The next step, as it is recurrent in many network analyses,
notably in bibliometric netwoks like bibliographic coupling networks
would be to keep only the main
component of your network. This could be done in one step using the
tbl_main_component()
function of
networkflow
.
library(networkflow)
## basic example code
graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", nb_components = 1)
#> Warning: `tbl_main_component()` was deprecated in networkflow 0.1.0.
#> ℹ Please use `filter_components()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
print(graph)
#> # A tbl_graph: 145 nodes and 2593 edges
#> #
#> # An undirected simple graph with 1 component
#> #
#> # Node Data: 145 × 6 (active)
#> Id Author Year Author_date Title Journal
#> <chr> <chr> <int> <chr> <chr> <chr>
#> 1 96284971 ALBANESI-S 2003 ALBANESI-S-2003 Expectation traps and mo… "The R…
#> 2 37095547 ATESOGLU-H 1980 ATESOGLU-H-1980 Inflation and its accele… "Journ…
#> 3 46282251 ATESOGLU-H 1982 ATESOGLU-H-1982 WAGES AND STAGFLATION "JOURN…
#> 4 214927 BALL-L 1991 BALL-L-1991 The Genesis of Inflation… "Journ…
#> 5 2207578 BALL-L 1995 BALL-L-1995a Relative-Price Changes a… "The Q…
#> 6 10729971 BALL-L 1995 BALL-L-1995b Time-consistent policy a… "Journ…
#> 7 95575910 BARSKY-R 2002 BARSKY-R-2002 Do we really know that o… "NBER …
#> 8 105203318 BARSKY-R 2004 BARSKY-R-2004 Oil and the Macroeconomy… "Journ…
#> 9 140987361 BENATI-L 2011 BENATI-L-2011 Would the Bundesbank hav… "JOURN…
#> 10 43590636 BERNANKE-B 1997 BERNANKE-B-1997a Systematic monetary poli… "Brook…
#> # ℹ 135 more rows
#> #
#> # Edge Data: 2,593 × 5
#> from to weight Source Target
#> <int> <int> <dbl> <int> <int>
#> 1 4 5 0.146 214927 2207578
#> 2 4 65 0.0408 214927 5982867
#> 3 4 46 0.0973 214927 8456979
#> # ℹ 2,590 more rows
The parameter nb_components
allows you to choose the
number of components you want to keep. For obvious reasons, it is
settled to 1 by default.
However, it could happen in some networks (for instance co-authorship
networks) that the second biggest component of your network is quite
large. To avoid removing too big components without knowing it, the
tbl_main_component()
function integrates a warning that
happens when a secondary component gathering more than x% of the total
number of nodes is removed. The threshold_alert
parameter
is set to 0.05 by default, but you can reduce it if you really want to
avoid removing relatively big components.
## basic example code
graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", threshold_alert = 0.001)
#> Warning in tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, :
#> Warning: you have removed a component gathering more than 0.001% of the nodes
print(graph)
#> # A tbl_graph: 145 nodes and 2593 edges
#> #
#> # An undirected simple graph with 1 component
#> #
#> # Node Data: 145 × 6 (active)
#> Id Author Year Author_date Title Journal
#> <chr> <chr> <int> <chr> <chr> <chr>
#> 1 96284971 ALBANESI-S 2003 ALBANESI-S-2003 Expectation traps and mo… "The R…
#> 2 37095547 ATESOGLU-H 1980 ATESOGLU-H-1980 Inflation and its accele… "Journ…
#> 3 46282251 ATESOGLU-H 1982 ATESOGLU-H-1982 WAGES AND STAGFLATION "JOURN…
#> 4 214927 BALL-L 1991 BALL-L-1991 The Genesis of Inflation… "Journ…
#> 5 2207578 BALL-L 1995 BALL-L-1995a Relative-Price Changes a… "The Q…
#> 6 10729971 BALL-L 1995 BALL-L-1995b Time-consistent policy a… "Journ…
#> 7 95575910 BARSKY-R 2002 BARSKY-R-2002 Do we really know that o… "NBER …
#> 8 105203318 BARSKY-R 2004 BARSKY-R-2004 Oil and the Macroeconomy… "Journ…
#> 9 140987361 BENATI-L 2011 BENATI-L-2011 Would the Bundesbank hav… "JOURN…
#> 10 43590636 BERNANKE-B 1997 BERNANKE-B-1997a Systematic monetary poli… "Brook…
#> # ℹ 135 more rows
#> #
#> # Edge Data: 2,593 × 5
#> from to weight Source Target
#> <int> <int> <dbl> <int> <int>
#> 1 4 5 0.146 214927 2207578
#> 2 4 65 0.0408 214927 5982867
#> 3 4 46 0.0973 214927 8456979
#> # ℹ 2,590 more rows
Second step: finding communities
Once you have you tidygraph graph, an important step is to run
community detection algorithms to group the nodes depending on their
links. This package uses the leidenAlg
package, and its find_partition()
function, to implement
the Leiden algorithm (Traag, Waltman, and van Eck 2019). The
leiden_workflow()
function of our package runs the Leiden
algorithm and attributes a community number to each node in the
Com_ID
column, but also to each edge (depending if the
from
and to
nodes are within the same
community).
# creating again the graph
graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", nb_components = 1)
# finding communities
graph <- leiden_workflow(graph)
print(graph)
You can observe that the function also gives the size of the community, by calculating the share of total nodes that are in each community.
The function also allows to play with the resolution
parameter of leidenAlg find_partition()
function. Varying the resolution of the algorithm results in a different
partition and different number of communities. A lower resolution means
less communities, and conversely. The basic resolution of the
leiden_workflow()
is set by res_1
and equals 1
by default. You can vary this parameter, but also try a second
resolution with res_2
and a third one with
res_3
:
# creating again the graph
graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", nb_components = 1)
# finding communities
graph <- leiden_workflow(graph, res_1 = 0.5, res_2 = 2, res_3 = 3)
print(graph)
Once you have detected different communities in your network, you are
well on the way of the projection of your graph, but two important steps
should be implemented before. First, you have to attribute some colors
to each community. These colors will be used for your nodes and edges
when you will project your graph with ggraph
. The function
community_colors
of the networkflow
package
allow to do that. You just have to give it a palette (with as many
colors as the number of communities for a better visualisation).2
# loading a palette with many colors
palette <- c("#1969B3","#01A5D8","#DA3E61","#3CB95F","#E0AF0C","#E25920","#6C7FC9","#DE9493","#CD242E","#6F4288","#B2EEF8","#7FF6FD","#FDB8D6","#8BF9A9","#FEF34A","#FEC57D","#DAEFFB","#FEE3E1","#FBB2A7","#EFD7F2","#5CAADA","#37D4F5","#F5779B","#62E186","#FBDA28","#FB8F4A","#A4B9EA","#FAC2C0","#EB6466","#AD87BC","#0B3074","#00517C","#871B2A","#1A6029","#7C4B05","#8A260E","#2E3679","#793F3F","#840F14","#401C56","#003C65","#741A09","#602A2A","#34134A","#114A1B","#27DDD1","#27DD8D","#4ADD27","#D3DD27","#DDA427","#DF2935","#DD27BC","#BA27DD","#3227DD","#2761DD","#27DDD1")
# creating again the graph
graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", nb_components = 1)
# finding communities
graph <- leiden_workflow(graph)
# attributing colors
graph <- community_colors(graph, palette, community_column = "Com_ID")
print(graph)
What you want to do next is to give a name automatically to your
community. The community_names()
function allows you to do
that: it gives to the community the label of the node, within the
community, which has the highest score in the statistics you choose. In
the next exemple, we will calculate the degree of each node, and each
community will take as a name the label of its highest degree node.
library(magrittr)
library(dplyr)
library(tidygraph)
# calculating the degree of nodes
graph <- graph %>%
activate(nodes) %>%
mutate(degree = centrality_degree())
# giving names to communities
graph <- community_names(graph, ordering_column = "degree", naming = "Author_date", community_column = "Com_ID")
print(graph)