A Workflow for network analysis

In the following article, we will go other the different steps to create a bibliometric network, manipulate it, prepare the plotting and eventually plot it.

First step: creating the network and keeping the main component

As a point of departure, you need your bibliometric data to be prepared in a certain format:¹

you need a nodes table. For instance, it may be a list of articles with metadata (author(s), title, journal, etc.). Nodes must have a unique identifier and all the information about a node are gathered on only one row (in case of articles, you need one row per article).
you need a directed_edges table, that is a table that links your nodes with another variable that will be used to build the edges between your nodes. For instance, the table could links articles (your nodes) with the references cited by these articles. It can also be a journal or a list of authors (if you are interested in collaboration). In your directed_edges table, you need the identifier of the nodes (also present in the nodes table), and the unique identifier of the categories (references cited, journals, authors…) the nodes are linked to.

As soon as you have a nodes and a edges file (see the biblionetwork package for creating such files), you can create a graph, using tidygraph and the tbl_graph() function. The next step, as it is recurrent in many network analyses, notably in bibliometric netwoks like bibliographic coupling networks would be to keep only the main component of your network. This could be done in one step using the tbl_main_component() function of networkflow.

library(networkflow)

## basic example code

graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", nb_components = 1)
#> Warning: `tbl_main_component()` was deprecated in networkflow 0.1.0.
#> ℹ Please use `filter_components()` instead.
#> This warning is displayed once every 8 hours.
#> Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
#> generated.
print(graph)
#> # A tbl_graph: 145 nodes and 2593 edges
#> #
#> # An undirected simple graph with 1 component
#> #
#> # Node Data: 145 × 6 (active)
#>    Id        Author      Year Author_date      Title                     Journal
#>    <chr>     <chr>      <int> <chr>            <chr>                     <chr>  
#>  1 96284971  ALBANESI-S  2003 ALBANESI-S-2003  Expectation traps and mo… "The R…
#>  2 37095547  ATESOGLU-H  1980 ATESOGLU-H-1980  Inflation and its accele… "Journ…
#>  3 46282251  ATESOGLU-H  1982 ATESOGLU-H-1982  WAGES AND STAGFLATION     "JOURN…
#>  4 214927    BALL-L      1991 BALL-L-1991      The Genesis of Inflation… "Journ…
#>  5 2207578   BALL-L      1995 BALL-L-1995a     Relative-Price Changes a… "The Q…
#>  6 10729971  BALL-L      1995 BALL-L-1995b     Time-consistent policy a… "Journ…
#>  7 95575910  BARSKY-R    2002 BARSKY-R-2002    Do we really know that o… "NBER …
#>  8 105203318 BARSKY-R    2004 BARSKY-R-2004    Oil and the Macroeconomy… "Journ…
#>  9 140987361 BENATI-L    2011 BENATI-L-2011    Would the Bundesbank hav… "JOURN…
#> 10 43590636  BERNANKE-B  1997 BERNANKE-B-1997a Systematic monetary poli… "Brook…
#> # ℹ 135 more rows
#> #
#> # Edge Data: 2,593 × 5
#>    from    to weight Source  Target
#>   <int> <int>  <dbl>  <int>   <int>
#> 1     4     5 0.146  214927 2207578
#> 2     4    65 0.0408 214927 5982867
#> 3     4    46 0.0973 214927 8456979
#> # ℹ 2,590 more rows

The parameter nb_components allows you to choose the number of components you want to keep. For obvious reasons, it is settled to 1 by default.

However, it could happen in some networks (for instance co-authorship networks) that the second biggest component of your network is quite large. To avoid removing too big components without knowing it, the tbl_main_component() function integrates a warning that happens when a secondary component gathering more than x% of the total number of nodes is removed. The threshold_alert parameter is set to 0.05 by default, but you can reduce it if you really want to avoid removing relatively big components.


## basic example code

graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", threshold_alert = 0.001)
#> Warning in tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, :
#> Warning: you have removed a component gathering more than 0.001% of the nodes
print(graph)
#> # A tbl_graph: 145 nodes and 2593 edges
#> #
#> # An undirected simple graph with 1 component
#> #
#> # Node Data: 145 × 6 (active)
#>    Id        Author      Year Author_date      Title                     Journal
#>    <chr>     <chr>      <int> <chr>            <chr>                     <chr>  
#>  1 96284971  ALBANESI-S  2003 ALBANESI-S-2003  Expectation traps and mo… "The R…
#>  2 37095547  ATESOGLU-H  1980 ATESOGLU-H-1980  Inflation and its accele… "Journ…
#>  3 46282251  ATESOGLU-H  1982 ATESOGLU-H-1982  WAGES AND STAGFLATION     "JOURN…
#>  4 214927    BALL-L      1991 BALL-L-1991      The Genesis of Inflation… "Journ…
#>  5 2207578   BALL-L      1995 BALL-L-1995a     Relative-Price Changes a… "The Q…
#>  6 10729971  BALL-L      1995 BALL-L-1995b     Time-consistent policy a… "Journ…
#>  7 95575910  BARSKY-R    2002 BARSKY-R-2002    Do we really know that o… "NBER …
#>  8 105203318 BARSKY-R    2004 BARSKY-R-2004    Oil and the Macroeconomy… "Journ…
#>  9 140987361 BENATI-L    2011 BENATI-L-2011    Would the Bundesbank hav… "JOURN…
#> 10 43590636  BERNANKE-B  1997 BERNANKE-B-1997a Systematic monetary poli… "Brook…
#> # ℹ 135 more rows
#> #
#> # Edge Data: 2,593 × 5
#>    from    to weight Source  Target
#>   <int> <int>  <dbl>  <int>   <int>
#> 1     4     5 0.146  214927 2207578
#> 2     4    65 0.0408 214927 5982867
#> 3     4    46 0.0973 214927 8456979
#> # ℹ 2,590 more rows

Second step: finding communities

Once you have you tidygraph graph, an important step is to run community detection algorithms to group the nodes depending on their links. This package uses the leidenAlg package, and its find_partition() function, to implement the Leiden algorithm (Traag, Waltman, and van Eck 2019). The leiden_workflow() function of our package runs the Leiden algorithm and attributes a community number to each node in the Com_ID column, but also to each edge (depending if the from and to nodes are within the same community).


# creating again the graph
graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", nb_components = 1)

# finding communities
graph <- leiden_workflow(graph)
print(graph)

You can observe that the function also gives the size of the community, by calculating the share of total nodes that are in each community.

The function also allows to play with the resolution parameter of leidenAlg find_partition() function. Varying the resolution of the algorithm results in a different partition and different number of communities. A lower resolution means less communities, and conversely. The basic resolution of the leiden_workflow() is set by res_1 and equals 1 by default. You can vary this parameter, but also try a second resolution with res_2 and a third one with res_3:


# creating again the graph
graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", nb_components = 1)

# finding communities
graph <- leiden_workflow(graph, res_1 = 0.5, res_2 = 2, res_3 = 3)
print(graph)

Once you have detected different communities in your network, you are well on the way of the projection of your graph, but two important steps should be implemented before. First, you have to attribute some colors to each community. These colors will be used for your nodes and edges when you will project your graph with ggraph. The function community_colors of the networkflow package allow to do that. You just have to give it a palette (with as many colors as the number of communities for a better visualisation).²

# loading a palette with many colors
palette <- c("#1969B3","#01A5D8","#DA3E61","#3CB95F","#E0AF0C","#E25920","#6C7FC9","#DE9493","#CD242E","#6F4288","#B2EEF8","#7FF6FD","#FDB8D6","#8BF9A9","#FEF34A","#FEC57D","#DAEFFB","#FEE3E1","#FBB2A7","#EFD7F2","#5CAADA","#37D4F5","#F5779B","#62E186","#FBDA28","#FB8F4A","#A4B9EA","#FAC2C0","#EB6466","#AD87BC","#0B3074","#00517C","#871B2A","#1A6029","#7C4B05","#8A260E","#2E3679","#793F3F","#840F14","#401C56","#003C65","#741A09","#602A2A","#34134A","#114A1B","#27DDD1","#27DD8D","#4ADD27","#D3DD27","#DDA427","#DF2935","#DD27BC","#BA27DD","#3227DD","#2761DD","#27DDD1")

# creating again the graph
graph <- tbl_main_component(nodes = Nodes_coupling, edges = Edges_coupling, directed = FALSE, node_key = "ItemID_Ref", nb_components = 1)

# finding communities
graph <- leiden_workflow(graph)

# attributing colors
graph <- community_colors(graph, palette, community_column = "Com_ID")
print(graph)

What you want to do next is to give a name automatically to your community. The community_names() function allows you to do that: it gives to the community the label of the node, within the community, which has the highest score in the statistics you choose. In the next exemple, we will calculate the degree of each node, and each community will take as a name the label of its highest degree node.


library(magrittr)
library(dplyr)
library(tidygraph)

# calculating the degree of nodes
 graph <- graph %>%
   activate(nodes) %>%
   mutate(degree = centrality_degree())
 
# giving names to communities
 graph <- community_names(graph, ordering_column = "degree", naming = "Author_date", community_column = "Com_ID")
 print(graph)

Third step: plotting the network

Preparing the plot

References

Traag, Vincent A, Ludo Waltman, and Nees Jan van Eck. 2019. “From Louvain to Leiden: Guaranteeing Well-Connected Communities.” Scientific Reports 9 (1): 1–12.

Aurélien Goutsmedt and Alexandre Truc

First step: creating the network and keeping the main component

Second step: finding communities

Third step: plotting the network

Preparing the plot

References