vignettes/Using_biblionetwork.Rmd
Using_biblionetwork.Rmd
This vignette introduces you to the different functions of the package with the data integrated in the package.
The biblio_coupling()
function is the most general function of the package. This function takes as an input a direct citation data frame (entities, like articles, authors or institutions, citing references) and produces an edge list for bibliographic coupling network, with the number of references that different articles share together, as well as the coupling angle value of edges (Sen and Gan 1983). This is a standard way to build bibliographic coupling network using Salton’s cosine measure: it divides the number of references that two articles share by the square root of the product of both articles bibliography lengths. It avoids giving too much importance to articles with a large bibliography. It looks like:
\[ \frac{R(A) \bullet R(B)}{\sqrt{L(A).L(B)}} \]
with \(R(A)\) and \(R(B)\) the references of documents A and B, \(R(A) \bullet R(B)\) being the number of shared references by A and B, and \(L(A)\) and \(L(B)\) the length of the bibliographies of documents A and B.
The output is an edge list linking nodes together (see the from
and to
columns) with a weight for each edge being the coupling angle measure. If normalized_weight_only
is set to be FALSE
, another column displays the number of references shared by the two nodes.
This example use the Ref_stagflation
data frame.
library(biblionetwork)
biblio_coupling(Ref_stagflation,
source = "Citing_ItemID_Ref",
ref = "ItemID_Ref",
normalized_weight_only = FALSE,
weight_threshold = 1)
#> from to weight nb_shared_references Source
#> 1: 214927 2207578 0.14605935 4 214927
#> 2: 214927 5982867 0.04082483 1 214927
#> 3: 214927 8456979 0.09733285 3 214927
#> 4: 214927 10729971 0.29848100 7 214927
#> 5: 214927 16008556 0.04714045 1 214927
#> ---
#> 2712: 1111111161 1111111172 0.03434014 1 1111111161
#> 2713: 1111111161 1111111180 0.02003610 1 1111111161
#> 2714: 1111111161 1111111183 0.04050542 2 1111111161
#> 2715: 1111111172 1111111180 0.03646625 1 1111111172
#> 2716: 1111111182 1111111183 0.27060404 8 1111111182
#> Target
#> 1: 2207578
#> 2: 5982867
#> 3: 8456979
#> 4: 10729971
#> 5: 16008556
#> ---
#> 2712: 1111111172
#> 2713: 1111111180
#> 2714: 1111111183
#> 2715: 1111111180
#> 2716: 1111111183
This function is a relatively general function that can also be used:
source
and ref
columns, but rather use the [biblio_cocitation()];The function just keeps the edges that have a non-normalized weight superior to the weight_threshold
. In a large bibliographic coupling network, you can consider for instance that sharing only one reference is not sufficient/significant for two articles to be linked together. This parameter could also be modified to avoid creating intractable networks with too many edges.
biblio_coupling(Ref_stagflation,
source = "Citing_ItemID_Ref",
ref = "ItemID_Ref",
weight_threshold = 3)
#> from to weight Source Target
#> 1: 214927 2207578 0.14605935 214927 2207578
#> 2: 214927 8456979 0.09733285 214927 8456979
#> 3: 214927 10729971 0.29848100 214927 10729971
#> 4: 214927 19627977 0.11202241 214927 19627977
#> 5: 1021902 12824456 0.06537205 1021902 12824456
#> ---
#> 958: 1111111147 1111111156 0.17325923 1111111147 1111111156
#> 959: 1111111147 1111111161 0.13333938 1111111147 1111111161
#> 960: 1111111156 1111111161 0.08580846 1111111156 1111111161
#> 961: 1111111159 1111111171 0.24333213 1111111159 1111111171
#> 962: 1111111182 1111111183 0.27060404 1111111182 1111111183
As explained above, you can use the biblio_coupling()
function for creating a co-citation network, you just have to put the references in the source
column (they will be the nodes of your network) and the citing articles in ref
. As it is likely to create some confusion, the package also integrates a biblio_cocitation()
function, which has a similar structure to biblio_coupling()
, but which is explicitly for co-citation: citing articles stay in source
and references stay in ref
. You can see in the next example that they produce the same results:
biblio_coupling(Ref_stagflation,
source = "ItemID_Ref",
ref = "Citing_ItemID_Ref")
#> from to weight Source Target
#> 1: 49248 180162 1.0000000 49248 180162
#> 2: 49248 804988 0.3162278 49248 804988
#> 3: 49248 1999903 1.0000000 49248 1999903
#> 4: 49248 2031010 1.0000000 49248 2031010
#> 5: 49248 3580645 0.7071068 49248 3580645
#> ---
#> 87664: 1111112223 1111112225 1.0000000 1111112223 1111112225
#> 87665: 1111112223 1111112227 1.0000000 1111112223 1111112227
#> 87666: 1111112224 1111112225 1.0000000 1111112224 1111112225
#> 87667: 1111112224 1111112227 1.0000000 1111112224 1111112227
#> 87668: 1111112225 1111112227 1.0000000 1111112225 1111112227
biblio_cocitation(Ref_stagflation,
source = "Citing_ItemID_Ref",
ref = "ItemID_Ref")
#> from to weight Source Target
#> 1: 49248 180162 1.0000000 49248 180162
#> 2: 49248 804988 0.3162278 49248 804988
#> 3: 49248 1999903 1.0000000 49248 1999903
#> 4: 49248 2031010 1.0000000 49248 2031010
#> 5: 49248 3580645 0.7071068 49248 3580645
#> ---
#> 87664: 1111112223 1111112225 1.0000000 1111112223 1111112225
#> 87665: 1111112223 1111112227 1.0000000 1111112223 1111112227
#> 87666: 1111112224 1111112225 1.0000000 1111112224 1111112225
#> 87667: 1111112224 1111112227 1.0000000 1111112224 1111112227
#> 87668: 1111112225 1111112227 1.0000000 1111112225 1111112227
coupling_strength()
functionThis coupling_strength()
calculates the coupling strength measure Shen et al. (2019) from a direct citation data frame. It is a refinement of biblio_coupling()
: it takes into account the frequency with which a reference shared by two articles has been cited in the whole corpus. In other words, the most cited references are less important in the links between two articles, than references that have been rarely cited. To a certain extent, it is similar to the tf-idf measure. It looks like:
\[ \frac{1}{L(A)}.\frac{1}{L(A)}\sum_{j}(log({\frac{N}{freq(R_{j})}})) \]
with \(N\) the number of articles in the whole dataset and \(freq(R_{j})\) the number of time the reference j (which is shared by documents A and B) is cited in the whole corpus.
coupling_strength(Ref_stagflation,
source = "Citing_ItemID_Ref",
ref = "ItemID_Ref",
weight_threshold = 1)
#> from to weight Source Target
#> 1: 214927 2207578 0.019691698 214927 2207578
#> 2: 214927 5982867 0.005331122 214927 5982867
#> 3: 214927 8456979 0.011752248 214927 8456979
#> 4: 214927 10729971 0.046511251 214927 10729971
#> 5: 214927 16008556 0.008648490 214927 16008556
#> ---
#> 2712: 1111111161 1111111172 0.005067554 1111111161 1111111172
#> 2713: 1111111161 1111111180 0.001168603 1111111161 1111111180
#> 2714: 1111111161 1111111183 0.002580798 1111111161 1111111183
#> 2715: 1111111172 1111111180 0.003870999 1111111172 1111111180
#> 2716: 1111111182 1111111183 0.037748271 1111111182 1111111183
Rather than focusing on documents, you can want to study the relationships between authors, institutions/affiliations or journals. The coupling_entity()
function allows you to do that. Coupling links are calculated using the coupling angle measure (like biblio_coupling()
) or the coupling strength measure (like coupling_strength()
). Coupling links are calculated depending of the number of references two authors share, taking into account the minimum number of times two authors are citing each reference. For instance, if two entities share a reference in common, the first one citing it twice (in other words, citing it in two different articles), the second one three times, the function takes two as the minimum value. In addition to the features of the coupling strength measure or the coupling angle measure, it means that, if two entities share two references in common, the fact that the first reference is cited at least four times by the two entities, whereas the second reference is cited at least only once, the first reference contributes more to the edge weight than the second reference. This use of minimum shared reference for entities coupling comes from Zhao and Strotmann (2008). With the coupling strength measure, it looks like:
\[ \frac{1}{L(A)}.\frac{1}{L(A)}\sum_{j} Min(C_{Aj},C_{Bj}).(log({\frac{N}{freq(R_{j})}})) \]
with \(C_{Aj}\) and \(C_{Bj}\) the number of time documents A and B cite the reference \(j\).
This example use the Ref_stagflation
and the Authors_stagflation
data frames.
# merging the references data with the citing author information in Nodes_stagflation
entity_citations <- merge(Ref_stagflation,
Authors_stagflation,
by.x = "Citing_ItemID_Ref",
by.y = "ItemID_Ref",
allow.cartesian = TRUE)
# allow.cartesian is needed as we have several authors per article, thus the merge results
# is longer than the longer merged data frame
coupling_entity(entity_citations,
source = "Citing_ItemID_Ref",
ref = "ItemID_Ref",
entity = "Author.y",
method = "coupling_angle")
#> from to weight Source Target
#> 1: ALBANESI-S CHARI-V 0.032897585 ALBANESI-S CHARI-V
#> 2: ALBANESI-S CHRISTIANO-L 0.025302270 ALBANESI-S CHRISTIANO-L
#> 3: ALBANESI-S BALL-L 0.024296477 ALBANESI-S BALL-L
#> 4: ALBANESI-S MANKIW-G 0.038924947 ALBANESI-S MANKIW-G
#> 5: ALBANESI-S ROTEMBERG-J 0.030457245 ALBANESI-S ROTEMBERG-J
#> ---
#> 3461: WILLIAMS-J YOUNG-W 0.008684168 WILLIAMS-J YOUNG-W
#> 3462: WILLIAMS-J WILLIAMS-N 0.014002801 WILLIAMS-J WILLIAMS-N
#> 3463: WILLIAMS-J ZHA-T 0.014002801 WILLIAMS-J ZHA-T
#> 3464: WILLIAMS-N ZHA-T 0.040000000 WILLIAMS-N ZHA-T
#> 3465: WOODFORD-M YOUNG-W 0.020672456 WOODFORD-M YOUNG-W
#> Weighting_method
#> 1: coupling_angle
#> 2: coupling_angle
#> 3: coupling_angle
#> 4: coupling_angle
#> 5: coupling_angle
#> ---
#> 3461: coupling_angle
#> 3462: coupling_angle
#> 3463: coupling_angle
#> 3464: coupling_angle
#> 3465: coupling_angle
The biblionetwork package contains bibliometric data built by Goutsmedt (2021). These data gather the academic articles and books that endeavoured to explain the United States stagflation of the 1970s, published between 1975 and 2013. They also gather all the references cited by these articles and books on stagflation. The Nodes_stagflation
file contains information about the academic articles and books on stagflation (the staflation documents), as well as about the references cited at least by two of these stagflation documents. The Ref_stagflation
is a data frame of direct citations, with the identifiers of citing documents, and the identifiers of cited documents. The Authors_stagflation
is a data frame with the list of documents explaining the US stagflation, and all the authors of these documents (Nodes_stagflation
just takes the first author for each document).
I take as example authors here, but the function could also be used for calculating a co-authorship network with institutions or countries as nodes.↩︎