Pagerank on R cran packages

I’d been testing with R and Python what the page rank score on each R packages.

Firstly, I need to scrape all package description pages and then parsing section “Depends”, “Imports”, “Reverse Depends” to know relation between packages. I was using Python with scrapemark for convenient.

 

This is “scrape.py”(no code optimization for easy understanding).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
#coding=utf-8 
 
from scrapemark import scrape
from urlparse import urljoin
import urllib2
 
indexpage = "http://cran.r-project.org/web/packages/available_packages_by_date.html"
 
def mainpage(url):
    ret = scrape("""
        <table border="1" summary="Available CRAN packages by date.">
        <tr> <th align="left"> Date </th> <th align="left"> Package </th> <th align="left"> Title </th> </tr>
        {*
            <tr>
            <td>{{ [date] }}</td><td><a href=' {{ [pkglinks] }}'> {{ [pkgname] }}</a></td><td>{{ [Title] }}</td>
        </tr>
        *}
           </table>
        """,
        url=url)
    return ret
 
 
def getDepRedep(pkglink):
    html = urllib2.urlopen(pkglink).read()
    imps = scrape("""
    <table summary=''>
    {*   
        <tr><td valign=top>Imports:</td>
        <td>{* <a >{{ [imps] }}</a> *}</td>
        </tr>
     *}
    </table>
    """, html)
    deps = scrape("""
    <table summary=''>
    {*   
        <tr><td valign=top>Depends:</td>
        <td>{* <a>{{ [deps] }}</a> *}</td>
        </tr>
     *}
    </table>
    """, html)
    revdeps = scrape("""
    <h4>Reverse dependencies:</h4>
    <table summary="">
    {*
        <tr><td valign=top>Reverse&nbsp;depends:</td>
        <td>{* <a>{{ [revdeps] }}</a> *}</td>
        </tr>
    *}
    </table>
    """, html)
    return [imps, deps, revdeps]
 
if __name__ == "__main__":
    import sys
    pkgdic = mainpage(indexpage)
    pkglinks = map(lambda l:urljoin(indexpage, l),  pkgdic["pkglinks"])
    pkgnames = pkgdic["pkgname"]
    pkginfos = dict([(pkglinks[i],pkgnames[i] ) for i in range(0, len(pkgnames) - 1, 1)])
    for link, name in pkginfos.items():
        ret = getDepRedep(link)
        if ret[0] != None:
            for imp in ret[0]["imps"]: 
                sys.stdout.write("%s\t%s\n" % (name, imp))
        if ret[1] != None:
            for deps in ret[1]["deps"]:
                sys.stdout.write("%s\t%s\n" % (name, deps))
        if ret[2] != None:
            for revdeps in ret[2]["revdeps"]:
                sys.stdout.write("%s\t%s\n" % (revdeps, name))

Just execute “python scrape.py > resultfile.txt”. “result.txt” will contain edge list of R cran packages. 

 

This edge list will be easily used by most of SNA package. In my case, igraph.

R code.

1
2
3
4
5
6
7
8
9
10
library(igraph)
 
g <- graph.edgelist(
  matrix(
    scan(file="resultfile.txt",what=character(0), sep="\t"), 
  ncol=2,byrow=T))
 
pr <- data.frame(pkg=as.vector(V(g)$name), pkgindex=as.vector(V(g)),pagerank=page.rank(g)$vector)
pr<- pr[order(pr$pagerank, decreasing=T),]
head(pr, n=30)

 

“head” results will be shown like below.

 

            pkg pkgindex    pagerank
105        MASS      104 0.044617865
115     mvtnorm      114 0.014133438
168      Matrix      167 0.011896352
79      lattice       78 0.009421894
604       rJava      603 0.008692808
4      survival        3 0.008073552
595       Hmisc      594 0.006965379
490        nlme      489 0.006819586
256        lme4      255 0.006506075
339       e1071      338 0.006162781
457         XML      456 0.006084811
18          car       17 0.004911839
574       abind      573 0.004501252
809    sandwich      808 0.004372066
464       TSdbi      463 0.004189485
113    multcomp      112 0.004094053
22      cluster       21 0.003680852
350        mgcv      349 0.003574597
505      fields      504 0.003455358
463         zoo      462 0.003405279
196    maptools      195 0.002977273
397        Rcpp      396 0.002874943
721      digest      720 0.002849751
1097    esd4all     1096 0.002843424
847  pairwiseCI      846 0.002814689
1279    foreach     1278 0.002803063
997  KernSmooth      996 0.002791697
61        rgdal       60 0.002784822
748         pls      747 0.002768121
1154      akima     1153 0.002742152

In a R package world, MASS is most valuable package. It means, MASS is used by other valuable packages most frequently.

It will be good to try to plot SNA graph with R packages edge list. But, I expect, too many vertex will consume all spaces. Better plot with SNA components.  

CC BY-NC 4.0 Pagerank on R cran packages by from __future__ import dream is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.