I’d been testing with R and Python what the page rank score on each R packages.
Firstly, I need to scrape all package description pages and then parsing section “Depends”, “Imports”, “Reverse Depends” to know relation between packages. I was using Python with scrapemark for convenient.
This is “scrape.py”(no code optimization for easy understanding).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | #coding=utf-8 from scrapemark import scrape from urlparse import urljoin import urllib2 indexpage = "http://cran.r-project.org/web/packages/available_packages_by_date.html" def mainpage(url): ret = scrape(""" <table border="1" summary="Available CRAN packages by date."> <tr> <th align="left"> Date </th> <th align="left"> Package </th> <th align="left"> Title </th> </tr> {* <tr> <td>{{ [date] }}</td><td><a href=' {{ [pkglinks] }}'> {{ [pkgname] }}</a></td><td>{{ [Title] }}</td> </tr> *} </table> """, url=url) return ret def getDepRedep(pkglink): html = urllib2.urlopen(pkglink).read() imps = scrape(""" <table summary=''> {* <tr><td valign=top>Imports:</td> <td>{* <a >{{ [imps] }}</a> *}</td> </tr> *} </table> """, html) deps = scrape(""" <table summary=''> {* <tr><td valign=top>Depends:</td> <td>{* <a>{{ [deps] }}</a> *}</td> </tr> *} </table> """, html) revdeps = scrape(""" <h4>Reverse dependencies:</h4> <table summary=""> {* <tr><td valign=top>Reverse depends:</td> <td>{* <a>{{ [revdeps] }}</a> *}</td> </tr> *} </table> """, html) return [imps, deps, revdeps] if __name__ == "__main__": import sys pkgdic = mainpage(indexpage) pkglinks = map(lambda l:urljoin(indexpage, l), pkgdic["pkglinks"]) pkgnames = pkgdic["pkgname"] pkginfos = dict([(pkglinks[i],pkgnames[i] ) for i in range(0, len(pkgnames) - 1, 1)]) for link, name in pkginfos.items(): ret = getDepRedep(link) if ret[0] != None: for imp in ret[0]["imps"]: sys.stdout.write("%s\t%s\n" % (name, imp)) if ret[1] != None: for deps in ret[1]["deps"]: sys.stdout.write("%s\t%s\n" % (name, deps)) if ret[2] != None: for revdeps in ret[2]["revdeps"]: sys.stdout.write("%s\t%s\n" % (revdeps, name)) |
Just execute “python scrape.py > resultfile.txt”. “result.txt” will contain edge list of R cran packages.
This edge list will be easily used by most of SNA package. In my case, igraph.
R code.
1 2 3 4 5 6 7 8 9 10 | library(igraph) g <- graph.edgelist( matrix( scan(file="resultfile.txt",what=character(0), sep="\t"), ncol=2,byrow=T)) pr <- data.frame(pkg=as.vector(V(g)$name), pkgindex=as.vector(V(g)),pagerank=page.rank(g)$vector) pr<- pr[order(pr$pagerank, decreasing=T),] head(pr, n=30) |
“head” results will be shown like below.
pkg pkgindex pagerank 105 MASS 104 0.044617865 115 mvtnorm 114 0.014133438 168 Matrix 167 0.011896352 79 lattice 78 0.009421894 604 rJava 603 0.008692808 4 survival 3 0.008073552 595 Hmisc 594 0.006965379 490 nlme 489 0.006819586 256 lme4 255 0.006506075 339 e1071 338 0.006162781 457 XML 456 0.006084811 18 car 17 0.004911839 574 abind 573 0.004501252 809 sandwich 808 0.004372066 464 TSdbi 463 0.004189485 113 multcomp 112 0.004094053 22 cluster 21 0.003680852 350 mgcv 349 0.003574597 505 fields 504 0.003455358 463 zoo 462 0.003405279 196 maptools 195 0.002977273 397 Rcpp 396 0.002874943 721 digest 720 0.002849751 1097 esd4all 1096 0.002843424 847 pairwiseCI 846 0.002814689 1279 foreach 1278 0.002803063 997 KernSmooth 996 0.002791697 61 rgdal 60 0.002784822 748 pls 747 0.002768121 1154 akima 1153 0.002742152
In a R package world, MASS is most valuable package. It means, MASS is used by other valuable packages most frequently.
It will be good to try to plot SNA graph with R packages edge list. But, I expect, too many vertex will consume all spaces. Better plot with SNA components.
Pagerank on R cran packages by from __future__ import dream is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.