DBLP-Analysis

Publications per year

Over the years, the amount of publications grew fast. For this graph, the different kinds of publications are differenciated. The two most common ones are articles and inproceedings, that's why this diagram shows the three different kinds articles, inproceedings and others.

The diagram was created using the following R-script:

#read the tables
s <- read.table(file=file.path('all.tsv'),skip=2,sep='|')
t <- read.table(file=file.path('article.tsv'),skip=2,sep='|')
u <- read.table(file=file.path('inproceedings.tsv'),skip=2,sep='|')

#assign second columns (first column contains years and second column contains number of publications)
x = t[,2]
y = u[,2]
z = s[,2]-(x+y) #subtract articles and inproceedings from all publications

#create matrix
data=matrix(c(x,y,z), nrow=3, byrow=TRUE)
rownames(data)=c("articles", "inproceedings", "other")

#set output file
jpeg(filename="publications.jpeg", width=1000, height=1000, pointsize=27)

#plot
a <- barplot( data, xlab="Year", ylab="Publications", ylim=c(0,300000), col=c("deeppink4","mediumblue", "gray40"))
axis(1,at=seq(a[1],a[length(a)],6), labels=seq(min(unlist(t[,1])),max(unlist(t[,1])),5))
legend(x="topleft",legend=rownames(data), fill=c("deeppink4","mediumblue", "gray40"))
grid(nx=7,ny=NULL,col="lightgray")

invisible(dev.off())

This diagram was generated using the following SQL query over the db generated by our relational mapping using the dblp.xml file from the dblp Web page (downloaded at 17th Apr 2018).
Note that the queries given here are written for PostgreSQL and some of them contain PostgreSQL-specific functions, so if you want to use another database, you might have to make some adjustments for the query to work.

To use the R-Files, you need to write the output into a tsv file. For that you can use the following shell command:

psql -U $USER_NAME -d dblp -c "QUERY" > DIR/FILE.tsv

SELECT publYear, COUNT(publYear)
FROM   papers
WHERE  publYear BETWEEN 1980 AND 2017
GROUP  BY publYear
ORDER  BY publYear

SELECT publYear, COUNT(publYear)
FROM   papers
WHERE  etype = 'inproceedings'
AND    publYear BETWEEN 1980 AND 2017
GROUP  BY publYear
ORDER  BY publYear

SELECT publYear, COUNT(publYear)
FROM   papers
WHERE  etype = 'article'
AND    publYear BETWEEN 1980 AND 2017
GROUP  BY publYear
ORDER  BY publYear

Authors per publication

The largest amount of publications is written by exactly two authors. Additionally, 95% of publications are written by five or less authors.

The diagram was created using the following R-script:

#read tables
s <- read.csv(file=file.path('all.tsv'),skip=2,sep='|')
t <- read.csv(file=file.path('article.tsv'),skip=2,sep='|')
u <- read.csv(file=file.path('inproceedings.tsv'),skip=2,sep='|')

#set number of bars shown in diagram
numberofbars=10
#assign second columns (first column contains number of authors and second column contains number of publications)
x = head(t[,2],numberofbars)
y = head(u[,2],numberofbars)
z = head(s[,2],numberofbars)-(x+y) #subtract articles and inproceedings from all publications
#create matrix
data=matrix(c(x,y,z), nrow=3, byrow=TRUE)
rownames(data)=c("articles", "inproceedings", "other")

#set output file
jpeg(filename="authors_per_publication.jpeg", width=1000, height=1000, pointsize=27)

#plot
a <- barplot( data, xlab="Authors", ylab="Publications", ylim=c(0,max(unlist(t[,2]))), col=c("deeppink4","mediumblue", "gray40"))
axis(1, at=a, labels=c(1:numberofbars))
legend(x="topright",legend=rownames(data), fill=c("deeppink4","mediumblue", "gray40"))
grid(nx=0,ny=NULL,col="lightgray")

invisible(dev.off())

SELECT authors, COUNT(authors)
FROM    (SELECT  COUNT(aid) AS authors
         FROM    writtenby
         JOIN    papers
                 ON papers.pid = writtenby.pid
       	 GROUP   BY papers.pid)x
WHERE  etype = 'article'
GROUP  BY authors
ORDER  BY authors

SELECT authors, COUNT(authors)
FROM    (SELECT  COUNT(aid) AS authors
         FROM    writtenby
         JOIN    papers
                 ON papers.pid = writtenby.pid
         GROUP   BY papers.pid)x
WHERE  etype = 'inproceedings'
GROUP  BY authors
ORDER  BY authors

SELECT authors, COUNT(authors)
FROM    (SELECT  COUNT(aid) AS authors
         FROM    writtenby
         GROUP   BY pid)x
GROUP  BY authors
ORDER  BY authors

Average node degree per year

As you can see, the average degree of coauthors has changed over time. The query does not accumulate the coauthors from the prior years, the number of coauthors is the average number of authors an author has worked with in that specific year.

The diagram was created using the following R-script:

#read table
t <- read.table(file=file.path('average_degree.tsv'),skip=2,sep='|')

#assign first and second column (first column contains the years and second column contains the degree)
x=t[,1]
y=t[,2]

#set output file
jpeg(filename="average_degree.jpeg", width=1000, height=1000, pointsize=27)

#plot
plot(0,0,xlab="Year",ylab="Average Degree", col="darkblue", ylim=c(0,max(unlist(y))), xlim=c(1950,10*ceiling(max(unlist(x))/10)))
grid(NULL,NULL,col="lightgray")
lines(x,y, col="red", lwd=3)

invisible(dev.off())

SELECT publyear, AVG(coauthors)
FROM   (SELECT   DISTINCT a.aid, COUNT(a.aid) AS coauthors, publYear
        FROM     papers, writtenby AS a
        JOIN     writtenby AS b
                 ON a.pid = b.pid
        WHERE    NOT a.aid = b.aid AND papers.pid = a.pid
        GROUP    BY publYear, a.aid)x
WHERE  publYear <= 2017
GROUP  BY publYear
ORDER  BY publYear

Degree distribution

This experiment analyzes the number of coauthors an author has. The diagram shows the frequency of a specific number of coauthors (node degree). The amount of coauthors is the number of authors an author has ever worked with and is not paper-specific. This experiment does not include soloauthors, which are authors that have never worked with another author.

The diagram was created using the following R-script:

#read table
t <- read.table(file=file.path('degree_distribution.tsv'),skip=2,sep='|')

#assign first and second column (first column contains the node degrees and second column contains the frequency)
x=head(t[,1],100)
y=head(t[,2],100)

#assign output file
jpeg(filename="degree_distribution.jpeg", width=1000, height=1000, pointsize=27)

#plot
plot(x, y, xlim=c(0,100), ylim=c(0, max(y)), xlab="Node Degrees", ylab="Frequency")

invisible(dev.off())

SELECT coauthors, COUNT(coauthors)
FROM   (SELECT COUNT(DISTINCT coauthor) AS coauthors
        FROM   authorpairs
        GROUP  BY aid)y
GROUP  BY coauthors
ORDER  BY coauthors

Logarithmic degree distribution

When shown logarithmic, the results from the experiment above show that the degree distribution follows a power law.

The diagram was created using the following R-script:

#read table
t <- read.table(file=file.path('degree_distribution.tsv'),skip=2,sep='|')

#assign first and second column (first column contains the node degrees and second column contains frequency)
x=head(t[,1],100)
y=head(t[,2],100)

#set output file
jpeg(filename="degree_distribution_log.jpeg", width=1000, height=1000, pointsize=27)

#plot
plot(log(x), log(y), xlim=c(0,5), ylim=c(5,13), xlab="LOG(Node Degrees)", ylab="LOG(Frequency)")

invisible(dev.off())

Alphabetical publications

In Mathematics, the authors of papers are usually ordered alphabetically and since Computer Science originated from Mathematics, the question is whether the same applies for Computer Science. The diagram shows the result for this question. The percentage of alphabetically ordered papers started at almost 100 percent and decreased over time. For this experiment, you need to generate a new table called position, which is similar to writtenBy with the difference that the authorposition is ordered by the last name of the authors.

The diagram was created using the following R-script:

#read tables
s <- read.csv(file=file.path('all.tsv'),skip=2,sep='|')
t <- read.csv(file=file.path('article.tsv'),skip=2,sep='|')
u <- read.csv(file=file.path('inproceedings.tsv'),skip=2,sep='|')

#assign first columns (years)
x1=s[,1]
y1=t[,1]
z1=u[,1]
#assign second columns (percentage of alphabetical papers)
x2=s[,2]
y2=t[,2]
z2=u[,2]

#set output file
jpeg(filename="authorposition.jpeg", width=1000, height=1000, pointsize=27)

#set minimum of x axis
xmin=max(min(unlist(x1)), min(unlist(y1)), min(unlist(z1)))

#plot
plot(1, xlab="Year", ylab="Percent", col="darkblue", ylim=c(0,100), xlim=c(xmin,10*ceiling(max(unlist(x1))/10)))
lines(x1,x2, col="gray40", lwd=5)
lines(y1,y2, col="deeppink4", lwd=5)
lines(z1,z2, col="mediumblue", lwd=5)
grid(NULL, NULL, col="lightgray")
legend("bottomleft", fill=c("gray40","deeppink4","mediumblue"), legend=c("all", "articles", "inproceedings"))

invisible(dev.off())

CREATE TABLE position AS
  (SELECT pid, writtenBy.aid, Rank()
          OVER (partition BY pid ORDER BY lastname)
   FROM   authors
   JOIN   writtenBy
          ON writtenBy.aid = authors.aid
   ORDER  BY pid)

SELECT     y.publYear, 100 * (1 - (COALESCE(x.c,0)) / y.c)
FROM       (SELECT publYear, 1.0 * COUNT(DISTINCT position.pid) AS c
            FROM   (papers
                    JOIN   position
                           ON papers.pid = position.pid)
                    JOIN   writtenBy
                           ON (position.pid, position.aid) = (writtenBy.pid, writtenBy.aid)
            WHERE  NOT apos = rank
            GROUP  BY publYear)x
RIGHT JOIN (SELECT publYear, 1.0 * COUNT(DISTINCT position.pid) AS c
            FROM   papers
            JOIN   position
                   ON papers.pid = position.pid
            GROUP  BY publYear)y
           ON x.publYear = y.publYear
WHERE      x.publYear <= 2017
ORDER      BY publYear

Small World

When you take a few randomly selected authors and determine their number of coauthors and the number of authors these coauthors have worked with and continue this until there are no new authors left which are in any way connected to the first author, you will get the result that the most authors an author is connected to are at the distance of six (on average). We randomly selected 20 authors for this experiment. For every one of them, we determined the amount of coauthors this author has at the given distance. The average of coauthors at a given distance is then divided by the total number of authors, which results in a probability for two authors to be coauthors of a specific distance.

The diagram was created using the following R-script:

#read table
t <- read.table(file=file.path('smallworld.tsv'),skip=2,sep='|')

#assign first and second column (first column contains distance, second column contains probability)
x=t[,1]
y=t[,2]

#set output file
jpeg(filename="smallworld.jpeg", width=1000, height=1000, pointsize=27)

#plot
plot(x,y,log="y",type="o",pch=16, col="darkblue", lwd=3,xlab="Distance",ylab="Probability", xlim=c(min(x),max(x)), xaxt="n")
axis(1,at=x)
grid(NULL,NULL,col="lightgray")

invisible(dev.off())

SELECT distance, AVG(coauthors)/(SELECT COUNT(*) FROM authors)
FROM   (SELECT distance, COUNT(coauthor) AS coauthors
        FROM   Generate_series(1,10) AS i, lateral (
	       WITH RECURSIVE ta(n, coauthor, path, cycle) AS
               (SELECT 0 AS n, x.aid, array[x.aid], false
                FROM   (SELECT i, (i - i + random() * (SELECT MAX(aid) FROM authors))::integer AS aid)x
                UNION  ALL
                SELECT n+1, a.coauthor, path || a.coauthor, a.coauthor = ANY(path)
                FROM   authorpairs a INNER JOIN ta t ON a.aid = t.coauthor WHERE NOT cycle)
                SELECT DISTINCT MIN(n) AS distance, coauthor, i
                FROM   ta
                GROUP  BY coauthor)y
        WHERE  NOT distance=0
        GROUP  BY distance, y.i
        ORDER  BY y.i)z
GROUP  BY distance
ORDER  BY distance

Keyword searching

Publications with the given Keyword in the title: No Keyword!

Here, you can search for the number of titles with a given keyword in them. There are two ways of searching keywords. You can either search for a single trendword query (it can contain multiple words), enter it above. The shown graph will distinguish between articles, inproceedings and other publication types. To search for multiple trendword queries, please enter them seperated with semicolons, i.e. "graph;tree". This diagram will not distinguish between different publication types, but it will show all the searched trendwords in the same graph. The search is case insensitive.

R-script for a single trendword

The diagram for a single trendword was created using the following R-script:

#read argument, argument is the keyword
args <- commandArgs(TRUE)
b <- args[1]
#read tables
s <- read.csv(file=file.path(paste(b,'.tsv', sep="")),skip=2,sep='|')
t <- read.csv(file=file.path(paste(b,'article.tsv', sep="")),skip=2,sep='|')
u <- read.csv(file=file.path(paste(b,'inproceedings.tsv', sep="")),skip=2,sep='|')

#determine start of diagram
min=1
a=s[,2]
for(i in a){
  if(i>0){
    break
  }
  min=min+1
}
if(min>length(a)){
  min=1
}
if(min>9){
  min=min-(t[,1][min] %% 10)
}

#assign second columns (first column contains year and second column contains number of publications)
w=t[,2][min:length(a)]
x=u[,2][min:length(a)]
y=s[,2][min:length(a)]-(w+x) #subtract articles and inproceedings from all publications
#create matrix
data=matrix(c(w,x,y), nrow=3, byrow=TRUE)
rownames(data)=c("articles", "inproceedings", "other")

#set minimum of x axis
xmin=10*floor(s[,1][min]/10)

#set output file
jpeg(filename=paste(b,".jpeg", sep=""), width=1000, height=1000, pointsize=27)

#plot
c <- barplot( data, xlab="Year", ylab="Publications", col=c("deeppink4","mediumblue", "gray40"))
legend(x="topleft",legend=rownames(data), fill=c("deeppink4","mediumblue", "gray40"))
axis(1,at=seq(c[1],c[length(c)], 6), labels=seq(xmin, xmin+length(c)-1, 5))
grid(nx=7,ny=NULL,col="lightgray")

invisible(dev.off())

R-scrript for multiple trendwords

The diagram for multiple trendwords was created using the following R-script:

#read arguments, arguments are all the filenames followed by all the keywords to be written in the legend
args <- commandArgs(TRUE)

wordsx <- list()
wordsy <- list()
words <- list()
filename=""
mins <- list()

#read every table and determine start of diagram
for(i in 1:length(args)){
  b <- args[i]
  t <- read.table(file=file.path(paste(b,".tsv", sep="")), skip=2,sep='|')
  x=t[,1]
  y=t[,2]
  words[i]=b
  filename=paste(filename,b,sep="")
  min=1
  for(a in y){
    if(a>0){
      break
    }
    min=min+1
  }
  wordsx[[i]]=x
  wordsy[[i]]=y
  mins[i] = min
}

#assign minima and maxima of x and y axis
if(min(unlist(mins))>length(wordsx[[1]])){
  xmin=wordsx[[1]][[1]]
} else{
  xmin= 10*floor(wordsx[[1]][[min(unlist(mins))]]/10)
}
xmax= 10*ceiling(max(unlist(wordsx))/10)
ymax= max(unlist(wordsy))

#assign output file
jpeg(filename=paste(filename,".jpeg", sep=""), width=1000, height=1000, pointsize=27)

#plot
plot(1, xlab="Year", ylab="Publications", col="darkblue", xlim=c(xmin,xmax), ylim=c(0,ymax))
cl <- rainbow(length(wordsx))
for(i in 1:length(args)){
  lines(wordsx[[i]],wordsy[[i]], col=cl[i], lwd=3)
}
grid(nx=NULL,ny=NULL,col="lightgray")
legend("topleft", fill=cl, legend=words)

invisible(dev.off())

SELECT    x.publYear, COALESCE(COUNT,0)
FROM      (SELECT DISTINCT publYear
           FROM   papers
           WHERE  publYear IS NOT NULL)x
LEFT JOIN (SELECT publYear, COUNT(*) AS COUNT
           FROM   papers
           WHERE  title ILIKE '%keyword%' AND publYear IS NOT NULL
           GROUP  BY publYear)y
          ON x.publYear = y.publYear
WHERE     x.publYear <= 2017
ORDER     BY publYear

Coauthorgraph

1950

1955

1960

1965

1970

1975

1980

1985

1990

These graphs show the evolution of the connection of authors. Each node represents an author. If two authors have ever worked together up to the given year, they are connected by an edge. Given that the number of authors and the average number of couthors have increased over time, the coauthor graph grows very fast.

The graphs were created using Gephi. with the following instructions:
1. Open first graph in Gephi.
2. Click on "Data Laboratory".
3. Choose the button "Nodes" inside of "Data Table".
4. Select all nodes and right-click -> "Edit all nodes".
5. A sidebar should appear on the left side. Click the "..." Button at "Interval" and specify an interval in the new text field in the format "[n,m]" (note that m>n, e.g. [0,10]).
6. Do the same for all edges (click "Edges" on the top, select all and specify interval).
7. Open the next graph ("File -> Open..."). Select "append" instead of "new graph" in the dialog.
8. Select all Nodes/Edges in the Data Laboratory that don't have an interval and specify the next interval (e.g. [1,10]).
9. Repeat steps 7 and 8 for all graphs.
10. Click on "Overview" to show the graph. On the left, you can select different layout-algorithms.
11. Click on "Enable Timeline" on the bottom.
12. Click on the "Settings" symbol and "set Costum Bounds". Now select an interval so that it always contains just one step.
13. Now you can move the box in the timeline and build the graph.
14. To save the current image, click on "Preview" at the top. On the left, you can configure the format of the preview and in the drop-down box at the top, you can select some defaults.
15. With "Refresh", you can refresh the preview and export it with "export SVG/PNG/PDF".
16. Repeat that for every version of the graph and you'll get graphs that grow but keep the same layout.

SELECT DISTINCT a.aid, b.aid
FROM   papers, writtenby AS a
JOIN   writtenby AS b
       ON a.pid = b.pid
WHERE  NOT a.aid = b.aid AND papers.pid = a.pid AND publyear <= 1950