The conversion from the DBLP XML file to a database first creates multiple tsv files which are then loaded into correspondant tables.
The xml is in the following format:
The table "papers" contains most of the information about each publication including the year of publication, the title, the publication type, the key contained in the xml as well as a newly generated numeric ID. The expriments all work an a view that excludes the homepages from the papers table.
The columns were adopted from the dblp.dtd file. Some of them don't have any values, e.g. the column 'reviewid' (see Errors).
The authors are listed in another table called "authors". Every author has a generated numeric ID, a full name, a first name and a last name. The xml provided by dblp does not contain the first and the last name, only the full name, so the first and last name are generated. The last name is defined by the word after the last space in their full name and everything else is the first name. So if the full name does not contain a space, tthe first name is empty and the last name is the full name.
aid | fullname | firstname | lastname
----+-------------------+--------------+----------
0 | Sanjeev Saxena | Sanjeev | Saxena
1 | Hans Ulrich Simon | Hans Ulrich | Simon
2 | Nathan Goodman | Nathan | Goodman
3 | Oded Shmueli | Oded | Shmueli
4 | Norbert Blum | Norbert | Blum
writtenBy
Which author wrote which paper is saved in the table "writtenBy". The author-ID is saved with the paper-ID. So if three authors wrote a paper, the paper will be in the table three times. Aditionally, the position in which the author is named in the list of authors is saved in this table.
Lastly, the table editedBy ist created which is based on the same context as writtenBy with the difference that it contains the editors and not the authors.
For some experiments you need an additional table, "authorPairs", which is created when the database ist created. It shows which authors are coauthors. So in this table, you can see with how many authors an author has worked with by the number of their appearences in the column "aid".
create table authorPairs as select distinct a.aid as author, b.aid as coauthor from writtenBy as a join writtenBy as b on a.pid=b.pid where not(a.aid=b.aid);
First, you have to install PostgreSQL (i.e. sudo apt-get install postgresql).
You can download the script for setting up your DBLP PostgreSQL database here. If you don't have a PostgreSQL user yet, execute createUser.sh first and specify a username. Then execute buildDatabase.sh. To make the script convert, you need the dblp.xml and the dblp.dtd files in the Source folder.
With the command psql dblp username (change username to the name of your PostgreSQL user), you can log into PostgreSQL and work on your database.
Generating the PostgreSQL database by hand:
Python script for converting the XML file to importable TSV files
import xml.sax
import codecs
tl = ['article', 'inproceedings', 'proceedings', 'book', 'incollection', 'phdthesis', 'mastersthesis', 'www']
at = ['title', 'booktitle', 'pages', 'year', 'address', 'journal', 'volume', 'number', 'month', 'url', 'ee', 'cdrom', 'cite', 'publisher', 'note', 'crossref', 'isbn', 'series', 'school', 'chapter']
sl = ['mdate', 'publtype', 'reviewid', 'rating', 'key']
pages = ['begin', 'end']
csvfields = ['etype'] + at + sl + pages
csvlengths = {}
for f in csvfields:
csvlengths[f] = 0
for f in tl:
csvlengths[f] = 0
csvlengths['author'] = 0
csvlengths['editor'] = 0
writtenBy = codecs.open("writtenBy.tsv", "w", "utf-8")
papers = codecs.open("papers.tsv", "w", "utf-8")
authors = codecs.open("authors.tsv", "w", "utf-8")
editedBy = codecs.open("editedBy.tsv", "w", "utf-8")
lengths = codecs.open("lengths.tsv", "w", "utf-8")
authorFirstNameLength = 0
authorLastNameLength = 0
class DBLPXMLHANDLER(xml.sax.ContentHandler):
cfields = {}
distAuthors = {}
cval = ""
paperCounter = 0
authorCounter = 0
authorID = 0
def startElement(self, name, attrs):
if name in tl:
self.cfields.clear()
self.cval = ""
self.cfields['anum'] = 1
self.cfields['etype'] = name
for s in tl:
self.cfields[s] = '\N'
for s in at:
self.cfields[s] = '\N'
for s in sl:
self.cfields[s] = '\N'
for s in pages:
self.cfields[s] = '\N'
for (k, v) in attrs.items():
self.cfields[k] = v
if name in ['author'] + csvfields:
self.cval = ""
if name in ['editor'] + csvfields:
self.cval = ""
def characters(self, content):
self.cval = self.cval + content
def endElement(self, name):
if name in (tl + csvfields) and not self.cval.isdigit() and csvlengths[name] < len(self.cval):
csvlengths[name] = len(self.cval)
#editors and authors share the same tsv, but not the same writtenBy/ editedBy
if name == 'author' or name == 'editor':
global authorFirstNameLength
global authorLastNameLength
if self.cval in self.distAuthors:
authorID = self.distAuthors[self.cval]
else:
self.distAuthors[self.cval] = self.authorCounter;
authorID = self.authorCounter;
self.authorCounter += 1
authorName = self.cval.split()
authorFirstName =""
for x in xrange(len(authorName) - 1):
authorFirstName += authorName[x]
if x<(len(authorName)-1):
authorFirstName += " "
authorLastName = authorName[len(authorName) - 1]
if authorFirstName is " ":
authorFirstName = "\N"
if len(authorFirstName) > authorFirstNameLength:
authorFirstNameLength = len(authorFirstName)
if len(authorLastName) > authorLastNameLength:
authorLastNameLength = len(authorLastName)
authors.write(str(authorID) + "\t" + self.cval + "\t" + authorFirstName + "\t" + authorLastName + "\n")
if name == 'author':
writtenBy.write(str(self.paperCounter) + "\t" + str(authorID) + "\t" + str(self.cfields['anum']).encode("utf-8").decode("utf-8") + "\n")
self.cfields['anum'] = self.cfields['anum'] + 1
else: #name == 'editor'
editedBy.write(str(self.paperCounter) + "\t" + str(authorID) + "\t" + str(self.cfields['anum']).encode("utf-8").decode("utf-8") + "\n")
self.cfields['anum'] = self.cfields['anum'] + 1
if name in at:
if name == 'pages':
pageArray = self.cval.split('-')
if len(pageArray) == 2 and pageArray[0].isdigit() and pageArray[len(pageArray) - 1].isdigit():
pageFrom = pageArray[0]
pageEnd = pageArray[len(pageArray) - 1]
self.cfields['begin'] = pageFrom
self.cfields['end'] = pageEnd
self.cfields[name] = self.cval
if name in tl:
line = []
for f in csvfields:
line.append(self.cfields.get(f, ''))
papers.write('\t'.join(line))
papers.write('\t' + str(self.paperCounter))
self.paperCounter = self.paperCounter + 1
papers.write('\n')
parser = xml.sax.make_parser()
parser.setContentHandler(DBLPXMLHANDLER())
parser.setContentHandler(DBLPXMLHANDLER())
parser.parse(open("../Source/dblp.xml", "r"))
#Another output from this script is the max length of each individual column, this will be stored in lengths.tsv
for key in csvlengths:
lengths.write(key + "\t" + str(csvlengths.get(key, '')) + "\n")
lengths.write("FirstName\t" + str(authorFirstNameLength) + "\n")
lengths.write("LastName\t" + str(authorLastNameLength) + "\n")
papers.close()
authors.close()
writtenBy.close()
editedBy.close()
lengths.close()
SQL script for creating the database schema
drop table if exists writtenBy;
drop table if exists editedBy;
drop view if exists papers;
drop table if exists papersOriginal;
drop table if exists authors;
-- The numbers for the column length is equal to the data given in the lengths.tsv. This still has to be done by hand, we might work on an automatic aproach to it later.
create table papersOriginal (etype varchar(20), title varchar(1642), booktitle varchar(210), pages varchar(32), publYear int, address varchar(9), journal varchar(76), volume varchar(50), volNumber varchar(50), month varchar(26), url varchar(272), ee varchar(254), cdrom varchar(51), cite varchar(48), publisher varchar(163), note varchar(350), crossref varchar(39), isbn varchar(19), series varchar(136), school varchar(150), chapter int, mdate varchar(100), publType varchar(100), reviewid varchar(100), rating varchar(100), paperKey varchar(100), pagesFrom int, pagesTo int, pid int primary key);
--The 'www' entries, are unimportant to the data, as they hold no academical value. All the experiments are run on the paper view. We are still including the full database, to enable unbiased experiments.
create view papers as (select * from papersOriginal where not etype='www');
create table authors (aid int primary key, fullName varchar(100), firstName varchar(100), lastName varchar(100));
create table writtenBy(pid int references papersOriginal, aid int references authors, apos int);
create table editedBy (pid int references papersOriginal, eid int references authors, epos int);
Shell command to populate the Database
# This line will populate the database with the data in papers.tsv into papers. The name of the table always equals the name of the file.
psql --host=localhost --dbname=dblp --username=$USER_NAME -c "COPY papersOriginal FROM '""$DIR/""papers.tsv' delimiter E'\t'"
SQL script for speeding up the Database
drop table if exists authorPairs;
create index completeWrittenBy on writtenBy(pid, aid, apos);
create index author on writtenBy(aid);
create index id on writtenBy(pid);
create index paperTitle ON papersOriginal(title);
create table authorPairs as select distinct a.aid, b.aid as coauthor from writtenBy as a join writtenBy as b on a.pid=b.pid where not(a.aid=b.aid);