XML to Relational Mapping

The conversion from the DBLP XML file to a database first creates multiple tsv files which are then loaded into correspondant tables.
The xml is in the following format:

<article mdate="2017-05-28" key="journals/acta/Saxena96">
<author>Sanjeev Saxena</author>
<title>Parallel Integer Sorting and Simulation Amongst CRCW Models.</title>
<pages>607-619</pages>
<year>1996</year>
<volume>33</volume>
<journal>Acta Inf.</journal>
<number>7</number>
<url>db/journals/acta/acta33.html#Saxena96</url>
<ee>https://doi.org/10.1007/BF03036466</ee>
</article><article mdate="2017-05-28" key="journals/acta/Simon83">
<author>Hans Ulrich Simon</author>
<title>Pattern Matching in Trees and Nets.</title>
<pages>227-248</pages>
<year>1983</year>
<volume>20</volume>
<journal>Acta Inf.</journal>
<url>db/journals/acta/acta20.html#Simon83</url>
<ee>https://doi.org/10.1007/BF01257084</ee>
</article>

papers

The table "papers" contains most of the information about each publication including the year of publication, the title, the publication type, the key contained in the xml as well as a newly generated numeric ID. The expriments all work an a view that excludes the homepages from the papers table.
The columns were adopted from the dblp.dtd file. Some of them don't have any values, e.g. the column 'reviewid' (see Errors).

 etype  |                                  title                                   | booktitle |  pages  | publyear | address |  journal  | volume | volnumber |
--------+--------------------------------------------------------------------------+-----------+---------+----------+---------+-----------+--------+-----------+-
article | Parallel Integer Sorting and Simulation Amongst CRCW Models.             |           | 607-619 |     1996 |         | Acta Inf. | 33     | 7         |
article | Pattern Matching in Trees and Nets.                                      |           | 227-248 |     1983 |         | Acta Inf. | 20     |           |
article | NP-complete Problems Simplified on Tree Schemas.                         |           | 171-178 |     1983 |         | Acta Inf. | 20     |           |
article | On the Power of Chain Rules in Context Free Grammars.                    |           | 425-433 |     1982 |         | Acta Inf. | 17     |           |
article | Schnelle Multiplikation von Polynomen über Körpern der Charakteristik 2. |           | 395-398 |     1977 |         | Acta Inf. | 7      |           |

month |                   url                   |                 ee                 | cdrom | cite | publisher | note | crossref | isbn | series | school |
------+-----------------------------------------+------------------------------------+-------+------+-----------+------+----------+------+--------+--------+-
      | db/journals/acta/acta33.html#Saxena96   | https://doi.org/10.1007/BF03036466 |       |      |           |      |          |      |        |        |
      | db/journals/acta/acta20.html#Simon83    | https://doi.org/10.1007/BF01257084 |       |      |           |      |          |      |        |        |
      | db/journals/acta/acta20.html#GoodmanS83 | https://doi.org/10.1007/BF00289414 |       |      |           |      |          |      |        |        |
      | db/journals/acta/acta17.html#Blum82     | https://doi.org/10.1007/BF00264161 |       |      |           |      |          |      |        |        |
      | db/journals/acta/acta7.html#Schonhage77 | https://doi.org/10.1007/BF00289470 |       |      |           |      |          |      |        |        |

chapter |   mdate    | publtype | reviewid | rating |         paperkey          | pagesfrom | pagesto | pid
--------+------------+----------+----------+--------+---------------------------+-----------+---------+-----
        | 2017-05-28 |          |          |        | journals/acta/Saxena96    |       607 |     619 |   0
        | 2017-05-28 |          |          |        | journals/acta/Simon83     |       227 |     248 |   1
        | 2017-05-28 |          |          |        | journals/acta/GoodmanS83  |       171 |     178 |   2
        | 2017-05-28 |          |          |        | journals/acta/Blum82      |       425 |     433 |   3
        | 2017-05-28 |          |          |        | journals/acta/Schonhage77 |       395 |     398 |   4
The following table shows the percentage of usages of the columns (rounded to two decimal places):

column     | etype | title | booktitle | pages | publyear | address | journal | volume | volnumber | month | url   | ee    | cdrom | cite | publisher | note | crossref | isbn | series | school | chapter | mdate | publtype | reviewid | rating | paperkey | pagesfrom | pagesto | pid
-----------+-------+-------+-----------+-------+----------+---------+---------+--------+-----------+-------+-------+-------+-------+------+-----------+------+----------+------+--------+--------+---------+-------+----------+----------+--------+----------+-----------+---------+-----
percentage | 100   | 100   | 54.31     | 89.74 | 100      | 0       | 43.76   | 44.21  | 34.84     | 0.26  | 98.07 | 95.66 | 0.3   | 0.2  | 1.37      | 0.67 | 52.93    | 1.28 | 0.56   | 1.62   | 0       | 100   | 5.17     | 0        | 0      | 100      | 83.22     | 83.22   | 100

authors

The authors are listed in another table called "authors". Every author has a generated numeric ID, a full name, a first name and a last name. The xml provided by dblp does not contain the first and the last name, only the full name, so the first and last name are generated. The last name is defined by the word after the last space in their full name and everything else is the first name. So if the full name does not contain a space, tthe first name is empty and the last name is the full name.

aid |     fullname      |  firstname   | lastname
----+-------------------+--------------+----------
  0 | Sanjeev Saxena    | Sanjeev      | Saxena
  1 | Hans Ulrich Simon | Hans Ulrich  | Simon
  2 | Nathan Goodman    | Nathan       | Goodman
  3 | Oded Shmueli      | Oded         | Shmueli
  4 | Norbert Blum      | Norbert      | Blum

writtenBy

Which author wrote which paper is saved in the table "writtenBy". The author-ID is saved with the paper-ID. So if three authors wrote a paper, the paper will be in the table three times. Aditionally, the position in which the author is named in the list of authors is saved in this table.

pid | aid | apos
----+-----+------
  0 |   0 |    1
  1 |   1 |    1
  2 |   2 |    1
  2 |   3 |    2
  3 |   4 |    1

editedBy

Lastly, the table editedBy ist created which is based on the same context as writtenBy with the difference that it contains the editors and not the authors.

 pid  |  eid  | epos
------+-------+------
54199 | 14155 |    1
54240 | 14155 |    1
54255 | 14155 |    1
54255 |  7610 |    2
54255 | 14125 |    3

Additional tables

For some experiments you need an additional table, "authorPairs", which is created when the database ist created. It shows which authors are coauthors. So in this table, you can see with how many authors an author has worked with by the number of their appearences in the column "aid".

create table authorPairs as select distinct a.aid as author, b.aid as coauthor from writtenBy as a join writtenBy as b on a.pid=b.pid where not(a.aid=b.aid);

  aid   | coauthor
--------+----------
1073981 |  1848050
 760532 |   819062
 539639 |   159025
  15599 |   611378
  94052 |   130135

How to use the database with PostgreSQL

First, you have to install PostgreSQL (i.e. sudo apt-get install postgresql).
You can download the script for setting up your DBLP PostgreSQL database here. If you don't have a PostgreSQL user yet, execute createUser.sh first and specify a username. Then execute buildDatabase.sh. To make the script convert, you need the dblp.xml and the dblp.dtd files in the Source folder.
With the command psql dblp username (change username to the name of your PostgreSQL user), you can log into PostgreSQL and work on your database.

Generating the PostgreSQL database by hand:

Python script for converting the XML file to importable TSV files
SQL script for creating the database schema
Shell command to populate the Database
SQL script for speeding up the Database