Using plot.ly for Python to create interactive Gantt-Charts

As a little project, I worked on a visualization of different regimes in the world. My brother was working on a university project with the same data set and I was just bored and wanted to freshen up my python skills.

In my online search I found that there is not too much info about the layouting of these Gantt Charts. Gantt Charts are useful for the visualization of projects, or anything with a timeline. There might be other uses that I currently don’t think of.

The data I used is from this source, a Stata data set. The good thing about it, the leaders names are included, looks good as additional information in the graph, the bad thing, some entries are wrong formatted or simply wrong.

Getting the data

import urllib

# download the file online
file = urllib.request.urlopen("https://uofi.box.com/shared/static/bba3968d7c3397c024ec.dta")

# read the stata file to a pandas dataframe
df_dd = pd.read_stata(file)

The file is relatively big (78 columns and 9159 rows), especially the columns are quite detailed and partly redunant it seems, but we only focus on a couple of them.

order	ctryname	year	aclpcode	cowcode	cowcode2	ccdcodelet	ccdcodenum	aclpyear	cowcode2year	...	regime	tt	ttd	tta	flagc	flagdem	flagreg	agedem	agereg	stra
0	1	Afghanistan	1946	142	700.0	700	AFG	1	1421946	7001946	...	5.0	0.0	0.0	0.0	1.0	1.0	1.0	18.0	18.0	0.0
1	2	Afghanistan	1947	142	700.0	700	AFG	1	1421947	7001947	...	5.0	0.0	0.0	0.0	0.0	0.0	0.0	19.0	19.0	0.0
2	3	Afghanistan	1948	142	700.0	700	AFG	1	1421948	7001948	...	5.0	0.0	0.0	0.0	0.0	0.0	0.0	20.0	20.0	0.0
3	4	Afghanistan	1949	142	700.0	700	AFG	1	1421949	7001949	...	5.0	0.0	0.0	0.0	0.0	0.0	0.0	21.0	21.0	0.0
4	5	Afghanistan	1950	142	700.0	700	AFG	1	1421950	7001950	...	5.0	0.0	0.0	0.0	0.0	0.0	0.0	22.0	22.0	0.0
5 rows � 78 columns

For me it’s only interesting when a certain regime started and when it ended. As there is no data for the ending of a regime I just used the beginning of a new regime as the ending date of the old regime. The dataset includes two rows with similar information it seemed, but I’m glad to get corrected here. The codebook describes the columns “ndate” and “edate” as date for the nominal head and date for the effective head. I picked “edate” in my example and create “edate_end” als end of regime.

For me it’s only important when they came to power, not the years in between. So all the rows without an “edate” are useless for me.

# add a column "edate_end"
df_dd["edate_end"] = ""

# delete all rows where edate is empty
df_dd = df_dd[df_dd.edate != ""]

# reset the index and delete previous index column
df_dd = df_dd.reset_index()
df_dd.drop('index', axis=1, inplace=True)

The result looks like this:

order	ctryname	year	aclpcode	cowcode	cowcode2	ccdcodelet	ccdcodenum	aclpyear	cowcode2year	...	tt	ttd	tta	flagc	flagdem	flagreg	agedem	agereg	stra	edate_end
0	1	Afghanistan	1946	142	700.0	700	AFG	1	1421946	7001946	...	0.0	0.0	0.0	1.0	1.0	1.0	18.0	18.0	0.0	
1	8	Afghanistan	1953	142	700.0	700	AFG	1	1421953	7001953	...	0.0	0.0	0.0	0.0	0.0	1.0	25.0	1.0	0.0	
2	18	Afghanistan	1963	142	700.0	700	AFG	1	1421963	7001963	...	0.0	0.0	0.0	0.0	0.0	1.0	35.0	1.0	0.0	
3	28	Afghanistan	1973	142	700.0	700	AFG	1	1421973	7001973	...	0.0	0.0	0.0	0.0	0.0	1.0	45.0	1.0	0.0	
4	33	Afghanistan	1978	142	700.0	700	AFG	1	1421978	7001978	...	0.0	0.0	0.0	0.0	0.0	0.0	50.0	6.0	0.0	
5 rows � 79 columns

Formatting the dates properly

The edates are formatted like Month-Day-Year with month having either two or only one digit.  Day is sometimes “00”, in those cases it should be “01”, just so the datetime format that we use later reads it correct. In a few cases there is a dot missing or more text to it. With my cleanup loop I try to find them and enter the right date manually. It also adds “edate_end”.

# fix typos in the edate column
for index, row in df_dd.iterrows():
    date = row["edate"].split(".")

    # check if the date has 3 parts [MM.DD.YY]
    if len(date) != 3:
        date = input("The date doesn't seem to be in this style: [MM.DD.YY] Please enter the right date format: "+row["edate"]).split(".")

    # format the month
    if len(date[0]) == 2:
        if date[0] == "00":
            date[0] = "01"
    elif len(date[0]) == 1:
        if date[0] == "0":
            date[0] = "1"
        date[0] = "0"+date[0]
    else:
        date[0] = input("Insert the right month: "+date[0])

    # format the day
    if len(date[1]) == 2:
        if date[1] == "00":
            date[1] = "01"
    elif len(date[1]) == 1:
        if date[1] == "0":
            date[1] = "1"
        date[1] = "0"+date[1]
    else:
        date[1] = input("Insert the right day: "+date[1])

    # format the year
    if len(date[2]) == 2:
        if row["year"] <= 1999:
            date[2] = "19"+date[2]
        else:
            date[2] = str(row["year"])
    elif len(date[2]) == 1:
        if date[2] == "0":
            date[2] = "1"
        date[2] = "0"+date[2]
    else:
        date[2] = input("Pick the right year: "+date[2])

    # Make a string from the list
    date = ".".join(date)
    # Exchange the old date with the new
    df_dd.at[index,"edate"] = date

    # Use the date of the next regime as "edate_end"
    if df_dd.iloc[index-1]["ctryname"] == row["ctryname"]:
        df_dd.at[index-1,"edate_end"] = date
    else:
        df_dd.at[index-1,"edate_end"] = "12.31."+str(int(df_dd.iloc[index-1]["exity"]))

# fixing the last row
df_dd.edate_end.iloc[-2] = "12.31.2008"
df_dd = df_dd.drop(df_dd.index[-1])

I change the dates to the Datetime format to calculate better:

# change to DateTime format
df_dd["edate"] = pd.to_datetime(df_dd["edate"], format="%m.%d.%Y")
df_dd["edate_end"] = pd.to_datetime(df_dd["edate_end"], format="%m.%d.%Y")

# change float to integer
df_dd["regime"] = df_dd["regime"].astype(int)
df_dd["year"] = df_dd["year"].astype(int)
df_dd["exity"] = df_dd["exity"].astype(int)

To work with smaller pieces of the Dataframe in order to speed up the waiting time, I use a working variable “df_work”. In the example below I comment out the slice I made. Afterwards I create a column “duration” that is “edate_end” minus “edate”. So the last day in power minus the first day. The result is a timedelta object.

df_work = df_dd#.loc[0:200]
df_work["duration"] = df_work["edate_end"]-df_work["edate"]
df_work["duration_str"] = ""

For the later graph it’s not too handy to see how many days the person was in power, I don’t know if there’s a handier way, but I transformed the durations into strings like this:

# get a clear text instead of timedelta
for index, row in df_work.iterrows():
    time = row["duration"]
    years = int(str(int(str(time).split(" ")[0]) / 365).split(".")[0])
    months = int(str((int(str(time).split(" ")[0]) / 365 - years) * 12).split(".")[0])
    days = int(str((((int(str(time).split(" ")[0]) / 365 - years) * 12) - months) * 30).split(".")[0])
    output = str(years)+" Years "+str(months)+" Months "+str(days)+" Days"
    df_work.at[index,"duration_str"] = output

Finally the plot.ly part

After formatting the data I could finally start using plotly. The Gantt-Chart needs four columns to get displayed properly, namely they are:

  • Task = in this case the country name
  • Start = the start date (edate)
  • Finish = the finish date (edate_end)
  • Resource = the type of regime, there are 5 types in this dataset

Time to import the plotly library (of cause, this would be on top of the file…). The little mess down there is due to different ways of layouting I tried, don’t fix me on the specific ones, but I guess it doesn’t matter too much.

import plotly
import plotly.plotly as py
import plotly.figure_factory as ff
from plotly.graph_objs import layout
import plotly.graph_objs as go

The colors for the regimes get defined as dictionary in colors. “fig” defines the layout and “resource_type” is later used to display the definitions of the regime types instead of numbers.

# define the colors of the different regime types in the diagram
colors = {0:  "#003bb7", 1: '#0657ff', 2: '#8fb4ff', 3: '#ffda8f', 4: "#ffad06", 5: '#ff6f06'} # shing the groups individually
#colors = {0:"#003bb7",1:'#003bb7',2:'#003bb7',3:'#ffad06',4:"#ffad06",5:'#ffad06'} # distiguish only between dictatorships and democracies
#colors = {0:"#FFFFFF",1:'#FFFFFF',2:'#FFFFFF',3:'#000000',4:"#000000",5:'#000000'} # only dictatorships in black

fig = ff.create_gantt(data,
                      colors=colors,
                      index_col='Resource',
                      show_colorbar=False,
                      title='Regimes in history worldwide',
                      showgrid_x=True,
                      group_tasks=True)

resource_type = {0 : "Parliamentary democracy",
                 1 : "Mixed (semi-presidential) democracy",
                 2 : "Presidential democracy",
                 3 : "Civilian dictatorship",
                 4 : "Military dictatorship",
                 5 : "Royal dictatorship"}

Last but not least I wanted to improve the hover so there is more information.

# improved hover
for i in range(len(fig["data"]) - 2):
    text = "Country: {}<br>Leader: {}<br>Regime type: {}<br>Duration: {}".format(df_work["ctryname"].loc[i], df_work["ehead"].loc[i], resource_type[df_work["regime"].loc[i]], df_work["duration_str"].loc[i])
    fig["data"][i].update(text=text, hoverinfo="text")

And lastly I export it. Notice that I export it offline. First I tried it online via the API, but it’s much faster offline and you don’t have the limitations that the API has. Especially for testing, offline is the way to go. Also notice the extreme height (6000px) I used to display the full plot properly.

fig["layout"].update(autosize=True, margin=dict(l=200), height=6000, width=1800)

plotly.offline.plot(fig, filename='./regimes.html')

The result of this effort can be seen by clicking HERE. The whole Python file can be found on github and you’re more than welcome to inform me about changes you’d make in the process 🙂

2 Comments

Join the discussion and tell us your opinion.

Victorreply
February 25, 2019 at 2:27 pm

Hello,
First of all, thank you for your work. But could you please provide the dataframe you get after the step of formatting dates ? Or the dataframe “data” of the create_gantt function ? Because the step of formatting date seems pretty long because of the numerous input I have to fill.
Thanks in advance.

hannesreply
March 16, 2019 at 9:32 am
– In reply to: Victor

Hej!
Sorry for the late reply, I had been working on something else.
As far as I remember it’s not too many imputs, but I agree, the dataset is quite poor. I can’t really provide you with the dataframe without going through the whole process myself as I didn’t save the dataframe permanently. I hope you were able to find a way. Are you using the same dataset about regimes worldwide?
Cheers!

Leave a reply