Jupyter Notebook: Python or R — Or Both?

6 min readJul 17, 2019

I was analytically betwixt and between a few weeks ago. Most of my Jupyter Notebook work is done in either Python or R. Indeed, I like to self-demonstrate the power of each platform by recoding R work in Python and vice-versa.

I must have a dozen active notebooks, some developed in Python and some in R, that investigate the performance of the stock market, using indexes like the S&P 500, Wilshire 5000, and Russell 3000. These index data can be readily downloaded from the web, using core functionality of either of them, for subsequent analytic processing and display. I’ve learned a lot about both programs, and the stock market developing such notebooks. Alas, for the latest exercise I’d envisioned, I wished to look at Russell 3000 data, available in a Python notebook, through the lens of analytics I’d coded for the Wilshire 5000 in R. The mishmash of notebook kernel and stock market index combinations I possessed didn’t meet my needs.

I figured I had a couple of choices. I could either re-write the R graphics code in Python using the ever-improving Seaborn library, or I could adapt the R ggplot Wilshire code to interoperate with Russell 3000 data using the Python library rpy2 and RMagic — in effect engaging R within Python. I decided to do the latter.

R and Python are two of the top analytics platforms. Both are open source, both have large user bases, and both have incredibly productive ecosystems. In addition, they interoperate: Python developers can use the rpy2 library to include R code in their scripts, while R developers have access to Python via the reticulate package. There’s also the feather package, available in both, for sharing data across platforms. I fully expect even more seamless collaboration between them in the near future.

For the analysis that follows, I focus on the performance of the FTSE Russell 3000 index using Python. I first download two files — a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested. I then wrangle the data to build the final Pandas dataframe. From there, I build R dataframes to show the growth of $1 invested over time suitable for ggplot2 charting.

The technology used below is JupyterLab 0.32.1, Anaconda Python 3.6.5, Pandas 0.23.0, R 3.6.0, and rpy2 2.9.4. This notebook’s kernel is Python 3 and uses the rpy2 library to enable R processing. In this instance, the initial data work is done in Python/Pandas, then handed off for graphics to the splendid R ggplot2 library.

Tone down warnings on pandas filters.

In [1]:

import warnings
warnings.filterwarnings("ignore")print("\n\n")

Add a local directory housing a personal library to the import path.

In [2]:

import sysfunctdir = "c:/data/jupyter/notebooks/functions"
sys.path.append(functdir)print("\n\n")

Load the personal library. The prarr and blanks functions are used in this article.

In [3]:

from myfuncs import *blanks(2)

Import other relevant Python libraries.

In [4]:

import os
import pandas as pd
import numpy as npblanks(2)

Set and migrate to the new working directory.

In [5]:

owd = os.getcwd()nwd = "c:/data/russell/potus"
os.chdir(nwd)print(os.getcwd())blanks(2)c:\data\russell\potus

Establish a list of url’s of the CSV files to be downloaded from the FTSE Russell website. The two files are year-to-date (ytd) and history (hist) of the daily Russell 3000 index levels.

In [6]:

urllst = [
"http://www.ftse.com/products/russell-index-values/home/getfile?id=valuesytd_US3000.csv",
"http://www.ftse.com/products/russell-index-values/home/getfile?id=valueshist_US3000.csv"
]blanks(2)

Load the ytd file into a pandas dataframe. Do a little munging.

In [7]:

ytd = pd.read_csv(urllst[0])
ytd.columns = ["portfolio","date","levwodiv","levwdiv"]
ytd['portfolio'] = "r3000"
ytd['date'] = pd.DatetimeIndex(ytd.date)ytd = ytd.sort_values(['date']).iloc[1:,]ytd.reset_index(drop=True,inplace=True)prarr(ytd)blanks(2)portfolio       date     levwodiv      levwdiv
0     r3000 2019-01-01  2704.369326  7733.105730
1     r3000 2019-01-02  2707.170883  7741.158427
2     r3000 2019-01-03  2643.051723  7559.582582
3     r3000 2019-01-04  2734.493258  7821.127295
4     r3000 2019-01-07  2757.993050  7888.401675     portfolio       date     levwodiv      levwdiv
118     r3000 2019-06-14  3117.538895  8996.253148
119     r3000 2019-06-17  3122.143259  9009.636226
120     r3000 2019-06-18  3153.227920  9099.614008
121     r3000 2019-06-19  3163.499294  9129.317529
122     r3000 2019-06-20  3191.645509  9211.282316

Ditto for hist.

In [8]:

hist = pd.read_csv(urllst[1])
hist.columns = ["portfolio","date","levwodiv","levwdiv"]
hist['portfolio'] = "r3000"
hist['date'] = pd.DatetimeIndex(hist.date)hist.sort_values(['date'],inplace=True)hist.reset_index(drop=True,inplace=True)prarr(hist)blanks(2)portfolio       date     levwodiv      levwdiv
0     r3000 2005-01-03  1261.856142  2742.745480
1     r3000 2005-01-04  1245.475031  2707.401151
2     r3000 2005-01-05  1238.467380  2692.398250
3     r3000 2005-01-06  1242.776350  2702.305567
4     r3000 2005-01-07  1240.174458  2696.663529      portfolio       date     levwodiv      levwdiv
3645     r3000 2018-12-25  2536.218945  7248.537164
3646     r3000 2018-12-26  2662.170289  7608.685862
3647     r3000 2018-12-27  2682.958961  7668.843455
3648     r3000 2018-12-28  2681.191917  7665.856863
3649     r3000 2018-12-31  2704.369326  7733.105730

Next, vertically “stack” hist and ytd into a “combine” dataframe.

In [9]:

combine = pd.concat([ytd,hist])
combine.sort_values(['date'],inplace=True)print(combine.shape,"\n\n")
prarr(combine)blanks(2)(3773, 4) 
  portfolio       date     levwodiv      levwdiv
0     r3000 2005-01-03  1261.856142  2742.745480
1     r3000 2005-01-04  1245.475031  2707.401151
2     r3000 2005-01-05  1238.467380  2692.398250
3     r3000 2005-01-06  1242.776350  2702.305567
4     r3000 2005-01-07  1240.174458  2696.663529     portfolio       date     levwodiv      levwdiv
118     r3000 2019-06-14  3117.538895  8996.253148
119     r3000 2019-06-17  3122.143259  9009.636226
120     r3000 2019-06-18  3153.227920  9099.614008
121     r3000 2019-06-19  3163.499294  9129.317529
122     r3000 2019-06-20  3191.645509  9211.282316

Delete duplicate level records that emerge from holidays.

In [10]:

combine.drop_duplicates(subset=['levwodiv','levwdiv'], inplace=True)print(combine.shape,"\n\n")
prarr(combine)blanks(2)(3640, 4) 
  portfolio       date     levwodiv      levwdiv
0     r3000 2005-01-03  1261.856142  2742.745480
1     r3000 2005-01-04  1245.475031  2707.401151
2     r3000 2005-01-05  1238.467380  2692.398250
3     r3000 2005-01-06  1242.776350  2702.305567
4     r3000 2005-01-07  1240.174458  2696.663529     portfolio       date     levwodiv      levwdiv
118     r3000 2019-06-14  3117.538895  8996.253148
119     r3000 2019-06-17  3122.143259  9009.636226
120     r3000 2019-06-18  3153.227920  9099.614008
121     r3000 2019-06-19  3163.499294  9129.317529
122     r3000 2019-06-20  3191.645509  9211.282316

Compute percent change attributes for both level with dividends and level without dividends.

In [11]:

combine['pctwodiv'] = combine.levwodiv.pct_change()
combine['pctwdiv'] = combine.levwdiv.pct_change()print(combine.shape,"\n\n")
prarr(combine)blanks(2)(3640, 6) 
  portfolio       date     levwodiv      levwdiv  pctwodiv   pctwdiv
0     r3000 2005-01-03  1261.856142  2742.745480       NaN       NaN
1     r3000 2005-01-04  1245.475031  2707.401151 -0.012982 -0.012886
2     r3000 2005-01-05  1238.467380  2692.398250 -0.005626 -0.005541
3     r3000 2005-01-06  1242.776350  2702.305567  0.003479  0.003680
4     r3000 2005-01-07  1240.174458  2696.663529 -0.002094 -0.002088     portfolio       date     levwodiv      levwdiv  pctwodiv   pctwdiv
118     r3000 2019-06-14  3117.538895  8996.253148 -0.002559 -0.002440
119     r3000 2019-06-17  3122.143259  9009.636226  0.001477  0.001488
120     r3000 2019-06-18  3153.227920  9099.614008  0.009956  0.009987
121     r3000 2019-06-19  3163.499294  9129.317529  0.003257  0.003264
122     r3000 2019-06-20  3191.645509  9211.282316  0.008897  0.008978

Write a csv file from the final dataframe. Calculate the growth of $1 from the inception of the data.

In [12]:

out = "r3000pd.csv"
combine.to_csv(out,index=None, sep=',')print(round((1+combine.pctwodiv).prod(),2))
print(round((1+combine.pctwdiv).prod(),2))blanks(2)2.53
3.36

Load the rpy2 (R within Python) module

In [13]:

%load_ext rpy2.ipythonblanks(2)

Import pertinent rpy2 libraries.

In [14]:

import rpy2                    
import rpy2.robjects.numpy2ri  
import rpy2.robjects as robjectsrobjects.pandas2ri.activate()blanks(2)

Create a version of the Pandas combine dataframe suitable for R processing.

In [15]:

r3000 = robjects.pandas2ri.py2ri(combine[['date','pctwdiv']])blanks(2)

Take a peek at the R data.frame.

In [16]:

%R -i r3000 tail(r3000)

Out[16]:

DATEPCTWDIV02019–06–13 00:00:00–05:000.00495812019–06–14 00:00:00–05:00–0.00244022019–06–17 00:00:00–05:000.00148832019–06–18 00:00:00–05:000.00998742019–06–19 00:00:00–05:000.00326452019–06–20 00:00:00–05:000.008978

Load relevant R libraries.

In [17]:

%R require(tidyverse); require(data.table); require(RColorBrewer); require(R.utils); require(lubridate)

Out[17]:

array([1], dtype=int32)

Create a cell of R processing. Push in the r3000 dataframe and commence R wrangling and graphics processing. Display the final chart.

In [18]:

%%R  -w700 -h700 -i r3000 r3000 <- data.table(r3000)mdte <- max(r3000[['date']])
dte <- substr(mdte,6,11)tdates <- lubridate::date(paste(c("2019-","2011-","2015-"),dte,sep=""))
fdates <- lubridate::date(c("2017-01-20","2009-01-20","2013-01-21"))#############################################
# function to build the data.table for ggplot.
#############################################nmkreturn <- function(to,from,dt)
{
    
    rbind(
    data.table(potus='Trump',
           date=dt[date>=fdates[1] & date<=tdates[1]]$date,
           returnpct=dt[date>=fdates[1] & date<=tdates[1],cumprod(1+pctwdiv)]
          ),  
    data.table(potus='Obama 1',
           date=dt[date>=fdates[2] & date<=tdates[2]]$date,
           returnpct=dt[date>=fdates[2] & date<=tdates[2],cumprod(1+pctwdiv)]
          ),
    data.table(potus='Obama 2',
            date=dt[date>=fdates[3] & date<=tdates[3]]$date,
            returnpct=dt[date>=fdates[3] & date<=tdates[3],cumprod(1+pctwdiv)]
          )  
    )     }#########################################
# nwork is the final graphics data.table.
#########################################work <- nmkreturn(tdates,fdates,r3000)
nwork <- data.table(left_join(work,work[,.(legend=paste(potus,date[1], date[.N], "\n",sep="\n")),.(potus)],by=c("potus"="potus")))X <- nwork[,.(date[.N],returnpct=round(100*(returnpct[.N])-100)),.(potus)]$V1
Y <- nwork[,.(date[.N],returnpct=round(100*(returnpct[.N])-100)),.(potus)]$returnpct###############################
# save data to an rds data set.
###############################ofile = "r3000.rds"
save(r3000,X,work,nwork,file=ofile)
 
###########################################
# set parm vars and execute the ggplot code.
###########################################titstr <- paste("Russell 3000 Returns", " thru ", mdte,sep="")
stitstr <- "Trump vs Obama\n"
xstr <- "\nYear"
ystr <- "Growth %\n"
cstr <- "Administration\n"pal <- brewer.pal(9,"Blues")g <- ggplot(nwork,aes(x=date,y=100*(returnpct-1), col=legend)) +
      geom_line(size=.7) +
      theme(legend.position = "right", plot.background = element_rect(fill = pal[2]), 
      panel.background = element_rect(fill = pal[2])) +
      theme(axis.text.x = element_text(size=10, angle = 45)) +
      labs(title=titstr,subtitle=stitstr,x=xstr,y=ystr,col=cstr) + 
      annotate("text", x = X+100*24*60*60, y = Y, label = Y, size=4) 
g

28 month stock market performance is solid for 45. Comparable period performance is even better for each administration of 44.

Original post here.

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.

Jupyter Notebook: Python or R — Or Both?

Written by ODSC - Open Data Science

No responses yet