Jupyter Notebook: Python or R — Or Both?
I was analytically betwixt and between a few weeks ago. Most of my Jupyter Notebook work is done in either Python or R. Indeed, I like to self-demonstrate the power of each platform by recoding R work in Python and vice-versa.
I must have a dozen active notebooks, some developed in Python and some in R, that investigate the performance of the stock market, using indexes like the S&P 500, Wilshire 5000, and Russell 3000. These index data can be readily downloaded from the web, using core functionality of either of them, for subsequent analytic processing and display. I’ve learned a lot about both programs, and the stock market developing such notebooks. Alas, for the latest exercise I’d envisioned, I wished to look at Russell 3000 data, available in a Python notebook, through the lens of analytics I’d coded for the Wilshire 5000 in R. The mishmash of notebook kernel and stock market index combinations I possessed didn’t meet my needs.
[Related Article: Snakes in a Package: Combining Python and R with Reticulate]
I figured I had a couple of choices. I could either re-write the R graphics code in Python using the ever-improving Seaborn library, or I could adapt the R ggplot Wilshire code to interoperate with Russell 3000 data using the Python library rpy2 and RMagic — in effect engaging R within Python. I decided to do the latter.
R and Python are two of the top analytics platforms. Both are open source, both have large user bases, and both have incredibly productive ecosystems. In addition, they interoperate: Python developers can use the rpy2 library to include R code in their scripts, while R developers have access to Python via the reticulate package. There’s also the feather package, available in both, for sharing data across platforms. I fully expect even more seamless collaboration between them in the near future.
For the analysis that follows, I focus on the performance of the FTSE Russell 3000 index using Python. I first download two files — a year-to-date and a history, that provide final 3000 daily index levels starting in 2005. Attributes include index name, date, level without dividends reinvested, and level with dividends reinvested. I then wrangle the data to build the final Pandas dataframe. From there, I build R dataframes to show the growth of $1 invested over time suitable for ggplot2 charting.
[Related Article: Introduction to R Shiny]
The technology used below is JupyterLab 0.32.1, Anaconda Python 3.6.5, Pandas 0.23.0, R 3.6.0, and rpy2 2.9.4. This notebook’s kernel is Python 3 and uses the rpy2 library to enable R processing. In this instance, the initial data work is done in Python/Pandas, then handed off for graphics to the splendid R ggplot2 library.
Tone down warnings on pandas filters.
In [1]:
import warnings
warnings.filterwarnings("ignore")print("\n\n")
Add a local directory housing a personal library to the import path.
In [2]:
import sysfunctdir = "c:/data/jupyter/notebooks/functions"
sys.path.append(functdir)print("\n\n")
Load the personal library. The prarr and blanks functions are used in this article.
In [3]:
from myfuncs import *blanks(2)
Import other relevant Python libraries.
In [4]:
import os
import pandas as pd
import numpy as npblanks(2)
Set and migrate to the new working directory.
In [5]:
owd = os.getcwd()nwd = "c:/data/russell/potus"
os.chdir(nwd)print(os.getcwd())blanks(2)c:\data\russell\potus
Establish a list of url’s of the CSV files to be downloaded from the FTSE Russell website. The two files are year-to-date (ytd) and history (hist) of the daily Russell 3000 index levels.
In [6]:
urllst = [
"http://www.ftse.com/products/russell-index-values/home/getfile?id=valuesytd_US3000.csv",
"http://www.ftse.com/products/russell-index-values/home/getfile?id=valueshist_US3000.csv"
]blanks(2)
Load the ytd file into a pandas dataframe. Do a little munging.
In [7]:
ytd = pd.read_csv(urllst[0])
ytd.columns = ["portfolio","date","levwodiv","levwdiv"]
ytd['portfolio'] = "r3000"
ytd['date'] = pd.DatetimeIndex(ytd.date)ytd = ytd.sort_values(['date']).iloc[1:,]ytd.reset_index(drop=True,inplace=True)prarr(ytd)blanks(2)portfolio date levwodiv levwdiv
0 r3000 2019-01-01 2704.369326 7733.105730
1 r3000 2019-01-02 2707.170883 7741.158427
2 r3000 2019-01-03 2643.051723 7559.582582
3 r3000 2019-01-04 2734.493258 7821.127295
4 r3000 2019-01-07 2757.993050 7888.401675 portfolio date levwodiv levwdiv
118 r3000 2019-06-14 3117.538895 8996.253148
119 r3000 2019-06-17 3122.143259 9009.636226
120 r3000 2019-06-18 3153.227920 9099.614008
121 r3000 2019-06-19 3163.499294 9129.317529
122 r3000 2019-06-20 3191.645509 9211.282316
Ditto for hist.
In [8]:
hist = pd.read_csv(urllst[1])
hist.columns = ["portfolio","date","levwodiv","levwdiv"]
hist['portfolio'] = "r3000"
hist['date'] = pd.DatetimeIndex(hist.date)hist.sort_values(['date'],inplace=True)hist.reset_index(drop=True,inplace=True)prarr(hist)blanks(2)portfolio date levwodiv levwdiv
0 r3000 2005-01-03 1261.856142 2742.745480
1 r3000 2005-01-04 1245.475031 2707.401151
2 r3000 2005-01-05 1238.467380 2692.398250
3 r3000 2005-01-06 1242.776350 2702.305567
4 r3000 2005-01-07 1240.174458 2696.663529 portfolio date levwodiv levwdiv
3645 r3000 2018-12-25 2536.218945 7248.537164
3646 r3000 2018-12-26 2662.170289 7608.685862
3647 r3000 2018-12-27 2682.958961 7668.843455
3648 r3000 2018-12-28 2681.191917 7665.856863
3649 r3000 2018-12-31 2704.369326 7733.105730
Next, vertically “stack” hist and ytd into a “combine” dataframe.
In [9]:
combine = pd.concat([ytd,hist])
combine.sort_values(['date'],inplace=True)print(combine.shape,"\n\n")
prarr(combine)blanks(2)(3773, 4)
portfolio date levwodiv levwdiv
0 r3000 2005-01-03 1261.856142 2742.745480
1 r3000 2005-01-04 1245.475031 2707.401151
2 r3000 2005-01-05 1238.467380 2692.398250
3 r3000 2005-01-06 1242.776350 2702.305567
4 r3000 2005-01-07 1240.174458 2696.663529 portfolio date levwodiv levwdiv
118 r3000 2019-06-14 3117.538895 8996.253148
119 r3000 2019-06-17 3122.143259 9009.636226
120 r3000 2019-06-18 3153.227920 9099.614008
121 r3000 2019-06-19 3163.499294 9129.317529
122 r3000 2019-06-20 3191.645509 9211.282316
Delete duplicate level records that emerge from holidays.
In [10]:
combine.drop_duplicates(subset=['levwodiv','levwdiv'], inplace=True)print(combine.shape,"\n\n")
prarr(combine)blanks(2)(3640, 4)
portfolio date levwodiv levwdiv
0 r3000 2005-01-03 1261.856142 2742.745480
1 r3000 2005-01-04 1245.475031 2707.401151
2 r3000 2005-01-05 1238.467380 2692.398250
3 r3000 2005-01-06 1242.776350 2702.305567
4 r3000 2005-01-07 1240.174458 2696.663529 portfolio date levwodiv levwdiv
118 r3000 2019-06-14 3117.538895 8996.253148
119 r3000 2019-06-17 3122.143259 9009.636226
120 r3000 2019-06-18 3153.227920 9099.614008
121 r3000 2019-06-19 3163.499294 9129.317529
122 r3000 2019-06-20 3191.645509 9211.282316
Compute percent change attributes for both level with dividends and level without dividends.
In [11]:
combine['pctwodiv'] = combine.levwodiv.pct_change()
combine['pctwdiv'] = combine.levwdiv.pct_change()print(combine.shape,"\n\n")
prarr(combine)blanks(2)(3640, 6)
portfolio date levwodiv levwdiv pctwodiv pctwdiv
0 r3000 2005-01-03 1261.856142 2742.745480 NaN NaN
1 r3000 2005-01-04 1245.475031 2707.401151 -0.012982 -0.012886
2 r3000 2005-01-05 1238.467380 2692.398250 -0.005626 -0.005541
3 r3000 2005-01-06 1242.776350 2702.305567 0.003479 0.003680
4 r3000 2005-01-07 1240.174458 2696.663529 -0.002094 -0.002088 portfolio date levwodiv levwdiv pctwodiv pctwdiv
118 r3000 2019-06-14 3117.538895 8996.253148 -0.002559 -0.002440
119 r3000 2019-06-17 3122.143259 9009.636226 0.001477 0.001488
120 r3000 2019-06-18 3153.227920 9099.614008 0.009956 0.009987
121 r3000 2019-06-19 3163.499294 9129.317529 0.003257 0.003264
122 r3000 2019-06-20 3191.645509 9211.282316 0.008897 0.008978
Write a csv file from the final dataframe. Calculate the growth of $1 from the inception of the data.
In [12]:
out = "r3000pd.csv"
combine.to_csv(out,index=None, sep=',')print(round((1+combine.pctwodiv).prod(),2))
print(round((1+combine.pctwdiv).prod(),2))blanks(2)2.53
3.36
Load the rpy2 (R within Python) module
In [13]:
%load_ext rpy2.ipythonblanks(2)
Import pertinent rpy2 libraries.
In [14]:
import rpy2
import rpy2.robjects.numpy2ri
import rpy2.robjects as robjectsrobjects.pandas2ri.activate()blanks(2)
Create a version of the Pandas combine dataframe suitable for R processing.
In [15]:
r3000 = robjects.pandas2ri.py2ri(combine[['date','pctwdiv']])blanks(2)
Take a peek at the R data.frame.
In [16]:
%R -i r3000 tail(r3000)
Out[16]:
DATEPCTWDIV02019–06–13 00:00:00–05:000.00495812019–06–14 00:00:00–05:00–0.00244022019–06–17 00:00:00–05:000.00148832019–06–18 00:00:00–05:000.00998742019–06–19 00:00:00–05:000.00326452019–06–20 00:00:00–05:000.008978
Load relevant R libraries.
In [17]:
%R require(tidyverse); require(data.table); require(RColorBrewer); require(R.utils); require(lubridate)
Out[17]:
array([1], dtype=int32)
Create a cell of R processing. Push in the r3000 dataframe and commence R wrangling and graphics processing. Display the final chart.
In [18]:
%%R -w700 -h700 -i r3000 r3000 <- data.table(r3000)mdte <- max(r3000[['date']])
dte <- substr(mdte,6,11)tdates <- lubridate::date(paste(c("2019-","2011-","2015-"),dte,sep=""))
fdates <- lubridate::date(c("2017-01-20","2009-01-20","2013-01-21"))#############################################
# function to build the data.table for ggplot.
#############################################nmkreturn <- function(to,from,dt)
{
rbind(
data.table(potus='Trump',
date=dt[date>=fdates[1] & date<=tdates[1]]$date,
returnpct=dt[date>=fdates[1] & date<=tdates[1],cumprod(1+pctwdiv)]
),
data.table(potus='Obama 1',
date=dt[date>=fdates[2] & date<=tdates[2]]$date,
returnpct=dt[date>=fdates[2] & date<=tdates[2],cumprod(1+pctwdiv)]
),
data.table(potus='Obama 2',
date=dt[date>=fdates[3] & date<=tdates[3]]$date,
returnpct=dt[date>=fdates[3] & date<=tdates[3],cumprod(1+pctwdiv)]
)
) }#########################################
# nwork is the final graphics data.table.
#########################################work <- nmkreturn(tdates,fdates,r3000)
nwork <- data.table(left_join(work,work[,.(legend=paste(potus,date[1], date[.N], "\n",sep="\n")),.(potus)],by=c("potus"="potus")))X <- nwork[,.(date[.N],returnpct=round(100*(returnpct[.N])-100)),.(potus)]$V1
Y <- nwork[,.(date[.N],returnpct=round(100*(returnpct[.N])-100)),.(potus)]$returnpct###############################
# save data to an rds data set.
###############################ofile = "r3000.rds"
save(r3000,X,work,nwork,file=ofile)
###########################################
# set parm vars and execute the ggplot code.
###########################################titstr <- paste("Russell 3000 Returns", " thru ", mdte,sep="")
stitstr <- "Trump vs Obama\n"
xstr <- "\nYear"
ystr <- "Growth %\n"
cstr <- "Administration\n"pal <- brewer.pal(9,"Blues")g <- ggplot(nwork,aes(x=date,y=100*(returnpct-1), col=legend)) +
geom_line(size=.7) +
theme(legend.position = "right", plot.background = element_rect(fill = pal[2]),
panel.background = element_rect(fill = pal[2])) +
theme(axis.text.x = element_text(size=10, angle = 45)) +
labs(title=titstr,subtitle=stitstr,x=xstr,y=ystr,col=cstr) +
annotate("text", x = X+100*24*60*60, y = Y, label = Y, size=4)
g
28 month stock market performance is solid for 45. Comparable period performance is even better for each administration of 44.
Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday.