Transcript#
This transcript was generated automatically and may contain errors.
Hi, everyone, and good afternoon. Today I'll be presenting about The Notebook or Not to Notebook, Multilingual Workflows for Data Analysis.
It's my first time in Canada, by the way, and I'm really enjoying the experience so far. So, thank you everyone from the organization, all the sponsors for organizing this amazing event.
And without further ado, I'll start by introducing myself a little bit. So, I'm originally from Brazil, from a city called Salvador. That's a picture of my city. I'm based in Pittsburgh, Pennsylvania, and I work as a geoengineer at Posit.
And my background is in chemistry, but the past few years I've been getting more into statistics, data, especially in the context of data analysis for issues of societal importance. That's what I'm really interested in because I believe that data and statistics can be really helpful in helping understand and improving the world.
Standing on the shoulders of giants
So, what I'm really, when I'm talking about something that I'm new to, I really like to stand on the shoulders of the ground. That's what people refer to that in academia.
And I see a lot of this conference as a conversation between different talks that happened in different locations, and then after the conference, they all become available on YouTube. And that's what I really find exciting, to watch them later, and to see how they are interconnected, and how the speakers have established that those conferences are meeting for dialogue, and for debates, and for conversations.
And so with that in mind, one of my favorite keynote talks that I attended recently was at Pi Data Boston, which was given by one of my colleagues, actually, Isabel Zimmerman, and she talked about the importance of tools, why we should care about tools, about using them, and about using them. And that talk really struck with me, because that made me even more interested in the tools that we use for data science, but more in general, the tools that we use every day, for example.
So some of the questions I had, after Isabel's keynote, were more general in terms of human history, like what was the first tool that was ever made? Who created it? Where, when, and how? And I think thinking about those questions can really guide us, as we think about tools that we use for programming, for data analysis, and so on.
So that's why I find it very interesting to think beyond just the technical aspects of what we do, but to also think in terms of the importance of these things for us as humans. So trying to look for an answer for these questions, I found out that the first tool ever made were stones, essentially, and the first stones that were found were 3.4 million years ago, which, it was very shocking to me, because us, modern humans, has only been around for 300,000 years. So it's been, essentially, a very, very long time ago.
And what I found very interesting about this is that shows that us, as humans, we are always trying to create, and innovate, and use tools across time.
we are always trying to create, and innovate, and use tools across time.
And, but what do these ancient stones have to do with multilingual notebooks? Why am I bringing it up? So the reason I'm bringing it up is because, again, to reflect how these tools inform what we do, and how we use them to achieve certain goals. So I really encourage everyone to check the talk by PyData Boston, that's available on YouTube.
Anatomy of a Jupyter Notebook
And without further ado, let's talk about the basic anatomy of a Jupyter Notebook, which is one of the tools that we will be exploring today. So this is from a talk on the history of Jupyter Notebooks, that was presented. How many of you here are familiar with Jupyter Notebooks? Yeah, everyone. Cool.
So everyone knows how it looks like. And what I find very interesting about Jupyter Notebooks, is the development of more and more extensions around it, that are contributed by the community. So everyone knows how to prep the docs for the different extensions, and so on. And these extensions allow for users to extend the capabilities of Jupyter Notebooks.
One extension that I find really exciting, for example, is to translate between different languages, human languages, that's what I'm referring to here. That allows it to translate the Markdown cells, what you have written on a notebook, into different languages.
So in terms of the current state of the Jupyter extension ecosystem, one website that I found out recently, is the Jupyter Extension Marketplace. And that website allows us to explore different types of extensions that are available, as well as extending some telemetry data about each of the extensions. So we can see how many downloads have been made in the past month, in the past year, for each of the extensions, which extensions are the most popular, and so on.
So I find that that website is not maintained by Project Jupyter, it's an unofficial website, but I find that really informative when I'm looking for extensions that fit into a particular purpose, and so on.
So there are some examples of extensions for different purposes. One of them is they're really like your API widgets, which enable widget interactivity in a notebook, and so on.
Multilingual kernel extensions
So in the context of this talk, in which we're talking about multilingual workflows, some extensions can be really helpful when you want to run different languages within a notebook. And those are under that category of runtime and kernel extensions. And the reason I really like them is because they can enable multilingual execution, so you can have R execution within a notebook, Python, SQL, and so on.
And when I was at PyDatabase, I did a live stream with Ian DeBrieve, who works for Matlab, and he was essentially showing how he can run a Matlab engine within a Jupyter notebook. And that's one example of how multilingual kernel extensions can be helpful for data analysis.
History of Jupyter notebooks
So a lot of what I'm exploring here as the basis for this talk in terms of the history of Jupyter notebooks came from this previous talk at EuroPython, which I found very interesting because it gives a very solid overview of the history of Jupyter notebooks.
And what I found very exciting about them is that it has had a lot of progress in the past 25 years, I would say, with IPython being created in 2001, and then IPython notebook first being released 2011. And then the project Jupyter being spun off from IPython in 2014. And then there was the first JupyterCon almost 10 years ago, like nine years ago. And then the stable release of JupyterLab. And by 2018, there were 2.7 million notebooks on GitHub. And a few years ago, about five years ago, Nature named Jupyter as one of the 10 projects that confirmed science, and it reached almost 10 million notebooks on GitHub.
The notebook controversy
So this very exciting history in the rise of Jupyter notebooks is not without controversies, though, because some people really like notebooks and some people really dislike notebooks.
So I don't know if you're familiar with these two talks that were given before. The first one is I Don't Like Notebooks, which was given at JupyterCon. And the second one is a PyData talk that was given at one of those PyData conferences called The First Notebook War. And then there were these very interesting blog posts that came up as well, someone talked about contextualizing that first notebook war, and someone else responding to the I Don't Like Notebooks talk, talking about how they like notebooks.
So there's a lot of controversy around notebooks. And the reason why I love notebooks beyond or despite of the technical aspects is because, first, the idea of memory and reproducibility in data science, as notebooks can remember the computation state, and you can pick up where you left and continue your reproducible workflow.
Second, the idea of collaboration and community. Third, narratives and incremental thinking, how you can develop your analysis step by step, making it easy to share with people through codes, visualizations, and explanations along the way. Fourth, the bridge between computational and storytelling, because you can tell your story with the data in a way that can be engaging to your audience.
And besides all these reasons that go beyond the technical aspects, as I was reflecting about PyCon US last year, I think there is one more reason that cannot be ignored, which is just the act of playing in the playful nature of exploratory data analogies. And the reason why I say that I was reflecting about this based on PyCon US last year, because there was this very cool keynote talk from Lynn Root, in which she talks about the importance of just playing in general.
So in this, like nowadays we have this pressure to always be doing things for a particular purpose, and because of work, or demands, or different goals, I think it's really important as well to think about the idea of data exploration just out of curiosity, and out of the desire to better understand the world. So that's what I find really exciting about exploratory data analysis with notebooks in particular.
Data scientists are multilingual
So yeah, so besides being playful, I've come to realize that data scientists are multilingual. Here's some data from the Kaggle survey from 2022, which shows that Python, SQL, and R are the third most used languages among data scientists.
And then there's your full data from 2018 to 2022. 2022 was the last year that they conducted that survey. And here we can see that there has been a rise in usage among data scientists of Python, and SQL is increasing, and R, although it's decreasing, it seems to be stable. Like there's a very strong community that remains using R, especially for statistical analysis purposes and so on.
So as I see those different trends in Python, SQL, and R adoption, they really make me wonder about what they mean for notebooks. For me, it means that this data science is increasingly polyglot. It means that notebooks and IDEs should be true, as real-world projects in their communities often use multiple languages. So you can have a data scientist who prefers to use SQL because they are more into data engineering, or a statistician who prefers to use R and so on, as each of those languages are often used for different purposes.
Multilingual notebook workflows in practice
So, in terms of multilingual notebook workflows, I want to show an example of how to integrate multiple languages within a notebook. So a classic example would be using SQL to extract data from a database, Python to obtain and to affirm data, and R with a stronger focus on statistics. And there are two approaches of tackling such workflows. One of them is using notebooks.
As you know, there is also Quarto, which I just used before, which has a similar idea, but different concept, in which you can essentially put some code within your Markdown document and you can execute them as you render the document. So there's also one way of approaching it.
The idea behind that works well of that notebook is that it shows how you can run SQL, R and Python within the same notebook using like Python and R2Pi extensions and so on in order to perform your analysis. So it's just one example. And I think the dependency and the installations that are done in the beginning can be really helpful in case you have data science projects in the future that you want to implement different languages in the same workflow.
So that's what I encourage everyone to check it out. And the Quarto document, it follows a different idea in which at the moment that you render the document, it can execute all the chunks of Python and R code that you have, so yeah.
Challenges of multilingual workflows
In terms of challenges of those multilingual workflows for notebooks, I think it's worth mentioning that when you add those polyglot capabilities, there are certain complexities that emerge. One of them is that debugging across different languages can become very challenging because you can be the new Python, R, SQL, Julia and so on.
Managing dependencies can also have additional challenges because you have multiple languages, multiple dependencies and so on. And often you are also running, let's say Python, if you want to run Python in R, for example, it might be using reticulate or the other way around with the R2Pi extension and so on. So those dependencies can also be challenging.
Third, we can have some performance overhead as a result of that language creation. And then we have the challenge of the team knowledge, knowledge requirement can increase because suddenly the person who's in Python needs to know some R, SQL and so on. And lastly, it can become harder to refactor the code and to maintain as we'll have multiple languages being used.
Notebook editors and IDEs
And to solve some of these challenges, what has been personally using is exploring different notebook editors and ideas that can make that type of workflow easier. And I wanted to show here some examples. And I think everyone should try different tools. I really encourage that because certain tools can be better for certain audiences. Some people have personal preference towards certain IDEs or certain notebook editors and so on.
So a few examples of notebook editors that are becoming more popular besides JupyterLab are DeepNode, GoCollab and Marimo. All of them have their advantages and disadvantages. So that's why I encourage everyone to try different things. In terms of IDE, everyone knows like VF Code, there is also PyCharm and there is Positron, which is the IDE that I've been building for the past year at work.
So just to chat a little bit more about Positron, like there is integration for Jupyter Notebook with Positron. So I think it provides a backers included experience in which people can just download Positron, which by the way is free, it's open source. And you can just do your notebook exploration within the IDE. And that exploration is integrated with like Data Explorer, which is a tool that's available within the IDE, as well as a native variables thing.
And second, in terms of like AI integration, there's definitely integration of like Positron systems, which enables for AI to make changes to your notebook and the way, I don't know if you can see this. So essentially you can have central prompts or you can have a custom prompt as well. And as you put your things, it can make edits.