TransWikia.com

Is Python a viable language to do statistical analysis in?

Data Science Asked on August 31, 2021

I originally came from R, but Python seems to be the more common language these days. Ideally, I would do all my coding in Python as the syntax is easier and I’ve had more real life experience using it – and switching back and forth is a pain.

Out side of ML type stuff, all of the statistical analysis I’ve done have been in R – like regressions, time series, ANOVA, logistic regression etc. I have never really done that type of stuff in Python. However, I am trying to create a bunch of code templates for myself, and before I start, I would like to know if Python is deep enough to completely replace R as my language of choice. I eventually do plan on moving more towards ML, and I know Python can do that, and eventually I would imagine I have to go to a more base language like C++.

Anyone know what are the limitations of Python when it comes to statistical analysis or has as link to the pros and cons of using R vs. Python as the main language for statistical analysis?

7 Answers

Python is more "general purpose" while R has a clear(er) focus on statistics. However, most (if not all) things you can do in R can be done in Python as well. The difference is that you need to use additional packages in Python for some things you can do in base R.

Some examples:

  • Data frames are base R while you need to use Pandas in Python.
  • Linear models (lm) are base R while you need to use statsmodels or scikit in Python. There are important conceptional differences to be considered.
  • For some rather basic mathematical operations you would need to use numpy.

Overall this leads to some additional effort (and knowledge) needed to work fluently in Python. I personally often feel more comfortable working with base R since I feel like being "closer to the data" in (base) R.

However, in other cases, e.g. when I use boosting or neural nets, Python seems to have an advantage over R. Many algorithms are developed in C++ (e.g. Keras, LightGBM) and adapted to Python and (often later to) R. At least when you work with Windows, this often works better with Python. You can use things like Tensorflow/Keras, LightGBM, Catboost in R, but it sometimes can be daunting to get the additional package running in R (especially with GPU support).

Many packages (or methods) are available for R and Python, such as GLMnet (for R / for Python). You can also see based on the Labs of "Introduction to Statistical Learning" - which are available for R and for Python as well - that there is not so much of a difference between the two languages in terms of what you can do. The difference is more like how things are done.

Finally, since Python is more "general purpose" than R (at least in my view), there are interesting and funny things you can do with Python (beyond statistics) which you cannot do with R (at least it is harder).

Correct answer by Peter on August 31, 2021

Python being more widely used is an important consideration. This will especially become important when applying for a job. Also Python has as many if not more key statistical and ML/AI tools as R, and a larger open-source base to utilize. Python is designed for programmers, R is designed for statisticians. Originally I was a R programmer, but most of my colleagues were using Python so I eventually switched over.

Here are some of the basic differences:

Python:

  1. programmer friendly
  2. debugging easier
  3. More open-source support (stack web sites, etc)

R:

  1. Easier and simpler to write scripts
  2. Works better with other languages
  3. More built in functionality

Good reference to check out: datacamp.com/community/tutorials/r-or-python-for-data-analysis

Also should mention that i have used R code within Python, using Rpy2. If you are using a notebook, just use %%R, after installing the necessary R libraries

Answered by Donald S on August 31, 2021

I'd like to add two points to the existing answers:

  • There is excellent interaction between R and python, with various possibilities for either direction.

    To me, it's not that much of a decision python vs. R. The decision is to choose the main language appropriately for the project at hand, and then do parts in the other language if that is better for some reason.

  • I find the facilities to generate reports much more convenient in R.
    Since lots of my work consists in producing reports about statistical analyses, I mainly use R.

    To the point that were I to encounter a data analysis + report today that I think is better done in python, I'd set up the report as "R"markdown and do the python in python chunks.

Answered by cbeleites unhappy with SX on August 31, 2021

For the love of the flying spaghetti monster, use anaconda to install the needed packages for data science. I have seen both Python and R being used in the data science setting and both needed additional packages to execute any data science capabilities. Conda made it way easier to install them.

From my point of view, Python has a better support for all kind of packages. There are simply more ports to Python than to R, but this may change in the future.

https://docs.conda.io/projects/conda/en/latest/user-guide/install/
conda install scikit-learn

Answered by stupidstudent on August 31, 2021

One thing that can be a gotcha coming from R to Python is that the Python stats ecosystem tends to be more machine learning-ey oriented rather than inferential stats-ey oriented.

This can create some hiccups, because some of the defaults in R that are the defaults because people who do inferential stats like in the social sciences always use them, are not the defaults in the main Python libraries.

For example, Statsmodels, one of the standard libraries for inferential stats, doesn't include the intercept by default when you do linear regression, UNLESS you use the R-style formulas with Patsy, in which case it is included.

Another example: Scikit-learn in Python uses the divide-by-n ("population") formula for standard deviation, while R uses the divide-by-n-1 ("sample") formula.

Those sort of things tend to be really confusing for people new to the ecosystem, and create totally unnecessary cognitive burden. So that's a tradeoff.

Answered by Paul Gowder on August 31, 2021

I eventually do plan on moving more towards ML

One aspect that I would like to add based on what I observed.

Things are moving with more focus towards Deep Learning e.g. Neural Networks and in this space, most of the dominating Libraries supports Python as first choice.

Companies manage a separate Python version to open-source, just to maintain the user base even though they themselves use either a C++ compiled version or something different. It's because of the two-way-additive process i.e. since Python has gained fame, companies are creating an open-sourced framework/library in Python and easily available Frameworks/Libraries are attracting more users.

Stackoverflow 2019 Survey

Most Popular Technologies - Python - 41.7% $hspace{1cm}$ R - 5.8%
Other Frameworks, Libraries, and Tools - Pandas and Tensorflow are in top 5-6
Most Wanted Languages - Python is at the top with 25.7%
Most Wanted Framework - Tensorflow at 2nd after NodeJs

Same logic goes with Books/Blogs and Tutorials.
I will agree that concepts don't change with a programming language but the examples/code provided in the books/blogs definitely accelerate the learning.
Almost everyone in the Industry will recommend this book to a beginner and I also found it the best.
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems 2nd Edition by Aurélien Géron

Answered by 10xAI on August 31, 2021

As others have pointed out, python is more general, more programmers oriented, with more libraries and better hardware support. I'm not an R user, but python seems faster (c based) and more suitable on processing large files, or extracting big data from sql, most times in my experience is a previous step before apply statistics or AI to data.

Of course if you try processing using Dataframes and all data artifacts R like, with pandas or other math libraries, you end with a bad performance as in R. But with python you also have the option to process raw data files, line to line and byte to byte, and optimize processing time on big data sets, use multiprocessing for full machine use, etc.

Answered by Rogelio Triviño on August 31, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP