Home

Defining Data Type through Scalars

|
Updated:  
2020-02-12 17:55:48
|
Data Science Essentials For Dummies
Explore Book
Buy On Amazon
Data science programming begins with the language you choose. The most common languages for data science programming are Python and R. Every data form in Python and R begins with a scalar — a single item of a particular type. Precisely how you define a scalar depends on how you want to view objects within your code and the definitions of scalars for your language.

For example, R provides these native, simple data types:

  • Character
  • Numeric (real or decimal)
  • Integer
  • Logical
  • Complex
In many respects, R views strings as vectors of characters; the scalar element is a character, not a string. The difference is important when thinking about how R works with scalars. R also provides a character vector, which is different from an R string. You can read about the difference at gastonsanchez.com.

Python provides these native, simple data types:

  • Boolean
  • Integer
  • Float
  • Complex
  • String
Note that Python doesn’t include a character data type because it works with strings, not with characters. Yes, you can create a string containing a single character and you can interact with individual characters in a string, but there isn’t an actual character type. To see this fact for yourself, try this code:
anA = chr(65)
print(type(anA))
The output will be <class 'str'>, rather than <class 'char'>, which is what most languages would provide. Consequently, a string is a scalar in Python but a vector in R. Keeping language differences in mind will help as you perform analysis on your data.

Most languages also support what you might term as semi-native data types. For example, Python supports a Fraction data type that you create by using code like this:

from fractions import Fraction
x = Fraction(2, 3)
print(x)
print(type(x))
The fact that you must import Fraction means that it’s not available all the time, as something like complex or int is. The tip-off that this is not a built-in class is the class output of <class 'fractions.Fraction'>. However, you get Fraction with your Python installation, which means that it’s actually a part of the language (hence, semi-native).

External libraries that define additional scalar data types are available for most languages. Access to these additional scalar types is important in some cases. Python provides access to just one data type in any particular category.

For example, if you need to create a variable that represents a number without a decimal portion, you use the integer data type. Using a generic designation like this is useful because it simplifies code and gives the developer a lot less to worry about.

However, in scientific calculations, you often need better control over how data appears in memory, which means having more data types — something that numpy provides for you.

For example, you might need to define a particular scalar as a short (a value that is 16 bits long). Using numpy, you could define it as myShort = np.short(15). You could define a variable of precisely the same size using the np.int16 function. You can discover more about the scalars provided by the NumPy library for Python. You also find that most languages provide means of extending the native types (see the articles at Python.org and greenteapress.com for additional details).

About This Article

This article is from the book: 

About the book author:

John Paul Mueller is a freelance author and technical editor. He has writing in his blood, having produced 100 books and more than 600 articles to date. The topics range from networking to home security and from database management to heads-down programming. John has provided technical services to both Data Based Advisor and Coast Compute magazines.

Luca Massaron is a data scientist specialized in organizing and interpreting big data and transforming it into smart data by means of the simplest and most effective data mining and machine learning techniques. Because of his job as a quantitative marketing consultant and marketing researcher, he has been involved in quantitative data since 2000 with different clients and in various industries, and is one of the top 10 Kaggle data scientists.