Determining the dataset content for functional programming
Once you load or fetch existing datasets from specific sources, you can apply them to your functional programming goals. These datasets generally have specific characteristics that you can discover online at places like Sci-kit resources for the Boston house-prices dataset. However, you can also use thedir()
function to learn about dataset content. When you use dir(Boston)
with the previously created Boston house-prices dataset, you discover that it contains DESCR
, data
, feature_names
, and target
properties. Here is a short description of each property:
DESCR
: Text that describes the dataset content and some of the information you need to use it effectivelydata
: The content of the dataset in the form of values used for analysis purposesfeature_names
: The names of the various attributes in the order in which they appear in datatarget
: An array of values used with data to perform various kinds of analysis
print(Boston.DESCR)
function displays a wealth of information about the Boston house-prices dataset, including the names of attributes that you can use to interact with the data. Check out the results of these queries.The information that the datasets contain can have significant commonality. For example, if you use dir(data)
for the Olivetti faces dataset example described earlier, you find that it provides access to DESCR
, data
, images
, and target
properties. As with the Boston house-prices dataset, DESCR gives you a description of the Olivetti faces dataset, which you can use for things like accessing particular attributes. By knowing the names of common properties and understanding how to use them, you can discover all you need to know about a common dataset in most cases without resorting to any online resource. In this case, you'd use print(data.DESCR)
to obtain a description of the Olivetti faces dataset. Also, some of the description data contains links to sites where you can learn more information.
Using the dataset sample code for functional programming
The online sources are important because they provide you with access to sample code, in addition to information about the dataset. For example, the Boston house-prices site provides access to six examples, one of which is the Gradient Boosting Regression example. Discovering how others access these datasets can help you build your own code. Of course, the dataset doesn’t limit you to the uses shown by these examples; the data is available for any use you might have for it.Creating a DataFrame
The common datasets are in a form that allows various types of analysis, as shown by the examples provided on the sites that describe them. However, you might not want to work with the dataset in that manner; instead, you may want something that looks a bit more like a database table. Fortunately, you can use the pandas library to perform the conversion in a manner that makes using the datasets in other ways easy. Using the Boston house-prices dataset as an example, the following code performs the required conversion:import pandas as pd BostonTable = pd.DataFrame(Boston.data, columns=Boston.feature_names)If you want to include the target values with the
DataFrame
, you must also execute: BostonTable['target'] = Boston.target
. However, here you don’t use target data.
Accessing specific records for functional programming
If you were to do adir()
command against a DataFrame
, you would find that it provides you with an overwhelming number of functions to try. The documentation at panda supplies a good overview of what's possible (which includes all the usual database-specific tasks specified by CRUD). The following example code shows how to perform a query against a pandas DataFrame
. In this case, the code selects only those housing areas where the crime rate is below 0.02 per capita.
CRIMTable = BostonTable.query('CRIM < 0.02') print(CRIMTable.count()['CRIM'])The output shows that only 17 records match the criteria. The
count()
function enables the application to count the records in the resulting CRIMTable
. The index, ['CRIM']
, selects just one of the available attributes (because every column is likely to have the same values).You can display all these records with all of the attributes, but you may want to see only the number of rooms and the average house age for the affected areas. The following code shows how to display just the attributes you actually need:
print(CRIMTable[['RM', 'AGE']])The image below shows the output from this code. As you can see, the houses vary between 5 and nearly 8 rooms in size. The age varies from almost 14 years to a little over 65 years.
You might find it a bit hard to work with the unsorted data you see above. Fortunately, you do have access to the full range of common database features. If you want to sort the values by number of rooms, you use:
print(CRIMTable[['RM', 'AGE']].sort_values('RM'))As an alternative, you can always choose to sort by average home age:
print(CRIMTable[['RM', 'AGE']].sort_values('AGE'))