Introduction to Computing for Engineers and Computer Scientists

Finishing OO and on to Machine Learning

Questions, Discussion¶

Piazza¶

From Class¶

HW5 Discussion¶

General Comments¶

(A): OO programming is complex and is very baffling in the beginning, but does enable code reuse and overall program simplification.

(B): Python's operator overloading and ability to support "mathematic" and other functions for abstract types is one of the language's most powerful features. The concept is simple in scope but complex to understand, however.

(A) + (B) $\Rightarrow$ HW5: The complexity and subtlety means that HW5 is intellectually very hard but the overall scope is very small.

We have touched a small subset of Python's OO capabilities and general support for extensibility, both of which are language strengths.

This assignment is hard. I understand and we will grade accordingly.

But, many of you are banging your heads against the wall in frustration.

Part 1: Abstract Algebra¶

Color is a simple example. We will represent color with symbols and a Cayley table.

$G = {r, g, b, y, m, c},$ where the symbols represent red, green, blue, yellow, magenta, cyan.

The (silly) Cayley tables is (clearly I was bad at art in kindergarten):

|

Representing this as a list of lists would be awkward.
- color_cayley['r']['g'] does not make much sense for lists.
- List iteration is of the form color_cayley[1][7]. There are ways I could do this but it would be awkward.

color_cayley['r']['g'] looks like a dictionary of dictionaries. I could have done the table this way, but choose a slightly different approach. In my dictionary,
- A pair of symbols is the key, e.g. ('r', 'g')
- The value is the symbol that represents "multiplying" the colors using the group operator.

# This simply keeps me from having to do 'c' over and over.
r,g,b,y,m,c = 'r','g','b','y','m','c'
cayley_colors = {}
cayley_colors.update({(r,r):r, (r,g):y, (r,b):m, (r,y):g, (r,c):b, (r,m):c})
cayley_colors.update({(g,r):y, (g,g):g, (g,b):c, (g,y):r, (g,c):b, (g,m):m})
cayley_colors.update({(b,r):m, (b,g):c, (b,b):b, (b,y):g, (b,c):r, (b,m):y})
cayley_colors.update({(y,r):m, (y,g):r, (y,b):c, (y,y):y, (y,c):g, (y,m):b})
cayley_colors.update({(c,r):g, (c,g):y, (c,b):m, (c,y):b, (c,c):c, (c,m):r})
cayley_colors.update({(m,r):c, (m,g):y, (m,b):r, (m,y):g, (m,b):c, (m,m):m})
print("If if got this right, with so little coffee, it is a miracle.")
cayley_colors

If if got this right, with so little coffee, it is a miracle.

{('b', 'b'): 'b',
 ('b', 'c'): 'r',
 ('b', 'g'): 'c',
 ('b', 'm'): 'y',
 ('b', 'r'): 'm',
 ('b', 'y'): 'g',
 ('c', 'b'): 'm',
 ('c', 'c'): 'c',
 ('c', 'g'): 'y',
 ('c', 'm'): 'r',
 ('c', 'r'): 'g',
 ('c', 'y'): 'b',
 ('g', 'b'): 'c',
 ('g', 'c'): 'b',
 ('g', 'g'): 'g',
 ('g', 'm'): 'm',
 ('g', 'r'): 'y',
 ('g', 'y'): 'r',
 ('m', 'b'): 'c',
 ('m', 'g'): 'y',
 ('m', 'm'): 'm',
 ('m', 'r'): 'c',
 ('m', 'y'): 'g',
 ('r', 'b'): 'm',
 ('r', 'c'): 'b',
 ('r', 'g'): 'y',
 ('r', 'm'): 'c',
 ('r', 'r'): 'r',
 ('r', 'y'): 'g',
 ('y', 'b'): 'c',
 ('y', 'c'): 'g',
 ('y', 'g'): 'r',
 ('y', 'm'): 'b',
 ('y', 'r'): 'm',
 ('y', 'y'): 'y'}

I can define a new subclass of BaseCayleyGroup

class BaseCayleyGroup(object):
    pass



class CayleyColorsGroup(BaseCayleyGroup):
    
    __symbols = {'r', 'g', 'b', 'y', 'c', 'm'}  # Attribute of class -- group symbols
    __table = None # Attribute of class -- Cayley Table for class.

    def __init__(self, v):

        # Not strictly the correct approach. Should probably return an error, but OK.
        self.__value = v

        # This is a property of the class. I could have done this outside of the class
        # but I that freaks me out for no good reason.
        if CayleyColorsGroup.__table is None:
            
            r,g,b,y,m,c = 'r','g','b','y','m','c'
            cayley_colors = {}
            cayley_colors.update({(r,r):r, (r,g):y, (r,b):m, (r,y):g, (r,c):b, (r,m):c})
            cayley_colors.update({(g,r):y, (g,g):g, (g,b):c, (g,y):r, (g,c):b, (g,m):m})
            cayley_colors.update({(b,r):m, (b,g):c, (b,b):b, (b,y):g, (b,c):r, (b,m):y})
            cayley_colors.update({(y,r):m, (y,g):r, (y,b):c, (y,y):y, (y,c):g, (y,m):b})
            cayley_colors.update({(c,r):g, (c,g):y, (c,b):m, (c,y):b, (c,c):c, (c,m):r})
            cayley_colors.update({(m,r):c, (m,g):y, (m,b):r, (m,y):g, (m,b):c, (m,m):m})
            CayleyColorsGroup.__table = cayley_colors
            

    @classmethod # Method on class. HINT: Base class calls.
    def get_cayley_table(cls):
        return CayleyColorsGroup.__table

    # Method on instance. HINT: base class calls.
    def get_value(self):
        return self.__value

c1 = CayleyColorsGroup('r')
c2 = CayleyColorsGroup('r')
c3 = CayleyColorsGroup('g')

print('(c1 == c2) =', c1 == c2)
print('(c1 == c3) =', c1 == c3)

(c1 == c2) = True
(c1 == c3) = False

What did that test tell me? The == operator works $\Rightarrow$
- Some class implements __eq__()
- CayleyColorsGroup does not implement __eq__()
- So, the method MUST be in BaseCayleyGroup.
- And I have you two hints:
  1. Group elements are equals if the symbol is the same.
  2. BaseCayleyGroup calls CayleyColorsGroup.get_value()

c4 = c1*c2
print(c4)

CayleyColorsGroup('r')

What did that test tell me? The * operator works $\Rightarrow$
- Some class implements __mul__()
- CayleyColorsGroup does not implement __mul__()
- So, the method MUST be in BaseCayleyGroup.
- And I gave you three hints:
  1. You need the Cayley Table to lookup the answer.
  2. I implemented the CayleyTable as a dictionary.
  3. BaseCayleyGroup calls CayleyColorsGroup.get_cayley_table()

c5 = c4.get_cayley_table()[('r','g')]
c5

'y'

You know getting the table returns a dictionary. You could have tried the method. What the heck, right?

The answer was not quote right, because we want a member of the group. So, __mul__() must be something like

c6 = CayleyColorsGroup('r')
c7 = CayleyColorsGroup('g')
print("If I have ", c6, 'and', c7, 'then', c6, '*', c7, '=')
c8 = CayleyColorsGroup(c6.get_cayley_table()[c6.get_value(), c7.get_value()])
print(c8)

If I have  CayleyColorsGroup('r') and CayleyColorsGroup('g') then CayleyColorsGroup('r') * CayleyColorsGroup('g') =
CayleyColorsGroup('y')

So what have we concluded?
1. BaseCayleyGroup must implement __eq__() and __mul__()
2. It must be calling get_value() and get_cayley_table() on itself, and getting the child implementation.
3. Don did something weird with the __str__() in BaseCayleyTable to do the printing, but he did not ask us to do that.

If you can do this homework, you will have mastered the two core concepts in OO:
- Simplifying application development by putting come functions in base classes, and inheriting in subclasses.
- Extending and completing base classes and frameworks by implementing/overriding base methods.

There is one more trick. x = BaseCayleyGroup(anything) must raise an exception.

All told, you write about 8 statements in the base class, and know you have to implement
- __init__()
- __eq__()
- __str__()__ (Optional).
- __mul__()
You have implemented simple classes and demonstrated Abstraction, Encapsulation and Inheritance.

Part 2: Matrices¶

I showed you what a matrix looks like.

tt = MySparseMatrix(int, 2, 2)
tt.set(0,0,11)
tt.set(0,1,5)
tt.set(1,0,2)
print(tt.get(0,1))
print("tt = ", tt)

Produces

5
tt =  2 X 2 matrix of type <class 'int'>:
11,5
2,0

Well, an empty 2x2 matrix is something like

m = [[0,0],[0,0]]
print(m)

[[0, 0], [0, 0]]

So, I must have some kind of __str__() method to print slightly differently.

set(0, 0, 11) must map to something like m[0][0] but that only works if I have already set up the lists with default values. Otherwise, I get undefined.

And I know that I must do something to create values even if not set because the entry 1,1 is set to 0.

But, the matrix is of any type. I pass in the type. How do I make default values of a type in a general way?

Hint:

x = complex
print(x())

0j

That is enough hints for MyMatrix. Equality and addition are obvious once you have figured out the matrix basic initialization, get and set.

What about sparse matrices? You can only "store" values that are set.

Think about what that means. The caller will give you (i,j) and you have to return:
- The value if it exists.
- The default for the type, otherwise.
- Well, the Cayley table implementation shows you how to male (i,j) to a value.
- And you had to solve the default class instance for MyMatrix.

Matrix addition and equality require get() and set(), and you know how to override a method.

Print Statements versus Logging¶

You have seen that the print statements can be helpful, but can also be overwhelming.

At different stages of development, I need to turn some on and some off. Just embedding print statements and commenting out is tedious.

I would also like to:
- Selectively print some statements for comfort, debugging, etc.
- Print messages when something critical happens.
- Write a lot of information to a (CSV) file for later analysis.

Instead of printing explicitly, I will have the code emit log events when something happens. Configurable event listeners will determine
- What messages go to the console for debugging, monitoring, etc.
- What data goes to a file for subsequent analysis.

Most environments, including Python, have the concept of a logger.

Instead of adding, removing, commenting out, etc. "print()" statements, I should
- instrument my code with log statements
- And then use
  - Handlers
  - Filters
- To determine which information goes where.

import logging

file_level = logging.DEBUG console_level = logging.ERROR

# create logger with logger to show to the class.
# An application can have several loggers.
logger = logging.getLogger('e1006_application')

# The logger's level is DEBUG.
logger.setLevel(logging.DEBUG)

# create file handler which logs even debug messages
fh = logging.FileHandler('demo.log')
fh.setLevel(logging.DEBUG)

# create console handler with a higher log level
ch = logging.StreamHandler()
ch.setLevel(logging.ERROR)

# create formatter and add it to the handlers
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
fh.setFormatter(formatter)
ch.setFormatter(formatter)

# add the handlers to the logger
logger.addHandler(fh)
logger.addHandler(ch)

logger.debug("***** Starting a test run *******")
logger.info('Creating an info message.')
logger.debug("Creating a debug message")
logger.warning("Creating a warning message.")
logger.error("Creating an ERROR message.")
logger.error("Creating another ERROR message.!")

2018-04-11 16:01:48,561 - e1006_application - ERROR - Creating an ERROR message.
2018-04-11 16:01:48,562 - e1006_application - ERROR - Creating another ERROR message.!

What does the log file look like?

with open("demo.log", "r") as log_file:
    for line in log_file:
        print(line)

2018-04-11 16:00:44,532 - e1006_application - DEBUG - ***** Starting a test run *******

2018-04-11 16:00:44,532 - e1006_application - INFO - Creating an info message.

2018-04-11 16:00:44,532 - e1006_application - DEBUG - Creating a debug message

2018-04-11 16:00:44,533 - e1006_application - WARNING - Creating a warning message.

2018-04-11 16:00:44,533 - e1006_application - ERROR - Creating an ERROR message.

2018-04-11 16:01:03,534 - e1006_application - DEBUG - ***** Starting a test run *******

2018-04-11 16:01:03,534 - e1006_application - INFO - Creating an info message.

2018-04-11 16:01:03,535 - e1006_application - DEBUG - Creating a debug message

2018-04-11 16:01:03,535 - e1006_application - WARNING - Creating a warning message.

2018-04-11 16:01:03,535 - e1006_application - ERROR - Creating an ERROR message.

2018-04-11 16:01:48,560 - e1006_application - DEBUG - ***** Starting a test run *******

2018-04-11 16:01:48,560 - e1006_application - INFO - Creating an info message.

2018-04-11 16:01:48,561 - e1006_application - DEBUG - Creating a debug message

2018-04-11 16:01:48,561 - e1006_application - WARNING - Creating a warning message.

2018-04-11 16:01:48,561 - e1006_application - ERROR - Creating an ERROR message.

2018-04-11 16:01:48,562 - e1006_application - ERROR - Creating another ERROR message.!

# I can change the level of the logger.
ch.setLevel(logging.WARNING)

logger.debug("***** Starting a test run *******")
logger.info('Creating an info message.')
logger.debug("Creating a debug message")
logger.warning("Creating a warning message.")
logger.error("Creating an ERROR message.")
logger.error("Creating another ERROR message.!")

2018-04-11 16:05:14,867 - e1006_application - WARNING - Creating a warning message.
2018-04-11 16:05:14,868 - e1006_application - ERROR - Creating an ERROR message.
2018-04-11 16:05:14,869 - e1006_application - ERROR - Creating another ERROR message.!

An application can also define it's own logging level. The default levels are:

Level	Numeric value
CRITICAL	50
ERROR	40
WARNING	30
INFO	20
DEBUG	10
NOTSET	0

And the file again, ...

with open("demo.log", "r") as log_file:
    for line in log_file:
        print(line)

2018-04-11 16:00:44,532 - e1006_application - DEBUG - ***** Starting a test run *******

2018-04-11 16:00:44,532 - e1006_application - INFO - Creating an info message.

2018-04-11 16:00:44,532 - e1006_application - DEBUG - Creating a debug message

2018-04-11 16:00:44,533 - e1006_application - WARNING - Creating a warning message.

2018-04-11 16:00:44,533 - e1006_application - ERROR - Creating an ERROR message.

2018-04-11 16:01:03,534 - e1006_application - DEBUG - ***** Starting a test run *******

2018-04-11 16:01:03,534 - e1006_application - INFO - Creating an info message.

2018-04-11 16:01:03,535 - e1006_application - DEBUG - Creating a debug message

2018-04-11 16:01:03,535 - e1006_application - WARNING - Creating a warning message.

2018-04-11 16:01:03,535 - e1006_application - ERROR - Creating an ERROR message.

2018-04-11 16:01:48,560 - e1006_application - DEBUG - ***** Starting a test run *******

2018-04-11 16:01:48,560 - e1006_application - INFO - Creating an info message.

2018-04-11 16:01:48,561 - e1006_application - DEBUG - Creating a debug message

2018-04-11 16:01:48,561 - e1006_application - WARNING - Creating a warning message.

2018-04-11 16:01:48,561 - e1006_application - ERROR - Creating an ERROR message.

2018-04-11 16:01:48,562 - e1006_application - ERROR - Creating another ERROR message.!

2018-04-11 16:05:14,866 - e1006_application - DEBUG - ***** Starting a test run *******

2018-04-11 16:05:14,866 - e1006_application - INFO - Creating an info message.

2018-04-11 16:05:14,867 - e1006_application - DEBUG - Creating a debug message

2018-04-11 16:05:14,867 - e1006_application - WARNING - Creating a warning message.

2018-04-11 16:05:14,868 - e1006_application - ERROR - Creating an ERROR message.

2018-04-11 16:05:14,869 - e1006_application - ERROR - Creating another ERROR message.!


Logging Conceptual Model

In practice: In the product I have been building:
- We write very detailed log messages for auditing, problem determination, etc.
- We selectively send critical messages to a Slack channel to immediately notify someone of the team that there is a problem.


Slack Critical Message Channel


Full Log Stream

Summary:

There is no action or to do for you. This is FYI.
I put print statements in my example code to help with teaching and explanation.
I want to make sure that you understand this is not a "real" approach to the function.

The Strange Island¶

We have four lectures left, including this one. I need to move on from OO.

We implemented
- Initialization
- Animation
- A class model for organisms, animals, plants, predators and prey.
- The ability to move around and die of old age.
- We need to implement starvation, reproduction and eating. This occurs when two things are at the same (i,j) location. All we need to do is implement a function that:
  - Looks at every (i,j) location.
  - If there is more than one thing
    - (moose, moose) $\rightarrow$ new moose if long enough since last birth.
    - (wolf, wolf) $\rightarrow$ new wolf if long enough since last birth
    - (wolf, moose) or (moose, wolf) $\rightarrow$ fed wolf and dead moose.
    - (plant, moose) or (moose, plant) $\rightarrow$ fed moose and dead plant.
- We also need to support plants spreading.

These are relatively straightforward extensions to what we have done, but I want to move on to some cool things. So,
- I will complete and publish the code.
- Schedule an optional recitation where I walk through and we discuss the details for the few of you who are interested.
- Extra-credit 3 will be examining my code and suggesting, in writing 3 changes that would improve the solution that I hacked together.

Moneyball!¶

Lecture 1 $-$ Reminder: Why Python?¶

"The Case For Python in Scientific Computing"

Interoperability with other languages: Complex science and engineering problems interact with preexisting systems and devices that use a diverse set of languages, interfaces, data models, etc.
Reusable tools and libraries:_ All languages come with built in tools (reusable sets of programs). Python by far has the best set of resuable tools and libraries. "If you have a problem to solve, you can most likely find a library to help and it's probably on github!"
Open source: Enables independent extensions and evolution, although this is true of many languages.


Python and Scientific/Engineering

For engineers and applied scientists, learning the Python language is a means to an end.

The end is the rich and expanding portfolio of tools, frameworks, utilities.

Introduction and Overview¶

"Moneyball: The Art of Winning an Unfair Game is a book by Michael Lewis, published in 2003, about the Oakland Athletics baseball team and its general manager Billy Beane. Its focus is the team's analytical, evidence-based, sabermetric approach to assembling a competitive baseball team, despite Oakland's disadvantaged revenue situation."

"The central premise of Moneyball is that the collective wisdom of baseball insiders (including players, managers, coaches, scouts, and the front office) over the past century is subjective and often flawed. Statistics such as stolen bases, runs batted in, and batting average, typically used to gauge players, are relics of a 19th-century view of the game and the statistics available at that time. Before sabermetrics was introduced to baseball, teams were dependent on the skills of their scouts to find and evaluate players. Scouts are those who are experienced in the sport, usually having been involved as players or coaches. The book argues that the Oakland A's' front office took advantage of more analytical gauges of player performance to field a team that could better compete against richer competitors in Major League Baseball (MLB)"

We will do very simple versions of the concept to get a feel for data analytics, numerical libraries and machine learning using Python.

Let's Do Some Simple Data Browsing/Visualization¶

We are engineers and scientists. We are not going to build a team using experience, conventional wisdom, the way we have always done it, etc.

This is a scientific and engineering problem. We need data and a model.

Data¶

What data do we have about teams?


Team Performance Data

What do we know about players?
- Batting is offensive statistics, e.g. hits, home runs, etc.
- Fielding is how they handle hit balls, e.g. attempts, errors.
- Pitching is the opposite of batting, e.g. how often does the pitcher give up a hit?
- What positions can/do players play?
- Salaries.

Positions


Player Positions

Offensive (Batting)


Player Batting Info

Pitching


Player Pitching Info

Fielding


Player Fielding Info

Salaries


Player Salary Info

Read Data¶

import pandas as pd

df = pd.read_csv("../Data/teaminfo.csv")

df.tail(20)

What did we just do?
- Read a CSV into a Pandas DataFrame
- Examined the last 20 rows (actually produced a new DataFrame containing) the last 20 rows of the original DataFrame.

Let's Visualize and Analyze¶

Cool thing 1: Histogram Wins

import matplotlib.pyplot as plt
plt.ioff()

# Make a figure/subplot
plt.figure(figsize=(10,8))

# Histogram wins into 20 buckets
n, bins, p = plt.hist(df['wins'], 20, label="Histogram of Wins")

# Compute the mean number of wins and plot as a line.
mu = df['wins'].mean()
min = df['wins'].min()
max = df['wins'].max()
l = plt.plot([min,max], [mu,mu], label="Average Wins")

plt.legend()

plt.show()

Cool thing 2: Scatter Plot wins as a function of homeruns.

# Make a new DataFrame with just the columns we want.
w_v_hr = pd.DataFrame(df[['home_runs','wins']])

# Make a new figure and set the size.
plt.figure(figsize=(10,8))

# Produce a scatter plot. The point color is green and the marker is a triangle.
plt.scatter(w_v_hr['home_runs'], w_v_hr['wins'], color="g", marker="^")

# Set some labels, font sizes and colors.
plt.xlabel("Homeruns", fontsize=20, color="r")
plt.ylabel("Wins", fontsize=20, color="b")

# Show the figure.
plt.show()

What is the correlation between HRs and wins?

w_v_hr.corr()

HOW HOW TO INTERPRET A CORRELATION COEFFICIENT R

In statistics, the correlation coefficient r measures the strength and direction of a linear relationship between two variables on a scatterplot. The value of r is always between +1 and –1. To interpret its value, see which of the following values your correlation r is closest to:

Exactly $–1$: A perfect downhill (negative) linear relationship
$–0.70$: A strong downhill (negative) linear relationship
$–0.50$: A moderate downhill (negative) relationship
$-0.30$: A weak downhill (negative) linear relationship
$0$: No linear relationship
$+0.30$: A weak uphill (positive) linear relationship
$+0.50$: A moderate uphill (positive) relationship
$0.70$: A strong uphill (positive) linear relationship
Exactly $+1$ A perfect uphill (positive) linear relationship

Correlations gone wild.

df.corr()

Including year in a correlation between metrics does not make sense.

df2 = df.drop(['year'], axis=1)
df2.corr()

How Much Data do We Have?¶

print("(rows, columns) for teams is = ", df.shape)

(rows, columns) for teams is =  (918, 23)

batting_df = pd.read_csv("../Data/playerbatting.csv")
pitching_df = pd.read_csv("../Data/playerpitching.csv")
fielding_df = pd.read_csv("../Data/playerfielding.csv")
salary_df = pd.read_csv("../Data/playersalary.csv")
print("(rows, columns) for batting is = ", batting_df.shape)
print("(rows, columns) for pitching is = ", pitching_df.shape)
print("(rows, columns) for fielding is = ", fielding_df.shape)
print("(rows, columns) for salary is = ", salary_df.shape)

(rows, columns) for batting is =  (37402, 13)
(rows, columns) for pitching is =  (18363, 9)
(rows, columns) for fielding is =  (37099, 7)
(rows, columns) for salary is =  (26428, 5)

all_dfs = [df, batting_df, pitching_df, fielding_df, salary_df]
total_cells = sum([x.shape[0]*x.shape[1] for x in all_dfs])
print("total number of cells is = ", total_cells)

total number of cells is =  1064440

Let's Take a Step Back¶

We have at least 3 tasks:

Figure out the relationships between team performance metrics (e.g HRs, ERA) and a team's wins. What metrics do we want to optimize?
Given a set of players, estimate the team's performance if the team has the players.
Find the cost optimal team(s)

This will involve:

Data Munging.
Cleaning and shaping.
Data Exploration/Visualization.
Learning.
Optimization.

But first, least learn a little about Pandas and PyPlot, which we need for all 5 tasks.

Pandas¶

Overview¶

There is a very good tutorial and set of recipes for common scenarios as https://pandas.pydata.org/pandas-docs/stable/tutorials.html.

This section will go through some of the material in the context of our data and business problem -- Moneyball.

((https://en.wikipedia.org/wiki/Pandas_(software)))
- "pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.[2] The name is derived from the term "panel data", an econometrics term for data sets that include both time-series and cross-sectional data"
- Library features:
  - DataFrame object for data manipulation with integrated indexing.
  - Tools for reading and writing data between in-memory data structures and different file formats.
  - Data alignment and integrated handling of missing data.
  - Reshaping and pivoting of data sets.
  - Label-based slicing, fancy indexing, and subsetting of large data sets.
  - Data structure column insertion and deletion.
  - Group by engine allowing split-apply-combine operations on data sets.
  - Data set merging and joining.
  - Hierarchical axis indexing to work with high-dimensional data in a lower-dimensional data structure.
  - Time series-functionality: Date range generation[4] and frequency conversion, moving window statistics, moving window linear regressions, date shifting and lagging.
  - The library is highly optimized for performance, ..."

Core Data Structures¶

Data Structure	Dimensions	Description
Series	1	1D labeled homogeneous array, sizeimmutable.
Data Frames	2	General 2D labeled, size-mutable tabular structure with potentially heterogeneously typed columns.
Panel	3	General 3D labeled, size-mutable array.
		Pandas Data Structures

We will focus on DataFrames.


DataFrame

Out data is tabular and naturally maps to DataFrames. For example,

batting_df.tail(30)

df.tail(20)

Basic Operations¶

Sort and Filter¶

Sort¶

teams_sorted_by_wins = df.sort_values(by=['wins'])
teams_sorted_by_wins.head(10)

teams_sorted_by_wins = df.sort_values(by=['wins', 'teamid'], ascending=False)
teams_sorted_by_wins.head(10)

Filter¶

teams_sorted_by_wins is already sorted. I want ONLY National League Teams for years greater than or equal to 2000, and the team is either BOS or NYY.

f1 =(df['year'] >= 2000) 
f2 = (df['teamid'].isin(['BOS','NYA']))
interesting_teams = teams_sorted_by_wins[f1 & f2]
interesting_teams.head(20)

/Users/donaldferguson/anaconda3/lib/python3.6/site-packages/ipykernel_launcher.py:3: UserWarning: Boolean Series key will be reindexed to match DataFrame index.
  This is separate from the ipykernel package so we can avoid doing imports until

Teams that won 20 or more games than they lost but did not finish 1st.

unhappy_teams = df[(df['wins'] > (df['loses'] + 20)) & (df['rank'] > 1)]
unhappy_teams.sort_values(by=['wins'], ascending=False).head(10)

Subset Columns (Projection)¶

just_team_year_and_wins = df[['teamid', 'year', 'wins']]
just_team_year_and_wins.head(10)

Combine Tables on Matching Values (JOIN)¶

JOIN¶

First, I made a mistake on column names. Salary uses 'yearID' and

batting_df.head(10)

salary_df.head(10)

The column names are different and I need to modify.

salary_df2 = pd.read_csv('../Data/playersalary.csv')

salary_df2.columns = salary_df2.columns.str.lower()
salary_df2.head(10)

batting_df2 = batting_df.set_index(['playerid', 'yearid'])
salary_df2 = salary_df2.set_index(['playerid', 'yearid'])

batting_and_salary = batting_df2.join(salary_df2)

batting_and_salary[['home_runs','salary']].head(10)

What are these JOIN and Index of which you Speak?¶

Index

"A (...) index is a data structure that improves the speed of data retrieval operations on a (...) table at the cost of additional (...) storage space to maintain the index data structure. Indexes are used to quickly locate data without having to search every row in a (...) table every time a (...) table is accessed. Indexes can be created using one or more columns of a (...) table, providing the basis for both rapid random lookups and efficient access of ordered records." (https://en.wikipedia.org/wiki/Database_index)


Binary Tree Index

There are many types of index.

Simplistically, if I want to find all rows with a given (playerid, yearid) a 1,000,000 row table
- Scanning each row is O(1,000,000) comparisons
- Using a binary search index is O(20) comparisons, or exponentially faster.

JOIN

"An SQL join clause combines columns from one or more tables in a relational database. It creates a set that can be saved as a table or used as it is. A JOIN is a means for combining columns from one (self-join) or more tables by using values common to each." (https://en.wikipedia.org/wiki/Join_(SQL))

There are many, many kinds of JOIN. Over simplistically
- Build a new table (DataFrame)
- Each row contains all columns from DataFrame A and all columns from DataFrame B.
- Only include rows where A.x = B.y


Natural JOIN Example

We are going to need JOIN because data from several tables characterizes a player's performance
- Batting (offense)
- Fielding (defense)
- Pitching
- Salary

Coming up Next Lecture¶

A little more Pandas to finalize what we need to analyze performance data.

Multi-layer Perceptron neural network for determining how a team's statistics define wins.

This is based on a formal, scientific, engineering solution to a problem baseball historically solved with intuition.

Formal Definition:

A vector of metrics $\vec{x}$ summarizes a team's performance capabilities, e.g. batting average, HRs, etc.

A similar vector $\vec{p}$ summarizes the performance contribution of a player to the team $T.$

Let $i$ be the subscript notation for the i-th metric. If $f_i$ is some aggregation function, e.g. sum, average, then

\begin{equation} x_i = f_i(p_i), \forall p \in T. \end{equation}

$S(p)$ can denote player $p$ salary. This yields two constraints
- C1: $ \#(T) \leq 25$ (approximating the maximum number of allowed players per team)
- C2: $\sum_{p \in T}{S(p)} \leq B(T),$ team $T$ budget.

If
- $W(\vec{x})$ is the number of wins per season as a function of $\vec{x} \in X.$
- $P(\vec{x}) \rightarrow [0,1]$ is the probability of winning the World Series for $\vec{x} \in X.$

We want to pick the team of players $T$ such that we
- $Max(W(\vec{x}))$ or $Max(P(\vec{x}))$
- With minimal $\sum{S(p)}$
- Subject to C1 and C2.

This boils down to two tasks:
1. Learning the relationship between $\vec{x}$ and wins (or winning world series).
2. Finding the price-optimal set of players. We are trying to build an artificial intelligence engine that replaces general managers for picking players.

Next Task

We will define and train a Multi-Layer Perceptron!


Neurons and Perceptrons

	year	teamid	rank	games	wins	loses	WSWin	runs	at_basts	hits	...	walks	strikeouts	stolen_bases	caught_stealing	sacrifice_flies	runs_against	earned_runs	earned_run_average	errors	fielding_percentage
898	2016	HOU	3	162	84	78	N	724	5545	1367	...	554	1452	102	44	31.0	701	663	4.06	77	0.987
899	2016	KCA	3	162	81	81	N	675	5552	1450	...	382	1224	121	35	34.0	712	674	4.21	94	0.984
900	2016	LAA	4	162	74	88	N	717	5431	1410	...	471	991	73	34	49.0	727	676	4.28	97	0.983
901	2016	LAN	1	162	91	71	N	725	5518	1376	...	525	1321	45	26	32.0	638	598	3.70	80	0.986
902	2016	MIA	3	161	79	82	N	655	5547	1460	...	447	1213	71	28	38.0	682	646	4.05	86	0.985
903	2016	MIL	4	162	73	89	N	671	5330	1299	...	599	1543	181	56	39.0	733	650	4.08	136	0.978
904	2016	MIN	5	162	59	103	N	722	5618	1409	...	513	1426	91	32	43.0	889	814	5.08	126	0.979
905	2016	NYA	4	162	84	78	N	680	5458	1378	...	475	1188	72	22	49.0	702	660	4.16	86	0.985
906	2016	NYN	2	162	87	75	N	671	5459	1342	...	517	1302	42	18	41.0	617	574	3.57	90	0.985
907	2016	OAK	5	162	69	93	N	653	5500	1352	...	442	1145	50	23	34.0	761	718	4.51	97	0.984
908	2016	PHI	4	162	71	91	N	610	5434	1305	...	424	1376	96	45	30.0	796	739	4.63	97	0.984
909	2016	PIT	3	162	78	83	N	729	5542	1426	...	561	1334	110	45	36.0	758	679	4.21	111	0.982
910	2016	SDN	5	162	68	94	N	686	5419	1275	...	449	1500	125	45	36.0	770	708	4.43	109	0.982
911	2016	SEA	2	162	86	76	N	768	5583	1446	...	506	1288	56	28	41.0	707	647	4.00	89	0.985
912	2016	SFN	2	162	87	75	N	715	5565	1437	...	572	1107	79	36	46.0	631	593	3.65	72	0.988
913	2016	SLN	2	162	86	76	N	779	5548	1415	...	526	1318	35	26	41.0	712	656	4.08	107	0.983
914	2016	TBA	5	162	68	94	N	672	5481	1333	...	449	1482	60	37	28.0	713	665	4.20	94	0.984
915	2016	TEX	1	162	95	67	N	765	5525	1446	...	436	1220	99	36	40.0	757	700	4.37	97	0.984
916	2016	TOR	2	162	89	73	N	759	5479	1358	...	632	1362	54	24	40.0	666	613	3.78	88	0.986
917	2016	WAS	1	162	95	67	N	763	5490	1403	...	536	1252	121	39	63.0	612	570	3.51	73	0.988

	year	rank	games	wins	loses	runs	at_basts	hits	doubles	triples	...	walks	strikeouts	stolen_bases	caught_stealing	sacrifice_flies	runs_against	earned_runs	earned_run_average	errors	fielding_percentage
year	1.000000	-0.143385	0.173925	0.064945	0.065103	0.036425	0.184300	0.084122	0.373114	-0.151842	...	-0.140116	0.693800	-0.311655	-0.544616	-0.278731	0.034589	0.103848	0.031735	-0.508298	0.541028
rank	-0.143385	1.000000	0.048105	-0.773918	0.811173	-0.413253	0.011006	-0.200168	-0.205209	0.011821	...	-0.307768	0.004263	-0.038045	0.142643	-0.247109	0.444685	0.407737	0.414782	0.337042	-0.337473
games	0.173925	0.048105	1.000000	0.370266	0.363513	0.339765	0.967173	0.652376	0.401734	0.122280	...	0.306388	0.413115	0.107512	0.085399	0.070013	0.317847	0.315043	-0.118703	0.221394	0.088089
wins	0.064945	-0.773918	0.370266	1.000000	-0.730609	0.558594	0.391991	0.478310	0.326008	0.028302	...	0.462781	0.063844	0.129699	-0.022943	0.293405	-0.395921	-0.371180	-0.564773	-0.188008	0.325085
loses	0.065103	0.811173	0.363513	-0.730609	1.000000	-0.310345	0.317679	-0.000114	-0.031263	0.060619	...	-0.239241	0.241311	-0.052518	0.084934	-0.291676	0.630772	0.604148	0.479542	0.349095	-0.259240
runs	0.036425	-0.413253	0.339765	0.558594	-0.310345	1.000000	0.478526	0.802633	0.646704	0.057869	...	0.628431	0.051717	-0.001748	-0.098627	0.495870	0.400948	0.420943	0.274112	-0.039085	0.140185
at_basts	0.184300	0.011006	0.967173	0.391991	0.317679	0.478526	1.000000	0.790044	0.516430	0.132280	...	0.327488	0.379069	0.069276	0.024065	0.283317	0.395707	0.398010	-0.025720	0.189474	0.111354
hits	0.084122	-0.200168	0.652376	0.478310	-0.000114	0.802633	0.790044	1.000000	0.687157	0.173270	...	0.394974	0.067709	0.039055	-0.008744	0.486705	0.453561	0.467737	0.184794	0.067685	0.131473
doubles	0.373114	-0.205209	0.401734	0.326008	-0.031263	0.646704	0.516430	0.687157	1.000000	0.048628	...	0.307399	0.263448	-0.164075	-0.274183	0.359872	0.362446	0.393986	0.222935	-0.139458	0.256301
triples	-0.151842	0.011821	0.122280	0.028302	0.060619	0.057869	0.132280	0.173270	0.048628	1.000000	...	-0.064815	-0.065748	0.270462	0.253111	0.159237	0.049912	0.036896	-0.015720	0.144544	-0.101339
home_runs	0.264342	-0.320328	0.204492	0.366849	-0.216931	0.722088	0.294434	0.437426	0.432573	-0.216900	...	0.389083	0.341558	-0.250488	-0.320080	0.100513	0.336602	0.367850	0.281919	-0.163907	0.217315
walks	-0.140116	-0.307768	0.306388	0.462781	-0.239241	0.628431	0.327488	0.394974	0.307399	-0.064815	...	1.000000	0.071423	0.056529	0.018022	0.308625	0.166550	0.165855	0.024025	0.047547	0.056568
strikeouts	0.693800	0.004263	0.413115	0.063844	0.241311	0.051717	0.379069	0.067709	0.263448	-0.065748	...	0.071423	1.000000	-0.149404	-0.293865	-0.358880	0.152151	0.180391	0.000559	-0.146959	0.261226
stolen_bases	-0.311655	-0.038045	0.107512	0.129699	-0.052518	-0.001748	0.069276	0.039055	-0.164075	0.270462	...	0.056529	-0.149404	1.000000	0.670498	0.148741	-0.098215	-0.123660	-0.182143	0.250288	-0.208147
caught_stealing	-0.544616	0.142643	0.085399	-0.022943	0.084934	-0.098627	0.024065	-0.008744	-0.274183	0.253111	...	0.018022	-0.293865	0.670498	1.000000	0.085237	-0.018639	-0.060760	-0.102907	0.397679	-0.350691
sacrifice_flies	-0.278731	-0.247109	0.070013	0.293405	-0.291676	0.495870	0.283317	0.486705	0.359872	0.159237	...	0.308625	-0.358880	0.148741	0.085237	1.000000	0.077198	0.084274	0.076054	-0.027902	0.020121
runs_against	0.034589	0.444685	0.317847	-0.395921	0.630772	0.400948	0.395707	0.453561	0.362446	0.049912	...	0.166550	0.152151	-0.098215	-0.018639	0.077198	1.000000	0.987529	0.888444	0.272166	-0.205167
earned_runs	0.103848	0.407737	0.315043	-0.371180	0.604148	0.420943	0.398010	0.467737	0.393986	0.036896	...	0.165855	0.180391	-0.123660	-0.060760	0.084274	0.987529	1.000000	0.901795	0.154658	-0.087989
earned_run_average	0.031735	0.414782	-0.118703	-0.564773	0.479542	0.274112	-0.025720	0.184794	0.222935	-0.015720	...	0.024025	0.000559	-0.182143	-0.102907	0.076054	0.888444	0.901795	1.000000	0.064617	-0.136271
errors	-0.508298	0.337042	0.221394	-0.188008	0.349095	-0.039085	0.189474	0.067685	-0.139458	0.144544	...	0.047547	-0.146959	0.250288	0.397679	-0.027902	0.272166	0.154658	0.064617	1.000000	-0.940386
fielding_percentage	0.541028	-0.337473	0.088089	0.325085	-0.259240	0.140185	0.111354	0.131473	0.256301	-0.101339	...	0.056568	0.261226	-0.208147	-0.350691	0.020121	-0.205167	-0.087989	-0.136271	-0.940386	1.000000

	rank	games	wins	loses	runs	at_basts	hits	doubles	triples	home_runs	walks	strikeouts	stolen_bases	caught_stealing	sacrifice_flies	runs_against	earned_runs	earned_run_average	errors	fielding_percentage
rank	1.000000	0.048105	-0.773918	0.811173	-0.413253	0.011006	-0.200168	-0.205209	0.011821	-0.320328	-0.307768	0.004263	-0.038045	0.142643	-0.247109	0.444685	0.407737	0.414782	0.337042	-0.337473
games	0.048105	1.000000	0.370266	0.363513	0.339765	0.967173	0.652376	0.401734	0.122280	0.204492	0.306388	0.413115	0.107512	0.085399	0.070013	0.317847	0.315043	-0.118703	0.221394	0.088089
wins	-0.773918	0.370266	1.000000	-0.730609	0.558594	0.391991	0.478310	0.326008	0.028302	0.366849	0.462781	0.063844	0.129699	-0.022943	0.293405	-0.395921	-0.371180	-0.564773	-0.188008	0.325085
loses	0.811173	0.363513	-0.730609	1.000000	-0.310345	0.317679	-0.000114	-0.031263	0.060619	-0.216931	-0.239241	0.241311	-0.052518	0.084934	-0.291676	0.630772	0.604148	0.479542	0.349095	-0.259240
runs	-0.413253	0.339765	0.558594	-0.310345	1.000000	0.478526	0.802633	0.646704	0.057869	0.722088	0.628431	0.051717	-0.001748	-0.098627	0.495870	0.400948	0.420943	0.274112	-0.039085	0.140185
at_basts	0.011006	0.967173	0.391991	0.317679	0.478526	1.000000	0.790044	0.516430	0.132280	0.294434	0.327488	0.379069	0.069276	0.024065	0.283317	0.395707	0.398010	-0.025720	0.189474	0.111354
hits	-0.200168	0.652376	0.478310	-0.000114	0.802633	0.790044	1.000000	0.687157	0.173270	0.437426	0.394974	0.067709	0.039055	-0.008744	0.486705	0.453561	0.467737	0.184794	0.067685	0.131473
doubles	-0.205209	0.401734	0.326008	-0.031263	0.646704	0.516430	0.687157	1.000000	0.048628	0.432573	0.307399	0.263448	-0.164075	-0.274183	0.359872	0.362446	0.393986	0.222935	-0.139458	0.256301
triples	0.011821	0.122280	0.028302	0.060619	0.057869	0.132280	0.173270	0.048628	1.000000	-0.216900	-0.064815	-0.065748	0.270462	0.253111	0.159237	0.049912	0.036896	-0.015720	0.144544	-0.101339
home_runs	-0.320328	0.204492	0.366849	-0.216931	0.722088	0.294434	0.437426	0.432573	-0.216900	1.000000	0.389083	0.341558	-0.250488	-0.320080	0.100513	0.336602	0.367850	0.281919	-0.163907	0.217315
walks	-0.307768	0.306388	0.462781	-0.239241	0.628431	0.327488	0.394974	0.307399	-0.064815	0.389083	1.000000	0.071423	0.056529	0.018022	0.308625	0.166550	0.165855	0.024025	0.047547	0.056568
strikeouts	0.004263	0.413115	0.063844	0.241311	0.051717	0.379069	0.067709	0.263448	-0.065748	0.341558	0.071423	1.000000	-0.149404	-0.293865	-0.358880	0.152151	0.180391	0.000559	-0.146959	0.261226
stolen_bases	-0.038045	0.107512	0.129699	-0.052518	-0.001748	0.069276	0.039055	-0.164075	0.270462	-0.250488	0.056529	-0.149404	1.000000	0.670498	0.148741	-0.098215	-0.123660	-0.182143	0.250288	-0.208147
caught_stealing	0.142643	0.085399	-0.022943	0.084934	-0.098627	0.024065	-0.008744	-0.274183	0.253111	-0.320080	0.018022	-0.293865	0.670498	1.000000	0.085237	-0.018639	-0.060760	-0.102907	0.397679	-0.350691
sacrifice_flies	-0.247109	0.070013	0.293405	-0.291676	0.495870	0.283317	0.486705	0.359872	0.159237	0.100513	0.308625	-0.358880	0.148741	0.085237	1.000000	0.077198	0.084274	0.076054	-0.027902	0.020121
runs_against	0.444685	0.317847	-0.395921	0.630772	0.400948	0.395707	0.453561	0.362446	0.049912	0.336602	0.166550	0.152151	-0.098215	-0.018639	0.077198	1.000000	0.987529	0.888444	0.272166	-0.205167
earned_runs	0.407737	0.315043	-0.371180	0.604148	0.420943	0.398010	0.467737	0.393986	0.036896	0.367850	0.165855	0.180391	-0.123660	-0.060760	0.084274	0.987529	1.000000	0.901795	0.154658	-0.087989
earned_run_average	0.414782	-0.118703	-0.564773	0.479542	0.274112	-0.025720	0.184794	0.222935	-0.015720	0.281919	0.024025	0.000559	-0.182143	-0.102907	0.076054	0.888444	0.901795	1.000000	0.064617	-0.136271
errors	0.337042	0.221394	-0.188008	0.349095	-0.039085	0.189474	0.067685	-0.139458	0.144544	-0.163907	0.047547	-0.146959	0.250288	0.397679	-0.027902	0.272166	0.154658	0.064617	1.000000	-0.940386
fielding_percentage	-0.337473	0.088089	0.325085	-0.259240	0.140185	0.111354	0.131473	0.256301	-0.101339	0.217315	0.056568	0.261226	-0.208147	-0.350691	0.020121	-0.205167	-0.087989	-0.136271	-0.940386	1.000000

	playerid	yearid	games	at_bats	runs	hits	doubles	triples	home_runs	rbi	stolen_bases	caught_steaming	walks
37372	zoskyed01	1991	18.0	27.0	2.0	4.0	1.0	1.0	0.0	2.0	0.0	0.0	0.0
37373	zoskyed01	1992	8.0	7.0	1.0	2.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0
37374	zoskyed01	1995	6.0	5.0	0.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
37375	zoskyed01	1999	8.0	7.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0
37376	zoskyed01	2000	4.0	4.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
37377	zuberjo01	1996	30.0	91.0	7.0	23.0	4.0	0.0	1.0	10.0	1.0	0.0	6.0
37378	zuberjo01	1998	38.0	45.0	6.0	11.0	3.0	1.0	2.0	6.0	0.0	0.0	6.0
37379	zuletju01	2000	30.0	68.0	13.0	20.0	8.0	0.0	3.0	12.0	0.0	1.0	2.0
37380	zuletju01	2001	49.0	106.0	11.0	23.0	3.0	0.0	6.0	24.0	0.0	1.0	8.0
37381	zumayjo01	2006	62.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
37382	zumayjo01	2007	28.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
37383	zumayjo01	2008	21.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
37384	zumayjo01	2009	29.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
37385	zumayjo01	2010	31.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
37386	zuninmi01	2013	52.0	173.0	22.0	37.0	5.0	0.0	5.0	14.0	1.0	0.0	16.0
37387	zuninmi01	2014	131.0	438.0	51.0	87.0	20.0	2.0	22.0	60.0	0.0	3.0	17.0
37388	zuninmi01	2015	112.0	350.0	28.0	61.0	11.0	0.0	11.0	28.0	0.0	1.0	21.0
37389	zuninmi01	2016	55.0	164.0	16.0	34.0	7.0	0.0	12.0	31.0	0.0	0.0	21.0
37390	zupcibo01	1991	18.0	25.0	3.0	4.0	0.0	0.0	1.0	3.0	0.0	0.0	1.0
37391	zupcibo01	1992	124.0	392.0	46.0	108.0	19.0	1.0	3.0	43.0	2.0	2.0	25.0
37392	zupcibo01	1993	141.0	286.0	40.0	69.0	24.0	2.0	2.0	26.0	5.0	2.0	27.0
37393	zupcibo01	1994	36.0	92.0	10.0	18.0	4.0	1.0	1.0	8.0	0.0	1.0	4.0
37394	zuvelpa01	1985	81.0	190.0	16.0	48.0	8.0	1.0	0.0	4.0	2.0	0.0	16.0
37395	zuvelpa01	1986	21.0	48.0	2.0	4.0	1.0	0.0	0.0	2.0	0.0	0.0	5.0
37396	zuvelpa01	1987	14.0	34.0	2.0	6.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
37397	zuvelpa01	1988	51.0	130.0	9.0	30.0	5.0	1.0	0.0	7.0	0.0	0.0	8.0
37398	zuvelpa01	1989	24.0	58.0	10.0	16.0	2.0	0.0	2.0	6.0	0.0	0.0	1.0
37399	zuvelpa01	1991	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
37400	zychto01	2015	13.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
37401	zychto01	2016	12.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

	year	teamid	rank	games	wins	loses	WSWin	runs	at_basts	hits	...	walks	strikeouts	stolen_bases	caught_stealing	sacrifice_flies	runs_against	earned_runs	earned_run_average	errors	fielding_percentage
898	2016	HOU	3	162	84	78	N	724	5545	1367	...	554	1452	102	44	31.0	701	663	4.06	77	0.987
899	2016	KCA	3	162	81	81	N	675	5552	1450	...	382	1224	121	35	34.0	712	674	4.21	94	0.984
900	2016	LAA	4	162	74	88	N	717	5431	1410	...	471	991	73	34	49.0	727	676	4.28	97	0.983
901	2016	LAN	1	162	91	71	N	725	5518	1376	...	525	1321	45	26	32.0	638	598	3.70	80	0.986
902	2016	MIA	3	161	79	82	N	655	5547	1460	...	447	1213	71	28	38.0	682	646	4.05	86	0.985
903	2016	MIL	4	162	73	89	N	671	5330	1299	...	599	1543	181	56	39.0	733	650	4.08	136	0.978
904	2016	MIN	5	162	59	103	N	722	5618	1409	...	513	1426	91	32	43.0	889	814	5.08	126	0.979
905	2016	NYA	4	162	84	78	N	680	5458	1378	...	475	1188	72	22	49.0	702	660	4.16	86	0.985
906	2016	NYN	2	162	87	75	N	671	5459	1342	...	517	1302	42	18	41.0	617	574	3.57	90	0.985
907	2016	OAK	5	162	69	93	N	653	5500	1352	...	442	1145	50	23	34.0	761	718	4.51	97	0.984
908	2016	PHI	4	162	71	91	N	610	5434	1305	...	424	1376	96	45	30.0	796	739	4.63	97	0.984
909	2016	PIT	3	162	78	83	N	729	5542	1426	...	561	1334	110	45	36.0	758	679	4.21	111	0.982
910	2016	SDN	5	162	68	94	N	686	5419	1275	...	449	1500	125	45	36.0	770	708	4.43	109	0.982
911	2016	SEA	2	162	86	76	N	768	5583	1446	...	506	1288	56	28	41.0	707	647	4.00	89	0.985
912	2016	SFN	2	162	87	75	N	715	5565	1437	...	572	1107	79	36	46.0	631	593	3.65	72	0.988
913	2016	SLN	2	162	86	76	N	779	5548	1415	...	526	1318	35	26	41.0	712	656	4.08	107	0.983
914	2016	TBA	5	162	68	94	N	672	5481	1333	...	449	1482	60	37	28.0	713	665	4.20	94	0.984
915	2016	TEX	1	162	95	67	N	765	5525	1446	...	436	1220	99	36	40.0	757	700	4.37	97	0.984
916	2016	TOR	2	162	89	73	N	759	5479	1358	...	632	1362	54	24	40.0	666	613	3.78	88	0.986
917	2016	WAS	1	162	95	67	N	763	5490	1403	...	536	1252	121	39	63.0	612	570	3.51	73	0.988

	year	teamid	rank	games	wins	loses	WSWin	runs	at_basts	hits	...	walks	strikeouts	stolen_bases	caught_stealing	sacrifice_flies	runs_against	earned_runs	earned_run_average	errors	fielding_percentage
508	2003	DET	5	162	43	119	N	591	5466	1312	...	443	1099	98	63	49.0	928	847	5.30	138	0.978
239	1994	CAL	4	115	47	68	NaN	543	3943	1042	...	402	715	65	54	NaN	660	618	5.42	76	0.983
258	1994	SDN	4	117	47	70	NaN	479	4068	1117	...	319	762	79	37	NaN	531	474	4.08	111	0.975
259	1994	SEA	3	112	49	63	NaN	569	3883	1045	...	372	652	48	21	NaN	616	546	4.99	95	0.977
241	1994	CHN	5	113	49	64	NaN	500	3918	1015	...	364	750	69	53	NaN	549	508	4.47	81	0.982
808	2013	HOU	5	162	51	111	N	610	5457	1307	...	426	1535	110	61	38.0	848	766	4.79	125	0.979
246	1994	FLO	5	115	51	64	NaN	468	3926	1043	...	349	746	65	26	NaN	576	508	4.50	95	0.978
255	1994	OAK	2	114	51	63	NaN	549	3885	1009	...	417	686	91	39	NaN	589	537	4.82	88	0.979
529	2004	ARI	5	162	51	111	N	615	5544	1401	...	441	1022	53	32	37.0	899	794	4.98	139	0.977
262	1994	TEX	1	114	52	62	NaN	613	3983	1114	...	437	730	82	35	NaN	697	619	5.45	105	0.976

	year	teamid	rank	games	wins	loses	WSWin	runs	at_basts	hits	...	walks	strikeouts	stolen_bases	caught_stealing	sacrifice_flies	runs_against	earned_runs	earned_run_average	errors	fielding_percentage
462	2001	SEA	1	162	116	46	N	927	5680	1637	...	614	989	174	42	70.0	627	576	3.54	83	0.986
366	1998	NYA	1	162	114	48	Y	965	5643	1625	...	653	1025	153	63	NaN	656	618	3.82	98	0.984
42	1986	NYN	1	162	108	54	Y	783	5558	1462	...	631	968	118	48	NaN	578	513	3.11	138	0.978
350	1998	ATL	1	162	106	56	N	826	5484	1489	...	548	1062	98	43	NaN	581	519	3.25	91	0.985
554	2004	SLN	1	162	105	57	N	855	5555	1544	...	548	1085	111	47	70.0	659	605	3.75	97	0.985
95	1988	OAK	1	162	104	58	N	800	5602	1474	...	580	926	129	54	NaN	620	569	3.44	105	0.983
208	1993	ATL	1	162	104	58	N	767	5515	1444	...	560	946	125	48	NaN	559	508	3.14	108	0.983
232	1993	SFN	2	162	103	59	N	808	5557	1534	...	516	930	120	65	NaN	636	584	3.61	101	0.984
147	1990	OAK	1	162	103	59	N	733	5433	1379	...	651	992	141	54	NaN	570	514	3.18	87	0.986
488	2002	OAK	1	162	103	59	N	800	5558	1450	...	609	1008	46	20	36.0	654	593	3.68	102	0.984

	year	teamid	rank	games	wins	loses	WSWin	runs	at_basts	hits	...	walks	strikeouts	stolen_bases	caught_stealing	sacrifice_flies	runs_against	earned_runs	earned_run_average	errors	fielding_percentage
486	2002	NYA	1	161	103	58	N	897	5601	1540	...	640	1171	100	38	41.0	697	625	3.87	127	0.979
695	2009	NYA	1	162	103	59	Y	915	5660	1604	...	663	1014	111	28	39.0	753	687	4.26	86	0.985
516	2003	NYA	1	163	101	61	N	877	5605	1518	...	684	1042	98	33	35.0	716	653	4.02	114	0.981
546	2004	NYA	1	162	101	61	N	897	5527	1483	...	670	982	84	33	50.0	808	752	4.69	99	0.984
532	2004	BOS	2	162	98	64	Y	949	5720	1613	...	659	1189	68	30	55.0	768	674	4.18	118	0.981
605	2006	NYA	1	162	97	65	N	930	5651	1608	...	649	1053	139	35	49.0	767	708	4.41	104	0.983
755	2011	NYA	1	162	97	65	N	867	5518	1452	...	627	1138	147	46	51.0	657	605	3.73	102	0.983
801	2013	BOS	1	162	97	65	Y	853	5651	1566	...	581	1308	123	19	50.0	656	613	3.79	80	0.987
621	2007	BOS	1	162	96	66	Y	867	5589	1561	...	689	1042	96	24	54.0	657	618	3.87	81	0.986
456	2001	NYA	1	161	95	65	N	804	5577	1488	...	519	1035	161	53	43.0	713	649	4.02	109	0.982
575	2005	NYA	1	162	95	67	N	886	5624	1552	...	637	989	84	27	43.0	789	718	4.52	95	0.984
725	2010	NYA	2	162	95	67	N	859	5567	1485	...	662	1136	103	30	44.0	693	651	4.06	69	0.988
785	2012	NYA	1	162	95	67	N	804	5524	1462	...	565	1176	93	27	49.0	668	617	3.84	75	0.987
502	2003	BOS	2	162	95	67	N	961	5769	1667	...	620	943	88	35	64.0	809	729	4.48	113	0.982
561	2005	BOS	2	162	95	67	N	910	5626	1579	...	653	1044	45	12	63.0	805	752	4.74	109	0.982
651	2008	BOS	2	162	95	67	N	845	5596	1565	...	646	1068	120	35	62.0	694	645	4.01	85	0.986
681	2009	BOS	2	162	95	67	N	872	5543	1495	...	659	1120	126	39	51.0	736	695	4.35	82	0.986
635	2007	NYA	2	162	94	68	N	968	5717	1656	...	637	991	123	40	54.0	777	724	4.49	88	0.985
472	2002	BOS	2	162	93	69	N	859	5640	1560	...	545	944	80	28	53.0	665	603	3.75	104	0.983
891	2016	BOS	1	162	93	69	N	878	5670	1598	...	558	1160	83	24	40.0	694	640	4.00	75	0.987

	playerid	yearid	games	at_bats	hits
0	aardsda01	2004	11.0	0.0	0.0
1	aardsda01	2006	45.0	2.0	0.0
2	aardsda01	2007	25.0	0.0	0.0
3	aardsda01	2008	47.0	1.0	0.0
4	aardsda01	2009	73.0	0.0	0.0
5	aardsda01	2010	53.0	0.0	0.0
6	aardsda01	2012	1.0	0.0	0.0
7	aardsda01	2013	43.0	0.0	0.0
8	aardsda01	2015	33.0	1.0	1.0
9	aasedo01	1985	54.0	0.0	0.0

	yearID	teamID	lgID	playerID	salary
0	1985	ATL	NL	barkele01	870000
1	1985	ATL	NL	bedrost01	550000
2	1985	ATL	NL	benedbr01	545000
3	1985	ATL	NL	campri01	633333
4	1985	ATL	NL	ceronri01	625000
5	1985	ATL	NL	chambch01	800000
6	1985	ATL	NL	dedmoje01	150000
7	1985	ATL	NL	forstte01	483333
8	1985	ATL	NL	garbege01	772000
9	1985	ATL	NL	harpete01	250000

Introduction to Computing for Engineers and Computer ScientistsFinishing OO and on to Machine Learning