KAGGLE NFL COMPETITON: PREDICT PLAYER CONTACT WITH LGBM

SALOME SONYA LOMSADZE
8 min readMar 2, 2023

The goal of the Competition

The goal of this competition is to detect external contact experienced by players during an NFL football game. I will player tracking data to identify moments with contacts to help improve player safety.

  • This is the imbalanced dataset. Nearly %8 of the data belongs to class 1.
  • In order to understand and calculate more easily, I get acceleration, speed, x, y, and distance features with prefixes “_x” and “_y” side by side. (In the format of sample_submission.csv)
  • Performed hyperparameter tuning and feature selection and to yield the best result generated many features that contribute an idea if two-player contacts or not.

The train labels data set has 4.721.618 examples and 6 features + the target variable (contact). 3 of the features are integers, 1 is datetime and 3 are objects. The features with a short description are the following:

  • contact_id: A combination of the game_play, player_ids and step columns.
  • game_play: the unique ID for the game and play.
  • nfl_player_id_1 The lower numbered player id in the contact pair. If contact with the ground then this is just the player id.
  • nfl_player_id_2: The larger number of player id in the contact pair. If for contact with the ground, this will contain an uppercase “G”
  • step: A number representing each timestep for each play, starting at 0 at the moment of the play starting, and incrementing by 1 every 0.1 seconds.
  • datetime: The timestamp of the contact, at 10Hz
  • contact: Whether contact occurred

The train player tracking data has 1.353.053 examples and 17 features. 8 of the features are floats, 5 are integers, 1 is datetime and 3 are objects. The features with short descriptions are the following:

  • game_play: Unique game key and play id combination for the play.
  • game_key: the ID code for the game.
  • play_id: the ID code for the play.
  • nfl_player_id: the player’s ID code.
  • datetime: timestamp at 10 Hz.
  • step: timestep within play relative to the play start.
  • position: the football position of the player.
  • team: team of the player, either home or away.
  • jersey_number: Player jersey number
  • x_position: player position along the long axis of the field. See figure below.
  • y_position: player position along the short axis of the field. See figure below.
  • speed: speed in yards/second.
  • distance: distance traveled from prior time point, in yards.
  • orientation: orientation of player (deg).
  • direction: angle of player motion (deg).
  • acceleration: magnitiude of the total acceleration in yards/second².
  • sa: Signed acceleration yards/second² in the direction the player is moving.

1. FUNCTION 1:

This function processes the training data for a machine-learning model used in football analysis. It takes in two pandas dataframes, train_labels and train_tracking. Combines train labels & train player tracking in player-wise & player-ground-wise levels.

def process_train_data(train_labels, train_tracking):
games = train_labels
games[[‘gameId’, ‘playId’]] = games[‘game_play’].str.split(‘_’, 1, expand=True)
games = games[[‘gameId’, ‘playId’, ‘step’, ‘nfl_player_id_1’, ‘nfl_player_id_2’, ‘contact’]]
train_tracking = train_tracking.rename(columns={‘game_key’:’gameId’, ‘play_id’:’playId’, ‘nfl_player_id’:’nfl_player_id_1'})
train_tracking[‘gameId’] = train_tracking[‘gameId’].astype(int)
games[‘gameId’] = games[‘gameId’].astype(int)
games[‘playId’] = games[‘playId’].astype(int)
games[‘nfl_player_id_1’] = games[‘nfl_player_id_1’].astype(int)
train_tracking = train_tracking[train_tracking.step>=0]
games = games.merge(train_tracking[[‘gameId’, ‘playId’, ‘step’, ‘nfl_player_id_1’, ‘x_position’, ‘y_position’, ‘speed’, ‘direction’, ‘orientation’, ‘acceleration’, ‘distance’, ‘sa’]], on=[‘gameId’, ‘playId’, ‘step’, ‘nfl_player_id_1’], how=’left’)
games = games.rename(columns={‘x_position’:’x_position_player_id_1', ‘y_position’:’y_position_player_id_1'})
ground_games = games[games.nfl_player_id_2==’G’]
games = games[games.nfl_player_id_2!=’G’]
train_tracking = train_tracking.rename(columns={‘nfl_player_id_1’:’nfl_player_id_2'})
games[‘nfl_player_id_2’] = games[‘nfl_player_id_2’].astype(int)
games = games.merge(train_tracking[[‘gameId’, ‘playId’, ‘step’, ‘nfl_player_id_2’, ‘x_position’, ‘y_position’, ‘speed’, ‘direction’, ‘orientation’, ‘acceleration’, ‘distance’, ‘sa’]], on=[‘gameId’, ‘playId’, ‘step’, ‘nfl_player_id_2’], how=’left’)
games = games.rename(columns={‘x_position’: ‘x_position_player_id_2’, ‘y_position’: ‘y_position_player_id_2’})
ground_games[‘x_position_player_id_2’] = 0
ground_games[‘y_position_player_id_2’] = 0
return games, ground_games
train_games, train_ground_games=process_train_data(train_labels, train_tracking)

Outputs:

train_games:

train_ground_games:

Not: train_ground_games data is for predicting whether or not the player in contact with the ground. That’s why I will predict data ground-wise and player-wise because features are not similar.

FUNCTION 2:

Combines test labels & test player tracking in player-wise & player-ground-wise levels accordingly.

def process_test_data(sub, test_tracking):
sub[[‘gameId’, ‘playId’, ‘step’, ‘nfl_player_id_1’, ‘nfl_player_id_2’]]=sub[‘contact_id’].str.split(‘_’, 5, expand=True)
test_tracking = test_tracking.rename(columns={‘game_key’:’gameId’, ‘play_id’:’playId’, ‘nfl_player_id’:’nfl_player_id_1'})
test_tracking[‘gameId’] = test_tracking[‘gameId’].astype(int)
sub[‘gameId’] = sub[‘gameId’].astype(int)
sub[‘playId’] = sub[‘playId’].astype(int)
sub[‘step’] = sub[‘step’].astype(int)
sub[‘nfl_player_id_1’] = sub[‘nfl_player_id_1’].astype(int)
test_tracking=test_tracking[test_tracking.step>=0]
games=sub.merge(test_tracking[[‘gameId’, ‘playId’, ‘step’, ‘nfl_player_id_1’, ‘x_position’, ‘y_position’, ‘speed’, ‘direction’, ‘orientation’, ‘acceleration’,’distance’, ‘sa’]], on=[‘gameId’, ‘playId’, ‘step’, ‘nfl_player_id_1’], how=’left’)
games = games.rename(columns={‘x_position’:’x_position_player_id_1', ‘y_position’:’y_position_player_id_1'})
ground_games = games[games.nfl_player_id_2==’G’]
games = games[games.nfl_player_id_2!=’G’]
test_tracking = test_tracking.rename(columns={‘nfl_player_id_1’:’nfl_player_id_2'})
games[‘nfl_player_id_2’] = games[‘nfl_player_id_2’].astype(int)
games = games.merge(test_tracking[[‘gameId’, ‘playId’, ‘step’, ‘nfl_player_id_2’, ‘x_position’, ‘y_position’, ‘speed’, ‘direction’, ‘orientation’, ‘acceleration’,’distance’ ,’sa’]], on=[‘gameId’, ‘playId’, ‘step’, ‘nfl_player_id_2’], how=’left’)
games = games.rename(columns={‘x_position’: ‘x_position_player_id_2’, ‘y_position’: ‘y_position_player_id_2’})
ground_games[‘x_position_player_id_2’] = 0
ground_games[‘y_position_player_id_2’] = 0
games = pd.concat([games, ground_games])
return games, ground_games
test_games, test_ground_games=process_test_data(sub, test_tracking)

FEATURE GENERATION ON TRAIN_GAMES & TEST_GAMES

# Convert directions and orientations to radians
theta1 = np.radians(train_games[‘orientation_x’])
theta2 = np.radians(train_games[‘orientation_y’])
phi1 = np.radians(train_games[‘direction_x’])
phi2 = np.radians(train_games[‘direction_y’])
# Calculate the velocities in x and y directions for each player
v1_x = train_games[‘speed_x’] * np.cos(phi1)
v1_y = train_games[‘speed_x’] * np.sin(phi1)
v2_x = train_games[‘speed_y’] * np.cos(phi2)
v2_y = train_games[‘speed_y’] * np.sin(phi2)
# Calculate the acceleration in x and y directions for each player
acc1_x = train_games[‘acceleration_x’] * np.cos(phi1)
acc1_y = train_games[‘acceleration_x’] * np.sin(phi1)
acc2_x = train_games[‘acceleration_y’] * np.cos(phi2)
acc2_y = train_games[‘acceleration_y’] * np.sin(phi2)
# Calculate the relative velocity vector components
v_rel_x = v2_x — v1_x
v_rel_y = v2_y — v1_y
# Calculate the relative acceleration vector components
acc_rel_x = acc2_x — acc1_x
acc_rel_y = acc2_y — acc1_y
# Calculate the magnitude of the relative velocity vector
v_rel_mag = np.sqrt(v_rel_x**2 + v_rel_y**2)
# Calculate the magnitude of the relative velocity vector
acc_rel_mag = np.sqrt(acc_rel_x**2 + acc_rel_y**2)
# Calculate the DISTANCE between the player_1 and player 2
train_games['distances'] = np.sqrt((train_games['x_position_player_id_1'] - train_games['x_position_player_id_2'])**2 + (train_games['y_position_player_id_1'] - train_games['y_position_player_id_1'])**2)

# Calculate mean distance between the player_1 and player 2 for each play
means=train_games.groupby(['gameId','playId', 'nfl_player_id_1', 'nfl_player_id_2'])['distances'].mean().reset_index().rename(columns={'distances':'meandistanceBTWPlayers'})
train_games=train_games.merge(means, on=['gameId','playId', 'nfl_player_id_1', 'nfl_player_id_2'], how='left')
min_distance=train_games.loc[train_games.groupby(['nfl_player_id_1'])["meandistanceBTWPlayers"].idxmin()][['gameId', 'playId', 'nfl_player_id_1', 'nfl_player_id_2', 'meandistanceBTWPlayers']]
min_distance.rename(columns={'meandistanceBTWPlayers':'minDistance'}, inplace=True)
# Calculate next expected distance btw players
train_games[‘newdistances’] = np.sqrt((train_games[‘x1_new’] — train_games[‘x2_new’])**2 + (train_games[‘y1_new’] — train_games[‘y2_new’])**2)

Feature Selection with RFE

RFE (Recursive Feature Elimination) is a technique used for feature selection in machine learning. It is used to select the most important features from a given dataset. The basic idea behind RFE is to recursively eliminate the least important features until the optimal set of features is obtained.

The RFE algorithm works as follows:

  1. Choose a model and the number of desired features to select.
  2. Fit the model with all features and rank the importance of each feature.
  3. Eliminate the least important feature.
  4. Fit the model again with the remaining features and rank their importance.
  5. Continue to eliminate the least important feature and fit the model again until the desired number of features is reached.

According to RFE, the top most important features with 98% accuracy score is :

  • ‘y_position_player_1’
  • ‘x_position_player_2’
  • ‘y_position_player_2’:
  • ‘distances’: distance between players
  • ‘meandistanceBTWPlayers’: mean distance btw player for each play
  • ‘speed_xRM’: Rolling mean (windows=8) of the speed of player 1
  • ‘rel_orient’: abs difference of orientation of players 1 & 2
  • “newdistances”: expected new distance btw players
  • “x_position_player_1_diff”
  • “x_position_player_2_diff”
  • “y_position_player_2_diff”

LGBM MODEL:

PREDICT IF TWO-PLAYER CONTACT OR NOT ON THE VALIDATION SET

import lightgbm as lgb
from sklearn.metrics import classification_report, confusion_matrix
# Separate the features and target variable
X = train.drop(columns=[‘contact’, ‘gameId’,’playId’,’step’,’nfl_player_id_1',’nfl_player_id_2'], axis=1)
y = train[‘contact’]
# Split the data into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define the LightGBM classifier
params = {
‘objective’: ‘binary’,
‘metric’: ‘auc’,
‘boosting_type’: ‘gbdt’,
‘num_leaves’: 150,
‘learning_rate’: 0.1,
‘feature_fraction’: 0.9,
‘bagging_fraction’: 0.9,
‘bagging_freq’: 2,
‘verbose’: 0,
‘min_data_in_leaf’: 100,
‘lambda_l1’: 0.1,
‘lambda_l2’: 0.1,
‘scale_pos_weight’: 4.0,
}
# Train the LightGBM classifier on the training data
lgb_train = lgb.Dataset(X_train, y_train)
lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train)
model = lgb.train(params, lgb_train, num_boost_round=1000, valid_sets=[lgb_train, lgb_test], early_stopping_rounds=10, verbose_eval=50)
# Evaluate the model on the testing data
y_pred = model.predict(X_test)
y_pred_class = np.round(y_pred)
print(confusion_matrix(y_test, y_pred_class))
print(classification_report(y_test, y_pred_class))
[1000]	training's auc: 1	valid_1's auc: 0.999273
[[824196 1052]
[ 945 8377]]
precision recall f1-score support

0 1.00 1.00 1.00 825248
1 0.89 0.90 0.89 9322

accuracy 1.00 834570
macro avg 0.94 0.95 0.95 834570
weighted avg 1.00 1.00 1.00 834570

Based on the provided confusion matrix and classification report, the model appears to have a very high accuracy, precision, recall, and F1-score for the positive class (label 1).

PREDICTING GROUND PLAYS

FUNCTION 3:

Create features for train_ground_games & test_ground_games and prepared data for ml.

def create_features_grounds(df):
# FUTURES FOR train_ground_games and test_ground_games:
# Future 1: DISTANCE between TWO PLAYERS.
ground_games=df
ground_games=ground_games.sort_values([‘gameId’, ‘playId’, ‘nfl_player_id_1’, ‘nfl_player_id_2’, ‘step’])
# Future 1: SPEED CHANGE of player
ground_games[‘speedChange’]=ground_games.groupby([‘gameId’,’playId’, ‘nfl_player_id_1’])[‘speed’].pct_change()
# Future 2: SPEED CHANGE DIRECTION of player
ground_games.loc[ground_games.speedChange>0,’speedDirection’]= 1
ground_games.loc[ground_games.speedChange<0,’speedDirection’]= -1
ground_games.loc[ground_games.speedChange==0,’speedDirection’]= 0
# Future 3: ACCELERATION CHANGE of player
ground_games[‘accelerationChange’]=ground_games.groupby([‘gameId’,’playId’, ‘nfl_player_id_1’])[‘acceleration’].pct_change()
# Future 4: ACCELERATION CHANGE DIRECTION of player
ground_games.loc[ground_games.speedChange>0,’accelerationDirection’]= 1
ground_games.loc[ground_games.speedChange<0,’accelerationDirection’]= -1
ground_games.loc[ground_games.speedChange==0,’accelerationDirection’]= 0
# Future 5: SPEED CHANGE RS of player
ground_games[‘speedChangeMA’]=ground_games.groupby([‘gameId’, ‘playId’, ‘nfl_player_id_1’])[‘speedChange’].rolling(8).mean().reset_index().speedChange
# Future 6: SPEED RS of player
ground_games[‘speedMA’]=ground_games.groupby([‘gameId’, ‘playId’, ‘nfl_player_id_1’])[‘speed’].rolling(8).mean().reset_index().speed
# Future 7: ACCELERATION CHANGE RS of player
ground_games[‘accChangeMA’]=ground_games.groupby([‘gameId’, ‘playId’, ‘nfl_player_id_1’])[‘accelerationChange’].rolling(8).mean().reset_index().accelerationChange
ground_games[‘accelerationDiff’]=ground_games[‘acceleration’].diff()
return ground_games
train_grounds=create_features_grounds(train_ground_games)
# Separate the features and target variable
X = train_grounds.drop(columns=[‘contact’, ‘gameId’,’playId’,’step’,’nfl_player_id_1',’nfl_player_id_2'], axis=1)
y = train_grounds[‘contact’]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Train the LightGBM classifier on the training data
params = {
‘objective’: ‘binary’,
‘metric’: ‘auc’,
‘boosting_type’: ‘gbdt’,
‘num_leaves’: 1500,
‘learning_rate’: 0.1,
‘feature_fraction’: 0.9,
‘bagging_fraction’: 0.9,
‘bagging_freq’: 2,
‘verbose’: 0,
‘min_data_in_leaf’: 100,
‘lambda_l1’: 0.1,
‘lambda_l2’: 0.1,
‘scale_pos_weight’: 9.0,
}
lgb_train = lgb.Dataset(X_train, y_train)
lgb_test = lgb.Dataset(X_test, y_test, reference=lgb_train)
model2 = lgb.train(params, lgb_train, num_boost_round=1000, valid_sets=[lgb_train, lgb_test], early_stopping_rounds=10, verbose_eval=50)
# Evaluate the model on the testing data
y_pred = model2.predict(X_test)
y_pred_class = np.round(y_pred)
print(confusion_matrix(y_test, y_pred_class))
print(classification_report(y_test, y_pred_class))
[[78443   327]
[ 499 2858]]
precision recall f1-score support

0 0.99 1.00 0.99 78770
1 0.90 0.85 0.87 3357

accuracy 0.99 82127
macro avg 0.95 0.92 0.93 82127
weighted avg 0.99 0.99 0.99 82127

Based on the given confusion matrix and classification report, the model seems to be performing well. It has a high accuracy of 0.99 and a high precision score for both classes, indicating that the model is good at identifying both positive and negative samples. The recall score is also high, indicating that the model is able to identify most of the positive samples. Additionally, the F1-score for both classes is relatively high, indicating a good balance between precision and recall.

--

--

SALOME SONYA LOMSADZE

Sr. Customer Analytics , BI Developer, Experienced in SQL, Python, Qlik, B.Sc in Chemistry at Bogazici University https://github.com/sonyalomsadze