线性回归建模问题

时间:2018-06-11 12:42:35

标签: r machine-learning

运行与训练线性回归模型相关的R脚本时,我收到以下错误。下面给出R脚本)和数据文件 - https://www.dropbox.com/s/pn5i75jomjbqsro/Jan_2015_OnTime.csv?dl=0。我可以确认直到最后一行的所有内容都是线性回归模型训练 - 工作得很好。有什么想法导致这个?

*In predict.lm(object, newdata, se.fit, scale = 1, type = if (type ==  ... :
  prediction from a rank-deficient fit may be misleading
14: model fit failed for Resample14: parameter=none Error : cannot allocate vector of size 113.2 Mb*

使用R版本3.6.0

origData<-read.csv('Jan_2015_OnTime.csv', header=TRUE, sep=',') # Import csv file into df
airports<-c('ATL','LAX','ORD','DFW','JFK','SFO','CLT', 'LAS','PHX') # Filter flights between specific airports
origData<-subset(origData,DEST %in% airports & ORIGIN %in% airports) # Filter flights between specific airports
origData$X <- NULL # Remove this field, seems to be junk
cor(origData[c("ORIGIN_AIRPORT_SEQ_ID","ORIGIN_AIRPORT_ID")]) # Check if these two fields are the same ie look for correlation.
cor(origData[c("DEST_AIRPORT_SEQ_ID","DEST_AIRPORT_ID")]) # Check if these two fields are the same ie look for correlation.
origData$ORIGIN_AIRPORT_SEQ_ID <- NULL # Remove this field
origData$DEST_AIRPORT_SEQ_ID <- NULL # Remove this field
mismatched <- origData[origData$CARRIER != origData$UNIQUE_CARRIER,] # Check if these two fields are the same ie look for correlation.
origData$origData$UNIQUE_CARRIER <- NULL # Remove this field
onTimeData <- origData[!is.na(origData$ARR_DEL15) & origData$ARR_DEL15!="" & !is.na(origData$DEP_DEL15) & origData$DEP_DEL15!="",] # Removing NA and Blank values
onTimeData$DISTANCE <- as.integer(onTimeData$DISTANCE) # Convert to an integer
onTimeData$CANCELLED <- as.integer(onTimeData$CANCELLED) # Convert to an integer
onTimeData$DIVERTED <- as.integer(onTimeData$DIVERTED) # Convert to an integer
onTimeData$ARR_DEL15 <- as.factor(onTimeData$ARR_DEL15) # Convert to a factor
onTimeData$DEP_DEL15 <- as.factor(onTimeData$DEP_DEL15) # Convert to a factor
onTimeData$DEST_AIRPORT_ID <- as.factor(onTimeData$DEST_AIRPORT_ID) # Convert to a factor
onTimeData$ORIGIN_AIRPORT_ID <- as.factor(onTimeData$ORIGIN_AIRPORT_ID) # Convert to a factor
onTimeData$DAY_OF_WEEK <- as.factor(onTimeData$DAY_OF_WEEK) # Convert to a factor
onTimeData$DEST <- as.factor(onTimeData$DEST) # Convert to a factor
onTimeData$ORIGIN <- as.factor(onTimeData$ORIGIN) # Convert to a factor
onTimeData$DEP_TIME_BLK <- as.factor(onTimeData$DEP_TIME_BLK) # Convert to a factor
onTimeData$CARRIER <- as.factor(onTimeData$CARRIER) # Convert to a factor
tapply(onTimeData$ARR_DEL15, onTimeData$ARR_DEL15, length) # Frequency distribution of factors in ARR_DEL15
install.packages('caret')
install.packages('e1071')
library(caret)
library(e1071)
set.seed(122515)
featureCols <- c("ARR_DEL15","DAY_OF_WEEK","CARRIER","DEST","ORIGIN", "DEP_TIME_BLK") # List of col be considered as features
onTimeDataFiltered <- onTimeData[,featureCols] # Create df with only feature columns
inTrainRows <- createDataPartition(onTimeDataFiltered$ARR_DEL15, p=0.7, list = FALSE) # Create Test Train partition
head(inTrainRows,10) # Check training output
trainDataFiltered <- onTimeDataFiltered[inTrainRows,] # Create Train data
testDataFiltered <- onTimeDataFiltered[-inTrainRows,] # Create Test data
nrow(trainDataFiltered)/(nrow(trainDataFiltered)+nrow(testDataFiltered)) # Check Train %
nrow(testDataFiltered)/(nrow(trainDataFiltered)+nrow(testDataFiltered)) # Check Text %
logisticRegModel <- train(ARR_DEL15 ~ ., data = trainDataFiltered,method="glm",family="binomial")

最诚挚的问候 Togy

0 个答案:

没有答案