Question

我有两个.CSV文件，其中一个包含IP地址：

76.83.179.64
76.83.179.64
187.42.62.209
89.142.219.5

，另一个包括IP范围和国家/地区名称，如下所示：

 ip_from|  ip_to|  country_name|    

|16777216|16777471|Australia|

到目前为止，我所做的工作如下：

加载数据：ip_from，ip_to和国家/地区名称

val rdd1 = sqlContext.read.format("csv").option("inferSchema", 
"true").load("/FileStore/tables/locations.CSV")
val df2 = rdd1.toDF()

加载数据并从IP转换为Long

val rdd2 = sc.textFile("/FileStore/tables/ipaddress.csv")
def ipToLong(ipAddress: String): Long = {
ipAddress.split("\\.").reverse.zipWithIndex
.map(a=>a._1.toInt*math.pow(256,a._2).toLong).sum
}
val df1 = rdd2.map(x=>ipToLong(x)).toDF()

现在，我应该编写什么用户定义的函数来加入两个DF（或查找）并根据ip地址检索国家/地区名称？

Answer 1

对于您的情况，您只需使用以下逻辑

df1.join(df2, df1("value") >= df2("ip_from") && df1("value") <= df2("ip_to"), "left")

Answer 2

您可以使用left_outer加入以及执行ip-to-long转换的UDF，如下例所示：

val dfIP = Seq(
  ("76.83.179.64"),
  ("76.83.179.64"),
  ("187.42.62.209"),
  ("89.142.219.5")
).toDF("ip")

val dfRange = Seq(
  (1000000000L, 1500000000L, "Country A"),
  (1500000000L, 3000000000L, "Country B"),
  (3000000000L, 4000000000L, "Country C")
).toDF("ip_from", "ip_to", "country_name")

def ipToLong = udf(
  (ip: String) =>
    ip.split("\\.").reverse.zipWithIndex.map(
      a => a._1.toInt * math.pow(256,a._2).toLong
    ).sum
)

val dfJoined = dfIP.join(
  dfIPRange,
  ipToLong($"ip") >= $"ip_from" && ipToLong($"ip") < $"ip_to",
  "left_outer"
)

dfJoined.show
+-------------+----------+----------+------------+
|           ip|   ip_from|     ip_to|country_name|
+-------------+----------+----------+------------+
| 76.83.179.64|1000000000|1500000000|   Country A|
| 76.83.179.64|1000000000|1500000000|   Country A|
|187.42.62.209|3000000000|4000000000|   Country C|
| 89.142.219.5|1500000000|3000000000|   Country B|
+-------------+----------+----------+------------+

检索IP地理位置

2 个答案: