pyspark.mllib.feature module

本文介绍了两种关键的特征提取技术TF-IDF和Word2Vec,以及数据转换技术StandardScaler和Normalizer,这些技术广泛应用于自然语言处理等领域。
复制代码
Feature Extraction
Feature Extraction converts vague features in the raw data into concrete numbers for further analysis. In this section, we introduce two feature extraction technologies: TF-IDF and Word2Vec.
TF-IDF
Term frequency-inverse document frequency (TF-IDF) reflects the importance of a term (word) to the document in corpus. Denote a term by  , a document by , and the corpus by . Term frequency  is the number of times that term  appears in  while document frequency is the number of documents that contain the term.
If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g., 'a', 'the', and 'of'. If a term appears very often across the corpus, it means it does not carry special information about a particular document. Inverse document frequency is a numerical measure of how much information a term provides:
where  is the total number of documents in the corpus. A smoothing term is applied to avoid dividing by zero for terms outside the corpus.
The TF-IDF measure is simply the product of TF and IDF:

pyspark.mllib.feature module

class pyspark.mllib.feature.HashingTF

Bases: object

Maps a sequence of terms to their term frequencies using hashing algorithm.
Method:

           indexOf(term)

Returns the index of the input term.

transform(document)

Transforms the input document (list of terms) to term frequency vectors, or transform the RDD of document to RDD of term frequency vectors.

class pyspark.mllib.feature.IDFModel

Bases: pyspark.mllib.feature.JavaVectorTransformer

Represents an IDF model that can transform term frequency vectors.
Method:

           transform(dataset)

Transforms term frequency (TF) vectors to TF-IDF vectors.

If minDocFreq was set for the IDF calculation, the terms which occur in fewer than minDocFreq documents will have an entry of 0.

Parameters:
    

dataset an RDD of term frequency vectors

Returns:
    

an RDD of TF-IDF vectors

class pyspark.mllib.feature.IDF (minDocFreq=0)

Bases: object

Inverse document frequency (IDF).

The standard formulation is used: idf = log((m + 1) / (d(t) + 1)), where m is the total number of documents and d(t) is the number of documents that contain term t.

This implementation supports filtering out terms which do not appear in a minimum number of documents (controlled by the variable minDocFreq). For terms that are not in at least minDocFreq documents, the IDF is found as 0, resulting in TF-IDFs of 0.
Method:

fit(dataset)

Computes the inverse document frequency.

Parameters:
    

dataset an RDD of term frequency vectors
Sample Code:

from pyspark import SparkContext

from pyspark.mllib.feature import HashingTF

from pyspark.mllib.feature import IDF

 

sc = SparkContext()

 

# Load documents (one per line).

documents = sc.textFile("data/mllib/document").map(lambda line: line.split(" "))

 

#Computes TF

hashingTF = HashingTF()

tf = hashingTF.transform(documents)

 

#Computes tfidef

tf.cache()

idf = IDF().fit(tf)

tfidf = idf.transform(tf)

 

for r in tfidf.collect(): print r

 

Data in document:

1 1 1 1

1 2 2 2

 

Output:

(1048576, [485808], [0.0])

# 1048576 and [485808] are total numbers of hash bracket and the hash bracket for this element respectively

# 0.0 is the TFIDF for word '1' in document 1.

(1048576, [485808, 559923], [0.0, 1.21639532432])

# 0.0 and 1. 21639532432 is the TFIDF for word '1' and word '2' in document 2.
Word2Vec
Word2Vec converts each word in documents into a vector. This technology is useful in many natural language processing applications such as named entity recognition, disambiguation, parsing, tagging and machine translation.
 Mllib uses skip-gram model that is able to convert word in similar contexts into vectors that are close in vector space. Given a large dataset, skip-gram model can predict synonyms of a word with very high accuracy.

pyspark.mllib.feature module

class pyspark.mllib.feature.Word2Vec

Bases: object

Word2Vec creates vector representation of words in a text corpus.

Word2Vec used skip-gram model to train the model.
Method:

fit(data)

Computes the vector representation of each word in vocabulary.

Parameters:
    

data training data. RDD of list of string

Returns:
    

Word2VecModel instance

setLearningRate(learningRate)

Sets initial learning rate (default: 0.025).

setNumIterations(numIterations)

Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.

setNumPartitions(numPartitions)

Sets number of partitions (default: 1). Use a small number for accuracy.

setSeed(seed)

Sets random seed.

setVectorSize(vectorSize)

Sets vector size (default: 100).

 

class pyspark.mllib.feature.Word2VecModel

Bases: pyspark.mllib.feature.JavaVectorTransformer

class for Word2Vec model
Method:

findSynonyms(word, num)

Find synonyms of a word

Note: local use only

Parameters:  word a word or a vector representation of word

                              num number of synonyms to find  

Returns:     array of (word, cosineSimilarity)

transform(word)

Transforms a word to its vector representation

Note: local use only

Parameters:
    

word a word

Returns:
    

vector representation of word(s)
Sample Code:

from pyspark import SparkContext

from pyspark.mllib.feature import Word2Vec

 

#Pippa Passes

sentence = "The year is at the spring \

        And the day is at the morn; \

        Morning is at seven;  \

        The hill-side is dew-pearled; \

            The lark is on the wing; \

            The snai is on the thorn; \

            God's in His heaven; \

        All's right with the world "

        

sc = SparkContext()

 

#Generate doc

localDoc = [sentence, sentence]

doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))

 

#Convect word in doc to vectors.

model = Word2Vec().fit(doc)

 

#Print the vector of "The"

vec = model.transform("The")

print vec

 

#Find the synonyms of "The"

syms = model.findSynonyms("The", 5)

print [s[0] for s in syms]

 
Output:

 

[-0.00352853513323,0.00335159664974,-0.00598029373214,0.00399478571489,-0.00198440207168,-0.00294396048412,-0.00279111019336,0.00574737275019,-0.00628866581246,-0.00110566907097,-0.00108648219611,-0.00195649731904,0.00195016933139,0.00108497566544,-0.00230407039635,0.00146713317372,0.00322529440746,-0.00460519595072,0.0029725972563,-0.0018835098017,-1.38119357871e-05,0.000757675385103,-0.00189483352005,-0.00201138551347,0.00030658338801,0.00328158447519,-0.00367985945195,0.003532753326,-0.0019905695226,0.00628945976496,-0.00582657754421,0.00338909355924,0.00336381071247,-0.00497342273593,0.000185315642739,0.00409715576097,0.00307129183784,-0.00160020322073,0.000823577167466,0.00359133118764,0.000429257488577,-0.00509830284864,0.00443912763149,0.00010487002146,0.00211782287806,0.00373624730855,0.00489703053609,-0.00397138809785,0.000249207223533,-0.00378827378154,-0.000930541602429,-0.00113072514068,-0.00480769388378,-0.00129892374389,-0.0016206469154,0.00158304872457,-0.00206038192846,-0.00416553160176,0.00646342104301,0.00531594920903,0.00196505431086,0.00229385774583,-0.00256532337517,1.66955578607e-05,-0.00372383627109,0.00685756560415,0.00612043589354,-0.000518668384757,0.000620941573288,0.00244942889549,-0.00180160428863,-0.00129932863638,-0.00452549103647,0.00417296867818,-0.000546502880752,-0.0016888830578,-0.000340467959177,-0.00224090646952,0.000401715224143,0.00230841850862,0.00308039737865,-0.00271077733487,-0.00409514643252,-0.000891392992344,0.00459721498191,0.00295961694792,0.00211095809937,0.00442661950365,-0.001312403474,0.00522524351254,0.00116976187564,0.00254187034443,0.00157006899826,-0.0026122755371,0.00510979117826,0.00422499561682,0.00410514092073,0.00415299832821,-0.00311993830837,-0.00247424701229]

[u'', u'the', u'\t', u'is', u'at'] #The synonyms of "The"

 
 
 
Data Transformation
Data Transformation manipulates values in each dimension of vectors according to a predefined rule. Vectors that have gone through transformation can be used for future processing.
We introduce two types of data transformation: StandardScaler and Normalizer in this section.
StandardScaler
StandardScaler makes vectors in the dataset have zero-mean (when subtracting the mean in the enumerator) and unit-variance.

pyspark.mllib.feature module        

class pyspark.mllib.feature.StandardScalerModel

Bases: pyspark.mllib.feature.JavaVectorTransformer

Represents a StandardScaler model that can transform vectors.
Method:

transform(vector)

Applies standardization transformation on a vector.

Parameters:
    

vector Vector or RDD of Vector to be standardized.

Returns:
    

Standardized vector. If the variance of a column is zero, it will return default 0.0 for the column with zero variance.

class pyspark.mllib.feature.StandardScaler(withMean=False, withStd=True)

Bases: object

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

If withMean is true, all the dimension of each vector subtract the mean of this dimension.

If withStd is true, all the dimension of each vector divides the length of the vector.
Method:

fit(dataset)

Computes the mean and variance and stores as a model to be used for later scaling.

Parameters:
    

data The data used to compute the mean and variance to build the transformation model.

Returns:
    

a StandardScalarModel
Sample Code:

from pyspark.mllib.feature import Normalizer

from pyspark.mllib.linalg import Vectors

from pyspark import SparkContext

from pyspark.mllib.feature import StandardScaler

 

sc = SparkContext()

 

vs = [Vectors.dense([-2.0, 2.3, 0]), Vectors.dense([3.8, 0.0, 1.9])]

 

dataset = sc.parallelize(vs)

 

#all false, do nothing.

standardizer = StandardScaler(False, False)

model = standardizer.fit(dataset)

result = model.transform(dataset)

for r in result.collect(): print r

 

print("\n")

 

#deducts the mean

standardizer = StandardScaler(True, False)

model = standardizer.fit(dataset)

result = model.transform(dataset)

for r in result.collect(): print r

 

print("\n")

 

#divides the length of vector

standardizer = StandardScaler(False, True)

model = standardizer.fit(dataset)

result = model.transform(dataset)

for r in result.collect(): print r

 

print("\n")

 

#Deducts min first, divides the length of vector later

standardizer = StandardScaler(True, True)

model = standardizer.fit(dataset)

result = model.transform(dataset)

for r in result.collect(): print r

 

print("\n")
Output:

#all false, do nothing.

[-2.0,2.3,0.0]

[3.8,0.0,1.9]

 

#deducts the mean

[-2.9,1.15,-0.95]

[2.9,-1.15,0.95]

 

#divides the length of vector

[-0.487659849094,1.41421356237,0.0]

[0.926553713279,0.0,1.41421356237]

 

#Deducts min first, divides the length of vector later

[-0.707106781187,0.707106781187,-0.707106781187]

[0.707106781187,-0.707106781187,0.707106781187]
Normalizer
Normalizer scales vectors by divide each dimension of the vector with a Lp norm.
For 1 <= p <= infinite, Lp norm is calculated as follows: sum(abs(vector)p)(1/p).
For p = infinite, Lp norm is max(abs(vector))

pyspark.mllib.feature module        

class pyspark.mllib.feature.Normalizer(p=2.0)

Bases: pyspark.mllib.feature.VectorTransformer
Method:

transform(vector)

Applies unit length normalization on a vector.

Parameters:
    

vector vector or RDD of vector to be normalized.

Returns:
    

normalized vector. If the norm of the input is zero, it will return the input vector.
Sample Code:

from pyspark.mllib.feature import Normalizer

from pyspark.mllib.linalg import Vectors

from pyspark import SparkContext

 

sc = SparkContext()

 

# v = [0.0, 1.0, 2.0]

v = Vectors.dense(range(3))

 

# p = 1

nor = Normalizer(1)

print (nor.transform(v))

 

# p = 2

nor = Normalizer(2)

print (nor.transform(v))

 

# p = inf

nor = Normalizer(p=float("inf"))

print (nor.transform(v))

 
Output:

[0.0, 0.3333333333, 0.666666667]

[0.0, 0.4472135955, 0.894427191]

[0.0, 0.5, 1.0]
复制代码

 

Feature Extraction

Feature Extraction converts vague features in the raw data into concrete numbers for further analysis. In this section, we introduce two feature extraction technologies: TF-IDF and Word2Vec.

TF-IDF

Term frequency-inverse document frequency (TF-IDF) reflects the importance of a term (word) to the document in corpus. Denote a term by  , a document by , and the corpus by . Term frequency  is the number of times that term  appears in  while document frequency is the number of documents that contain the term.

If we only use term frequency to measure the importance, it is very easy to over-emphasize terms that appear very often but carry little information about the document, e.g., 'a''the', and 'of'. If a term appears very often across the corpus, it means it does not carry special information about a particular document. Inverse document frequency is a numerical measure of how much information a term provides:

where  is the total number of documents in the corpus. A smoothing term is applied to avoid dividing by zero for terms outside the corpus.

The TF-IDF measure is simply the product of TF and IDF:

pyspark.mllib.feature module

class pyspark.mllib.feature.HashingTF

Bases: object

Maps a sequence of terms to their term frequencies using hashing algorithm.

Method:

           indexOf(term)

Returns the index of the input term.

transform(document)

Transforms the input document (list of terms) to term frequency vectors, or transform the RDD of document to RDD of term frequency vectors.

class pyspark.mllib.feature.IDFModel

Bases: pyspark.mllib.feature.JavaVectorTransformer

Represents an IDF model that can transform term frequency vectors.

Method:

           transform(dataset)

Transforms term frequency (TF) vectors to TF-IDF vectors.

If minDocFreq was set for the IDF calculation, the terms which occur in fewer than minDocFreq documents will have an entry of 0.

Parameters:

dataset an RDD of term frequency vectors

Returns:

an RDD of TF-IDF vectors

class pyspark.mllib.feature.IDF(minDocFreq=0)

Bases: object

Inverse document frequency (IDF).

The standard formulation is used: idf = log((m + 1) / (d(t) + 1)), where m is the total number of documents and d(t) is the number of documents that contain term t.

This implementation supports filtering out terms which do not appear in a minimum number of documents (controlled by the variable minDocFreq). For terms that are not in at least minDocFreq documents, the IDF is found as 0, resulting in TF-IDFs of 0.

Method:

fit(dataset)

Computes the inverse document frequency.

Parameters:

dataset an RDD of term frequency vectors

Sample Code:

from pyspark import SparkContext

from pyspark.mllib.feature import HashingTF

from pyspark.mllib.feature import IDF

 

sc = SparkContext()

 

# Load documents (one per line).

documents = sc.textFile("data/mllib/document").map(lambda line: line.split(" "))

 

#Computes TF

hashingTF = HashingTF()

tf = hashingTF.transform(documents)

 

#Computes tfidef

tf.cache()

idf = IDF().fit(tf)

tfidf = idf.transform(tf)

 

for r in tfidf.collect():print r

 

Data in document:

1111

1222

 

Output:

(1048576,[485808],[0.0])

# 1048576 and [485808] are total numbers of hash bracket and the hash bracket for this element respectively

# 0.0 is the TFIDF for word '1'in document 1.

(1048576,[485808,559923],[0.0,1.21639532432])

# 0.0 and 1.21639532432 is the TFIDF for word '1' and word '2' in document 2.

Word2Vec

Word2Vec converts each word in documents into a vector. This technology is useful in many natural language processing applications such as named entity recognition, disambiguation, parsing, tagging and machine translation.

 Mllib uses skip-gram model that is able to convert word in similar contexts into vectors that are close in vector space. Given a large dataset, skip-gram model can predict synonyms of a word with very high accuracy.

pyspark.mllib.feature module

class pyspark.mllib.feature.Word2Vec

Bases: object

Word2Vec creates vector representation of words in a text corpus.

Word2Vec used skip-gram model to train the model.

Method:

fit(data)

Computes the vector representation of each word in vocabulary.

Parameters:

data training data. RDD of list of string

Returns:

Word2VecModel instance

setLearningRate(learningRate)

Sets initial learning rate (default: 0.025).

setNumIterations(numIterations)

Sets number of iterations (default: 1), which should be smaller than or equal to number of partitions.

setNumPartitions(numPartitions)

Sets number of partitions (default: 1). Use a small number for accuracy.

setSeed(seed)

Sets random seed.

setVectorSize(vectorSize)

Sets vector size (default: 100).

 

class pyspark.mllib.feature.Word2VecModel

Bases: pyspark.mllib.feature.JavaVectorTransformer

class for Word2Vec model

Method:

findSynonyms(wordnum)

Find synonyms of a word

Note: local use only

Parameters:  worda word or a vector representation of word

                              numnumber of synonyms to find  

Returns:     array of (word, cosineSimilarity)

transform(word)

Transforms a word to its vector representation

Note: local use only

Parameters:

word a word

Returns:

vector representation of word(s)

Sample Code:

from pyspark import SparkContext

from pyspark.mllib.feature import Word2Vec

 

#Pippa Passes

sentence ="The year is at the spring \

        And the day is at the morn; \

        Morning is at seven;  \

        The hill-side is dew-pearled; \

            The lark is on the wing; \

            The snai is on the thorn; \

            God's in His heaven; \

        All's right with the world "

        

sc = SparkContext()

 

#Generate doc

localDoc =[sentence, sentence]

doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))

 

#Convect word in doc to vectors.

model = Word2Vec().fit(doc)

 

#Print the vector of "The"

vec = model.transform("The")

print vec

 

#Find the synonyms of "The"

syms = model.findSynonyms("The",5)

print[s[0]for s in syms]

 

Output:

 

[-0.00352853513323,0.00335159664974,-0.00598029373214,0.00399478571489,-0.00198440207168,-0.00294396048412,-0.00279111019336,0.00574737275019,-0.00628866581246,-0.00110566907097,-0.00108648219611,-0.00195649731904,0.00195016933139,0.00108497566544,-0.00230407039635,0.00146713317372,0.00322529440746,-0.00460519595072,0.0029725972563,-0.0018835098017,-1.38119357871e-05,0.000757675385103,-0.00189483352005,-0.00201138551347,0.00030658338801,0.00328158447519,-0.00367985945195,0.003532753326,-0.0019905695226,0.00628945976496,-0.00582657754421,0.00338909355924,0.00336381071247,-0.00497342273593,0.000185315642739,0.00409715576097,0.00307129183784,-0.00160020322073,0.000823577167466,0.00359133118764,0.000429257488577,-0.00509830284864,0.00443912763149,0.00010487002146,0.00211782287806,0.00373624730855,0.00489703053609,-0.00397138809785,0.000249207223533,-0.00378827378154,-0.000930541602429,-0.00113072514068,-0.00480769388378,-0.00129892374389,-0.0016206469154,0.00158304872457,-0.00206038192846,-0.00416553160176,0.00646342104301,0.00531594920903,0.00196505431086,0.00229385774583,-0.00256532337517,1.66955578607e-05,-0.00372383627109,0.00685756560415,0.00612043589354,-0.000518668384757,0.000620941573288,0.00244942889549,-0.00180160428863,-0.00129932863638,-0.00452549103647,0.00417296867818,-0.000546502880752,-0.0016888830578,-0.000340467959177,-0.00224090646952,0.000401715224143,0.00230841850862,0.00308039737865,-0.00271077733487,-0.00409514643252,-0.000891392992344,0.00459721498191,0.00295961694792,0.00211095809937,0.00442661950365,-0.001312403474,0.00522524351254,0.00116976187564,0.00254187034443,0.00157006899826,-0.0026122755371,0.00510979117826,0.00422499561682,0.00410514092073,0.00415299832821,-0.00311993830837,-0.00247424701229]

[u'',u'the',u'\t',u'is',u'at'#The synonyms of "The"

 

 

 

Data Transformation

Data Transformation manipulates values in each dimension of vectors according to a predefined rule. Vectors that have gone through transformation can be used for future processing.

We introduce two types of data transformation: StandardScaler and Normalizer in this section.

StandardScaler

StandardScaler makes vectors in the dataset have zero-mean (when subtracting the mean in the enumerator) and unit-variance.

pyspark.mllib.feature module        

class pyspark.mllib.feature.StandardScalerModel

Bases: pyspark.mllib.feature.JavaVectorTransformer

Represents a StandardScaler model that can transform vectors.

Method:

transform(vector)

Applies standardization transformation on a vector.

Parameters:

vector Vector or RDD of Vector to be standardized.

Returns:

Standardized vector. If the variance of a column is zero, it will return default 0.0 for the column with zero variance.

class pyspark.mllib.feature.StandardScaler(withMean=FalsewithStd=True)

Bases: object

Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.

If withMean is true, all the dimension of each vector subtract the mean of this dimension.

If withStd is true, all the dimension of each vector divides the length of the vector.

Method:

fit(dataset)

Computes the mean and variance and stores as a model to be used for later scaling.

Parameters:

data The data used to compute the mean and variance to build the transformation model.

Returns:

a StandardScalarModel

Sample Code:

from pyspark.mllib.feature import Normalizer

from pyspark.mllib.linalg import Vectors

from pyspark import SparkContext

from pyspark.mllib.feature import StandardScaler

 

sc = SparkContext()

 

vs =[Vectors.dense([-2.0,2.3,0]), Vectors.dense([3.8,0.0,1.9])]

 

dataset = sc.parallelize(vs)

 

#all false, do nothing.

standardizer = StandardScaler(False,False)

model = standardizer.fit(dataset)

result = model.transform(dataset)

for r in result.collect():print r

 

print("\n")

 

#deducts the mean

standardizer = StandardScaler(True,False)

model = standardizer.fit(dataset)

result = model.transform(dataset)

for r in result.collect():print r

 

print("\n")

 

#divides the length of vector

standardizer = StandardScaler(False,True)

model = standardizer.fit(dataset)

result = model.transform(dataset)

for r in result.collect():print r

 

print("\n")

 

#Deducts min first, divides the length of vector later

standardizer = StandardScaler(True,True)

model = standardizer.fit(dataset)

result = model.transform(dataset)

for r in result.collect():print r

 

print("\n")

Output:

#all false, do nothing.

[-2.0,2.3,0.0]

[3.8,0.0,1.9]

 

#deducts the mean

[-2.9,1.15,-0.95]

[2.9,-1.15,0.95]

 

#divides the length of vector

[-0.487659849094,1.41421356237,0.0]

[0.926553713279,0.0,1.41421356237]

 

#Deducts min first, divides the length of vector later

[-0.707106781187,0.707106781187,-0.707106781187]

[0.707106781187,-0.707106781187,0.707106781187]

Normalizer

Normalizer scales vectors by divide each dimension of the vector with a Lnorm.

For 1 <= p <= infinite, Lp norm is calculated as follows: sum(abs(vector)p)(1/p).

For p = infinite, Lp norm is max(abs(vector))

pyspark.mllib.feature module        

class pyspark.mllib.feature.Normalizer(p=2.0)

Bases: pyspark.mllib.feature.VectorTransformer

Method:

transform(vector)

Applies unit length normalization on a vector.

Parameters:

vector vector or RDD of vector to be normalized.

Returns:

normalized vector. If the norm of the input is zero, it will return the input vector.

Sample Code:

from pyspark.mllib.feature import Normalizer

from pyspark.mllib.linalg import Vectors

from pyspark import SparkContext

 

sc = SparkContext()

 

# v = [0.0, 1.0, 2.0]

= Vectors.dense(range(3))

 

# p = 1

nor = Normalizer(1)

print(nor.transform(v))

 

# p = 2

nor = Normalizer(2)

print(nor.transform(v))

 

# p = inf

nor = Normalizer(p=float("inf"))

print(nor.transform(v))

 

Output:

[0.0,0.3333333333,0.666666667]

[0.0,0.4472135955,0.894427191]

[0.0,0.5,1.0]


















本文转自张昺华-sky博客园博客,原文链接:http://www.cnblogs.com/bonelee/p/7778000.html,如需转载请自行联系原作者


import os from pyspark.sql import SparkSession from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import MulticlassClassificationEvaluator from T2 import spark # 设置环境变量 spark_files = 'D:\spark实训\spark_files' os.environ['PYSPARK_PYTHON'] = os.path.join(spark_files, 'spark_virenvs\python.exe') os.environ['SPARK_HOME'] = os.path.join(spark_files, 'spark-3.4.4-bin-hadoop3') os.environ['JAVA_HOME'] = os.path.join(spark_files, 'Java\jdk-18.0.1') # 加载数据集 train_data = spark.read.csv("D:\spark大数据快速运算大作业\数据集\题目二\data_train.txt", inferSchema=True, header=False) test_data = spark.read.csv("D:\spark大数据快速运算大作业\数据集\题目二\data_test.txt", inferSchema=True, header=False) # 数据预处理 from pyspark.ml.feature import VectorAssembler assembler = VectorAssembler( inputCols=[f"_{i}" for i in range(1, 55)], # 前54列为特征 outputCol="features" ) train = assembler.transform(train_data).withColumnRenamed("_55", "label") test = assembler.transform(test_data).withColumnRenamed("_55", "label") from pyspark.ml.classification import RandomForestClassifier rf = RandomForestClassifier( numTrees=100, maxDepth=10, seed=42, labelCol="label", featuresCol="features" ) rf_model = rf.fit(train) rf_predictions = rf_model.transform(test) lr = LogisticRegression( maxIter=100, regParam=0.01, family="multinomial", # 多分类设置 labelCol="label", featuresCol="features" ) lr_model = lr.fit(train) lr_predictions = lr_model.transform(test) evaluator = MulticlassClassificationEvaluator( labelCol="label", predictionCol="prediction", metricName="accuracy" ) rf_accuracy = evaluator.evaluate(rf_predictions) lr_accuracy = evaluator.evaluate(lr_predictions) # F1分数评估 evaluator.setMetricName("f1") rf_f1 = evaluator.evaluate(rf_predictions) lr_f1 = evaluator.evaluate(lr_predictions) # 模型训练与评估完整流程 def train_evaluate(model, train_data, test_data): model = model.fit(train_data) predictions = model.transform(test_data) # 计算评估指标 accuracy = evaluator.setMetricName("accuracy").evaluate(predictions) f1 = evaluator.setMetricName("f1").evaluate(predictions) # 特征重要性(随机森林特有) if isinstance(model, RandomForestClassificationModel): importances = model.featureImportances print("Top 5 features:", importances.values.argsort()[-5:][::-1]) return accuracy, f1 # 执行比较 rf_acc, rf_f1 = train_evaluate(rf, train, test) lr_acc, lr_f1 = train_evaluate(lr, train, test)D:\spark实训\spark_files\spark_virenvs\python.exe D:\spark大数据快速运算大作业\22.py 25/06/12 17:24:08 WARN Shell: Did not find winutils.exe: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 25/06/12 17:24:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 正在加载数据... 训练随机森林模型... 随机森林训练时间: 21.42秒 随机森林评估结果: 准确率: 0.6507 F1分数: 0.6208 训练逻辑回归模型... 25/06/12 17:24:38 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS 逻辑回归训练时间: 14.64秒 逻辑回归评估结果: 准确率: 0.6712 F1分数: 0.6558 ===== 模型比较 ===== 随机森林训练时间: 21.42秒 vs 逻辑回归训练时间: 14.64秒 随机森林准确率: 0.6507 vs 逻辑回归准确率: 0.6712 随机森林F1分数: 0.6208 vs 逻辑回归F1分数: 0.6558 随机森林特征重要性(前10): 特征 0: 0.3683 特征 13: 0.2188 特征 35: 0.0529 特征 23: 0.0500 特征 25: 0.0368 特征 51: 0.0336 特征 52: 0.0281 特征 17: 0.0279 特征 5: 0.0244 特征 36: 0.0236 25/06/12 17:24:49 ERROR Instrumentation: java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:735) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:270) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:286) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:660) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:700) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:788) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:356) at org.apache.hadoop.mapred.FileOutputCommitter.setupJob(FileOutputCommitter.java:131) at org.apache.hadoop.mapred.OutputCommitter.setupJob(OutputCommitter.java:265) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:188) at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:79) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1091) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1089) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1062) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1027) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$3(PairRDDFunctions.scala:1009) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1008) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$2(PairRDDFunctions.scala:965) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:963) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$2(RDD.scala:1593) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1593) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$1(RDD.scala:1579) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1579) at org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:413) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1(Pipeline.scala:250) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1$adapted(Pipeline.scala:247) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:247) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:346) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169) at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3$adapted(Pipeline.scala:344) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.save(Pipeline.scala:344) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:547) at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:568) at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:591) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:688) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) at org.apache.hadoop.conf.Configuration.getTimeDurationHelper(Configuration.java:1907) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1867) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1840) at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183) at org.apache.hadoop.util.ShutdownHookManager$HookEntry.<init>(ShutdownHookManager.java:207) at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:304) at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:181) at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50) at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153) at org.apache.spark.util.ShutdownHookManager$.<init>(ShutdownHookManager.scala:58) at org.apache.spark.util.ShutdownHookManager$.<clinit>(ShutdownHookManager.scala) at org.apache.spark.util.Utils$.createTempDir(Utils.scala:341) at org.apache.spark.util.Utils$.createTempDir(Utils.scala:331) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:370) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1111) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:467) at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:438) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:515) ... 23 more 25/06/12 17:24:49 ERROR Instrumentation: java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:735) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:270) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:286) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:660) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:700) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:788) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:356) at org.apache.hadoop.mapred.FileOutputCommitter.setupJob(FileOutputCommitter.java:131) at org.apache.hadoop.mapred.OutputCommitter.setupJob(OutputCommitter.java:265) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:188) at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:79) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1091) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1089) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1062) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1027) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$3(PairRDDFunctions.scala:1009) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1008) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$2(PairRDDFunctions.scala:965) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:963) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$2(RDD.scala:1593) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1593) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$1(RDD.scala:1579) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1579) at org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:413) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1(Pipeline.scala:250) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1$adapted(Pipeline.scala:247) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:247) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:346) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169) at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3$adapted(Pipeline.scala:344) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.save(Pipeline.scala:344) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:547) at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:568) at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:591) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:688) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) at org.apache.hadoop.conf.Configuration.getTimeDurationHelper(Configuration.java:1907) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1867) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1840) at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183) at org.apache.hadoop.util.ShutdownHookManager$HookEntry.<init>(ShutdownHookManager.java:207) at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:304) at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:181) at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50) at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153) at org.apache.spark.util.ShutdownHookManager$.<init>(ShutdownHookManager.scala:58) at org.apache.spark.util.ShutdownHookManager$.<clinit>(ShutdownHookManager.scala) at org.apache.spark.util.Utils$.createTempDir(Utils.scala:341) at org.apache.spark.util.Utils$.createTempDir(Utils.scala:331) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:370) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1111) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:467) at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:438) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:515) ... 23 more Traceback (most recent call last): File "D:\spark大数据快速运算大作业\22.py", line 6, in <module> from T2 import spark File "D:\spark大数据快速运算大作业\T2.py", line 113, in <module> rf_model.write().overwrite().save("models/random_forest_model") File "D:\spark实训\spark_files\spark_virenvs\lib\site-packages\pyspark\ml\util.py", line 197, in save self._jwrite.save(path) File "D:\spark实训\spark_files\spark_virenvs\lib\site-packages\py4j\java_gateway.py", line 1322, in __call__ return_value = get_return_value( File "D:\spark实训\spark_files\spark_virenvs\lib\site-packages\pyspark\errors\exceptions\captured.py", line 169, in deco return f(*a, **kw) File "D:\spark实训\spark_files\spark_virenvs\lib\site-packages\py4j\protocol.py", line 326, in get_return_value raise Py4JJavaError( py4j.protocol.Py4JJavaError: An error occurred while calling o963.save. : java.lang.RuntimeException: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.getWinUtilsPath(Shell.java:735) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:270) at org.apache.hadoop.util.Shell.getSetPermissionCommand(Shell.java:286) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSystem.java:978) at org.apache.hadoop.fs.RawLocalFileSystem.mkOneDirWithMode(RawLocalFileSystem.java:660) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:700) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirsWithOptionalPermission(RawLocalFileSystem.java:699) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.java:672) at org.apache.hadoop.fs.ChecksumFileSystem.mkdirs(ChecksumFileSystem.java:788) at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.setupJob(FileOutputCommitter.java:356) at org.apache.hadoop.mapred.FileOutputCommitter.setupJob(FileOutputCommitter.java:131) at org.apache.hadoop.mapred.OutputCommitter.setupJob(OutputCommitter.java:265) at org.apache.spark.internal.io.HadoopMapReduceCommitProtocol.setupJob(HadoopMapReduceCommitProtocol.scala:188) at org.apache.spark.internal.io.SparkHadoopWriter$.write(SparkHadoopWriter.scala:79) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopDataset$1(PairRDDFunctions.scala:1091) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopDataset(PairRDDFunctions.scala:1089) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$4(PairRDDFunctions.scala:1062) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1027) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$3(PairRDDFunctions.scala:1009) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:1008) at org.apache.spark.rdd.PairRDDFunctions.$anonfun$saveAsHadoopFile$2(PairRDDFunctions.scala:965) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.PairRDDFunctions.saveAsHadoopFile(PairRDDFunctions.scala:963) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$2(RDD.scala:1593) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1593) at org.apache.spark.rdd.RDD.$anonfun$saveAsTextFile$1(RDD.scala:1579) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:405) at org.apache.spark.rdd.RDD.saveAsTextFile(RDD.scala:1579) at org.apache.spark.ml.util.DefaultParamsWriter$.saveMetadata(ReadWrite.scala:413) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1(Pipeline.scala:250) at org.apache.spark.ml.Pipeline$SharedReadWrite$.$anonfun$saveImpl$1$adapted(Pipeline.scala:247) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.Pipeline$SharedReadWrite$.saveImpl(Pipeline.scala:247) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.saveImpl(Pipeline.scala:346) at org.apache.spark.ml.util.MLWriter.save(ReadWrite.scala:168) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.super$save(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$4(Pipeline.scala:344) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent(events.scala:174) at org.apache.spark.ml.MLEvents.withSaveInstanceEvent$(events.scala:169) at org.apache.spark.ml.util.Instrumentation.withSaveInstanceEvent(Instrumentation.scala:42) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3(Pipeline.scala:344) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.$anonfun$save$3$adapted(Pipeline.scala:344) at org.apache.spark.ml.util.Instrumentation$.$anonfun$instrumented$1(Instrumentation.scala:191) at scala.util.Try$.apply(Try.scala:213) at org.apache.spark.ml.util.Instrumentation$.instrumented(Instrumentation.scala:191) at org.apache.spark.ml.PipelineModel$PipelineModelWriter.save(Pipeline.scala:344) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:748) Caused by: java.io.FileNotFoundException: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. -see https://wiki.apache.org/hadoop/WindowsProblems at org.apache.hadoop.util.Shell.fileNotFoundException(Shell.java:547) at org.apache.hadoop.util.Shell.getHadoopHomeDir(Shell.java:568) at org.apache.hadoop.util.Shell.getQualifiedBin(Shell.java:591) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:688) at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79) at org.apache.hadoop.conf.Configuration.getTimeDurationHelper(Configuration.java:1907) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1867) at org.apache.hadoop.conf.Configuration.getTimeDuration(Configuration.java:1840) at org.apache.hadoop.util.ShutdownHookManager.getShutdownTimeout(ShutdownHookManager.java:183) at org.apache.hadoop.util.ShutdownHookManager$HookEntry.<init>(ShutdownHookManager.java:207) at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:304) at org.apache.spark.util.SparkShutdownHookManager.install(ShutdownHookManager.scala:181) at org.apache.spark.util.ShutdownHookManager$.shutdownHooks$lzycompute(ShutdownHookManager.scala:50) at org.apache.spark.util.ShutdownHookManager$.shutdownHooks(ShutdownHookManager.scala:48) at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153) at org.apache.spark.util.ShutdownHookManager$.<init>(ShutdownHookManager.scala:58) at org.apache.spark.util.ShutdownHookManager$.<clinit>(ShutdownHookManager.scala) at org.apache.spark.util.Utils$.createTempDir(Utils.scala:341) at org.apache.spark.util.Utils$.createTempDir(Utils.scala:331) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:370) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:192) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:215) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:91) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1111) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1120) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:467) at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:438) at org.apache.hadoop.util.Shell.<clinit>(Shell.java:515) ... 23 more Process finished with exit code 1
06-18
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值