可信智能计算服务 TICS-准备数据:准备本地横向联邦数据资源

时间:2024-04-23 17:09:41

准备本地横向联邦数据资源

  1. 上传数据集文件(作业参与方)

    上传数据集文件到计算节点挂载路径下,供计算节点执行的脚本读取。如果是主机挂载,上传到宿主机的挂载路径下。如果是OBS挂载,使用华为云提供的对象存储服务,上传到当前计算节点使用的对象桶中。

    图5 对象桶名称

    此处以主机挂载为例:

    1. 创建一个主机挂载的计算节点Agent1,挂载路径为/tmp/tics1/。
    2. 使用文件上传工具上传包含数据集iris1.csv的dataset文件夹到宿主机/tmp/tics1/目录下。
      iris1.csv内容如下:
      sepal_length,sepal_width,petal_length,petal_width,class
      5.1,3.5,1.4,0.3,Iris-setosa
      5.7,3.8,1.7,0.3,Iris-setosa
      5.1,3.8,1.5,0.3,Iris-setosa
      5.4,3.4,1.7,0.2,Iris-setosa
      5.1,3.7,1.5,0.4,Iris-setosa
      4.6,3.6,1,0.2,Iris-setosa
      5.1,3.3,1.7,0.5,Iris-setosa
      4.8,3.4,1.9,0.2,Iris-setosa
      5,3,1.6,0.2,Iris-setosa
      5,3.4,1.6,0.4,Iris-setosa
      5.2,3.5,1.5,0.2,Iris-setosa
      5.2,3.4,1.4,0.2,Iris-setosa
      4.7,3.2,1.6,0.2,Iris-setosa
      4.8,3.1,1.6,0.2,Iris-setosa
      5.4,3.4,1.5,0.4,Iris-setosa
      5.2,4.1,1.5,0.1,Iris-setosa
      5.5,4.2,1.4,0.2,Iris-setosa
      4.9,3.1,1.5,0.1,Iris-setosa
      5,3.2,1.2,0.2,Iris-setosa
      5.5,3.5,1.3,0.2,Iris-setosa
      4.9,3.1,1.5,0.1,Iris-setosa
      4.4,3,1.3,0.2,Iris-setosa
      5.1,3.4,1.5,0.2,Iris-setosa
      5,3.5,1.3,0.3,Iris-setosa
      4.5,2.3,1.3,0.3,Iris-setosa
      4.4,3.2,1.3,0.2,Iris-setosa
      5,3.5,1.6,0.6,Iris-setosa
      5.1,3.8,1.9,0.4,Iris-setosa
      4.8,3,1.4,0.3,Iris-setosa
      5.1,3.8,1.6,0.2,Iris-setosa
      4.6,3.2,1.4,0.2,Iris-setosa
      5.3,3.7,1.5,0.2,Iris-setosa
      5,3.3,1.4,0.2,Iris-setosa
      6.8,2.8,4.8,1.4,Iris-versicolor
      6.7,3,5,1.7,Iris-versicolor
      6,2.9,4.5,1.5,Iris-versicolor
      5.7,2.6,3.5,1,Iris-versicolor
      5.5,2.4,3.8,1.1,Iris-versicolor
      5.5,2.4,3.7,1,Iris-versicolor
      5.8,2.7,3.9,1.2,Iris-versicolor
      6,2.7,5.1,1.6,Iris-versicolor
      5.4,3,4.5,1.5,Iris-versicolor
      6,3.4,4.5,1.6,Iris-versicolor
      6.7,3.1,4.7,1.5,Iris-versicolor
      6.3,2.3,4.4,1.3,Iris-versicolor
      5.6,3,4.1,1.3,Iris-versicolor
      5.5,2.5,4,1.3,Iris-versicolor
      5.5,2.6,4.4,1.2,Iris-versicolor
      6.1,3,4.6,1.4,Iris-versicolor
      5.8,2.6,4,1.2,Iris-versicolor
      5,2.3,3.3,1,Iris-versicolor
      5.6,2.7,4.2,1.3,Iris-versicolor
      5.7,3,4.2,1.2,Iris-versicolor
      5.7,2.9,4.2,1.3,Iris-versicolor
      6.2,2.9,4.3,1.3,Iris-versicolor
      5.1,2.5,3,1.1,Iris-versicolor
      5.7,2.8,4.1,1.3,Iris-versicolor
      6.3,3.3,6,2.5,Iris-virginica
      5.8,2.7,5.1,1.9,Iris-virginica
      7.1,3,5.9,2.1,Iris-virginica
      6.3,2.9,5.6,1.8,Iris-virginica
      6.5,3,5.8,2.2,Iris-virginica
      7.6,3,6.6,2.1,Iris-virginica
      4.9,2.5,4.5,1.7,Iris-virginica
      7.3,2.9,6.3,1.8,Iris-virginica
      6.7,2.5,5.8,1.8,Iris-virginica
      7.2,3.6,6.1,2.5,Iris-virginica
      6.5,3.2,5.1,2,Iris-virginica
      6.4,2.7,5.3,1.9,Iris-virginica
      6.8,3,5.5,2.1,Iris-virginica
      5.7,2.5,5,2,Iris-virginica
      5.8,2.8,5.1,2.4,Iris-virginica
      6.4,3.2,5.3,2.3,Iris-virginica
      6.5,3,5.5,1.8,Iris-virginica
      7.7,3.8,6.7,2.2,Iris-virginica
      7.7,2.6,6.9,2.3,Iris-virginica
      6,2.2,5,1.5,Iris-virginica
      6.9,3.2,5.7,2.3,Iris-virginica
      5.6,2.8,4.9,2,Iris-virginica
      7.7,2.8,6.7,2,Iris-virginica
      6.3,2.7,4.9,1.8,Iris-virginica
      6.7,3.3,5.7,2.1,Iris-virginica
      7.2,3.2,6,1.8,Iris-virginica
    3. 为了使容器内的计算节点程序有权限能够读取到文件,使用命令chown -R 1000:1000 /tmp/tics1/修改挂载目录下的文件的属主和组为1000:1000。
    4. 在第二台主机上创建计算节点Agent2,挂载路径为/tmp/tics2/。上传包含数据集iris2.csv的dataset文件夹到宿主机目录下,修改属主。
      iris2.csv的内容如下:
      sepal_length,sepal_width,petal_length,petal_width,class
      5.1,3.5,1.4,0.2,Iris-setosa
      4.9,3,1.4,0.2,Iris-setosa
      4.7,3.2,1.3,0.2,Iris-setosa
      4.6,3.1,1.5,0.2,Iris-setosa
      5,3.6,1.4,0.2,Iris-setosa
      5.4,3.9,1.7,0.4,Iris-setosa
      4.6,3.4,1.4,0.3,Iris-setosa
      5,3.4,1.5,0.2,Iris-setosa
      4.4,2.9,1.4,0.2,Iris-setosa
      4.9,3.1,1.5,0.1,Iris-setosa
      5.4,3.7,1.5,0.2,Iris-setosa
      4.8,3.4,1.6,0.2,Iris-setosa
      4.8,3,1.4,0.1,Iris-setosa
      4.3,3,1.1,0.1,Iris-setosa
      5.8,4,1.2,0.2,Iris-setosa
      5.7,4.4,1.5,0.4,Iris-setosa
      5.4,3.9,1.3,0.4,Iris-setosa
      7,3.2,4.7,1.4,Iris-versicolor
      6.4,3.2,4.5,1.5,Iris-versicolor
      6.9,3.1,4.9,1.5,Iris-versicolor
      5.5,2.3,4,1.3,Iris-versicolor
      6.5,2.8,4.6,1.5,Iris-versicolor
      5.7,2.8,4.5,1.3,Iris-versicolor
      6.3,3.3,4.7,1.6,Iris-versicolor
      4.9,2.4,3.3,1,Iris-versicolor
      6.6,2.9,4.6,1.3,Iris-versicolor
      5.2,2.7,3.9,1.4,Iris-versicolor
      5,2,3.5,1,Iris-versicolor
      5.9,3,4.2,1.5,Iris-versicolor
      6,2.2,4,1,Iris-versicolor
      6.1,2.9,4.7,1.4,Iris-versicolor
      5.6,2.9,3.6,1.3,Iris-versicolor
      6.7,3.1,4.4,1.4,Iris-versicolor
      5.6,3,4.5,1.5,Iris-versicolor
      5.8,2.7,4.1,1,Iris-versicolor
      6.2,2.2,4.5,1.5,Iris-versicolor
      5.6,2.5,3.9,1.1,Iris-versicolor
      5.9,3.2,4.8,1.8,Iris-versicolor
      6.1,2.8,4,1.3,Iris-versicolor
      6.3,2.5,4.9,1.5,Iris-versicolor
      6.1,2.8,4.7,1.2,Iris-versicolor
      6.4,2.9,4.3,1.3,Iris-versicolor
      6.6,3,4.4,1.4,Iris-versicolor
      6.8,2.8,4.8,1.4,Iris-versicolor
      6.2,2.8,4.8,1.8,Iris-virginica
      6.1,3,4.9,1.8,Iris-virginica
      6.4,2.8,5.6,2.1,Iris-virginica
      7.2,3,5.8,1.6,Iris-virginica
      7.4,2.8,6.1,1.9,Iris-virginica
      7.9,3.8,6.4,2,Iris-virginica
      6.4,2.8,5.6,2.2,Iris-virginica
      6.3,2.8,5.1,1.5,Iris-virginica
      6.1,2.6,5.6,1.4,Iris-virginica
      7.7,3,6.1,2.3,Iris-virginica
      6.3,3.4,5.6,2.4,Iris-virginica
      6.4,3.1,5.5,1.8,Iris-virginica
      6,3,4.8,1.8,Iris-virginica
      6.9,3.1,5.4,2.1,Iris-virginica
      6.7,3.1,5.6,2.4,Iris-virginica
      6.9,3.1,5.1,2.3,Iris-virginica
      5.8,2.7,5.1,1.9,Iris-virginica
      6.8,3.2,5.9,2.3,Iris-virginica
      6.7,3.3,5.7,2.5,Iris-virginica
      6.7,3,5.2,2.3,Iris-virginica
      6.3,2.5,5,1.9,Iris-virginica
      6.5,3,5.2,2,Iris-virginica
      6.2,3.4,5.4,2.3,Iris-virginica
      5.9,3,5.1,1.8,Iris-virginica
  2. 准备模型文件/初始权重(作业发起方)

    作业发起方需要提供模型、初始权重(非必须),上传到Agent1的挂载目录下并使用命令chown -R 1000:1000 /tmp/tics1/修改挂载目录下的文件的属主和组。

    使用python代码创建模型文件,保存为二进制文件model.h5,以鸢尾花为例,生成如下的模型:

    import tensorflow as tf
    import keras
     
    model = keras.Sequential([
        keras.layers.Dense(4, activation=tf.nn.relu, input_shape=(4,)),
        keras.layers.Dense(6, activation=tf.nn.relu),
        keras.layers.Dense(3, activation='softmax')
    ])
     
    model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
    model.save("d:/model.h5")

    初始权重的格式是浮点数的数组,与模型对应。使用联邦学习训练出来的结果result_1可以作为初始权重,样例如下:

    -0.23300957679748535,0.7804553508758545,0.0064492723904550076,0.5866460800170898,0.676144003868103,-0.7883696556091309,0.5472091436386108,-0.20961782336235046,0.58524489402771,-0.5079598426818848,-0.47474920749664307,-0.3519996106624603,-0.10822880268096924,-0.5457949042320251,-0.28117161989212036,-0.7369481325149536,-0.04728877171874046,0.003856887575238943,0.051739662885665894,0.033792052417993546,-0.31878742575645447,0.7511205673217773,0.3158722519874573,-0.7290999293327332,0.7187696695327759,0.09846954792737961,-0.06735057383775711,0.7165604829788208,-0.730293869972229,0.4473201036453247,-0.27151209115982056,-0.6971480846405029,0.7360773086547852,0.819558322429657,0.4984433054924011,0.05300116539001465,-0.6597640514373779,0.7849202156066895,0.6896201372146606,0.11731931567192078,-0.5380218029022217,0.18895208835601807,-0.18693888187408447,0.357051283121109,0.05440644919872284,0.042556408792734146,-0.04341210797429085,0.0,-0.04367709159851074,-0.031455427408218384,0.24731603264808655,-0.062861368060112,-0.4265706539154053,0.32981523871421814,-0.021271884441375732,0.15228557586669922,0.1818728893995285,0.4162319302558899,-0.22432318329811096,0.7156463861465454,-0.13709741830825806,0.7237883806228638,-0.5489991903305054,0.47034209966659546,-0.04692812263965607,0.7690137028694153,0.40263476967811584,-0.4405142068862915,0.016018997877836227,-0.04845477640628815,0.037553105503320694
  3. 编写训练脚本(作业发起方)

    作业发起方还需要编写联邦学习训练脚本,其中需要用户自行实现读取数据、训练模型、评估模型、获取评估指标的逻辑。计算节点会将数据集配置文件中的path属性作为参数传递给训练脚本。

    JobParam属性如下:

    class JobParam:
        """训练脚本参数
        """
        # 作业id
        job_id = ''
        # 当前轮数
        round = 0
        # 迭代次数
        epoch = 0
        # 模型文件路径
        model_file = ''
        # 数据集路径
        dataset_path = ''
        # 是否仅做评估
        eval_only = False
        # 权重文件
        weights_file = ''
        # 输出路径
        output = ''
        # 其他参数json字符串
        param = ''

    鸢尾花的训练脚本iris_train.py样例如下:

    # -*- coding: utf-8 -*-
    
    import getopt
    import sys
    
    import keras
    
    import horizontal.horizontallearning as hl
    
    
    def train():
        # 解析命令行输入
        jobParam = JobParam()
        jobParam.parse_from_command_line()
        job_type = 'evaluation' if jobParam.eval_only else 'training'
        print(f"Starting round {jobParam.round} {job_type}")
    
        # 加载模型,设置初始权重参数
        model = keras.models.load_model(jobParam.model_file)
        hl.set_model_weights(model, jobParam.weights_file)
    
        # 加载数据、训练、评估 -- 用户自己实现
        print(f"Load data {jobParam.dataset_path}")
        train_x, test_x, train_y, test_y, class_dict = load_data(jobParam.dataset_path)
    
        if not jobParam.eval_only:
            b_size = 1
            model.fit(train_x, train_y, batch_size=b_size, epochs=jobParam.epoch, shuffle=True, verbose=1)
            print(f"Training job [{jobParam.job_id}] finished")
        eval = model.evaluate(test_x, test_y, verbose=0)
        print("Evaluation on test data: loss = %0.6f accuracy = %0.2f%% \n" % (eval[0], eval[1] * 100))
    
        # 结果以json格式保存 -- 用户读取评估指标
        result = {}
        result['loss'] = eval[0]
        result['accuracy'] = eval[1]
    
        # 生成结果文件
        hl.save_train_result(jobParam, model, result)
    
    
    # 读取CSV数据集,并拆分为训练集和测试集
    # 该函数的传入参数为CSV_FILE_PATH: csv文件路径
    def load_data(CSV_FILE_PATH):
        import pandas as pd
        from sklearn.model_selection import train_test_split
        from sklearn.preprocessing import LabelBinarizer
    
        # 读取目录数据集,读取目录下所有CSV文件
        if os.path.isdir(CSV_FILE_PATH):
           print(f'read file folder [{CSV_FILE_PATH}]')
           all_csv_path = glob.glob(os.path.join(CSV_FILE_PATH, '*.csv'))
           all_csv_path.sort()
           csv_list = []
           for csv_path in all_csv_path:
               csv_list.append(pd.read_csv(csv_path))
           IRIS = pd.concat(csv_list)
        # 读取CSV文件
        else:
            IRIS = pd.read_csv(CSV_FILE_PATH)
        target_var = 'class'  # 目标变量
        # 数据集的特征
        features = list(IRIS.columns)
        features.remove(target_var)
        # 目标变量的类别
        Class = IRIS[target_var].unique()
        # 目标变量的类别字典
        Class_dict = dict(zip(Class, range(len(Class))))
        # 增加一列target, 将目标变量进行编码
        IRIS['target'] = IRIS[target_var].apply(lambda x: Class_dict[x])
        # 对目标变量进行0-1编码(One-hot Encoding)
        lb = LabelBinarizer()
        lb.fit(list(Class_dict.values()))
        transformed_labels = lb.transform(IRIS['target'])
        y_bin_labels = []  # 对多分类进行0-1编码的变量
        for i in range(transformed_labels.shape[1]):
            y_bin_labels.append('y' + str(i))
            IRIS['y' + str(i)] = transformed_labels[:, i]
        # 将数据集分为训练集和测试集
        train_x, test_x, train_y, test_y = train_test_split(IRIS[features], IRIS[y_bin_labels],
                                                            train_size=0.7, test_size=0.3, random_state=0)
        return train_x, test_x, train_y, test_y, Class_dict
    
    
    class JobParam:
        """训练脚本参数
        """
        # required parameters
        job_id = ''
        round = 0
        epoch = 0
        model_file = ''
        dataset_path = ''
        eval_only = False
    
        # optional parameters
        weights_file = ''
        output = ''
        param = ''
    
        def parse_from_command_line(self):
            """从命令行中解析作业参数
            """
            opts, args = getopt.getopt(sys.argv[1:], 'hn:w:',
                                       ['round=', 'epoch=', 'model_file=', 'eval_only', 'dataset_path=',
                                        'weights_file=', 'output=', 'param=', 'job_id='])
            for key, value in opts:
                if key in ['--round']:
                    self.round = int(value)
                if key in ['--epoch']:
                    self.epoch = int(value)
                if key in ['--model_file']:
                    self.model_file = value
                if key in ['--eval_only']:
                    self.eval_only = True
                if key in ['--dataset_path']:
                    self.dataset_path = value
                if key in ['--weights_file']:
                    self.weights_file = value
                if key in ['--output']:
                    self.output = value
                if key in ['--param']:
                    self.param = value
                if key in ['--job_id']:
                    self.job_id = value
    
    
    if __name__ == '__main__':
        train()
support.huaweicloud.com/usermanual-tics/tics_02_0024.html