DeepObfusCode：Source Code Obfuscation Through Sequence-to-Sequence Networks

DeepObfusCode：Source Code Obfuscation Through Sequence-to-Sequence Networks - 郑瀚Andrew
2023-7-27 08:49:0 Author: www.cnblogs.com(查看原文) 阅读量:17 收藏

代码混淆技术旨在解决代码逆向对抗问题。

本质上，代码混淆技术的目标是：在保持一个程序逻辑结构不变以及完整保存的前提下，同时让攻击者不易识别，以此保护软件的完整性和知识产权。

传统的防护策略包括：

插入空白/冗余的逻辑运算
增加不必要的条件运算等

传统的混淆技术最大的问题是它可以被逆向工程，对攻击者来说，只能能够看到加密函数源代码，由于人工编写的逻辑不管多么负责，本质上永远是可逆的，只要有充分的时间就100%可以编写处逆向破解函数。

论文提出的 DeepObfusCode 是一种使用sequence-to-sequence神经网络算法，实现一套对称加密算法，具备加密/解密能力。

加密算法生成：sequence-to-sequence神经网络的权重因子采用随机函数产生，保证了由神经网络支撑的加密函数本身不可复现。在这种方法下，即使攻击者能够看到整个由神经网络组成的加密算法，也无法轻易编写出逆向破解函数。
解密算法生成：sequence-to-sequence序列神经网络具备从encode_str-decode_str数据集中学习统计模式的能力，通过输入”密文-明文序列数据集”，并通过反向传播使神经网络完成函数最大似然拟合，此时得到的非线性函数本质上就是一个具备解密能力的解密函数，并通过神经网络权重形式持久化，将解密函数固化保存。
密文代码执行：一旦完成神经网络加密函数的训练，外部用户在后续的使用中就可以直接加载model，通过model的预测函数对密文进行解密，在内存中得到明文后，再传入命令执行管道执行。外部用户无需了解密文代码的明文细节（无需知道密文对应的明文）。

从学术角度来看，这种方法论将混淆方法，从增量混淆，进化完全混淆。

除了序列到序列网络模型之外，还可以在其上继续叠加更多的混淆模型，并且也可以利用其他深度学习方法来替代序列到序列模型实现混淆。

同时，该算法架构可以进一步合并到更大的框架或基础设施中，以实现同态加密，并确保代码执行期间的匿名性。

参考链接：

https://arxiv.org/pdf/1909.01837.pdf

本节将详细介绍传统的代码混淆方法以及它们的评估效果，然后讨论深度学习技术给混淆技术带来了哪些改变。

0x1：Code Obfuscation Methods

代码混淆的目的是掩盖计算过程和软件代码背后的逻辑，以保护商业秘密、知识产权或机密数据。

传统上，有八种通用方法实现代码混淆：

名称混淆（Name obfuscation）
数据混淆（Data obfuscation）
代码流混淆（Code flow obfuscation）
增量混淆（Incremental obfuscation）
中间代码优化（Intermediate code optimization）
调试信息混淆（Debug information obfuscation）
水印（Watermarking）
源代码混淆（Source code obfuscation）

源代码混淆是本文的重点，目标隐藏源代码背后含义的过程，即使一个第三方获得了代码也无法理解代码具体逻辑。

上述每一种混淆方法的每一个分支都有一些子技术来降低其他混淆方法共享的代码的可理解性，包括

控制顺序（control ordering）：changing the execution order of program statements
控制计算（control computation ）：changing the control flow of the program, such as inserting dead code to increase code complexity
数据聚合（data aggregation）：changing the access to data structures after their conversion into other types
重命名标识符（renamed identifiers）：replacing identifiers with meaningless strings as new identifiers to reduce readability and comprehension of what certain functions or methods do

现有方法往往需要使用上述方法手动更改源代码，但本文提出的方法可以相对随机的方式执行完全混淆。恶意攻击者或读者将无法根据加密后密文对代码进行逆向工程。同时，用于解密的model权重文件（本质上就是秘钥文件）可以保存在执行服务器上，这个model权重文件本身是很难被逆向的。

为了定量比较所提出的源代码混淆方法与现有代码混淆方法的性能，我们提出四个比较指标：

代码效力（Code potency）：该指标主要关注计算复杂性度量，特别关注控制流和数据混淆，一个典型地例子是词频统计和误导声明的数量。
弹性（Resilience）：该指标衡量混淆文本抵御自动化工具的攻击的能力。
隐秘性（Stealth）：该指标测试人类手动反混淆的难度，它本质上检查对抗性反混淆器检测的速度。
执行成本（Execution cost）：该指标衡量 (i) 所需的增量时间，以及 (ii) 执行所需的增量时间。

0x2：Applications of Neural Networks

神经网络最近在密码学领域得到了不少应用，这表明人们对通过使用神经网络加密数据越来越感兴趣。

但本文提出的方法与之前通过深度学习技术进行数据加密不同，这个架构开启了另一种神经网络加密范式，即使用深度神经网络对源代码进行加密，并保存神经网络权重文件，然后使用生成的模型文件来解密并执行代码。

DeepObfusCode混淆架构首先是一个初级递归神经网络（RNN）编码器-解码器模型，通过随机函数设置网络权重，将原始代码文本作为输入来生成混淆文本。

然后再是一个 RNN 编码器-解码器模型，将【生成密文-原始明文】作为【编码-解码】数据集，进行多次迭代训练，得到一个模型权重（解密秘钥）。

0x1：Ciphertext generation

为了生成混淆代码，我们首先获取原始的明文文本（源代码）和完整的字符集（包含所有字符的字符串，包括字母，数字和标点符号）。

Overview of ciphertext generation. We pass the source code and a character set as inputs to initialize both character sets for the encoder-decoder model, then randomly assign weights to generate the ciphertext, an obfuscated version of the source code.

Model: "model"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_1 (InputLayer)           [(None, None, 23)]   0           []                               
                                                                                                  
 input_2 (InputLayer)           [(None, None, 72)]   0           []                               
                                                                                                  
 lstm (LSTM)                    [(None, 256),        286720      ['input_1[0][0]']                
                                 (None, 256),                                                     
                                 (None, 256)]                                                     
                                                                                                  
 lstm_1 (LSTM)                  [(None, None, 256),  336896      ['input_2[0][0]',                
                                 (None, 256),                     'lstm[0][1]',                   
                                 (None, 256)]                     'lstm[0][2]']                   
                                                                                                  
 dense (Dense)                  (None, None, 72)     18504       ['lstm_1[0][0]']                 
                                                                                                  
==================================================================================================
Total params: 642,120
Trainable params: 642,120
Non-trainable params: 0
__________________________________________________________________________________________________

0x2：Key generation

生成密文后，我们将其与原始源代码一起，用于生成秘钥。

Overview of key generation. With the known ciphertext and original source code, the developer of the source code would pass them as inputs into another encoderdecoder model and train over a number of iterations such that the model weights obtained can translate the obfuscated code into executable code, with validation of executability at the end.

训练结束后，导出编码器和解码器（HDF5 格式的模型文件）（key文件），并以 pickle 格式导出元数据（索引到字符、字符到索引的字典）。

本质上，密钥生成函数 K(p, c) 接受参数 c 密文和 p 明文，并通过最小化损失函数，计算出模型权重。

0x3：Source code execution

在实时执行期间，我们有三个输入：

混淆代码（密文）
密钥（模型文件）
元数据文件

在我们的实验中，模型和元数据文件是分开的，如果在实时系统中执行，可以将它们组合成一个文件。

当我们将所有三个传递给模型容器时，输出值返回后立即执行，即 Exec(K(c, k))。

Overview of live execution. To run the obfuscated code on any server or system, one would pass in the obfuscated code into an execution engine that takes the ciphertext and the lodged model files as inputs to execute the withheld code.

优点

混淆效果不输于传统混淆方法
比手工编写混淆框架更简单、灵活
相比传统对称加密算法，抗key枚举爆破攻击能力更强，本质是因为秘钥key的搜索空间十分庞大

缺点

秘钥生成相比传统密码学算法更花时间
加密和解码的可逆性和对称性，不是通过密码学数学原理保证的，而仅仅是依靠极大似然估计的一种概率统计理论，在一些极端场景下可能无法100%保证算法的严谨性、

demo.py

# -*- coding: utf-8 -*-

import encryption as enc

if __name__ == '__main__':
    source_code = "<?php eval($_POST['1']); ?>"
    enc_source_code = enc.encryption(source_code)
    print("enc_source_code: ", enc_source_code)

    translated_code = enc.decryption(enc_source_code, source_code)
    print("translated_code: ", translated_code)

encryption.py

import json
import numpy as np
from collections import Counter
from util import *
from keras.models import Model
from keras.layers import Input, LSTM, Dense
import tensorflow as tf
gpu_limit = 0.2
# gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_limit)
gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=gpu_limit)
# sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options))
sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options))


def encryption(source_code, iterations_to_crack=2000, randomnessIndex = 10, lossThreshold = 0.3):
    fullcharacterset = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890.,;:/\?!"
    source_sentences = []
    target_sentences = []
    source_chars = set()
    target_chars = set()
    nb_samples = 1
        
    source_line = str(source_code).split('\t')[0]
    target_line = '\t' + str(fullcharacterset) + '\n'
    source_sentences.append(source_line)
    target_sentences.append(target_line)
    for ch in source_line:
        if (ch not in source_chars):
            source_chars.add(ch)
    for ch in target_line:
        if (ch not in target_chars):
            target_chars.add(ch)
    target_chars = sorted(list(target_chars))
    print("target_chars: ", target_chars)
    source_chars = sorted(list(source_chars))
    print("source_chars: ", source_chars)
    source_index_to_char_dict = {}
    source_char_to_index_dict = {}
    for k, v in enumerate(source_chars):
        source_index_to_char_dict[k] = v
        source_char_to_index_dict[v] = k
    target_index_to_char_dict = {}
    target_char_to_index_dict = {}
    for k, v in enumerate(target_chars):
        target_index_to_char_dict[k] = v
        target_char_to_index_dict[v] = k
    source_sent = source_sentences
    print("source_sent: ", source_sent)
    target_sent = target_sentences
    print("target_sent: ", target_sent)
    max_len_source_sent = max([len(line) for line in source_sent])
    max_len_target_sent = max([len(line) for line in target_sent])

    tokenized_source_sentences = np.zeros(shape = (nb_samples,max_len_source_sent,len(source_chars)), dtype='float32')
    print("tokenized_source_sentences: ", tokenized_source_sentences)
    tokenized_target_sentences = np.zeros(shape = (nb_samples,max_len_target_sent,len(target_chars)), dtype='float32')
    print("tokenized_target_sentences: ", tokenized_target_sentences)
    target_data = np.zeros((nb_samples, max_len_target_sent, len(target_chars)),dtype='float32')
    for i in range(nb_samples):
        for k,ch in enumerate(source_sent[i]):
            tokenized_source_sentences[i,k,source_char_to_index_dict[ch]] = 1
        for k,ch in enumerate(target_sent[i]):
            tokenized_target_sentences[i,k,target_char_to_index_dict[ch]] = 1
            # decoder_target_data will be ahead by one timestep and will not include the start character.
            if k > 0:
                target_data[i,k-1,target_char_to_index_dict[ch]] = 1
    print("tokenized_source_sentences: ", tokenized_source_sentences)
    print("tokenized_target_sentences: ", tokenized_target_sentences)

    # Encoder model
    encoder_input = Input(shape=(None,len(source_chars)))
    encoder_LSTM = LSTM(256,return_state = True)
    encoder_outputs, encoder_h, encoder_c = encoder_LSTM (encoder_input)
    encoder_states = [encoder_h, encoder_c]
    # Decoder model
    decoder_input = Input(shape=(None,len(target_chars)))
    decoder_LSTM = LSTM(256,return_sequences=True, return_state = True)
    decoder_out, _ , _ = decoder_LSTM(decoder_input, initial_state=encoder_states)
    decoder_dense = Dense(len(target_chars),activation='softmax')
    decoder_out = decoder_dense (decoder_out)
    model = Model(inputs=[encoder_input, decoder_input],outputs=[decoder_out])

    model.summary()

    # create weights with the right shape, sample:
    # nested randomness creation
    weights = [ w * np.random.rand(*w.shape) for w in model.get_weights()]
    for i in range(randomnessIndex):
        weights = [ w * np.random.rand(*w.shape) for w in weights] 
    # update
    model.set_weights(weights)

    # Inference models for testing
    # Encoder inference model
    encoder_model_inf = Model(encoder_input, encoder_states)
    # Decoder inference model
    decoder_state_input_h = Input(shape=(256,))
    decoder_state_input_c = Input(shape=(256,))
    decoder_input_states = [decoder_state_input_h, decoder_state_input_c]
    decoder_out, decoder_h, decoder_c = decoder_LSTM(decoder_input, 
                                                     initial_state=decoder_input_states)
    decoder_states = [decoder_h , decoder_c]
    decoder_out = decoder_dense(decoder_out)
    decoder_model_inf = Model(inputs=[decoder_input] + decoder_input_states,
                              outputs=[decoder_out] + decoder_states )

    for seq_index in range(1):
        inp_seq = tokenized_source_sentences[seq_index:seq_index+1]
        obfuscated_code = decode_seq(inp_seq, encoder_model_inf, decoder_model_inf, target_chars, target_char_to_index_dict, target_index_to_char_dict, max_len_target_sent)
        print('-')
        print('Input sentence:', source_sent[seq_index])
        print('Decoded sentence:', obfuscated_code)

    return obfuscated_code


def decryption(obfuscated_code, source_code):
    source_sentences = []
    target_sentences = []
    source_chars = set()
    target_chars = set()
    nb_samples = 1

    source_line = str(obfuscated_code).split('\t')[0]
    target_line = '\t' + str(source_code) + '\n'
    source_sentences.append(source_line)
    target_sentences.append(target_line)
    for ch in source_line:
        if (ch not in source_chars):
            source_chars.add(ch)
    for ch in target_line:
        if (ch not in target_chars):
            target_chars.add(ch)
    target_chars = sorted(list(target_chars))
    source_chars = sorted(list(source_chars))
    source_index_to_char_dict = {}
    source_char_to_index_dict = {}
    for k, v in enumerate(source_chars):
        source_index_to_char_dict[k] = v
        source_char_to_index_dict[v] = k
    target_index_to_char_dict = {}
    target_char_to_index_dict = {}
    for k, v in enumerate(target_chars):
        target_index_to_char_dict[k] = v
        target_char_to_index_dict[v] = k
    source_sent = source_sentences
    target_sent = target_sentences
    max_len_source_sent = max([len(line) for line in source_sent])
    max_len_target_sent = max([len(line) for line in target_sent])

    tokenized_source_sentences = np.zeros(shape = (nb_samples,max_len_source_sent,len(source_chars)), dtype='float32')
    tokenized_target_sentences = np.zeros(shape = (nb_samples,max_len_target_sent,len(target_chars)), dtype='float32')
    target_data = np.zeros((nb_samples, max_len_target_sent, len(target_chars)),dtype='float32')
    for i in range(nb_samples):
        for k,ch in enumerate(source_sent[i]):
            tokenized_source_sentences[i,k,source_char_to_index_dict[ch]] = 1
        for k,ch in enumerate(target_sent[i]):
            tokenized_target_sentences[i,k,target_char_to_index_dict[ch]] = 1
            # decoder_target_data will be ahead by one timestep and will not include the start character.
            if k > 0:
                target_data[i,k-1,target_char_to_index_dict[ch]] = 1

    # Encoder model
    encoder_input = Input(shape=(None,len(source_chars)))
    encoder_LSTM = LSTM(256,return_state = True)
    encoder_outputs, encoder_h, encoder_c = encoder_LSTM (encoder_input)
    encoder_states = [encoder_h, encoder_c]
    # Decoder model
    decoder_input = Input(shape=(None,len(target_chars)))
    decoder_LSTM = LSTM(256,return_sequences=True, return_state = True)
    decoder_out, _ , _ = decoder_LSTM(decoder_input, initial_state=encoder_states)
    decoder_dense = Dense(len(target_chars),activation='softmax')
    decoder_out = decoder_dense (decoder_out)
    model = Model(inputs=[encoder_input, decoder_input],outputs=[decoder_out])
    # early stopping -- see if reach certain loss yet
    from keras.callbacks import EarlyStopping
    from keras.callbacks import ModelCheckpoint
    es = EarlyStopping(monitor='acc', mode='max', min_delta=1)
    # model exporting
    mc = ModelCheckpoint('decryption_key_ckpt.h5', monitor='loss', mode='min', save_best_only=True)

    # Run training
    model.compile(optimizer='rmsprop', loss='categorical_crossentropy')
    model.fit(x=[tokenized_source_sentences,tokenized_target_sentences], 
              y=target_data,
              batch_size=1,
              epochs=2000,   #50 # 1000 is perfect replica # 150 seems to do well for short strings (0.500 loss threshold)
              shuffle=True,
              callbacks=[es, mc]
             )
    model.save('decryption_key.h5')

    # Inference models for testing
    # Encoder inference model
    encoder_model_inf = Model(encoder_input, encoder_states)
    # Decoder inference model
    decoder_state_input_h = Input(shape=(256,))
    decoder_state_input_c = Input(shape=(256,))
    decoder_input_states = [decoder_state_input_h, decoder_state_input_c]
    decoder_out, decoder_h, decoder_c = decoder_LSTM(decoder_input, 
                                                     initial_state=decoder_input_states)
    decoder_states = [decoder_h , decoder_c]
    decoder_out = decoder_dense(decoder_out)
    decoder_model_inf = Model(inputs=[decoder_input] + decoder_input_states,
                              outputs=[decoder_out] + decoder_states )
    encoder_model_inf.save('encoder_decryption_key.h5')
    decoder_model_inf.save('decoder_decryption_key.h5')

    for seq_index in range(1):
        inp_seq = tokenized_source_sentences[seq_index:seq_index+1]
        translated_code = decode_seq(inp_seq, encoder_model_inf, decoder_model_inf, target_chars, target_char_to_index_dict, target_index_to_char_dict, max_len_target_sent)
        print('-')
        print('Input sentence:', source_sent[seq_index])
        print('Decoded sentence:', translated_code)
    print("Ground truth sentence: ", source_code)
    print("Successful obfuscation: ", assertFunctionEquals(translated_code, source_code))

    with open('decryption_text.txt', 'w') as outfile:  
        json.dump(source_sent[seq_index], outfile)
    with open('target_chars.txt', 'w') as outfile:  
        json.dump(target_chars, outfile)
    with open('target_index_to_char_dict.txt', 'w') as outfile:  
        json.dump(target_index_to_char_dict, outfile)
    with open('target_char_to_index_dict.txt', 'w') as outfile:  
        json.dump(target_char_to_index_dict, outfile)
    with open('max_len_target_sent.txt', 'w') as outfile:  
        json.dump(max_len_target_sent, outfile)

    return translated_code

参考链接：

https://github.com/dattasiddhartha/DeepObfusCode/tree/main

文章来源: https://www.cnblogs.com/LittleHann/p/17582132.html
如有侵权请联系:admin#unsafe.sh