代码混淆技术旨在解决代码逆向对抗问题。
本质上,代码混淆技术的目标是:在保持一个程序逻辑结构不变以及完整保存的前提下,同时让攻击者不易识别,以此保护软件的完整性和知识产权。
传统的防护策略包括:
传统的混淆技术最大的问题是它可以被逆向工程,对攻击者来说,只能能够看到加密函数源代码,由于人工编写的逻辑不管多么负责,本质上永远是可逆的,只要有充分的时间就100%可以编写处逆向破解函数。
论文提出的 DeepObfusCode 是一种使用sequence-to-sequence神经网络算法,实现一套对称加密算法,具备加密/解密能力。
从学术角度来看,这种方法论将混淆方法,从增量混淆,进化完全混淆。
除了序列到序列网络模型之外,还可以在其上继续叠加更多的混淆模型,并且也可以利用其他深度学习方法来替代序列到序列模型实现混淆。
同时,该算法架构可以进一步合并到更大的框架或基础设施中,以实现同态加密,并确保代码执行期间的匿名性。
参考链接:
https://arxiv.org/pdf/1909.01837.pdf
本节将详细介绍传统的代码混淆方法以及它们的评估效果,然后讨论深度学习技术给混淆技术带来了哪些改变。
代码混淆的目的是掩盖计算过程和软件代码背后的逻辑,以保护商业秘密、知识产权或机密数据。
传统上,有八种通用方法实现代码混淆:
源代码混淆是本文的重点,目标隐藏源代码背后含义的过程,即使一个第三方获得了代码也无法理解代码具体逻辑。
上述每一种混淆方法的每一个分支都有一些子技术来降低其他混淆方法共享的代码的可理解性,包括
现有方法往往需要使用上述方法手动更改源代码,但本文提出的方法可以相对随机的方式执行完全混淆。恶意攻击者或读者将无法根据加密后密文对代码进行逆向工程。同时,用于解密的model权重文件(本质上就是秘钥文件)可以保存在执行服务器上,这个model权重文件本身是很难被逆向的。
为了定量比较所提出的源代码混淆方法与现有代码混淆方法的性能,我们提出四个比较指标:
神经网络最近在密码学领域得到了不少应用,这表明人们对通过使用神经网络加密数据越来越感兴趣。
但本文提出的方法与之前通过深度学习技术进行数据加密不同,这个架构开启了另一种神经网络加密范式,即使用深度神经网络对源代码进行加密,并保存神经网络权重文件,然后使用生成的模型文件来解密并执行代码。
DeepObfusCode混淆架构首先是一个初级递归神经网络(RNN)编码器-解码器模型,通过随机函数设置网络权重,将原始代码文本作为输入来生成混淆文本。
然后再是一个 RNN 编码器-解码器模型,将【生成密文-原始明文】作为【编码-解码】数据集,进行多次迭代训练,得到一个模型权重(解密秘钥)。
为了生成混淆代码,我们首先获取原始的明文文本(源代码)和完整的字符集(包含所有字符的字符串,包括字母,数字和标点符号)。
Overview of ciphertext generation. We pass the source code and a character set as inputs to initialize both character sets for the encoder-decoder model, then randomly assign weights to generate the ciphertext, an obfuscated version of the source code.
Model: "model" __________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_1 (InputLayer) [(None, None, 23)] 0 [] input_2 (InputLayer) [(None, None, 72)] 0 [] lstm (LSTM) [(None, 256), 286720 ['input_1[0][0]'] (None, 256), (None, 256)] lstm_1 (LSTM) [(None, None, 256), 336896 ['input_2[0][0]', (None, 256), 'lstm[0][1]', (None, 256)] 'lstm[0][2]'] dense (Dense) (None, None, 72) 18504 ['lstm_1[0][0]'] ================================================================================================== Total params: 642,120 Trainable params: 642,120 Non-trainable params: 0 __________________________________________________________________________________________________
生成密文后,我们将其与原始源代码一起,用于生成秘钥。
Overview of key generation. With the known ciphertext and original source code, the developer of the source code would pass them as inputs into another encoderdecoder model and train over a number of iterations such that the model weights obtained can translate the obfuscated code into executable code, with validation of executability at the end.
训练结束后,导出编码器和解码器(HDF5 格式的模型文件)(key文件),并以 pickle 格式导出元数据(索引到字符、字符到索引的字典)。
本质上,密钥生成函数 K(p, c) 接受参数 c 密文和 p 明文,并通过最小化损失函数,计算出模型权重。
在实时执行期间,我们有三个输入:
在我们的实验中,模型和元数据文件是分开的,如果在实时系统中执行,可以将它们组合成一个文件。
当我们将所有三个传递给模型容器时,输出值返回后立即执行,即 Exec(K(c, k))。
Overview of live execution. To run the obfuscated code on any server or system, one would pass in the obfuscated code into an execution engine that takes the ciphertext and the lodged model files as inputs to execute the withheld code.
demo.py
# -*- coding: utf-8 -*- import encryption as enc if __name__ == '__main__': source_code = "<?php eval($_POST['1']); ?>" enc_source_code = enc.encryption(source_code) print("enc_source_code: ", enc_source_code) translated_code = enc.decryption(enc_source_code, source_code) print("translated_code: ", translated_code)
encryption.py
import json import numpy as np from collections import Counter from util import * from keras.models import Model from keras.layers import Input, LSTM, Dense import tensorflow as tf gpu_limit = 0.2 # gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_limit) gpu_options = tf.compat.v1.GPUOptions(per_process_gpu_memory_fraction=gpu_limit) # sess = tf.Session(config=tf.ConfigProto(gpu_options=gpu_options)) sess = tf.compat.v1.Session(config=tf.compat.v1.ConfigProto(gpu_options=gpu_options)) def encryption(source_code, iterations_to_crack=2000, randomnessIndex = 10, lossThreshold = 0.3): fullcharacterset = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz1234567890.,;:/\?!" source_sentences = [] target_sentences = [] source_chars = set() target_chars = set() nb_samples = 1 source_line = str(source_code).split('\t')[0] target_line = '\t' + str(fullcharacterset) + '\n' source_sentences.append(source_line) target_sentences.append(target_line) for ch in source_line: if (ch not in source_chars): source_chars.add(ch) for ch in target_line: if (ch not in target_chars): target_chars.add(ch) target_chars = sorted(list(target_chars)) print("target_chars: ", target_chars) source_chars = sorted(list(source_chars)) print("source_chars: ", source_chars) source_index_to_char_dict = {} source_char_to_index_dict = {} for k, v in enumerate(source_chars): source_index_to_char_dict[k] = v source_char_to_index_dict[v] = k target_index_to_char_dict = {} target_char_to_index_dict = {} for k, v in enumerate(target_chars): target_index_to_char_dict[k] = v target_char_to_index_dict[v] = k source_sent = source_sentences print("source_sent: ", source_sent) target_sent = target_sentences print("target_sent: ", target_sent) max_len_source_sent = max([len(line) for line in source_sent]) max_len_target_sent = max([len(line) for line in target_sent]) tokenized_source_sentences = np.zeros(shape = (nb_samples,max_len_source_sent,len(source_chars)), dtype='float32') print("tokenized_source_sentences: ", tokenized_source_sentences) tokenized_target_sentences = np.zeros(shape = (nb_samples,max_len_target_sent,len(target_chars)), dtype='float32') print("tokenized_target_sentences: ", tokenized_target_sentences) target_data = np.zeros((nb_samples, max_len_target_sent, len(target_chars)),dtype='float32') for i in range(nb_samples): for k,ch in enumerate(source_sent[i]): tokenized_source_sentences[i,k,source_char_to_index_dict[ch]] = 1 for k,ch in enumerate(target_sent[i]): tokenized_target_sentences[i,k,target_char_to_index_dict[ch]] = 1 # decoder_target_data will be ahead by one timestep and will not include the start character. if k > 0: target_data[i,k-1,target_char_to_index_dict[ch]] = 1 print("tokenized_source_sentences: ", tokenized_source_sentences) print("tokenized_target_sentences: ", tokenized_target_sentences) # Encoder model encoder_input = Input(shape=(None,len(source_chars))) encoder_LSTM = LSTM(256,return_state = True) encoder_outputs, encoder_h, encoder_c = encoder_LSTM (encoder_input) encoder_states = [encoder_h, encoder_c] # Decoder model decoder_input = Input(shape=(None,len(target_chars))) decoder_LSTM = LSTM(256,return_sequences=True, return_state = True) decoder_out, _ , _ = decoder_LSTM(decoder_input, initial_state=encoder_states) decoder_dense = Dense(len(target_chars),activation='softmax') decoder_out = decoder_dense (decoder_out) model = Model(inputs=[encoder_input, decoder_input],outputs=[decoder_out]) model.summary() # create weights with the right shape, sample: # nested randomness creation weights = [ w * np.random.rand(*w.shape) for w in model.get_weights()] for i in range(randomnessIndex): weights = [ w * np.random.rand(*w.shape) for w in weights] # update model.set_weights(weights) # Inference models for testing # Encoder inference model encoder_model_inf = Model(encoder_input, encoder_states) # Decoder inference model decoder_state_input_h = Input(shape=(256,)) decoder_state_input_c = Input(shape=(256,)) decoder_input_states = [decoder_state_input_h, decoder_state_input_c] decoder_out, decoder_h, decoder_c = decoder_LSTM(decoder_input, initial_state=decoder_input_states) decoder_states = [decoder_h , decoder_c] decoder_out = decoder_dense(decoder_out) decoder_model_inf = Model(inputs=[decoder_input] + decoder_input_states, outputs=[decoder_out] + decoder_states ) for seq_index in range(1): inp_seq = tokenized_source_sentences[seq_index:seq_index+1] obfuscated_code = decode_seq(inp_seq, encoder_model_inf, decoder_model_inf, target_chars, target_char_to_index_dict, target_index_to_char_dict, max_len_target_sent) print('-') print('Input sentence:', source_sent[seq_index]) print('Decoded sentence:', obfuscated_code) return obfuscated_code def decryption(obfuscated_code, source_code): source_sentences = [] target_sentences = [] source_chars = set() target_chars = set() nb_samples = 1 source_line = str(obfuscated_code).split('\t')[0] target_line = '\t' + str(source_code) + '\n' source_sentences.append(source_line) target_sentences.append(target_line) for ch in source_line: if (ch not in source_chars): source_chars.add(ch) for ch in target_line: if (ch not in target_chars): target_chars.add(ch) target_chars = sorted(list(target_chars)) source_chars = sorted(list(source_chars)) source_index_to_char_dict = {} source_char_to_index_dict = {} for k, v in enumerate(source_chars): source_index_to_char_dict[k] = v source_char_to_index_dict[v] = k target_index_to_char_dict = {} target_char_to_index_dict = {} for k, v in enumerate(target_chars): target_index_to_char_dict[k] = v target_char_to_index_dict[v] = k source_sent = source_sentences target_sent = target_sentences max_len_source_sent = max([len(line) for line in source_sent]) max_len_target_sent = max([len(line) for line in target_sent]) tokenized_source_sentences = np.zeros(shape = (nb_samples,max_len_source_sent,len(source_chars)), dtype='float32') tokenized_target_sentences = np.zeros(shape = (nb_samples,max_len_target_sent,len(target_chars)), dtype='float32') target_data = np.zeros((nb_samples, max_len_target_sent, len(target_chars)),dtype='float32') for i in range(nb_samples): for k,ch in enumerate(source_sent[i]): tokenized_source_sentences[i,k,source_char_to_index_dict[ch]] = 1 for k,ch in enumerate(target_sent[i]): tokenized_target_sentences[i,k,target_char_to_index_dict[ch]] = 1 # decoder_target_data will be ahead by one timestep and will not include the start character. if k > 0: target_data[i,k-1,target_char_to_index_dict[ch]] = 1 # Encoder model encoder_input = Input(shape=(None,len(source_chars))) encoder_LSTM = LSTM(256,return_state = True) encoder_outputs, encoder_h, encoder_c = encoder_LSTM (encoder_input) encoder_states = [encoder_h, encoder_c] # Decoder model decoder_input = Input(shape=(None,len(target_chars))) decoder_LSTM = LSTM(256,return_sequences=True, return_state = True) decoder_out, _ , _ = decoder_LSTM(decoder_input, initial_state=encoder_states) decoder_dense = Dense(len(target_chars),activation='softmax') decoder_out = decoder_dense (decoder_out) model = Model(inputs=[encoder_input, decoder_input],outputs=[decoder_out]) # early stopping -- see if reach certain loss yet from keras.callbacks import EarlyStopping from keras.callbacks import ModelCheckpoint es = EarlyStopping(monitor='acc', mode='max', min_delta=1) # model exporting mc = ModelCheckpoint('decryption_key_ckpt.h5', monitor='loss', mode='min', save_best_only=True) # Run training model.compile(optimizer='rmsprop', loss='categorical_crossentropy') model.fit(x=[tokenized_source_sentences,tokenized_target_sentences], y=target_data, batch_size=1, epochs=2000, #50 # 1000 is perfect replica # 150 seems to do well for short strings (0.500 loss threshold) shuffle=True, callbacks=[es, mc] ) model.save('decryption_key.h5') # Inference models for testing # Encoder inference model encoder_model_inf = Model(encoder_input, encoder_states) # Decoder inference model decoder_state_input_h = Input(shape=(256,)) decoder_state_input_c = Input(shape=(256,)) decoder_input_states = [decoder_state_input_h, decoder_state_input_c] decoder_out, decoder_h, decoder_c = decoder_LSTM(decoder_input, initial_state=decoder_input_states) decoder_states = [decoder_h , decoder_c] decoder_out = decoder_dense(decoder_out) decoder_model_inf = Model(inputs=[decoder_input] + decoder_input_states, outputs=[decoder_out] + decoder_states ) encoder_model_inf.save('encoder_decryption_key.h5') decoder_model_inf.save('decoder_decryption_key.h5') for seq_index in range(1): inp_seq = tokenized_source_sentences[seq_index:seq_index+1] translated_code = decode_seq(inp_seq, encoder_model_inf, decoder_model_inf, target_chars, target_char_to_index_dict, target_index_to_char_dict, max_len_target_sent) print('-') print('Input sentence:', source_sent[seq_index]) print('Decoded sentence:', translated_code) print("Ground truth sentence: ", source_code) print("Successful obfuscation: ", assertFunctionEquals(translated_code, source_code)) with open('decryption_text.txt', 'w') as outfile: json.dump(source_sent[seq_index], outfile) with open('target_chars.txt', 'w') as outfile: json.dump(target_chars, outfile) with open('target_index_to_char_dict.txt', 'w') as outfile: json.dump(target_index_to_char_dict, outfile) with open('target_char_to_index_dict.txt', 'w') as outfile: json.dump(target_char_to_index_dict, outfile) with open('max_len_target_sent.txt', 'w') as outfile: json.dump(max_len_target_sent, outfile) return translated_code
参考链接:
https://github.com/dattasiddhartha/DeepObfusCode/tree/main