I have a mapping data analysis assistant

Author: Knownsec 404 Team
Chinese version: https://paper.seebug.org/3026/

1. Abstract

In May 2023, the ZoomEye team [1] released ZoomEyeGPT, a Chrome browser extension based on GPT that aims to provide AI-assisted search experience for ZoomEye users.

ZoomEyeGPT applies GPT on the input side of ZoomEye search. Meanwhile, this article applies GPT on the output side of ZoomEye search results, using ChatGPT [2] to assist users in interpreting ZoomEye data results, improving their efficiency in analyzing and understanding data in practical business scenarios.

The common knowledge capacity of ChatGPT far exceeds that of humans, but in terms of professional knowledge and application, it is not as good as professionals. ChatGPT cannot judge whether an IP address in the ZoomEye platform search results has vulnerabilities, but it can analyze and identify the result data in some regular ways, and its efficiency is higher than humans.

First, we use the advantages of ChatGPT in three practical scenarios to assist users in interpreting ZoomEye data results:

Assist users in analyzing mapping data, identifying software and hardware vendors, and expanding the fingerprint rule library, which can effectively improve analysis efficiency. ChatGPT is a qualified fingerprint rule annotation assistant!
Assist users in analyzing SSL certificate data, judging the organization and industry type to which the IP belongs, and the results are relatively accurate.

Finally, we apply these methods in business scenario case. First, we obtain and download asset mapping data of the target network segment on the ZoomEye platform. Then use the API interface to input the mapping data to ChatGPT for analysis and interpretation. Through the analysis results of ChatGPT, we know that an IP address in the target network segment uses the "Metabase" component, which may be affected by the "Metabase Remote Code Execution (CVE-2023-38646)" vulnerability, and this IP address belongs to "Shanghai ** Health Technology Co., Ltd."

The application of ChatGPT in mapping data analysis described in this article is just the tip of the iceberg. In the process of using the ZoomEye platform, actual business scenarios will be more complex and diverse. If we can find points where ChatGPT's advantages can be combined with business scenarios, utilizing ChatGPT's abilities, it will undoubtedly produce twice the result with half the effort.

2.Overview

In daily operations, a cybersecurity incident response team receives a batch of vulnerability intelligence information titled "KNOWNSEC SAFETY BRAIN | Metabase Remote Code Execution (CVE-2023-38646) and 137 other vulnerability intelligence". They want to assess the potential impact of these vulnerabilities on IP addresses within their managed IP network range and determine the corresponding organizational entities.

The standard workflow involves the following steps: firstly, obtaining the names of the vulnerable components and analyzing their banner data features (also known as fingerprint rules); secondly, using the component fingerprint rules to identify IP addresses within the IP network range that utilize these vulnerable components, i.e., the potentially affected IP addresses; finally, determining the organizational entities associated with these IP addresses. If the vulnerability intelligence contains a large number of vulnerabilities, identifying the fingerprint rules for these vulnerable components manually can be time-consuming. Similarly, manually determining the organizational entities for the IP addresses is also a time-consuming task.

Therefore, we propose applying GPT to the output side of ZoomEye results and utilizing ChatGPT to assist users in interpreting the ZoomEye data results, thus enhancing work efficiency. In this article, we will explore the practical application of ChatGPT by combining it with this business scenario case.

3．Identification of Software and Hardware Fingerprints

For those familiar with the field of cyberspace mapping, it is understand that by mapping and acquiring banner data of a specific IP address and port, certain specific string features within the data's header, body, SSL certificate, etc., allow us to identify the use of a particular type of device or software by that IP address. These string features are referred to as fingerprint rules (the terminology may vary on different cyberspace search engine platforms).

Once these fingerprint rules are pre-set in cyberspace search engine platforms, users can directly search for specific software or hardware based on the manufacturer's name, without the need to memorize the specific fingerprint rules for identifying software and hardware. For example, on the ZoomEye platform, users can directly enter "app:wordpress" to search for IP addresses using WordPress software, without needing to know the specific fingerprint rules .

To enhance user search experience, each network space search engine platform has invested substantial human and financial resources in refining their proprietary fingerprint rules database. However, no platform's fingerprint rules database can achieve 100% coverage of all global software and hardware vendors.

Consider the following scenarios:

User Requirement: The user needs to perform network reconnaissance in a specific area and retrieve mapping data from the platform. For unidentified software and hardware in the obtained results, the expectation is to leverage manual analysis to identify the respective vendors as accurately as possible.
Data Analyst Task: The platform's in-house data analysts are faced with mapping results where software and hardware identification was unsuccessful. Their objective is to expand the fingerprint rules database systematically, enabling identification of additional software and hardware.

In both scenarios, can ChatGPT assist in interpreting ZoomEye data, thereby enhancing data analysis efficiency? This section explores the application through sample cases.

3.1 Identifying ASUS

Firstly, we employ a relatively simple set of banner data, along with pre-prepared prompts, to guide ChatGPT in analyzing the data and outputting the desired results. From the screenshot below, it is evident that the answer provided by ChatGPT meets the requirements, correctly identifying the hardware manufacturer as "ASUS," specifying the model as "RT-ACRH13," and even providing additional information: the device type is a "router."

Diagram 3-1: Identifying ASUS

3.2 Identifying Fortinet

Next, we utilize slightly more complex banner data. From the screenshot below, it can be observed that ChatGPT successfully identifies the hardware device as "FortiGate," with the manufacturer being "Fortinet." Additionally, it provides an extra information: this is a "network security device."

Diagram 2-2： Identifying Fortinet

3.3 Identifying SonicWall

Moving forward, we will utilize more intricate banner data. As depicted in the screenshot below, ChatGPT successfully identifies the vendor as "SonicWall". Additionally, ChatGPT informs us that this recognition is based on distinctive strings present in the Server header and HTML Content. Essentially, apart from providing an identification outcome, ChatGPT furnishes explicit identification rules by specifying the characteristic strings and their expected appearance within the banner fields.

Diagram 3-3: Identifying SonicWall

3.4 Identifying WatchGuard

In the previous examples, the banner data already explicitly indicated the manufacturer's name. Next, we will test with more challenging banner data where the manufacturer's name is not directly mentioned. From the screenshot below, it can be observed that ChatGPT still successfully identifies this banner data, determining the manufacturer as "WatchGuard" based on the content of the header title.

Diagram 3-4: Identifying WatchGuard

3.5 Identifying Cisco

Next, we continue with a more complex banner data.

During the initial attempt, ChatGPT failed to identify any outcomes. However, with my encouragement, ChatGPT successfully recognized the device as "Cisco SSL VPN" in the subsequent attempt and provided substantiating evidence for the identification. This information proves to be highly valuable in augmenting our fingerprint rules for this specific device.

Diagram 3-5: Identifying Cisco 1

Diagram 3-6 identify Cisco 2

3.6 Identifying WordPress

Moving forward, we will attempt to analyze a banner data associated with software. As depicted in the screenshot below, ChatGPT effectively identifies the result as "WordPress (content management system)" and furnishes characteristic strings crucial for identification.

Diagram 3-7: Identifying WordPress

3.7 Identifying Cobalt Strike

Finally, we will analyze a rather peculiar banner data where an IP address is associated with a Cobalt Strike service.Remarkably, ChatGPT triumphs in recognizing this service and conducts an in-depth analysis of its configuration content, consequently producing crucial configuration information.

Diagram 3-8: Identifying Cobalt Strike 1

Diagram 3-9: Identifying Cobalt Strike 2

3.8 Summary

Through the experiments in this chapter, it is evident that ChatGPT not only possesses the capability to identify software and hardware manufacturers through banner data but can also recognize the types of hardware and software. What is even more surprising is that ChatGPT can provide identification criteria, making judgments based on specific characteristic strings in certain field values of the banner data. This is equivalent to providing concrete fingerprint recognition rules.

Therefore, we believe that utilizing ChatGPT to assist users in analyzing mapping data, identifying software and hardware manufacturers, and expanding the fingerprint rule library can effectively enhance the efficiency of analysis. ChatGPT serves as a capable assistant for fingerprint rule labeling!

4. Interpreting SSL Certificate Data

4.1 Frustrations Encountered

When examining search results on the ZoomEye platform, some field data can be readily understood by ordinary engineers. For instance, the IP address, port, protocol, country, city, and mapping timestamp highlighted in the image below can be easily comprehended by an average IT engineer.

Diagram 4-1: Search Result 1

However, some field data contains information that may not be readily accessible and comprehensible to ordinary engineers, such as the certificate fields highlighted in the image below. The data within the certificate fields can be extensive, and the specific areas of interest may vary depending on the user's requirements. With the data from the certificate fields, users may not always be able to directly obtain the results they need. Instead, they may need to perform a secondary transformation of the certificate data to obtain the desired outcome.

Diagram 4-2: Search Result 2

Using the above screenshots as an example, if a user's requirement is to determine the organization name associated with the certificate used by a specific IP address, they may not be interested in all the data within the certificate fields. Instead, they would only need to examine the Subject data within the certificate fields. If the user sees that the Subject data is "CN=www.miraculoussolutions.com," they may not be able to directly identify the organization holding the certificate. Instead, they would need to access "www.miraculoussolutions.com," examine the corresponding domain owner, in order to determine the organization name associated with the certificate.

4.2 ChatGPT-Assisted Interpretation

To address the challenges outlined in the previous section, we employ ChatGPT to assist users in interpreting ZoomEye data and subsequently obtain the organization name associated with the SSL certificate used by a specific IP address. This approach aims to enhance the efficiency of data interpretation for users.

Firstly, we export the ZoomEye search results in "json" format, selecting only two fields in the "field configuration": "ip" and "ssl".

Diagram 4-3: Download Result

Next, within ChatGPT, we input a query that has been fine-tuned over N iterations to guide ChatGPT in receiving, extracting, analyzing, and outputting data according to our specific requirements.

Diagram 4-4：analyze SSL data 1

Once ChatGPT indicates understanding of the request, we provide it with two JSON data entries, as shown below:

{
  "ip": "173.194.51.233",
  "ssl": "\n\n\nSSL Certificate\n ...Subject: CN=\*.c.docs.google.com\n* *..."
}
{
  "ip": "41.63.166.101",
  "ssl": "\n\n\nSSL Certificate\n ...Subject: CN=FortiGate,O=Fortinet Ltd.\n..."
}

Diagram 4-5: analyze SSL data 2

As shown in the above image, the response from ChatGPT largely meets the requirements. ChatGPT extracts information from the input SSL certificate data, focusing on the Subject field. It further extracts information from the O field and CN field within the Subject, ultimately identifying the certificate holder's organization name. Moreover, ChatGPT also provides information about the industry type and Chinese name of the organization. These two pieces of information can be readily understood by ordinary technical engineers. In scenarios involving a large volume of ZoomEye result data, this can effectively enhance the efficiency of user comprehension and data analysis.

Additionally, please note in the first JSON data entry, the Subject field does not contain the O field, only the CN field. Therefore, ChatGPT intelligently infers the organization name based on the value of the CN field.

Next, we will input two more JSON data entries, as shown below:

{
  "ip": "104.90.119.209",
  "ssl": "\n\n\nSSL Certificate\n ... Subject: C=US,CN=store.nba.com,L=New York,O=NBA Media Ventures, LLC,ST=New York\n ..."
}
{
  "ip": "144.53.243.70",
  "ssl": "\n\n\nSSL Certificate\n ... Subject: C=AU,CN=\*.abs.gov.au,L=Belconnen,O=Australian Bureau of Statistics,ST=Australian Capital Territory\n ..."
}

Diagram 4-6： analyze SSL data 3

As shown in the above image, the response from ChatGPT is quite accurate. In particular, the determination of the industry type of the organization is quite precise, and in practical applications, it can significantly assist users in interpreting the data.

5. Practical Application Case

In this chapter, we will apply the methods outlined in the previous sections in a real-world business scenario, using ChatGPT to assist users in analyzing and interpreting ZoomEye data results.

5.1 Application Scenario

A certain network security regulatory department has obtained a batch of vulnerability intelligence information titled "KNOWNSEC SAFETY BRAIN | Metabase Remote Code Execution (CVE-2023-38646) and 137 Other Vulnerability Intelligence" [3]. They want to assess whether any IP addresses within their jurisdictional IP range may be affected by these vulnerabilities, and to determine which organizations these potentially affected IP addresses belong to.

5.2 Application Example

We will select a Class C subnet "8.37../24" within the jurisdictional IP range as a practical example. From the vulnerability intelligence information, we will choose the top three vulnerabilities: "Ruijie, EG Gateway File Upload," "Cloudpanel Remote Code Execution (CVE-2023-35885)," and "Metabase Remote Code Execution (CVE-2023-38646)" as our practical examples.

First, we will use the ZoomEye platform to obtain network asset mapping data for this Class C subnet. Next, we will use ChatGPT to analyze and interpret the data, determining if any IP addresses within this range are using components associated with these three vulnerabilities, and thus, potentially affected by them. Finally, based on the SSL certificate information associated with each IP address, we will determine which organization the IP address belongs to.

Obtaining Network Mapping Data for Class C Subnet

In the ZoomEye platform, enter the following keywords to search for mapping data for the specified Class C subnet, with mapping dates after August 1, 2023.

cidr:"212.129.\*.*/24" +after:"2023-08-01" +before:"2024-01-01"

Diagram 5-1: result of search

Next, we will proceed to download the search results. In the dialog box, we choose the data format as "JSON format" and select the fields "IP Address," "Port Number," "Banner," and "SSL." Rename the downloaded JSON file as "zoomeye_data.json."

Diagram 5-2:download result

Analyzing and Interpreting Data

For the selected top three vulnerabilities: "Ruijie EG Gateway File Upload," "Cloudpanel Remote Code Execution (CVE-2023-35885)," and "Metabase Remote Code Execution (CVE-2023-38646)," the corresponding components are: "Ruijie" "Cloudpanel," and "Metabase."

We write Python code (code example in the next section) to read the mapping data from the JSON file line by line. We utilize ChatGPT's API interface to input the mapping data to ChatGPT，instruct it to analyze and interpret the data according to the following steps: determine if the IP uses any of the three components: "Ruijie, "Cloudpanel," or "Metabase." Then, based on the SSL certificate content, determine the organization's name and industry to which the IP belongs.

Each json data contains 4 fields:
1. The field "ip" means "IP address";
2. The field "port" means "port";
3. The field "ssl" means "the content of the SSL certificate corresponding to the IP address";
4. The field "banner" means "mapping banner data". 
Please perform data extraction and data analysis according to my requirements for each piece of json data:
1. According to the value of the field "banner" and the field "ssl", analyze the header, title, etc., and identify what
system or tool it uses. I call it "component name".
2. If the value of "component name" is one of "Ruijie", "Cloudpanel" or "Metabase", please continue. If not, tell
me: "does not match", then stop the analysis.
3. Based on the SSL certificate content in the field "ssl" value, extract the "Subject" field used to identify the
holder or subject of the certificate.
4. The "O" field in "Subject" field is used to identify the organizational name of the "certificate holder" (usually
an individual, organization or entity). If the "O" field is empty, continue. 
5. The "CN" field in the "Subject" field is used to identify the common name of the certificate holder, usually the
host name (Hostname) or domain name (Domain Name). If it is a domain name, please extract its main domain
name, and think that the name of the organization corresponding to the main domain name is the organization
name of the "certificate holder". 
6. The data does not reflect the organization’s industry information. Please use the organization name of the
"certificate holder" obtained in steps 3 and 4, combined with your own profound knowledge, to infer the
organization industry of the "certificate holder". 
7. If the value of "ssl" field or the value of "Subject" field is empty, the organization name of the "certificate
holder" is empty.
8. Finally, please tell me the result: "IP address", "port", "component name", organization name of "certificate
holder", organization industry of "certificate holder".

If an IP address uses any one of the components "Ruijie," "Cloudpanel," or "Metabase," then output the IP address, its port, the component name, the organization name, and the industry of the organization to which the IP belongs.

Based on the diagram shown below, the response from ChatGPT indicates that the C segment contains an IP address "8.37.." with an open port of 443. This IP address is associated with the "Metabase" component and is attributed to " Institute for Immunology," an organization operating in the field of immunology research.

Diagram 5-3 result of run the code

In this practical application case, we first obtained and downloaded the asset mapping data of the target subnet from the ZoomEye platform. Next, we used the API interface of ChatGPT to input the mapping data to ChatGPT for analysis and interpretation. Based on ChatGPT's analysis results, we learned that within the target subnet, there is an IP address using the "Metabase" component. It may be affected by the "Metabase Remote Code Execution (CVE-2023-38646)" vulnerability. The organization to which this IP address belongs is "Institute for Immunology"," meeting the requirements of the practical application scenario.

Of course, ChatGPT's analysis results are not 100% accurate and may have false positives. However, when used as a data analysis assistant, ChatGPT can significantly assist users in processing data and improve work efficiency, making it highly competent for the task.

5.3 Code Example

The Python code example used in this practical application chapter is as follows:

"""Work with ChatGPT together. 

package version:
pip install openai==0.28.0
pip install tiktoken==0.4.0

"""

import os
import json

import openai
import tiktoken

# set apikey of openai
openai.api_key = os.getenv("OPENAI_API_KEY")

class WorkWithChatGPT():
  """The class of work with ChatGPT together.

  """ MODEL = "gpt-3.5-turbo" 
  MAXTOKENS = 4000

 def __init__(self, data: dict):
   """The init function."""
   self.data = data

   self.ip = self.data.get("ip", "")
   self.port = self.data.get("port", 0)
   self.ssl = self.data.get("ssl", "")
   self.banner = self.data.get("banner", "")

   self.tiktoken_enc = tiktoken.encoding_for_model(WorkWithChatGPT.MODEL)

 def work(self)
 """The entry function."""
 self.prepare_data()
 self.chat()

def prepare_data(self): 
  """Prepare the data."""
  self.normalize_ssl()
  self.normalize_banner()

def normalize_ssl(self): 
  """Normalize data of ssl."""
  ssl = self.ssl.strip()
  if not ssl:
    return

  pos_subject = ssl.find("Subject:")
  pos_end = ssl.find("\n", pos_subject)
  self.ssl = ssl[: pos_end]
  self.data["ssl"] = self.ssl

def normalize_banner(self):
  """Normalize data of banner.""" 
  data_exist = { 
    "ip": self.ip,
    "port": self.port,
    "ssl": self.ssl, "banner": ""
  }
  count_tokens_exist = self.calc_count_tokens(json.dumps(data_exist))
  max_tokens_banner = WorkWithChatGPT.MAXTOKENS - count_tokens_exist
  self.banner = self.banner[: max_tokens_banner]
  self.data["banner"] = self.banner

 def calc_count_tokens(self, string: str): 
   """Calculator the count of token in string."""
   return len(self.tiktoken_enc.encode(string))

 def chat(self): 
   """Analyze json data with ChatGPT."""
   msg_system = "You are an IT technologist and cyberspace mapping data analyst"
   msg_user = """ 
     Each json data contains 4 fields:
     1. The field "ip" means "IP address";
     2. The field "port" means "port";
     3. The field "ssl" means "the content of the SSL certificate corresponding to the IP address";
     4. The field "banner" means "mapping banner data".

     Please perform data extraction and data analysis according to my requirements for each piece of json data:
     1. According to the value of the field "banner" and the field "ssl", analyze the header, title, etc., and identify what system or tool it uses. I call it "component name". 
     2. If the value of "component name" is one of "Ruijie", "Cloudpanel" or "Metabase", please continue. If not, tell me: "does not match", then stop the analysis.
     3. Based on the SSL certificate content in the field "ssl" value, extract the "Subject" field used to identify the holder or subject of the certificate. 
     4. The "O" field in "Subject" field is used to identify the organizational name of the "certificate holder"(usually an individual, organization or entity). If the "O" field is empty, continue.
     5. The "CN" field in the "Subject" field is used to identify the common name of the certificate holder, usually the host name (Hostname) or domain name (Domain Name). If it is a domain name, please extract its main domain name, and think that the name of the organization corresponding to the main domain name is the organization name of the "certificate holder". 
     6. The data does not reflect the organization’s industry information. Please use the organization name of the "certificate holder" obtained in steps 3 and 4, combined with your own profound knowledge, to infer the organization industry of the "certificate holder".
     7. If the value of "ssl" field or the value of "Subject" field is empty, the organization name of the "certificate holder" is empty. 
     8. Finally, please tell me the result: "IP address", "port", "component name", organization name of "certificate holder", organization industry of "certificate holder". 

     After you confirm that you understand what I mean, I will provide you with a piece of json data. 
     """
     msg_assistant = "I understand your requirements. Please provide a piece of json data." 
     messages = [
       {"role": "system", "content": msg_system}, 
       {"role": "user", "content": msg_user}, 
       {"role": "assistant", "content": msg_assistant}, 
       {"role": "user", "content": str(self.data)}
 ]
 completion = openai.ChatCompletion.create(
   model=WorkWithChatGPT.MODEL, 
   messages=messages
 )

  answer = completion.choices[0].message["content"]
  print("-" * 50)
  if "does not match" in answer:
    print(f"{self.ip}, {self.port}, Not affected by the vulnerability.")
  else:
    print(f"{self.ip}, {self.port}, May be affected by the vulnerability.")
    print(answer)


def work(filename):
  """Work with ChatGPT together."""
  datas = list(map(json.loads, open(filename)))

 for data in datas:
   wwc = WorkWithChatGPT(data)
   wwc.work()

if __name__ == "__main__":
  work("zoomeye_data.json")

6.Conclusion

In this article, we demonstrated the effective use of ChatGPT to assist users in interpreting ZoomEye data results, and the practical results were promising. It significantly enhances users' efficiency in analyzing and understanding the data.

We acknowledge that ChatGPT's general knowledge capacity far exceeds that of humans. When it comes to routine analysis and recognition of ZoomEye data results, ChatGPT can perform the task with higher efficiency than humans.

In real-world scenarios using the ZoomEye platform, complexities and variations are inevitable. We believe that astute users can identify points of synergy with ChatGPT's strengths in their specific business contexts, thereby leveraging its capabilities to boost data analysis efficiency.

7. References

[1] ZoomEye - Network Space Search Engine：https://www.zoomeye.org

[2] ChatGPT （GPT-3.5）：https://chat.openai.com

[3] KNOWNSEC SAFETY BRAIN | Detection of 137 Vulnerabilities including Metabase Remote Code Execution (CVE-2023-38646)：https://mp.weixin.qq.com/s/MPRqzwv9I8tOWr1Hdf9DCg

Paper 本文由 Seebug Paper 发布，如需转载请注明来源。本文地址：https://paper.seebug.org/3028/