IPDB Format
Introduction
The IPDB database is an IP database format designed and utilized by IPIP.net, renowned for its compact size, high query efficiency, and multi-language support.
Format Analysis
+--------------------------------+--------------------------------+ | MetaData Length (4byte) | MetaData (Json Format) | +--------------------------------+--------------------------------+ | Node Chunk (Prefix Tree / Trie) | +--------------------------------+--------------------------------+ | Data Chunk | +--------------------------------+--------------------------------+
- The file is divided into three parts: metadata, node block, and data block.
MetaData Analysis
Metadata is a JSON object containing the database build information and metadata required for querying. Below is an example:
{ "build": 1632971142, // Build time "ip_version": 1, // IP database version (IPv4: 0x1, IPv6: 0x2) "languages": { "CN": 0 // Languages and field offsets }, "node_count": 8705098, // Number of nodes "total_size": 90028407, // Total size of the node block + data block "fields": [ // Fields for each data set in the data block "country_name", "region_name", "city_name", "owner_domain", "isp_domain", "latitude", "longitude", "timezone", "utc_offset", "china_admin_code", "idd_code", "country_code", "continent_code" ]}
NodeBlock Analysis
The node block consists of a prefix tree (trie). Each node is 8 bytes and stores the offset to the next node. If the offset exceeds the number of nodes, it jumps to the data block, indicating a result has been found.
DataBlock Analysis
The data block stores IP database data. Identical data is stored only once to reduce redundancy.
+--------------------------------+--------------------------------+--------------------------------+ | Data Length (2byte) | Data Fields(<country>\t<province>\t<city>\t<isp>\t<country>\t<province>) | +--------------------------------+--------------------------------+--------------------------------+ | Data Length (2byte) | Data Fields(<country>\t<province>\t<city>\t<isp>\t<country>\t<province>) | +--------------------------------+--------------------------------+--------------------------------+
- In the data block, data is split into two parts: length and data.
- The data part uses
\t
to separate fields. In multi-language versions of the database, different language data is returned using field offsets.
Query Operation
- A CIDR address is a way of describing a network segment that combines an IP address with a subnet mask, e.g., 10.0.0.1/8 represents an 8-bit subnet mask (255.0.0.0). All IPs within a CIDR segment have identical subnet mask parts.
- IP addresses are treated as 32-bit/128-bit binary strings. In the node block, the prefix tree is used to search from the beginning, and once a CIDR match is found, it jumps to the data block to return the corresponding data.
- More CIDR groupings result in a larger number of nodes.
- CIDRs should not overlap; nested CIDRs (like 10.0.0.1/8 and 10.0.0.1/16) may be precluded from further matching due to earlier matches.
- For a detailed explanation of the query process, refer to the paper IPv4 route lookup on Linux.
Packaging Process
- Build the prefix tree and place different datasets into the data block in load order.
- Prefix trees are built according to IPv6 specifications; IPv4 data needs to fill in 96 bits of subnet data at the front, i.e., 80 bits of 0 and 16 bits of 1, corresponding to the IPv6 mapped address (::FFFF:
). This allows fast offset to the 96-bit mask position for IPv4 querying. - In theory, the ipdb format database can store both IPv4 and IPv6 data in the same file, but ensure that no other CIDR records exist on the ::FFFF: path to keep the IPv4 query path clear (modifying the query SDK can better support simultaneous IPv4/IPv6 queries).