RMSP Highlights


Finding Duplicates

Jared Deutsch,

Duplicates are problematic in geostatistical estimation. Duplicates should be identified and cleaned from the data during exploratory data analysis. Duplicates may be present due to twinning of drill holes, duplicate samples taken from the same blast hole cuttings pile, or for a variety of other reasons. The spatial coordinates of the duplicates may be identical, or they may be very close together requiring a search tolerance to identify duplicate points.

We can generate a small synthetic data set with a number of duplicates is generated. There are two duplicate clusters. The first cluster contains two points that are very close together while the third contains three close points. The duplicates do not precisely overlap, so some spatial search tolerance will be necessary to locate the duplicate clusters automatically.

import rmsp
import numpy as np

test_points = rmsp.PointData(data=np.array(
    [[1.0, 2.0, 2.01, 1.96, 1.32, 2.19, 1.3],
     [1.0, 1.67, 1.72, 1.71, 1.56, 1.3, 1.53],
     [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
     [9.0, 4.0, 6.0, 5.0, 7.0, 4.0, 7.5]]).T,
    columns=["x", "y", "z", "value"],
    x="x", y="y", z="z",
)

test_points.sectionplot("value", cmap="plasma", cbar_label="Value",
                        title="Data Set with Duplicates")
Location map of synthetic data set with duplicates.

The example points with duplicates:

print(test_points)
x y z value
01 1 0 9
12 1.67 0 4
22.011.72 0 6
31.961.71 0 5
41.321.56 0 7
52.191.3 0 4
61.3 1.53 0 7.5

Duplicate clusters are identified here using a spatial tolerance of 0.1 m. This identifies four unique locations out of the seven provided points:

dups = rmsp.Duplicates(spatial_tolerance=0.1).fit(test_points)
print(dups.info())

# Outputs:
"""
Summary of Duplicate Analysis:
 Total Locations: 7
 Number Unique Locations: 4
 Max Count at Duplicate Locations: 3
"""

A deduplicated data set and list of all identified duplicates can be summarize. In this case we choose to remove duplicates by averaging the duplicated value arithmetically.

deduplicated, duplicates = dups.transform(test_points, default_method="arithmetic")

The deduplicated data set contains the average value of each duplicate and is ready for further analysis:

print(deduplicated)
x y z value
01 1 0 9
12 1.67 0 5
21.321.56 0 7.25
32.191.3 0 4

The duplicates that were identified can be separately queried:

print(duplicates)
x y z value Index To DeDup
12 1.67 0 4 1
22.011.72 0 6 1
31.961.71 0 5 1
41.321.56 0 7 2
61.3 1.53 0 7.5 2

These duplicate clusters are located:

fig, ax, cax = test_points.sectionplot(title="Duplicate Clusters", s=50,
                               grid=True, xlim=(0.75, 2.5), ylim=(0.75, 2.0))
duplicates.sectionplot_draw(ax, "Index To DeDup", marker="X", s=400,
                            lw=0, alpha=0.5)
Location map of synthetic data set with duplicates identified by cluster.

This small example demonstrated how to locate and remove spatial duplicates. There are many possible approaches for removing duplicates prior to geostatistical analysis and resource modeling. Any approach taken to duplicate removal depends on the reason the duplicate is present in the database.

Interested in using RMSP for your resource modeling challenges?

  contact@resmodsol.com