SECURE MULTI-PARTY COMPUTATION OF DIFFERENTIALLY PRIVATE HEAVY HITTERS
According to an aspect, a method may include receiving a candidate value; in response to a received candidate value matching one of the entries in the table, incrementing a corresponding count; in response to the received candidate value not matching one of the entries in the table and the table not exceeding a threshold size, adding an entry to the table; in response to the received candidate value not matching one of the entries in the table and the table exceeding the threshold size, decrementing the counts in the table and deleting entries having a count of zero; adding noise to the corresponding counts in the entries of the table and deleting any noisy corresponding counts less than a threshold value; and outputting at least a portion of the table as the top-k value result set.
Services for performing analytics (e.g., statistics, aggregate queries, or the like) on sensitive data may involve sharing data with a third party. In some instances, it may not be desirable or feasible for one or more parties sharing data to share plaintext data. For example, the data may be sensitive data that is not permitted to be shared. In some instances, the parties sharing the data may be mutually distrusting parties. In other instances, use of a trusted third party may not be feasible as the trusted third party may become compromised.
SUMMARYMethods, systems, and articles of manufacture, including computer program products, are provided for secure multiparty computations.
According to an aspect, a system includes at least one data processor and at least one memory storing instructions which, when executed by the at least one data processor, result in operations comprising: generating, for a top-k value determination across a plurality of clients, a table including entries to map candidate values to corresponding counts; receiving, from each of the plurality of clients, a candidate value; in response to a received candidate value matching one of the entries in the table, incrementing, for the matching candidate value, a corresponding count; in response to the received candidate value not matching one of the entries in the table and the table not exceeding a threshold size, adding an entry to the table by adding the received candidate value with a count value of 1; in response to the received candidate value not matching one of the entries in the table and the table exceeding the threshold size, decrementing all of the counts in the table by 1 and deleting from the table any entries having a count of zero; adding noise to the corresponding counts in the entries of the table; in response to a noisy corresponding count being less than a threshold value, deleting the corresponding entry in the table for the noisy corresponding count; and outputting at least a portion of the table as the top-k value result set.
In some variations, one or more of the features disclosed herein including the following features can optionally be included in any feasible combination. The table may be sorted based on the noisy corresponding count before the outputting. The system may comprise or be comprised in a trusted server. The top-k value result set is determined based on a multi-party computation using a domain of data across the plurality of clients, wherein the top-k value result set is determined over the domain of data. The system may utilize, to perform the multi-party computation, at least one compute node at a cloud provider or at least one compute node at one or more of the plurality of clients. The receiving, from each of the plurality of clients, the candidate value may include, receiving the candidate value in a secured message, the secured message further including a partial noise value. The adding noise to the corresponding counts in the entries of the table further may include adding the noise based on the partial noise value from each of the plurality of clients. The outputting at least a portion of the table as the top-k value result set may further include outputting the at least a portion of the table in a secured message. The top-k value result set may be output in accordance with differential privacy.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive. Further features and/or variations may be provided in addition to those set forth herein. For example, the implementations described herein may be directed to various combinations and subcombinations of the disclosed features and/or combinations and subcombinations of several further features disclosed below in the detailed description.
The accompanying drawings, which are incorporated in and constitute a part of this specification, show certain aspects of the subject matter disclosed herein and, together with the description, help explain some of the principles associated with the disclosed implementations. In the drawings,
Like labels are used to refer to same or similar items in the drawings.
DETAILED DESCRIPTIONData collection is a primary function of many entities around the globe. For example, some entities offer a free service, such as Internet searching or a social network, and then monetize data collection of end-user data from those free services. However, unrestricted, general data collection that allows uniquely identifying the identity of an end-user may cause ethical and/or legal concerns under the data protection regulations of certain jurisdictions, such as General Data Protection Regulations (GDPR). Specialized, privacy-preserving data collection may alleviate some of these data collection-related privacy concerns. For this reason, differential privacy (DP) may be used to provide a strong privacy guarantee. Moreover, secure, multi-party computation (MPC) may be used in combination with differential privacy. The additional use of secure multi-party computation may improve accuracy, without reducing privacy. Secure multi-party computation is a cryptographic tool that allows multiple parties to evaluate a function on data distributed among the parties but only the function's result is revealed or shared among the parties (in other words, the input data is not shared among the parties). However, secure computation of a differential privacy mechanism may be considered generally less efficient, with potentially high communication and computation overhead.
Disclosed herein is a solution to the problem of distributed private learning of the top-k values, such as the k most common values (which is also referred to as the top k “heavy hitters”). The term “k” refers to how many of the top values, such as top 1 (k=1), top 2 values (k=2), and so forth, are in the result set. For example, multiple, distributed users may determine, such as compute, the top-k values with high accuracy, a strong privacy guarantee, and without resorting to trusted third parties for holding and sharing a user's private data for the calculation. To that end, in some embodiments, there is provided a secure multi-party computation (MPC) of differentially private (DP) top-k values.
In some embodiments, there is provided a first protocol HH1 and a second protocol HH2. These protocols securely compute in a differentially private way the top-k values, without disclosing a party's private information during the computation while providing differential privacy protection for the computation output. Moreover, the HH1 and HH2 protocol may be considered highly accurate even for small data sets (of, for example, few users), which is a challenging regime for differential privacy and/or may be considered to have a practical run time (e.g., efficient, optimized compute implementation).
In the following, HH1 refers to a so-called “ideal” functionality as run by a trusted third party, for example, while the HH1 protocol example may thus refer to an example MPC implementation of FHH1 which replaces the trusted third party with cryptographic protocols. HH1 may combine heavy hitter detection (e.g., top-k value(s)) with a differently private (DP) bounded count release (e.g., releasing values whose noisy count exceeds a threshold). Moreover, efficient, secure computation implementation examples for HH1 is also provided in the form of the HH1 algorithm (see, e.g., Tables 5 and 7 below). The function for the second protocol HH2 combines distributed heavy hitter detection with a central differential privacy heavy hitter detection. Moreover, efficient, secure computation implementations for HH2 is also provided in the form of the HH2 algorithm (see, e.g., Tables 8 and 9 below).
The use of a differentially privacy top-k values discovery can be used in a variety of environments. For example, user behavior data mining may be performed across multiple parties, such as users, in a differentially private manner. The user behavior mining may include determining, for example, frequently typed words on a client device (which can be used to improve auto-complete suggestions), detect user selections or settings of a given application, and the like. To illustrate further, differentially private telemetry data collection may be deployed to client devices to allow answering queries such as what are the top-k items among users at the client devices. End users can execute the secure, privacy-preserving analytics over their combined data, without revealing any of their own data to anybody else (due to the disclosed efficient secure computation). Moreover, the data need not be shared with a trusted third party to obtain a result to the query for the top-k values. To illustrate further, queries such as what are the k-top most commonly accessed applications or the k-top most searched for products (or, e.g., viewed, bought, returned product) can be answered using the secure MPC DP protocols HH1 and HH2 disclosed herein without violating the privacy of any individual user's private data. The privacy-preserving protocols for the k-top values disclosed herein may also be used in computations to gather information not only from end user's at a single entity, such as a company's end-users, but across different entities (e.g., different companies which do not normally share private data) and their corresponding end users, while providing strong privacy and security guarantees between the entities (and/or the end users). For example, information may be computed from different companies (while not sharing the private information of any of the companies) to provide holistic insights for an industry sector, for example.
Before providing additional details regarding the HH1 and HH2 protocols (and the corresponding HH1 and HH2 implementations) the following provides a description regarding differential privacy and secure multiparty computations.
In some of the examples disclosed herein, a party or a user may refer to a client machine (or device), such as a computer, Internet of Things (IoT) device, and/or other processor-based machine. Given for example, a set of quantity of n parties, this can be represented as follows:
={P1, . . . ,Pn},
where each party Pi holds at least at least a single data value di, where i varies from 1 to n, and D denotes the combined data set of the parties, . The combined data set may be modelled as D={wherein d1, . . . , dn}, d1, d2, . . . dn are data values (or more simply “data”) of the data domain U (e.g., a data universe U representing a set of potential data values, such as a set of all integers, a set of integers, and a like).
Secure multi-party computation (which is also referred to as multi-party computation, MPC) may enable the set of parties, ={P1, . . . , Pn} to jointly compute a function, such as a median, mode, top-k, or other type of function or operation, without each party sharing their data set with the other parties. In MPC for example, each party may take part in the MPC by providing, or exchanging, a secure input message (e.g., a secret share) with the other parties, such that the parties operate on those messages to jointly compute the function. The final output reveals a final encrypted output, which can be decrypted (e.g., using secret shares) by each of the parties to reconstruct or reveal the result, without each of the parties revealing their private data. Secret sharing refers to distributing a secret among a group of parties, each of which is allocated a share of the secret such that the secret can be reconstructed only when a sufficient number of shares are combined together. To illustrate secret sharing, Shamir's Secret Sharing (SSS) can be used to secure a secret in a distributed way (e.g., the secret is split into multiple parts, or shares), which are used to reconstruct the original secret.
In the case of the HH1 and HH2 protocols (and the corresponding HH1 and HH2), the function being computed is the top-k values among the data sets of the parties. Given for example each party Pi holds sensitive input data di, the set of parties may jointly compute the top-k function yi, . . . , yk=ƒ(d1, . . . , dn), while maintaining the privacy of the inputs d1, . . . , dn. The output of the secure multi-party computation must be correct and secret; in other words, the correct value of the k-top output values y1, . . . , yk must be computed and the secrecy of the input data d1, . . . , dn is preserved among the parties, so only the output is revealed to the parties.
Secure multi-party computation may be implemented using different trust assumption models. In a semi-honest model (or passive), the parties (also referred to as adversaries) do not deviate from the protocol but gather everything created during the run of the protocol. However, in the malicious model (or active), the parties can deviate from the protocol (e.g., alter messages).
Differential privacy (DP) provides, as noted, strong privacy guarantees by restricting what can be provided as an output. When a single data value of the input data set changes for example, the effect on the output may be restricted or bounded, so that privacy is maintained. If an algorithm is differentially private, an observer seeing the algorithm's output would not be able to discern an input data value used to compute the output. Some form of randomization is an essential aspect for differential privacy to hide and maintain the privacy of a party's input data. In a mathematical or formal sense, differential privacy may be defined as shown in Table 1 below, although less formal definitions of differential privacy may satisfy the input data privacy required of differential privacy. Although the definition provided at Table 1 holds against an unbounded adversary; in the case of cryptography however, the definition may also hold for a computationally bounded adversary as well. At Table 1, (ε, 0)-DP may be referred to as pure DP, and approximate DP allows an additional, additive privacy loss d>0. Typically, d is negligible in the size of the data. While pure DP mechanisms are presented, the protocols apply them in combination with d-based thresholds and thus may satisfy approximate DP.
Randomization may be provided by adding noise, which is an aspect used to achieve differential privacy, such that an individual's data may be hidden or obfuscated. For example, noise may be added to a function's output to provide a degree of differential privacy. A way of adding noise may be in the form of a Laplace mechanism. In the case of the Laplacian mechanism, the noise added is drawn from a Laplace distribution. In a mathematical or formal sense, the Laplace mechanism may be defines as shown in Table 2 below.
An alternative to additive noise is the use of probabilistic output selection via the Exponential mechanism. In the case of the Exponential mechanism, the Exponential mechanism (EM) computes selection probabilities according to exponentially weighted utility scores. The Exponential mechanism expands the application of differential privacy to functions with non-numerical output, or when the output is not robust to additive noise, such as in the case of a median function. The Exponential mechanism is exponentially more likely to select “good” results where “good” is quantified via a utility function u(D, r) which takes as input a database D∈Un, and a potential output r E R from a fixed set of arbitrary outputs R. Informally, the Exponential mechanisms outputs elements with probability proportional to:
exp
and higher utility means the output is more desirable and its selection probability is increased accordingly.
The argmax over utility scores with additive noise from the Gumbel distribution may be equivalent to Exponential mechanism. The Gumbel mechanism, which has the same or similar output distribution as the Exponential mechanism, adds Gumbel-distributed noise to the utility scores and selects the output with the highest noisy score (arg max of noisy utility score). In a formal or mathematical sense, the Gumbel mechanism MG may be defined as shown in Table 4 below.
Noise, such as Laplace noise, exponential noise, and/or Gumbel noise may be used for the differential privacy mechanisms. These types of noise may be generated by multiple parties. For example, each party may provide a partial noise that in combination with noise from other parties becomes the required noise for differential privacy. The Laplace (b) may be expressed as the sum of n partial noise values as follows:
where Yj1,Yi2 are samples from gamma distribution Gamma
and gamma distribution with shape k=1/n and scale b has density as follow:
The Gumbel (b) may be expressed as follows:
where Yj is sampled from exponential distribution Expon(1), and exponential distribution with scale b has density as follows:
for x>0, and 0 elsewhere.
In the local model 101B, each of the parties 120A and 120N (depicted as client devices “C1” . . . “CN”) locally apply the differential privacy algorithm and then send anonymized values 121A-121B to an untrusted server 122 for aggregation. In the case of the local model 101B, the accuracy of the output 125 may be limited as the randomization is applied multiple times. Hence, the local model 101B may require a relatively large number of users to achieve accuracy comparable to the central model.
In the case of the intermediate shuffle model 101C, a shuffler 130, which is a trusted party added between the parties 120A-N and the server 122. The shuffler does not collude with any of the parties 120A-N. The shuffler permutes and forwards the randomized client values 132A-B. The permutation breaks the mapping between a client and her value, which reduces randomization requirements. The accuracy of the shuffle model 101C may be between the accuracy of the local model 101A and the central model 101B; but in general, the shuffle model 101C is strictly weaker than the central model 101B. The centralized MPC model 101A may generally incur a high computation burden and communication overhead (which reduces efficiency and scalability to larger quantities of clients/parties). The centralized MPC model 101A may provide some of the benefits over the other models, such as higher accuracy and stronger privacy (e.g., no disclosure of values to a third party).
The following provides a description of the first protocol FHH1 including an example MPC implementation HH1 for the first protocol using MPC, and there is provided a description of the second protocol FHH2 including an example implementation HH2 for FHH2. Although some of the examples refer to the use of a trusted third party server, the examples described herein may also be implemented with secure MPC.
In the example of
The trusted server, which is represented by a compute server 210, creates a table 266A with entries that map the data values (labeled “values”) in the domain of data values to a corresponding count. For example, the compute server 210 creates table 266A, such that each data entry includes a data value mapped to a count (see, e.g., Table 5 below at 1). In this example, the table 266A includes a single value 4 mapped to count 1 as only the single value 4 is received, at 220A, from the first party P1 at the client device 202A. In other words, d is an element of the table T (d∈T), so the count value is incremented to 1 (e.g., T[d]=1).
At
At
At
At
At
In some embodiments, noise may be added to the counts, and values whose noisy count is below a threshold is removed. For example, the compute server may add noise to the counts in the table, such as table 266F, before providing an output of the top-k values. Moreover, noisy counts below a threshold may be removed. The thresholding may help ensure that values (which are provided by a plurality of parties rather than a single party) help to protect privacy as we require enough participants to allow small changes (single value added/removed) to not alter the outcome with high probability (whereas multiple changes can alter the outcome). In other words, individual contributions are protected and not necessarily released but aggregated contributions of many individuals (with additive noise) are released as an output (with high probability).
In the case of additive noise and a trusted third party in accordance with some example embodiments, FHH1 would further include, as shown at
Table 5 provides an example implementation of the FHH1 for the top-k values using additive noise and a trusted third party, in accordance with some example embodiments. In the example of Table 5, at line 3(a), noise is added to the counts as noted above. At line 3(b) a value i is removed from the table T unless it exceeds a threshold value, τ. And, at line 4 the remaining values in the table are sorted by their noisy counts and then released as an output. Referring to
At Table 5 (as well as Table 7 below), the symbol Δ denotes the maximum number of counts an individual party can influence, so for example Δ=1 (when we query countries of origin, for example) or Δ≥1 (when there are queries of current and former employers, for example).
In the case of an implementation where a trusted third party server is not used but instead MPC is used, the FHH1 algorithm described above may be implemented as the HHI MPC algorithm depicted at Table 7 below. In other words, the MPC HH1 algorithm is similar to HH1 but HH1 uses MPC rather than a trusted server and the parties provide encrypted inputs (e.g., using secret shares or other cryptographic technology) including the value and a partial noise value, to enable the joint computation of the top-k values.
For example, the parties 202A-E may perform the joint computation of the top-k using MPC (also referred to as secure MPC) among the parties by exchanging messages securely (e.g., secret sharing), where the secured messages represent the input to the MPC joint computation of the top-k. For example, the secure message from party 202A may include the value (e.g., “4” in the example of 220A) and a partial noise value. The parties operate on the secured input messages to yield a secured final output, such as the top-k. The secured final output is encrypted (secret shares), and can be decrypted by each of the parties to reconstruct or reveal the result, such as the top-k values. Although the parties may jointly compute the top-k function, the parties may, however, outsource this MPC processing to a compute node (e.g., a plurality of cloud-service providers). The outsourcing allows a distribution of trust (where there is no single totally trusted party but multiple semi-honest parties only if a majority of them where to collaborate or where attacked/hacked, the secrets could be reconstructed). For this, the parties secret share their inputs with the computation parties, and the computation parties execute the computation of HH1.
In the case of MPC for the top-k, the parties 202A-E may provide to each other an input message including the value and a partial noise value, wherein the input message is encrypted via a secret share. For example, party 202A may provide an encrypted message containing “4” and a partial noise value to the other parties 202B-E. The partial noise is added to the count value as noted above with respect to
At Table 7, the MPC operations (or subprotocols) are listed in Table 6. The outputs of these subprotocols are encrypted (e.g., secret share), except for Rec(⋅), which reconstructs the output using the secret share (“decrypts”). Protected values are surrounded with angled brackets, such as <⋅>, which may be considered a form of encryption (via, e.g., secret share). The upper case letters in Table 7 denote arrays, where A[j] denotes the jth element in array A, for example. The array V holds the values, the array C holds the count, the array N holds the noise being added. The Boolean values (in the form of a bit) are indicated with bstate (e.g., bmatch=1 indicates a match).
In the example of
At 310, the compute server queries the first group and requests counts for an initial set of prefixes of length 2 (e.g., γ=1, η=1). For example, counts are requested for the initial set of prefixes (00, 01, 10, and 11). In response, the first party 302A responds, at 312A, with a count vector (1,0,0,0) representative of the prefix 00, which corresponds to the encoded data value it holds 001. And, the second party 302B responds, at 312B, with the count vector (0,0,1,0) representative of the prefix 10, which corresponds to the encoded data value it holds 100.
At 314, the compute server 210 adds the reported counts element-wise. For example, the count vectors (1,0,0,0) and (0,0,1,0) are added to yield (1,0,1,0), which indicates that the most frequently occurring prefixes are “00” and “10”. In some implementations, the compute server 210 may add noise to the counts and perform thresholding as described below with respect to 325 and 330). If a noisy count does not exceed a threshold value at 330, the value corresponding count is removed as a top-k result (see, e.g., Table 8 at 2D). In the example of
Next, the compute server extends the 2 bit prefix of the most frequently occurring prefixes by a given value, such as η=1 bit.
At 316, the compute server 210 queries the second group of parties 302C-D requests counts for prefix candidates (000, 001, 100, and 101). These prefix candidates correspond to the extended prefixes for 00 (which corresponds to 3 bit prefixes 000 and 001) and the prefixes for 10 (which corresponds to 3 bit prefixes 100 and 101). In response, the third party 302C responds, at 318A, with a count vector (0,1,0,0) representative of the prefix 001, which corresponds to the encoded data value it holds 001. And, the fourth party 302D responds, at 318B, (0,1,0,0) representative of the prefix 001, which corresponds to the encoded data value it holds 001.
At 320, the compute server 210 adds the reported counts element-wise. For example, the count vectors (0,1,0,0) and (0,1,0,0) are added to yield (0,2,0,0), which indicates that the most frequently occurring prefixes is 001.
At 325, the compute server 210 adds noise, such as Laplacian distributed noise and the like, to the counts. For example, the compute server may add noise to the aggregate counts, such as the counts in the aggregate vector (0,2,0,0). If a noisy count does not exceed a threshold value at 330, the value corresponding count is removed as a top-k result (see, e.g., Table 8 at 2D). In the example of
Like
In the case of an implementation where a trusted third party server is not used but instead MPC is used, the HH2 algorithm described above with respect to
In the case of MPC, each party may respond with an answer plus a partial noise value. In the case of the first party at 312A for example, it responds with (1+partialnoise1, 0+partialnoise2, 0+partialnoise3, 0+partialnoise4). At 314, the addition of the vectors across the first and second party of the first group yields (1+fullnoise1, 0+fullnoise2, 1+fullnoise3, 0+full noise4), for example. And, these responses (as well as other exchanges of information and the result output of 335) may be encrypted with a corresponding party's secret share.
Although a variety of noise mechanisms may be used, in the case of unrestricted sensitivity Δ, the Gumbel mechanism for noise may be used instead of Laplace noise at 2(d)(i) at Table 8.
In the case of an implementation where a trusted third party server is not used, the HH2 MPC algorithm described above may be implemented as depicted and described at Table 9 below. At Table 9, the MPC operations (or subprotocols) are listed in Table 6, which may be implemented as noted above. The sorting in line 9 can be implemented as a sorting network (based on conditional swaps) where the sorting result (a bit indicating that a value is smaller or larger) for C is re-used to sort i in the same way. The inputs and computation values are scaled integers (also known as fix point representation), which allow for more efficient secure implementation than floating point numbers.
As shown in
The processor 510, the memory 520, the storage device 530, and the input/output devices 540 can be interconnected via a system bus 550. The processor 510 is capable of processing instructions for execution within the computing system 500. Such executed instructions can implement one or more components of, for example, the trusted server, client devices (parties), and/or the like. In some implementations of the current subject matter, the processor 510 can be a single-threaded processor. Alternately, the processor 510 can be a multi-threaded processor. The process may be a multi-core processor have a plurality or processors or a single core processor. The processor 510 is capable of processing instructions stored in the memory 520 and/or on the storage device 530 to display graphical information for a user interface provided via the input/output device 540.
The memory 520 is a computer readable medium such as volatile or non-volatile that stores information within the computing system 500. The memory 520 can store data structures representing configuration object databases, for example. The storage device 530 is capable of providing persistent storage for the computing system 500. The storage device 530 can be a floppy disk device, a hard disk device, an optical disk device, or a tape device, or other suitable persistent storage means. The input/output device 540 provides input/output operations for the computing system 500. In some implementations of the current subject matter, the input/output device 540 includes a keyboard and/or pointing device. In various implementations, the input/output device 540 includes a display unit for displaying graphical user interfaces.
According to some implementations of the current subject matter, the input/output device 540 can provide input/output operations for a network device. For example, the input/output device 540 can include Ethernet ports or other networking ports to communicate with one or more wired and/or wireless networks (e.g., a local area network (LAN), a wide area network (WAN), the Internet).
In some implementations of the current subject matter, the computing system 500 can be used to execute various interactive computer software applications that can be used for organization, analysis and/or storage of data in various (e.g., tabular) format (e.g., Microsoft Excel®, and/or any other type of software). Alternatively, the computing system 500 can be used to execute any type of software applications. These applications can be used to perform various functionalities, e.g., planning functionalities (e.g., generating, managing, editing of spreadsheet documents, word processing documents, and/or any other objects, etc.), computing functionalities, communications functionalities, etc. The applications can include various add-in functionalities (e.g., SAP Integrated Business Planning add-in for Microsoft Excel as part of the SAP Business Suite, as provided by SAP SE, Walldorf, Germany) or can be standalone computing products and/or functionalities. Upon activation within the applications, the functionalities can be used to generate the user interface provided via the input/output device 540. The user interface can be generated and presented to a user by the computing system 500 (e.g., on a computer screen monitor, etc.).
One or more aspects or features of the subject matter described herein can be realized in digital electronic circuitry, integrated circuitry, specially designed ASICs, field programmable gate arrays (FPGAs) computer hardware, firmware, software, and/or combinations thereof. These various aspects or features can include implementation in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which can be special or general purpose, coupled to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device. The programmable system or computing system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
These computer programs, which can also be referred to as programs, software, software applications, applications, components, or code, include machine instructions for a programmable processor, and can be implemented in a high-level procedural and/or object-oriented programming language, and/or in assembly/machine language. As used herein, the term “machine-readable medium” refers to any computer program product, apparatus and/or device, such as for example magnetic discs, optical disks, memory, and Programmable Logic Devices (PLDs), used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term “machine-readable signal” refers to any signal used to provide machine instructions and/or data to a programmable processor. The machine-readable medium can store such machine instructions non-transitorily, such as for example as would a non-transient solid-state memory or a magnetic hard drive or any equivalent storage medium. The machine-readable medium can alternatively or additionally store such machine instructions in a transient manner, such as for example, as would a processor cache or other random access memory associated with one or more physical processor cores.
To provide for interaction with a user, one or more aspects or features of the subject matter described herein can be implemented on a computer having a display device, such as for example a cathode ray tube (CRT) or a liquid crystal display (LCD) or a light emitting diode (LED) monitor for displaying information to the user and a keyboard and a pointing device, such as for example a mouse or a trackball, by which the user may provide input to the computer. Other kinds of devices can be used to provide for interaction with a user as well. For example, feedback provided to the user can be any form of sensory feedback, such as for example visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including acoustic, speech, or tactile input. Other possible input devices include touch screens or other touch-sensitive devices such as single or multi-point resistive or capacitive track pads, voice recognition hardware and software, optical scanners, optical pointers, digital image capture devices and associated interpretation software, and the like.
In the descriptions above and in the claims, phrases such as “at least one of” or “one or more of” may occur followed by a conjunctive list of elements or features. The term “and/or” may also occur in a list of two or more elements or features. Unless otherwise implicitly or explicitly contradicted by the context in which it is used, such a phrase is intended to mean any of the listed elements or features individually or any of the recited elements or features in combination with any of the other recited elements or features. For example, the phrases “at least one of A and B;” “one or more of A and B;” and “A and/or B” are each intended to mean “A alone, B alone, or A and B together.” A similar interpretation is also intended for lists including three or more items. For example, the phrases “at least one of A, B, and C;” “one or more of A, B, and C;” and “A, B, and/or C” are each intended to mean “A alone, B alone, C alone, A and B together, A and C together, B and C together, or A and B and C together.” Use of the term “based on,” above and in the claims is intended to mean, “based at least in part on,” such that an unrecited feature or element is also permissible.
The subject matter described herein can be embodied in systems, apparatus, methods, and/or articles depending on the desired configuration. The implementations set forth in the foregoing description do not represent all implementations consistent with the subject matter described herein. Instead, they are merely some examples consistent with aspects related to the described subject matter. Although a few variations have been described in detail above, other modifications or additions are possible. In particular, further features and/or variations can be provided in addition to those set forth herein. For example, the implementations described above can be directed to various combinations and subcombinations of the disclosed features and/or combinations and subcombinations of several further features disclosed above. In addition, the logic flows depicted in the accompanying figures and/or described herein do not necessarily require the particular order shown, or sequential order, to achieve desirable results. For example, the logic flows may include different and/or additional operations than shown without departing from the scope of the present disclosure. One or more operations of the logic flows may be repeated and/or omitted without departing from the scope of the present disclosure. Other implementations may be within the scope of the following claims.
Claims
1. A system, comprising:
- at least one data processor; and
- at least one memory storing instructions which, when executed by the at least one data processor, result in operations comprising: generating, for a top-k value determination across a plurality of clients, a table including entries to map candidate values to corresponding counts; receiving, from each of the plurality of clients, a candidate value; in response to a received candidate value matching one of the entries in the table, incrementing, for the matching candidate value, a corresponding count; in response to the received candidate value not matching one of the entries in the table and the table not exceeding a threshold size, adding an entry to the table by adding the received candidate value with a count value of 1; in response to the received candidate value not matching one of the entries in the table and the table exceeding the threshold size, decrementing all of the counts in the table by 1 and deleting from the table any entries having a count of zero; adding noise to the corresponding counts in the entries of the table; in response to a noisy corresponding count being less than a threshold value, deleting the corresponding entry in the table for the noisy corresponding count; and outputting at least a portion of the table as the top-k value result set.
2. The system of claim 1, wherein the table is sorted based on the noisy corresponding count before the outputting.
3. The system of claim 1, wherein the system comprises or is comprised in a trusted server.
4. The system of claim 1, wherein the top-k value result set is determined based on a multi-party computation using a domain of data across the plurality of clients, wherein the top-k value result set is determined over the domain of data.
5. The system of claim 4, wherein the system utilizes, to perform the multi-party computation, at least one compute node at a cloud provider or at least one compute node at one or more of the plurality of clients.
6. The system of claim 4, wherein the receiving, from each of the plurality of clients, the candidate value comprises, receiving the candidate value in a secured message, the secured message further including a partial noise value.
7. The system of claim 6, wherein the adding noise to the corresponding counts in the entries of the table further comprises adding the noise based on the partial noise value from each of the plurality of clients.
8. The system of claim 4, wherein the outputting at least a portion of the table as the top-k value result set further comprises outputting the at least a portion of the table in a secured message.
9. The system of claim 1, wherein the top-k value result set is output in accordance with differential privacy.
10. A method comprising:
- generating, for a top-k value determination across a plurality of clients, a table including entries to map candidate values to corresponding counts;
- receiving, from each of the plurality of clients, a candidate value;
- in response to a received candidate value matching one of the entries in the table, incrementing, for the matching candidate value, a corresponding count;
- in response to the received candidate value not matching one of the entries in the table and the table not exceeding a threshold size, adding an entry to the table by adding the received candidate value with a count value of 1;
- in response to the received candidate value not matching one of the entries in the table and the table exceeding the threshold size, decrementing all of the counts in the table by 1 and deleting from the table any entries having a count of zero;
- adding noise to the corresponding counts in the entries of the table;
- in response to a noisy corresponding count being less than a threshold value, deleting the corresponding entry in the table for the noisy corresponding count; and
- outputting at least a portion of the table as the top-k value result set.
11. The method of claim 10, wherein the table is sorted based on the noisy corresponding count before the outputting.
12. The method of claim 10, wherein the system comprises or is comprised in a trusted server.
13. The method of claim 10, wherein the top-k value result set is determined based on a multi-party computation using a domain of data across the plurality of clients, wherein the top-k value result set is determined over the domain of data.
14. The method of claim 13, wherein the system utilizes, to perform the multi-party computation, at least one compute node at a cloud provider or at least one compute node at one or more of the plurality of clients.
15. The method of claim 13, wherein the receiving, from each of the plurality of clients, the candidate value comprises, receiving the candidate value in a secured message, the secured message further including a partial noise value.
16. The method of claim 15, wherein the adding noise to the corresponding counts in the entries of the table further comprises adding the noise based on the partial noise value from each of the plurality of clients.
17. The method of claim 13, wherein the outputting at least a portion of the table as the top-k value result set further comprises outputting the at least a portion of the table in a secured message.
18. The method of claim 13, wherein the top-k value result set is output in accordance with differential privacy.
19. (canceled)
20. A method comprising:
- requesting, from a first group of clients, counts for a first set of prefixes, the first set of prefixes representing an encoding of data domain for a plurality of clients grouped into the first group of clients and a second group of clients;
- receiving, from a first client of the first group of clients, a first count vector, the first count vector indicating the presence of each of the first set of prefixes at the first client;
- receiving, from a second client of the first group of clients, a second count vector, the second count vector indicating the presence of each of the first set of prefixes at the second client;
- adding the first count vector and the second count vector to yield a first aggregate count vector;
- adding noise to the first aggregate count vector;
- in response to a noisy count in the first aggregate count vector being less than a threshold value, removing, from the first aggregate count vector, the noisy count and the corresponding prefix, the first aggregate count vector identifying one or more prefixes frequently occurring in the first group of clients;
- requesting, from the second group of clients, counts for a second set of extended prefixes, the second set of extended prefixes corresponding to the one or more prefixes identified via the first aggregate count, the second set of extended prefixes extended by a predetermined number of bits;
- receiving, from a third client of the second group of clients, a third count vector, the third count vector indicating the presence of each of the second set of extended prefixes at the third client;
- receiving, from a fourth client of the second group of clients, a fourth count vector, the fourth count vector indicating the presence of each of the second set of extended prefixes at the fourth client;
- adding the third count vector and the fourth count vector to yield a second aggregate count vector;
- adding noise to the second aggregate count vector;
- in response to a noisy count in the second aggregate count vector being less than the threshold value, removing, from the second aggregate count vector, the noisy count and the corresponding prefix; and
- outputting a top-k result set based on the second aggregate count vector.
Type: Application
Filed: Jun 24, 2021
Publication Date: Jan 19, 2023
Inventor: Jonas Boehler (Karlsruhe)
Application Number: 17/357,096