[whispersystems] One on One Contact Discovery

Discussion:

Stefan Kamphausen

2017-02-20 12:37:21 UTC

Hi,

Like many people before, I came across the wonderful post about the
problems of contact discovery[1] which explains very well the problems of
safe contact discovery. According to [2] this list is the place to discuss
ideas for that problem.

I take it that for many people the issue here is, that they don't want to
send the phone numbers of their contacts to a foreign server, Signal's in
this case. At least, that's how I found that post.

Now, when I want my messenger app to discover possible contacts, what I am
actually asking is not "Signal, do you know this number?" but rather
"Signal, can I establish a connection with this number?". For that to be
answered positively, the Signal server must know both numbers. So, the
discovery might as well involve both numbers.

Based on that observation, here is an idea: I use my own number as a salt
for the hashing.

On my phone, the Signal app iterates over the numbers of my contacts,
concatenates my own number and the contact's, hashes the result and sends
the hash to the Signal server to store it (possibly stripping a byte or
two). On some other device when someone else wants to connect with me,
they'd create the same concatenation and hash and ask Signal for it.
Obviously for discovery, they'd have do this for every contact.

Would the space spanned by both numbers be large enough to protect against
pre-calculated hash tables?
Would the data to be stored on the Signal server be of acceptable size?
Would it be acceptable at all if the contact graph is stored on the server,
even if in hashed form?
And finally, is there an obvious mistake in this train of thoughts?

Looking forward to seeing someone crush this idea.

Kind regards,
Stefan

[1] https://whispersystems.org/blog/contact-discovery/
[2] https://github.com/WhisperSystems/Signal-Android/
issues/4726#issuecomment-159959146

Ralf Kohrt

2017-02-20 13:12:52 UTC

Permalink

Hi Stefan,

I had the same idea recently and in the process I have thought about it
some more. This has one obvious disadvantage: you can only find contacts
that have you in their contact list as well. I.e. this won't work for
people whom you just met and have only given you their number without
you giving your number to them.

As for hashtables: this should be just a space of 10^18 possible hashes
- so with 1 gigahash per second this should take 10^9 seconds. So if I
did not do a mistake this seems to be reasonably difficult ;-)

Best regards,
Ralf

Post by Stefan Kamphausen
Hi,
Like many people before, I came across the wonderful post about the
problems of contact discovery[1] which explains very well the problems
of safe contact discovery. According to [2] this list is the place to
discuss ideas for that problem.
I take it that for many people the issue here is, that they don't want
to send the phone numbers of their contacts to a foreign server,
Signal's in this case. At least, that's how I found that post.
Now, when I want my messenger app to discover possible contacts, what
I am actually asking is not "Signal, do you know this number?" but
rather "Signal, can I establish a connection with this number?". For
that to be answered positively, the Signal server must know both
numbers. So, the discovery might as well involve both numbers.
Based on that observation, here is an idea: I use my own number as a
salt for the hashing.
On my phone, the Signal app iterates over the numbers of my contacts,
concatenates my own number and the contact's, hashes the result and
sends the hash to the Signal server to store it (possibly stripping a
byte or two). On some other device when someone else wants to connect
with me, they'd create the same concatenation and hash and ask Signal
for it. Obviously for discovery, they'd have do this for every contact.
Would the space spanned by both numbers be large enough to protect
against pre-calculated hash tables?
Would the data to be stored on the Signal server be of acceptable size?
Would it be acceptable at all if the contact graph is stored on the
server, even if in hashed form?
And finally, is there an obvious mistake in this train of thoughts?
Looking forward to seeing someone crush this idea.
Kind regards,
Stefan
[1] https://whispersystems.org/blog/contact-discovery/
<https://whispersystems.org/blog/contact-discovery/>
[2]
https://github.com/WhisperSystems/Signal-Android/issues/4726#issuecomment-159959146
<https://github.com/WhisperSystems/Signal-Android/issues/4726#issuecomment-159959146>

Stefan Kamphausen

2017-02-20 13:26:49 UTC

Permalink

Hi Ralf,

Glad to hear that someone else had the same idea. Makes me feel less like
an idiot publicly embarrassing myself. :-)

Yes, the limitation is clear to me. That would be a trade-off. If it's the
only one, I'd personally prefer it over the current trade-off which is: "so
the only thing we can do is write the server such that it doesn't store the
transmitted contact information, inform the user, and give them the choice
of opting out." [https://whispersystems.org/blog/contact-discovery/]. Maybe
it is just me, though.

Best regards,
stefan

Post by Ralf Kohrt
Hi Stefan,
I had the same idea recently and in the process I have thought about it
some more. This has one obvious disadvantage: you can only find contacts
that have you in their contact list as well. I.e. this won't work for
people whom you just met and have only given you their number without you
giving your number to them.
As for hashtables: this should be just a space of 10^18 possible hashes -
so with 1 gigahash per second this should take 10^9 seconds. So if I did
not do a mistake this seems to be reasonably difficult ;-)
Best regards,
Ralf
Hi,
Like many people before, I came across the wonderful post about the
problems of contact discovery[1] which explains very well the problems of
safe contact discovery. According to [2] this list is the place to discuss
ideas for that problem.
I take it that for many people the issue here is, that they don't want to
send the phone numbers of their contacts to a foreign server, Signal's in
this case. At least, that's how I found that post.
Now, when I want my messenger app to discover possible contacts, what I am
actually asking is not "Signal, do you know this number?" but rather
"Signal, can I establish a connection with this number?". For that to be
answered positively, the Signal server must know both numbers. So, the
discovery might as well involve both numbers.
Based on that observation, here is an idea: I use my own number as a salt
for the hashing.
On my phone, the Signal app iterates over the numbers of my contacts,
concatenates my own number and the contact's, hashes the result and sends
the hash to the Signal server to store it (possibly stripping a byte or
two). On some other device when someone else wants to connect with me,
they'd create the same concatenation and hash and ask Signal for it.
Obviously for discovery, they'd have do this for every contact.
Would the space spanned by both numbers be large enough to protect against
pre-calculated hash tables?
Would the data to be stored on the Signal server be of acceptable size?
Would it be acceptable at all if the contact graph is stored on the
server, even if in hashed form?
And finally, is there an obvious mistake in this train of thoughts?
Looking forward to seeing someone crush this idea.
Kind regards,
Stefan
[1] https://whispersystems.org/blog/contact-discovery/
[2] https://github.com/WhisperSystems/Signal-Android/issues/
4726#issuecomment-159959146

Peter Holtwick

2017-02-20 15:26:27 UTC

Permalink

Hello Stefan,

I think the space needed on the Signal server's side is quite considerable,
since you have to store per user all contacts, so you multiply with factor
ca. 5000.

I was actually wondering how the contact is established _after_
discoverity. In the discussion with the Bloom filter, there is an intrinsic
chance for false positives. So when a contact is believed to use Signal,
does the client ask the server if he is really registered? Would the Bloom
filter just be an optimization of a loop over 5000 contacts and querying
each (privacy aside)?

Regards,
Peter

Post by Stefan Kamphausen
Hi Ralf,
Glad to hear that someone else had the same idea. Makes me feel less like
an idiot publicly embarrassing myself. :-)
Yes, the limitation is clear to me. That would be a trade-off. If it's
"so the only thing we can do is write the server such that it doesn't store
the transmitted contact information, inform the user, and give them the
choice of opting out." [https://whispersystems.org/blog/contact-discovery/].
Maybe it is just me, though.
Best regards,
stefan

Post by Ralf Kohrt
Hi Stefan,
I had the same idea recently and in the process I have thought about it
some more. This has one obvious disadvantage: you can only find contacts
that have you in their contact list as well. I.e. this won't work for
people whom you just met and have only given you their number without you
giving your number to them.
As for hashtables: this should be just a space of 10^18 possible hashes -
so with 1 gigahash per second this should take 10^9 seconds. So if I did
not do a mistake this seems to be reasonably difficult ;-)
Best regards,
Ralf
Hi,
Like many people before, I came across the wonderful post about the
problems of contact discovery[1] which explains very well the problems of
safe contact discovery. According to [2] this list is the place to discuss
ideas for that problem.
I take it that for many people the issue here is, that they don't want to
send the phone numbers of their contacts to a foreign server, Signal's in
this case. At least, that's how I found that post.
Now, when I want my messenger app to discover possible contacts, what I
am actually asking is not "Signal, do you know this number?" but rather
"Signal, can I establish a connection with this number?". For that to be
answered positively, the Signal server must know both numbers. So, the
discovery might as well involve both numbers.
Based on that observation, here is an idea: I use my own number as a salt
for the hashing.
On my phone, the Signal app iterates over the numbers of my contacts,
concatenates my own number and the contact's, hashes the result and sends
the hash to the Signal server to store it (possibly stripping a byte or
two). On some other device when someone else wants to connect with me,
they'd create the same concatenation and hash and ask Signal for it.
Obviously for discovery, they'd have do this for every contact.
Would the space spanned by both numbers be large enough to protect
against pre-calculated hash tables?
Would the data to be stored on the Signal server be of acceptable size?
Would it be acceptable at all if the contact graph is stored on the
server, even if in hashed form?
And finally, is there an obvious mistake in this train of thoughts?
Looking forward to seeing someone crush this idea.
Kind regards,
Stefan
[1] https://whispersystems.org/blog/contact-discovery/
[2] https://github.com/WhisperSystems/Signal-Android/issues/4726
#issuecomment-159959146

Moxie Marlinspike

2017-02-21 00:00:22 UTC

Permalink

Post by Stefan Kamphausen
Would the space spanned by both numbers be large enough to protect
against pre-calculated hash tables?

I don't think so. Seems like you're thinking about it in terms of
"Here's a hash token, what numbers does it correspond to?" But the
question you're actually worried about is "Here's Stefan's phone number,
who are his contacts?" At that point, it's 10^10 again. Just iterate
over the set of all possible phone numbers once, and you've got your
matches. Narrow your query down a region and it's even smaller.

- moxie

--
http://www.thoughtcrime.org

Stefan Kamphausen

2017-02-21 10:09:08 UTC

Permalink

Hi!

Post by Moxie Marlinspike

Post by Stefan Kamphausen
Would the space spanned by both numbers be large enough to protect
against pre-calculated hash tables?

If I understand that correctly, the approach may protect against
pre-calculated tables. But if it does or does not is irrelevant, because
you can just iterate over all possible contact numbers (and the combined
hashes) when needed. That means, calculating 10^10 hashes is considered
doable, even with bcrypt or other slower hashing algorithms, that can not
be made too slow, because they need to be fast enough to lookup possible
contacts.

So, I was starting off the wrong question.

I guess, that was the crushing I was waiting for. Thanks for taking the
time to consider this.

Kind regards,
stefan

Jani Monoses

2017-03-02 07:05:16 UTC

Permalink

What about the possibility of lazy contact discovery?

For those opting out of having their address book used to discover the
contacts intersection, allow picking contacts directly from the address
book when messaging. Unless the server returns a no such user error (which
happens now if you try messaging a number) the session can proceed.

This may complicate the UX, risk having users opt out uninitentionally
without knowing the implications and would not be in line with the 'as few
options as possible' policy.

Post by Stefan Kamphausen
Hi!

Post by Moxie Marlinspike

Post by Stefan Kamphausen
Would the space spanned by both numbers be large enough to protect
against pre-calculated hash tables?

If I understand that correctly, the approach may protect against
pre-calculated tables. But if it does or does not is irrelevant, because
you can just iterate over all possible contact numbers (and the combined
hashes) when needed. That means, calculating 10^10 hashes is considered
doable, even with bcrypt or other slower hashing algorithms, that can not
be made too slow, because they need to be fast enough to lookup possible
contacts.
So, I was starting off the wrong question.
I guess, that was the crushing I was waiting for. Thanks for taking the
time to consider this.
Kind regards,
stefan

Tufts

2017-03-02 07:49:51 UTC

Permalink

Post by Jani Monoses
What about the possibility of lazy contact discovery?
For those opting out of having their address book used to discover
the contacts intersection, allow picking contacts directly from the
address book when messaging. Unless the server returns a no such user
error (which happens now if you try messaging a number) the session
can proceed.
This may complicate the UX, risk having users opt out
uninitentionally without knowing the implications and would not be in
line with the 'as few options as possible' policy.

I think this could be done with minimal changes on Android and small
changes on iOS.

For Android, the flow goes like this:

- Hit the FAB (pencil button)
- Select a contact
- Type your message
- During this time, your phone queries Signal's servers about this
number. While this is happening, there is a spinner where the send icon
usually is, and you are unable to send.
- If the number is registered, spinner resolves into the blue,
encrypted send button. If not, it resolves into the gray button for sms.

For iOS, the difference is that now all of your contacts show when
trying to compose a message, but you're simply unable to send a message
to some contacts. I'm not familiar with iOS, but I assume there's a way
to link to the iMessage app? We could use that so that the screen "You
can't message this contact because they're not on SIgnal" can link to
iMessage with an invite link.

---

My analysis of this approach is:

- Potentially far more requests.
- Should be offset by each request being smaller and far fewer
numbers being transmitted.

- Signal servers could receive additional metadata about how often you
correspond with a given contact, even if you're just sending them sms.
- But, you already trust OWS with that metadata for messages you send
through Signal. And of course there's the huge gain that you don't need
to share your whole social graph.
- Clients could possibly obfuscate some of this data by bundling some
less frequent contacts or random numbers each time a request is made.

- Potentially worse UX, if you have to sit there waiting to know if you
can message someone.
- There is a purely technical side: How long does it take for a
request to return?
- There's also the element of "I thought you (Signal) said I could
message them, but now you're telling me I can't?!".
- While this is certainly a bit annoying, it's also not
substantially different from other ubiquitous messaging systems. If I
send you an sms, but I have the wrong number or your phone is out of
service, delivery may fail. Likewise with email.

- Can't give notifications that a contact has joined Signal without
messaging them, first.
- There's no way around this one. It is a bit anti-network effect.
Not to mention, it's fun seeing people join Signal for reasons other
than me bugging them to :P.

---

I conclude that it would be reasonable to offer this feature either as
an opt-in or opt-out. We can re-purpose the "New contacts notification"
setting to also control whether your full contacts are sent to the
server, so there's little additional complexity added.

If there's anything I'm overlooking, please step in and point it out!

~Stephen

Post by Jani Monoses
On Tue, Feb 21, 2017 at 12:09 PM, Stefan Kamphausen

Post by Stefan Kamphausen
Hi!

Post by Moxie Marlinspike

Post by Stefan Kamphausen
Would the space spanned by both numbers be large enough to protect
against pre-calculated hash tables?

If I understand that correctly, the approach may protect against
pre-calculated tables. But if it does or does not is irrelevant,
because you can just iterate over all possible contact numbers (and
the combined hashes) when needed. That means, calculating 10^10
hashes is considered doable, even with bcrypt or other slower
hashing algorithms, that can not be made too slow, because they need
to be fast enough to lookup possible contacts.
So, I was starting off the wrong question.
I guess, that was the crushing I was waiting for. Thanks for taking
the time to consider this.
Kind regards,
stefan

Jani Monoses

2017-03-02 08:00:13 UTC

Permalink

Post by Tufts
- Hit the FAB (pencil button)
- Select a contact
- Type your message
- During this time, your phone queries Signal's servers about this number.
While this is happening, there is a spinner where the send icon usually is,
and you are unable to send.

Just sending the message would result in either OK or No such user error,
so I don't think a separate query message is needed.
The situation for when there is no network connectivity would need a good
UX too.