Hao's Blog

Software Engineer

Email Predictor

| Comments

I wrote a email predictor recently which Given the following sample dataset:

1
2
3
4
5
6
7
8
{
  "John Ferguson" => "john.ferguson@alphasights.com",
  "Damon Aw" => "damon.aw@alphasights.com",
  "Linda Li" => "linda.li@alphasights.com",
  "Larry Page" => "larry.p@google.com",
  "Sergey Brin" => "s.brin@google.com",
  "Steve Jobs" => "s.j@apple.com"
}

It will Predict the email addresses for the following advisors:

1
2
3
4
"Peter Wong", "alphasights.com"
"Craig Silverstein", "google.com"
"Steve Wozniak", "apple.com"
"Barack Obama", "whitehouse.gov"

The solution is designed to be object-oriented, and it composes three parts: Pattern, Analyzer and Predictor.

A Pattern handles all business logic related with first_name and last_name in this case. Its responsibility is solely designed to fulfill the purpose of looking for/predictoring patterns based on first_name and last_name. In our case, there are four cases, which are first_name_dot_last_name, first_name_dot_last_initial, first_initial_dot_last_initial, first_initial_dot_last_name.

A Analyzer should be created with a raw dataset and a pattern, and based on instructions defined in patterns, analyzer generate dataset which can be used for further prediction process.

A Predictor should be created with a dataset and a pattern, and its ‘#formulate’ method will take a name and company as input, and based on exisiting pattern which is processed and stored in dataset, it will predict email addresses for the given name and company.

The philosiphy behind this can be interpreted in this way: Pattern can easily be replaced with other patterns or a pattern can easily to modifed to accept more business logic based on other attributes. Analyzer should only care about its given pattern and its given raw data, and generate a dataset. Predictor should only caret about its given pattern and its given dataset. Analyzer and Predictor should not know each other.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
require 'pry'
require 'pp'

class Pattern
  attr_reader :first_name, :last_name

  def split(name)
    @first_name, @last_name = name.split(/[.|\s{1}]/)
  end

  def find
    [first_part, last_part].join('_dot_').to_sym
  end

  def predict(pattern)
    %w(first_initial first_name last_initial last_name).inject([]) do |memo, params|
      memo << send(params) if pattern.to_s =~ /#{params}/
      memo
    end.join('.')
  end

  private

  def first_initial
    @first_name[0]
  end

  def last_initial
    @last_name[0]
  end

  def first_part
    @first_name.length == 1 ? 'first_initial' : 'first_name'
  end

  def last_part
    @last_name.length == 1 ? 'last_initial' : 'last_name'
  end
end

class Analyzer
  attr_reader :dataset, :pattern

  def initialize(attributes)
    @dataset = {}
    @pattern = attributes[:pattern]
  end

  def process(raw_data)
    raw_data.each do |_, email_address|
      name, email = email_address.split('@')
      @pattern.split(name)
      pattern = @pattern.find

      @dataset[email] ||= []
      @dataset[email] << pattern unless @dataset[email].include?(pattern)
    end
  end
end

class Predictor
  attr_reader :pattern, :dataset

  def initialize(attributes)
    @dataset = attributes[:dataset]
    @pattern = attributes[:pattern]
  end

  def formulate(attributes)
    name = attributes[:name].downcase
    email = attributes[:email]

    return "There is no matching in dataset for #{attributes[:name]} working for #{email}" if @dataset[email].nil?

    @dataset[email].map do |pattern|
      @pattern.split(name)
      @pattern.predict(pattern) + "@#{email}"
    end
  end
end

# Instructions to run:

raw =
{
  "John Ferguson" => "john.ferguson@alphasights.com",
  "Damon Aw" => "damon.aw@alphasights.com",
  "Linda Li" => "linda.li@alphasights.com",
  "Larry Page" => "larry.p@google.com",
  "Sergey Brin" => "s.brin@google.com",
  "Steve Jobs" => "s.j@apple.com"
}
# Create a analyzer object and give it a pattern
analyzer = Analyzer.new(pattern: Pattern.new)
# Analyze the given dataset
analyzer.process(raw)
# Create a predictor object and give it a dataset and a pattern
predictor = Predictor.new(dataset: analyzer.dataset, pattern: Pattern.new)

# Formulate potential patterns for given name and email
pp predictor.formulate(name: 'Criag Silverstein', email: 'google.com')
pp predictor.formulate(name: 'Peter Wong', email: 'alphasights.com')
pp predictor.formulate(name: 'Steve Wozniak', email: 'apple.com')
pp predictor.formulate(name: 'Barack Obama', email: 'whitehouse.gov')

Specs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
# Instruction to run test:
# rspec -fd email_predictor_spec.rb

require './email_predictor'

describe Pattern do
  let(:pattern) { Pattern.new }

  it 'should respond to first name' do
    expect(pattern).to respond_to(:first_name)
  end

  it 'should respond to last name' do
    expect(pattern).to respond_to(:last_name)
  end

  describe '#split' do
    it 'should split name into first_name and last_name if name is separted by white space' do
      pattern.split('John Ferguson')
      expect(pattern.first_name).to eql('John')
      expect(pattern.last_name).to eql('Ferguson')
    end

    it 'should split name into first_name and last_name if name is separted by .' do
      pattern.split('steve.jobs')
      expect(pattern.first_name).to eql('steve')
      expect(pattern.last_name).to eql('jobs')
    end
  end

  describe '#find' do
    it 'should generate pattern if first and last are full' do
      pattern.split('john.ferguson')
      expect(pattern.find).to eql(:first_name_dot_last_name)
    end

    it 'should generate pattern if first is initial and last is full' do
      pattern.split('j.ferguson')
      expect(pattern.find).to eql(:first_initial_dot_last_name)
    end

    it 'should generate pattern if first is full and last are initial' do
      pattern.split('john.f')
      expect(pattern.find).to eql(:first_name_dot_last_initial)
    end

    it 'should generate pattern if both are initial' do
      pattern.split('j.f')
      expect(pattern.find).to eql(:first_initial_dot_last_initial)
    end
  end

  describe '#predict' do
    it 'should be able to predict for first_name_dot_last_name' do
      pattern.split('john ferguson')
      expect(pattern.predict(:first_name_dot_last_name)).to eql('john.ferguson')
    end

    it 'should be able to predict for first_initial_dot_last_name' do
      pattern.split('john ferguson')
      expect(pattern.predict(:first_initial_dot_last_name)).to eql('j.ferguson')
    end

    it 'should be able to predict for first_name_dot_last_initial' do
      pattern.split('john ferguson')
      expect(pattern.predict(:first_name_dot_last_initial)).to eql('john.f')
    end

    it 'should be able to predict for first_initial_dot_last_initial' do
      pattern.split('john ferguson')
      expect(pattern.predict(:first_initial_dot_last_initial)).to eql('j.f')
    end
  end
end

describe Analyzer do

  let(:pattern) { Pattern.new }
  let(:analyzer) { Analyzer.new pattern: pattern }

  it 'should respond to dataset' do
    expect(analyzer).to respond_to(:dataset)
  end

  it 'should use hash as dataset' do
    expect(analyzer.dataset).to be_a(Hash)
  end

  it 'should respond to raw data' do
    expect(analyzer).to respond_to(:pattern)
  end

  describe '#process' do

    let(:raw) { { "John Ferguson" => "john.ferguson@alphasights.com" } }

    it 'should use pattern to split name and find pattern' do
      pattern.should_receive(:split)
      pattern.should_receive(:find).and_return(:first_name_dot_last_name)
      analyzer.process(raw)
    end

    it 'should be able to process data based on given rule' do
      analyzer.process(raw)
      expect(analyzer.dataset.size).to eql(1)
      expect(analyzer.dataset).to eql('alphasights.com' => [:first_name_dot_last_name])
    end

    let(:multiple) do
      {
        "John Ferguson" => "john.ferguson@alphasights.com",
        "Damon Aw" => "damon.aw@alphasights.com",
        "Linda Li" => "linda.li@alphasights.com",
        "Larry Page" => "larry.p@google.com"
      }
    end

    it 'should be albe to process data for multiple companies' do
      analyzer.process(multiple)
      expect(analyzer.dataset.size).to eql(2)
      expect(analyzer.dataset).to eql('alphasights.com' => [:first_name_dot_last_name],
                                      'google.com' => [:first_name_dot_last_initial])
    end
  end
end

describe Predictor do
  let(:pattern) { Pattern.new }
  let(:predictor) { Predictor.new pattern: pattern }

  it 'should respond to dataset' do
    expect(predictor).to respond_to(:dataset)
  end

  it 'should respond to pattern' do
    expect(predictor).to respond_to(:pattern)
  end

  describe '#formulate' do
    it 'should not formulate a email address if company is not given in the dataest' do
      predictor = Predictor.new pattern: pattern, dataset: {}
      attributes = { name: 'Barack Obama', email: 'whitehouse.gov'}
      response = predictor.formulate(attributes)
      expect(response).to eql("There is no matching in dataset for #{attributes[:name]} working for #{attributes[:email]}")
    end

    it 'should use pattern to split name and predict email' do
      predictor = Predictor.new pattern: pattern, dataset: { 'alphasights.com' => [:first_name_dot_last_name] }
      attributes = { name: 'Peter Wong', email: 'alphasights.com'}
      pattern.should_receive(:split).and_return(true)
      pattern.should_receive(:predict).and_return('peter.wong')
      predictor.formulate(attributes)
    end

    it 'should predict email address based if one pattern exist' do
      predictor = Predictor.new pattern: pattern, dataset: { 'alphasights.com' => [:first_name_dot_last_name] }
      attributes = { name: 'Peter Wong', email: 'alphasights.com'}
      response = predictor.formulate(attributes)
      expect(response).to eql(['peter.wong@alphasights.com'])
    end

    it 'should predict email address based if multiple patterns exist' do
      predictor = Predictor.new pattern: pattern, dataset: { 'google.com' => [:first_name_dot_last_initial, :first_initial_dot_last_name] }
      attributes = { name: 'Criag Silverstein', email: 'google.com' }
      response = predictor.formulate(attributes)
      expect(response).to match_array(['c.silverstein@google.com', 'criag.s@google.com'])
    end
  end
end

Extract Values From Javascript Nested Object

| Comments

Underscore doesn’t support search and find nested value in a nested javascript object, So I went ahead and built a version for that:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
# This helper is served to relieve the efforts of building up external formatted object through
# nested object which we retrieve from mongoDB or http params
# Usage:
#     SampleData =
#       token: '1234',
#       user:
#         token: '4321'
#       contact:
#         address:
#           city: 'Emeryville'
#           state: 'CA'
#           zip_code: '51421'

# All we need to do is
# Helper.DataPicker(SampleData, 'token', 'user.token', 'city')
# For nested key with the duplicated names, we need to nest the top parents nodes as well.
# For example 'user.token'

_ = require 'underscore'

DataPicker = (data, attributes...) ->

  result = {}

  # Loop through each pair of data and find the matching attributes
  _.each data, (value, key, list) ->
    if _.contains(attributes, key) and (!_.isObject(list[key]) or _.isArray(value))
      result[key] ||= value
    else if _.isObject(value) and !_.isArray(value)
      for subKey, subValue of value
        arguments.callee(subValue, subKey, value)

  # Loop through nested cases like user.token, merchant.token
  _.each attributes, (attribute) ->
    if attribute.match(/^[a-z]+[.]/)
      keys = attribute.split('.')
      element = keys.join('_')
      nestedElement = _.reduce(keys, (memo, num) ->
        if !_.isUndefined(memo)
          memo = memo[num]
      , data)

      result[element] = nestedElement if !_.isUndefined(nestedElement)

  return result

module.exports = DataPicker

DFS and BFS in Ruby Using Simple Tree Objects

| Comments

In Ruby, there is no existing pre-built tree structures available, but it is pretty easy to define tree objects. Here is the an exmaple which DFS and BFS are performed using ruby.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
class Node
  attr_accessor :value, :left, :right, :name

  def initialize(options={})
    @value = options[:value]
    @name = options[:name]
  end

  def children
    [@left, @right].compact
  end

  def children?
    @left && @right
  end

  def no_children?
    !children?
  end
end

root = Node.new({:value => 1, :name => 'root'})
child_1 = Node.new({:value => 2, :name => 'child_1'})
child_2 = Node.new({:value => 3, :name => 'child_2'})
grand_child_1 = Node.new({:value => 4, :name => 'grand_child_1'})
grand_grand_child_1 = Node.new({:value => 5, :name => 'grand_grand_child_1'})
grand_child_1.left = grand_grand_child_1
child_1.left = grand_child_1
root.left = child_1
root.right = child_2

def bfs(node)
  queue = []
  queue.push(node)

  while(queue.size != 0)
    n = queue.shift
    puts n.value
    n.children.each do |child|
      queue.push(child)
    end
  end
end

bfs(root)

puts '~~~~~~~~~~~~~~~~~~~~~~~~~~~'

def dfs(node)
  puts node.value
  node.children.each do |child|
    dfs(child)
  end
end

dfs(root)

Subset Sum Using Dynamic Programming

| Comments

Problem: Given a set [1,2,3,4], how many subsets there are in which each sum is equal to 4 ?

Solution: Solution is programmed using ruby in dynamic programming. The idea is set up a matrix in which each cell in x axis represents a number in set. And in y axis, each cell represent their corresponding sum.

For example, in this case, 1,2,3,4 will be numbered in x axis and 1,2,3,4 (incrementally) will be numbered in y axis.

Then we set up boudary condtions. The algorithm solves problems in multiple subproblems and each subproblem memoize the optimal solution from the previous iteration.

For furthur detailed algorithm, visit Dynamic Programming

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
require 'pry'
require 'pp'
require 'matrix'

def dynamic_programming(set, n, sum)
  matrix = []
  (sum+1).times { matrix << Array.new(n+1, 0) }

  for j in 0..n
    matrix[0][j] = 0
  end

  for i in 0..sum
    matrix[i][0] = 0
  end

  for i in 1..n
    for j in 1..sum
      if j < set[i-1]
        matrix[j][i] = matrix[j][i-1]
      else
        matrix[j][i] = [matrix[j][i-1], set[i-1] + matrix[j-set[i-1]][i-1]].max
      end
    end
  end
  pp matrix
end

set = [1,2,3,4]
sum = 4
n = 4

dynamic_programming(set, n, sum)

Rotx and Compromise

| Comments

Rotx exercise:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def rotx(x, string, encrypt=true)
  letters = ('a'..'z').to_a

  string.split('').map! do |letter|
    if letter.match(/[a-zA-Z]/)
      # Looking for index
      index = letter == letter.downcase ? letters.index(letter) : letters.index(letter.downcase)

      # Building rotated index either by moving forward or backward
      rotated_index = encrypt ? (index + x)%26 : (index - x)%26

      # Building new letter based on rotated index
      letter == letter.downcase ? letters[rotated_index] : letters[rotated_index].upcase
    else
      letter
    end
  end.join
end

describe 'Test #rotx' do
  it 'should rotate the string and encrypt as default' do
    rotx(10, 'Hello, World').should eql 'Rovvy, Gybvn'
  end

  it 'should rotate back the string if encrypt is false' do
    rotx(10, 'Rovvy, Gybvn', false).should eql 'Hello, World'
  end

  it 'should return the same results if roration number is added 26' do
    rotx(36, 'Hello, World').should eql 'Rovvy, Gybvn'
  end
end

A simple compromise implementation, and this will build the foundation in removing callbacks in node.js app.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
Promise = function () {
  this.value;
  this.callbacks = [];
};

Promise.prototype.resolve = function(result) {
  if (this.value) {
    throw new Error('A promise should only be able to be resolved once !');
  } else {
    this.value = result;
  }
  if (this.callbacks.length > 0) {
    this.triggerCallbacks();
  }
};

Promise.prototype.success = function(fn) {
  this.callbacks.push(fn);

  if (this.value) {
    this.triggerCallbacks();
  }
};

Promise.prototype.triggerCallbacks = function() {
  var self = this;

  this.callbacks.forEach(function(callback) {
    callback.call(self,self.value);
  });
}

var foo = new Promise();
var bar = new Promise();

foo.resolve('hello');

setTimeout(function(){
  bar.resolve('world');
}, 500);

foo.success(function(result){
  console.log(result);
});

bar.success(function(result){
  console.log(result);
});

// Throw errors if one promise tries to resolve twice

var foobar = new Promise();
foobar.resolve('hello');
foobar.resolve('world');

Twitter Streaming Retweets

| Comments

Recently, I did an exercise using twitter streaming api to generate the most recent 10 retweets in the past n minutes.

The program is pretty simple. It sends http streaming requests to twitter api and doing sorting in the background thread, while the main thread is printing out the most recent 10 retweets.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
require 'twitter'
require 'pry'
require 'pp'

class Streamer
  attr_reader :client, :background_thread, :mins, :container

  def initialize(attributes)
    @client = attributes[:client]
    @mins = attributes[:mins]
    @container = attributes[:container]
  end

  def stream
    @background_thread = Thread.new do
      @client.sample do |object|
        if object.is_a?(Twitter::Tweet)
          @container[object.created_at] ||= []
          @container[object.created_at] << object.text
        end
        # Container will only keep tweets which are 5 mins up to date
        @container.keep_if {|time, tweet| @mins * 60 > Time.now - time}
      end
    end
  end

  def stop
    Thread.kill(@background_thread)
  end
end

class Massager
  attr_reader :raw_tweets, :top_tweets, :container

  def initialize(attributes)
    @container = attributes[:container]
  end

  def top_tweets
    build
    sort
    format
  end

  private

  def build
    @raw_tweets = Hash.new(0)
    @container.values.flatten.each do |tweet|
      @raw_tweets[tweet] += 1
    end
  end

  def sort
    @raw_tweets = @raw_tweets.sort{|a,b| a[1] <=> b[1]}.last(10)
  end

  def format
    @raw_tweets.inject({}) do |memo, value|
      memo[value[1]] ||= []
      memo[value[1]] << value[0]
      memo
    end
  end
end

class Manager
  attr_reader :container, :streamer, :massager

  def initialize(attributes)
    @container = {}
    @streamer = Streamer.new(container: @container, mins: attributes[:mins], client: attributes[:client])
    @massager = Massager.new(container: @container)
  end
end

client = Twitter::Streaming::Client.new do |config|
  config.consumer_key        = "#"
  config.consumer_secret     = "#"
  config.access_token        = "#"
  config.access_token_secret = "#"
end

mins = 1
manager = Manager.new(mins: mins, client: client)
manager.streamer.stream

start_time = Time.now
loop do
  window_start = Time.now - start_time > 60*mins ? Time.now - 60*mins : start_time
  sleep(1)
  puts "------------- Top Ten Tweets from #{window_start} To #{Time.now} ---------------"
  pp manager.massager.top_tweets
end

Min_stack

| Comments

This is a typical interview question which is going to ask you to implement a stack which can track the max/min value for the current stack. The solution posted here will only show how min-stack is implemented, and max-stack can be developed similarly.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
require 'pry'
require 'pp'

class Stack
  attr_reader :stack

  def initialize
    @stack = []
  end

  def pop
    @stack.pop
  end

  def push(element)
    @stack.push(element)
  end

  def peek
    @stack[-1]
  end
end

class MinStack
  attr_reader :min_stack, :stack

  def initialize
    @min_stack = Stack.new
    @stack = Stack.new
  end

  def push(element)
    if @stack.peek && @stack.peek < element
      @min_stack.push(@stack.peek)
    else
      @min_stack.push(element)
    end
    @stack.push(element)
  end

  def pop
    @stack.pop
    puts "The minimun is #{@min_stack.pop}"
  end
end

min_stack = MinStack.new
min_stack.push(2)
min_stack.push(5)

Create Multiple Varariables With Differentreferences in Ruby

| Comments

Today I had a bug in ruby which I do not quite understand, and it turned out to be reference issue in ruby.

First, I tried something like this and intended to create 4 variables for the same values, but different objects

1
a, b, c ,d = [Object.new] * 4

This turns out to be a, b, c, d refers to the same object

1
2
a.object_id == b.object_id
#=> true

So the correct way should be:

1
a, b, c, d = Object.new, Object.new, Object.new, Object.new