Monthly Readings #21: 2017 Augest

Dremel made simple with Parquet

https://www.evernote.com/l/AAQKTqv0eOdL9bdRDp8t4YULZa0jwozkmIc

Pseudo GTID and easy replication management

https://www.evernote.com/l/AAQ7t9l8Px9D7qd_WzyaMFffXvcVyEz0AKM

Understanding GC pauses in JVM, HotSpot’s minor GC.

https://www.evernote.com/l/AASyfwT65NZJMar2aDb_asYUigabLm1fnCA

Understanding GC pauses in JVM, HotSpot’s CMS collector.

https://www.evernote.com/l/AAQSmFgCqN1DHZNoKIcAtt_LOvz4OAAxcwM

Colum-Store vs Row-Stores: How Different Are They Really

https://docs.google.com/document/d/1vtxqDyREUfDOPubEcGC2PE-VYt7ZwDXGJCbnHEy8uMo/edit#

16 Sep 2017 by fleuria

Monthly Readings #20: 2017 July

Mining Massive Datasets 学习笔记: LSH

怎样在大规模数据集中发现相似的项对列表?

最糙的解法是嵌套两层 for 循环遍历所有的项对,计算彼此的相似性再按相似性排序,这一计算量显然是不可接受的。而 k-d tree 之类的多维搜索树,面向高维度的数据集性能也退化的厉害,几乎与嵌套两层 for 循环无异。

最近在看《大数据:互联网大规模数据挖掘与分布式处理》介绍了 LSH (Locality Sensitive Hashing, 距离敏感哈希) 算法,它的思路是对数据项计算哈希,想办法将相似的项对分到同一个哈希桶中,从而简化大大计算过程。过程很有意思,在这里记一下。

集合的 Jaccard 相似度

集合 S 和 T 的 Jaccard 相似度表示为 S 和 T 的交集与并集之间的差:

集合相似度可以用于衡量文本文档的相似性,比如论文抄袭、重复新闻的检测;可以应用于协同过滤的场景,如一位用户喜欢的书籍集合和我的书籍集合相似,那么可以认为我们喜好相似;也可以将 uber 的驾车行驶路线表示为集合,继而可以通过 LSH 算法找出相似路线。

将文本文档表示为集合:Shingling

将文本文档表示为集合的 Shingling 方法和搜索引擎的 N-gram 是同义词,如 k-shingle 等于长度为 k 的任意子串。

对于重复新闻检测,可以将 shingle 定义为停用词加上后续的两个词形成 shingle 集,这一做法基于假定:新闻正文中富含停用词,而正文内混插的广告语通常是简短的口号性的文本,停用词少。这样做有助于减少同一新闻内容受站点插入的广告造成的干扰。

集合的摘要表示: MinHash

shingle 集合非常大,远远大于原始文档的体积。幸运的是可以通过 SimHash 将集合压缩为摘要,而摘要仍能用于相似性的计算。

minHash 之所以有用,在于两个集合的在经过 minHash 转换之后相等的概率恰好等于两者的 Jaccard 相似度。

设 shingle 集中有两个 shingle 集合 S 和 T,分别求取 minHash 等价于在两个集合中随机抽取一个元素:

  • 总样本空间为
  • 两个元素相等的样本空间为

易知随机抽取的两个元素相等的概率等于 Jaccard 相似度。

类似 bloom filter,使用的哈希函数越多,Jaccard 相似度的估算也就越精确。生成 100 个哈希函数,分别应用于一个集合所得的 minHash 列表可作为一个集合的 minHash 签名,对两个集合的签名做概率计算,即可估算出彼此的 Jaccard 相似度。

LSH 的哈希桶

哈希表可以保证相同的键分到同一个哈希桶,而相似性哈希的精髓在于,将划分到同一哈希桶的对象视为相似,并通过增加样本数提高相似性判定的精确性。

对每个集合计算 minHash,并将 minHash 分别划分到哈希桶中,分到同一桶中的集合会被视为可能相似的候选集。

为了提高精确性,计算多个 minHash 并分桶,同一对集合可能存在于多个桶中,后续通过一次排重扫描,找出哈希桶重复度最高的一批集合匹配,最后在它们中间计算 Jaccard 距离,进一步确认出最相似的集合。

Reference

01 Jul 2017 by fleuria

Monthly Readings #19: 2017 Jun

Is Kanban a part of Scrum, and is it simply the task board?

https://www.evernote.com/l/AARzQeN-h8pLTYcYQXT4mSc3o-bKqoOL0Ao

What is scrum?

https://www.evernote.com/l/AATlTXZAYGFJppz4h3_UOXtUqnbtKSDAfqI

互联网广告拍卖机制设计

https://www.evernote.com/l/AAQBt8ARkWtPGK0m1N04To7vIz-fLVXbjmM

Monte Carlo Tree Search

https://www.evernote.com/l/AASOQIGeQ-xA4K7Zw0dI9ZOQPq8Ob1_awzc

What is the difference between “hill climbing” and “greedy” algorithms?

https://www.evernote.com/l/AAQZ4srFHh5AV4fX5uHNwa23dhaCYRwUg6M

Impossible Engineering Problems Often Aren’t

https://www.evernote.com/l/AAQKyUWqRNRMf65MM_O2_uzD460Y9H4hWhk

Locality Sensitive Hashing By Spark

https://www.evernote.com/l/AAQ9tPAfbd1NUphCmSAYeyFLumT80BbS8mw

15 Jun 2017 by fleuria

Monthly Readings #18: 2017 March

Spark GraphX源码分析: 分布式图计算

https://www.evernote.com/l/AATTemn2ag1BBYcxkyREZRGxDp481A4Uauo

Impala Concepts and Architecture

https://www.evernote.com/l/AAQENXLmPghDC5TIukKM8gExGhu7kGgEMiE

09 Apr 2017 by fleuria

Monthly Readings #17: 2017 Jan

Semi-Synchronous Replication at Facebook:

https://www.evernote.com/l/AATV7nQ9kf9OQbS0BNfY2IHjdu99iM4_GXM

The Architecture of Schemaless, Uber Engineering’s Trip Datastore Using MySQL

https://www.evernote.com/l/AASmU-mwfQxJ3YWeogHM2yzFwsbn_nWKWZc

Designing Schemaless, Uber Engineering’s Scalable Datastore Using MySQL

https://www.evernote.com/l/AAQlAIyDfWVOoqlnk6SCipe9lmJP3z6QElY

Why Uber Engineering Switched from Postgres to MySQL

https://www.evernote.com/l/AAT4duPitB1GA4gpjjx1DX_vYa3wezBXNbY

Sharding Pinterest: How we scaled our MySQL fleet

https://www.evernote.com/l/AATkU-5xjDJPT4WXSk-indgUaqdCwK8HFI4

MySQL Backup in Facebook

https://www.evernote.com/l/AAQEbbfOhr5ASYpl4Dkhdlz7mtD-qKrVjls

If You Must Deploy Multi-Master Replication, Read This First

https://www.evernote.com/l/AAQxpve8CxVL9bL69Bi5DSWJGj4arRJvD9w

Best Practices for Amazon RDS

https://www.evernote.com/l/AATzBLB_6AZN1b8BKEkMTbvgi0dPQmvwH04

gh-ost: Triggerless design

https://www.evernote.com/l/AARj67QKJqlBlIfudgyiAgRfoKGtQZrDxe4

gh-ost: Sub-second replication lag throttling

https://www.evernote.com/l/AARjhPwJHbNAEqQ6kUmV1NFlVioDqFJPrF8

31 Jan 2017 by fleuria

Monthly Readings #16: December

Apache Hadoop Goes Realtime at Facebook:

https://www.evernote.com/l/AAQKwpeL_xFHKaqhsBCed9Bhlz2f9mjnP_E

gh-ost:

https://www.evernote.com/l/AATAfa7ydpNB_or2vZ-jW9a8Ps-A2d5fm1Q

LVS: How virtual server works?:

https://www.evernote.com/l/AATi2DGeTs9Nc7tQY_pYMhElH7-RF6JvWug

Ten years of KVM:

https://www.evernote.com/l/AASTEQ5KkdtEZ6fnFIO8QF0RSrHHIuroEtw

01 Jan 2017 by fleuria

Monthly Readings #15: November

Druid: A Real-time Analytical Data Store

https://www.evernote.com/l/AASFinGZRzVKJ6I5orCqFQ8Uxu52rHbL8Ug

Error Handling in Node.js

https://www.evernote.com/l/AATLkkYbySBBnbejRGJCKkfhjjA9HqOv8yQ

Putting Apache Kafka To Use: A Practical Guide to Building a Stream Data Platform

https://www.evernote.com/l/AATPH3DPMjtJgYf4zGfDxOho7Qnm4tnSWmo

Making “Push on Green” a Reality

https://www.evernote.com/l/AATrdAOKFDdA9a7a9bE0EFk1pQnpj21oFd4

Big Data in Real-Time at Twitter

https://www.evernote.com/l/AAQfykd2fw1IOJZMZYFZK4RVVbLKmr-SDag

04 Dec 2016 by fleuria

Monthly Readings #14: October

FollowFeed: LinkedIn’s Feed Made Faster and Smarter

https://www.evernote.com/l/AARahELmw89Pw78UcnP_H5M0-dGdOlDxNCk

Design Decisions For Scaling Your High Traffic Feeds

https://www.evernote.com/l/AARsZ-atKkZHppDeskPAHfdR0kVo6NUKdFs

Etsy Activity Feeds Architecture

https://www.evernote.com/l/AARhu-dZlMFAR4yN0qevIrtyf62LDX6qJ8A

What I Wish I Had Known Before Scaling Uber to 1000 Services

https://www.evernote.com/l/AASoq1-krsdI2oKJhD8f5V-gd_N2yUCO9Ig

Wasting Time TDDing The Wrong Things

https://www.evernote.com/l/AATqMOkwCS5HB6ydLcL8UZXHX_AzsEAjW3E

There is not Fork: an Abstraction for Efficient, Concurrent, and Concise Data Access

https://www.evernote.com/l/AAQJF0wQWKxCLaLAoqVMhQXyJkneH1SY-0o

dataloader.js

https://www.evernote.com/l/AASdRH_2b0NGq7od1OxCMpP9Nc6yjks2YZU

Comparing Redux and Relay

https://www.evernote.com/l/AASmZ_MkGtxF1b7bmrwfYX4ddyAG80n0Cs8

29 Oct 2016 by fleuria

Monthly Readings #13: September

*Don’t Share Libraries among Microservices:

https://www.evernote.com/l/AATq1J4g-BhGj6t_PfqK2BZAbiP3Tqd7ljU

DDD_Aggregate

https://www.evernote.com/l/AAS53yW5zy9ErawkJiqT4RGoSYULchxyxes

Canary All the Things

https://www.evernote.com/l/AASxzJ7DsdlNbL6kz3DOlLzRIuOqKBGlV-s

How to deploy software

https://www.evernote.com/l/AAS-ol4x-wFJpoRR2hikG9lG-1P9msurdTQ

JSON API: 格式

https://www.evernote.com/l/AATHMVTAu-9Dkoa8A9Kq3qLz2EKemjQzd00

05 Oct 2016 by fleuria

Monthly Readings #12

Building Timeline: Scaling up to hold your life story

https://www.evernote.com/l/AATOGiFn7t1D2Lw3AAuWWF6VHr3XwW3_D0o

Facebook News Feed: Social data at scale

https://www.evernote.com/l/AAQg4AINa9FMU56HnatvYH1QfHgdnIOuio0

ZooKeeper Resilience at Pinterest

https://www.evernote.com/l/AARlxHPkXbRKLrHOh8pANZf4X833d_oKB9c

Top 5 Docker Logging Methods to Fit Your Container Deployment Strategy

https://www.evernote.com/l/AARwr8aTv1tBFrAd-KIuPo6FGb1WIoAESRw

Web Service Efficiency at Instagram with Python

https://www.evernote.com/l/AAS3LB2pClJNJZuFGtk8wCnahDI35lNcXlo

Kubernates: Pods

https://www.evernote.com/l/AATps-0CQG9L-qVI4n33dCLNsTOCa0oAMXA

20 Aug 2016 by fleuria

Monthly Readings #11

thrift Field Requiredness

https://www.evernote.com/l/AASh1z6M0mBPGq9lxkCxkeFxCnW_uJu9QAI

Open-source Service Discovery

https://www.evernote.com/l/AATTi5v95VxJ4YSj6HtWiULhP48r3OsZGFY

From Python to Go, and back again

https://www.evernote.com/l/AAQPcRZGmcNM9YRJdw2Swmo1dThym6i3fgc

SLA, SLO, SLI

https://www.evernote.com/l/AAQ4DceFGD9Frar7Ve5B7asJGjb3z6gkqbo

Continuous Deployment at Instagram

https://www.evernote.com/l/AASPAOiZAAxCZq6TEhl4OhG_91jj_7AKSLg

Why It Matters: CORS

https://www.evernote.com/l/AASPAOiZAAxCZq6TEhl4OhG_91jj_7AKSLg

Strategic Scala Style: Principle of Least Power

https://www.evernote.com/l/AARFYm4RFF1NQLfgyX7E476HKN0Cv1yE2SM

What is JBOD — And Why Should You Care?

https://www.evernote.com/l/AAQG6vkHrahGk4EDaEJZ8NAK0pkEhw2qPU4

微博数据库变迁

https://www.evernote.com/l/AARJFoBwrCRABpPXhMS6T_BBlzNx-i16u70

28 May 2016 by fleuria


... MORE