Deep Visual-Semantic Alignments for Generating Image Descriptions

摘要

  • 图像 CNN,文本 BiRNN,multimodal embedding,alignment model

介绍部分

  • 先前:给定categories的labeling
  • 目标:生成images的dense description
  • 要求:模型同时推断内容和找出自然语言的表示,并且通过训练获得
  • 数据集的challenge:image captioning的数据集并不包含图片中实体的定位
  • core insight: 句子作为weak labels,句子和图像的定位都未知 -> 模型需要推断位置的 latent alignment
阅读全文 »

链表类ListNode定义

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
import java.util.Arrays;

public class ListNode {
int val;
ListNode next;
ListNode(int x) {
val = x;
next = null;
}

public static int[] toArray(ListNode head){
int[] arr = {};
while (head != null){
arr = Arrays.copyOf(arr, arr.length+1);
arr[arr.length-1] = head.val;
head = head.next;
}
return arr;
}

public static ListNode fromArray(int[] arr){
if (arr.length == 0) return null;

ListNode ll = new ListNode(arr[0]);
ListNode head = ll;
for (int i = 1; i < arr.length; i++){
ll.next = new ListNode(arr[i]);
ll = ll.next;
}
return head;
}

}

提供了导入导出为数组的内置静态方法。

问题

反转链表

阅读全文 »

链接

二分查找算法详解

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
public class binarysearch {
public static void main(String[] args){
int[] a = {1, 2, 2, 2, 3};
System.out.println(right_bound(a, 2));
}
// 默认数组有序
static int binarySearch(int[] arr, int target){
int left = 0;
int right = arr.length - 1;

while (left <= right){
int mid = (left + right) >> 1;
if (target == arr[mid]){
return mid;
} else if (arr[mid] < target){
left = mid + 1;
} else if (arr[mid] > target){
right = mid - 1;
}
}
return -1;
}

// 寻找左侧边界
static int left_bound(int[] arr, int target){
int n = arr.length;
int left, right;
left = 0;
right = n;

while(left < right){
int mid = (left + right) >> 1;
if (arr[mid] == target){
right = mid;
} else if (arr[mid] > target){
right = mid;
} else if (arr[mid] < target){
left = mid + 1;
}
}
if (left == arr.length) return -1;
return (arr[left] == target) ? left : -1;
}

// 寻找右侧边界
static int right_bound(int[]arr, int target){
int n = arr.length;
int left, right;
left = 0;
right = n;

while(left < right){
int mid = (left + right) >> 1;
if (arr[mid] == target){
left = mid + 1;
} else if (arr[mid] > target){
right = mid;
} else if (arr[mid] < target){
left = mid + 1;
}
}
if (left == 0) return -1;
return (arr[left - 1] == target) ? left - 1 : -1;
}
}

注意点

阅读全文 »

【Python学习】股票分析系列(中文自制) by 大力牛肉粉

东方财富网资金流向

Convert curl syntax to Python, Ansible URI, MATLAB, Node.js, R, PHP, Strest, Go, Dart, JSON, Elixir, Rust

1
2
3
4
5
6
7
8
9
%matplotlib inline

import datetime as dt
import pandas as pd
import numpy as np
import pandas_datareader as pdr # 获取在线数据
import matplotlib.pyplot as plt
from matplotlib import style # style

获取数据

阅读全文 »

Basic Info

Offices: BJ, SH, HK.

技能点:

  • Big data
  • data mining
  • machine learning
  • analytical problem solving
  • co-creation
  • design thinking
  • change enablement
  • people effectiveness

发展路径: 前三年跨行业

阅读全文 »

Natural Language Generation

NLG

  • subcomponent of
    • Machine Translation
    • summarization
    • dialogue
    • Freeform question answering(not only from the context)
    • Image Captioning

Recap

  • Language modeling? the task of predicting the next word: \[P(y_t|y_1, ..., y_(t-1))\]

  • Language model

  • RNN-LM

  • Conditional Language Modeling \[P(y_t|y_1, ..., y_(t-1), x)\]

    • what is x? condition.
    • Examples:
      • Machine Translation (x = source sentence, y = target sentence)
      • Summarization (context and summarized)
      • Dialogue (dialogue history and next utterance)
  • training a RNN-LM? \[J = \dfrac{1}{T}\sum\limits_{t=1}^T J_t\]

    • "Teacher Forcing": always use the gold to feed into the decoder
  • decoding algorithms

    • Greedy decoding: argmax each step
    • Beam search: aims to find a high prob seq.
      • k most probable partial seqs (hypotheses)
      • k is the beam size (e.g. 2)
      • when reaching some stopping criterion, output
      • what's the effect of changing k?
        • k=1: greedy decoding
        • larger k: more hypotheses, computationaly expensive
          • for NMT, increasing k too much decreases BLEU, reason: producing shorter translations
          • for chit-chat dialogue, producing too generic responses
    • sampling-based decoding
      • pure sampling: randomly sample, instead of argmax in greedy
      • top-n sampling, randomly sample from top-n. truncate. n is another hyperparameter
        • increasing n, diverse and risky
        • decreasing n, generic and safe
    • Softmax teperature -- not actually a decoding algorithm, but a technique applied at test time in conjunction with decoding algorithm
      • temperature hypparam \(\tau\) to the softmax:
      • larger \(\tau\): \(P_t\) becomes more uniform, more diverse output(probability is spread around vocab)
  • Decoding algorithms: summary

    • Greedy
    • Beam search
    • Sampling methods
    • Softmax temperature
阅读全文 »

提要

这是 DROM8110 Business Analytics Strategy 的最终报告,运用了 Contributed Value Analysis / Sustainability Analysis 等框架,分析了 Uber 的主营业务——叫车(Ride-Hailing) 的商业模型、盈利前景和可行策略。

其他作者:

  • RUOMING GU (rg3266@columbia.edu)
  • YUXIN ZHANG (yz3718@columbia.edu)

正文:

阅读全文 »

CS224N Assignment #5

(7.20)增加 Pylance 插件作为语言服务器. 打开 type checking mode(basic).

文字题

    1. We learned in class that recurrent neural architectures can operate over variable length input (i.e., the shape of the model parameters is independent of the length of the input sentence). Is the same true of convolutional architectures? Write one sentence to explain why or why not.

    window t ∈ {1, . . . , mword − k + 1}, mword 即最长单词的长度可变, xconv ∈ R^(eword×(mword−k+1))

  • (b)...if we use the kernel size k = 5, what will be the size of the padding (i.e. the additional number of zeros on each side) we need for the 1-dimensional convolution, such that there exists at least one window for all possible values of mword in our dataset?

    极端情况 mword=1, 前后各 1 个 token,还需 padding=1.

    1. In step 4, we introduce a Highway Network with xhighway = xgate xproj + (1 − xgate) xconv out. Since xgate is the result of the sigmoid function, it has the range (0, 1).Consider the two extreme cases. If xgate → 0, then xhighway → xconv out. When xgate → 1, then xhighway → xproj. This means the Highway layer is smoothly varying its behavior between that of normal linear layer (xproj) and that of a layer which simply passes its inputs (xconv out) through. Use one or two sentences to explain why this behavior is useful in character embeddings. Based on the definition of xgate = σ(Wgatexconv out + bgate), do you think it is better to initialize bgate to be negative or positive? Explain your reason briefly. 原因: 所谓的 highway, x_gate=0 可以直接用 x_convout 的值。

    希望默认 x_gate 较小方便 highway,所以 b 取负。

    1. In Lecture 10, we briefly introduced Transformers, a non-recurrent sequence (or sequence-to-sequence) model with a sequence of attention-based transformer blocks. Describe 2 advantages of a Transformer encoder over the LSTM-with-attention encoder in our NMT model

    可以看一下 <>: "Self-attention, sometimes called intra-attention is an attention mechanism relating different positions of a single sequence in order to compute a representation of the sequence." 每一步都是句子里的所有单词之间建立联系。 主要用到三个矩阵 Key, Query, value, Attention(Q,K,V) = softmax(QK.T/\sqrt(d_k))V (包学包会,这些动图和代码让你一次读懂「自注意力」 - 机器之心的文章 - 知乎 https://zhuanlan.zhihu.com/p/96492170)

    attention-based transformers的好处(P6 的 Part 4, Why self-attention):

    未采用RNN就可以避免梯度消失和梯度爆炸等问题, 从sequential computation 到实现parallelized computation, 更易学习到"long-range dependencies in the network", 更加interpretable.

1. Character-based convolutional encoder for NMT (36 points)

阅读全文 »

CS224N A4: NMT Assignment

Note: Heavily inspired by the https://github.com/pcyin/pytorch_nmt repository

作业分为两部分, 第一部分代码实现 NMT with RNN, 第二部分文字题分析 NMT

1. NMT with RNN

  • Bidirectional LSTM Encoder & Unidirectional LSTM Decoder
  • 勘误:
    • h 为(embedding size)
    • (3), (4) 式中的下标 1 应改为 m
    1. utils.py
    • 要求每个 batch 里的句子有相同 length。utils.py 中实现 pad_sents(padding)
    1. model_embeddings.py
    • 先去 vocab.py 里了解类的定义。
    • VocabEntry 类初始化参数 word2id(dict: words -> indices), id2word 返回 idx 对应的 word 值,from_corpus 从 corpus 生成一个 VocabEntry 实例,from collections import Counter,Counter 可以直接查找出字符串中字母出现次数
    • Vocab 类包含 src 和 tgt 语言,初始化参数式两种语言的 VocabEntry, @staticmethod 静态方法
    • VocabEntry.from_corpus 创建一个 vocab_entry 对象。Vocab.build 分别用 src_sents, tgt_sents 创建 src, tgt 两个 vocab_entry 并返回包含两者的 Vocab(src, tgt)
    • 运用 nn.Embedding 初始化词嵌。
    1. nmt_model.py
    • 按照 pdf 中的维度对各层初始化
    1. nmt_model.py 中 encode 方法实现
    • self.encoder 是一个双向 lstm
    • encode 方法传入两个参数:source_padded, source_lengths。前者是已经 pad 后(src_len, b)的 tensor,每一列是一个句子。后者是一个整数列表,表示每个句子实际多少词。
    • 需要返回两个值:enc_hiddens = hencs(所有 1<=i<=m(句长),每一句中所有词,同时对于整个 batch 所有句子), dec_init_state = (hdec0, cdec0)
    • lstm 要求输入满足规范形状,所以需要pad_packed_sequencepacked_pad_sequence进行变形
    • 第一步用self.model_embeddings把 source_padded 转换为词嵌入
    1. decode方法
    • self.decoder 是nn.LSTMCell,返回值 h、c,但这部分包装在 step 里面,本 decode 方法里从self.step取得返回值dec_state, combined_output, e_t
    • 还是先用model_embeddings将 target_padded 转换为 Y,一个目标词嵌入,(tgt_len, b, e)
    • torch.split方法, 将 Y 按第 0 维分成步长为 1 的步数,相当于逐词(t)操作。
    • (5)式表明了一个迭代过程,最后关心的combined_outputs是 o_t 集合
    • 07/23 勘误 做a5时发现dedcode忘记更新o_prev
    1. step方法
    • step 方法具体处理(5)到(12)式。
    • 第一部分,(5)-(7),运用 bmm、(un)squeeze。bmm 需要注意第 0 维度是留给 batch_size 的,两个三维 tensor 的第一二维相乘,满足维度要求。常见的是在 dim=1/2 做 unsqueeze,乘完再 squeeze
    • 注意到调换乘法次序+不同的变换维度方式会造成最终结果的精度损失。
    1. 文字题:generate_sent_masks() 生成 enc_masks(b, src_len)标识 batch 中每个 sentence 每个词是否是 pad,这样做对 attention 计算的影响以及其必要性。
    • step中,(8)式 α_t 进行了 softmax,后续 a_t 计算为确保 attention 不受 padding 影响要求 padding 处 α_t=0,即 e_t 设置为-∞。
    • git 配置:git remote add origin https://github.com/hy2632/cs224n.git

    • git push origin master

    • ..

    • Corpus BLEU: 31.892219171042335

    1. Attention Type Advantage Disadvantage
      Dot Product 不需要self.att_projection 需要满足维度一致
      Multiplicative - -
      Additive tanh 操作 normalize 了数值 两个参数矩阵,参数更多,空间复杂度大
阅读全文 »