290. Word Pattern【E】【47】

字符串模式匹配算法
本文介绍了一种用于判断字符串是否遵循给定模式的算法。通过建立字母与非空单词之间的双射映射,该算法能够有效地解决模式匹配问题,并提供了具体的实现代码及示例。

Given a pattern and a string str, find if str follows the same pattern.

Here follow means a full match, such that there is a bijection between a letter in pattern and a non-empty word in str.

Examples:

  1. pattern = "abba", str = "dog cat cat dog" should return true.
  2. pattern = "abba", str = "dog cat cat fish" should return false.
  3. pattern = "aaaa", str = "dog cat cat dog" should return false.
  4. pattern = "abba", str = "dog dog dog dog" should return false.

Notes:
You may assume pattern contains only lowercase letters, and str contains lowercase letters separated by a single space.

Credits:
Special thanks to @minglotus6 for adding this problem and creating all test cases.


Subscribe to see which companies asked this question






class Solution(object):
    def wordPattern(self, pattern, str):
        #print pattern,str
        p = pattern
        str = str.split(' ')
        #print str
        dic = {}
        #print dic,len(str)

        if len(str) != len(p):
            return False

        for i in xrange(len(str)):
            if p[i] not in dic:
                if str[i] in dic.values():
                    return False
                dic[p[i]] = str[i]
            else:

                if dic[p[i]] != str[i]:
                    return False

        return True


```python for idx, row in self.corpus.iterrows(): text = str(row.get('text', '')) # Get 'text' column, default to empty if missing words = self.cleaned_vocab(text) for word in words: self.word_freq[word] = self.word_freq.get(word, 0) + 1 ``` ### ✅ Explanation: This code **iterates through each row** of a text corpus (a DataFrame), extracts and cleans the text, then **builds a frequency dictionary** (`self.word_freq`) that counts how many times each cleaned word appears across the entire dataset. It's commonly used during text preprocessing in NLP for tasks like: - Building vocabulary - Feature selection (e.g., removing very rare words) - Data analysis (e.g., finding most common words) --- #### 🔍 Line-by-Line Breakdown: ```python for idx, row in self.corpus.iterrows(): ``` - `.iterrows()` is a Pandas method that allows iteration over the DataFrame rows as `(index, Series)` pairs. - `idx`: The index (row number or label) - `row`: A Pandas `Series` representing one row of the DataFrame > ⚠️ Note: `.iterrows()` is convenient but not the fastest — for large datasets, consider vectorized operations instead. --- ```python text = str(row.get('text', '')) # Get 'text' column, default to empty if missing ``` - Uses `.get('text', '')` to safely retrieve the value from the `'text'` column. - If the column exists → returns its value - If it doesn’t exist or is `NaN` → returns an empty string `''` - Wraps with `str(...)` to ensure the result is a string (even if original was numeric or `NaN`) ✅ This prevents errors when passing non-string types to cleaning functions. --- ```python words = self.cleaned_vocab(text) ``` - Calls a previously defined method (likely called `cleaned_vocab`) that: - Cleans the raw text (removes punctuation, etc.) - Splits into tokens - Filters out stopwords and short/invalid words - Returns a list of meaningful words (tokens), e.g., `['great', 'movie', 'love']` We assume this function was defined earlier (like in your previous question). --- ```python for word in words: self.word_freq[word] = self.word_freq.get(word, 0) + 1 ``` - Loops over each cleaned word in the current document. - Updates a frequency counter stored in `self.word_freq`, which should be initialized earlier (e.g., as `self.word_freq = {}`). - Uses `.get(word, 0)`: - If `word` is already in the dictionary → get its current count - If notreturn `0` as default - Then increments by 1 and stores back 💡 This pattern avoids needing to check `if word in self.word_freq`. ✅ Example: ```python Initial: self.word_freq = {} After seeing "good": {'good': 1} Later see "good" again: {'good': 2} ``` --- ### 🧩 Full Context Example Assume `self.corpus` contains: | text | label | |--------------------------|-------| | "I love this movie!" | 1 | | "Great acting, great cast!" | 1 | And `self.cleaned_vocab(...)` returns: ```python "I love this movie!" → ['love', 'movie'] "Great acting, great cast!" → ['great', 'acting', 'cast', 'great'] ``` Then after processing both rows: ```python self.word_freq = { 'love': 1, 'movie': 1, 'great': 2, 'acting': 1, 'cast': 1 } ``` Now you can analyze word importance, filter rare words, or visualize top terms. --- ### ✅ Best Practices & Improvements | Enhancement | Why It Helps | |-----------|--------------| | Initialize `self.word_freq = {}` in `__init__` | Avoid `AttributeError` | | Use `collections.Counter` instead | More efficient and provides built-in tools | | Consider using `apply()` + vectorization for speed | Faster than `iterrows()` on big data | #### 💡 Improved Version Using `Counter`: ```python from collections import Counter # Instead of loop: all_words = [] for _, row in self.corpus.iterrows(): text = str(row.get('text', '')) all_words.extend(self.cleaned_vocab(text)) self.word_freq = Counter(all_words) ``` Or even more concise: ```python texts = self.corpus['text'].fillna('').astype(str) all_words = texts.apply(lambda x: self.cleaned_vocab(x)).sum() self.word_freq = Counter(all_words) ``` ---
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值