Repeated DNA Sequences
All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.
Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.
For example,
Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT", Return: ["AAAAACCCCC", "CCCCCAAAAA"].
Just use an array as hashtable to avoid Memory Limit Exceed.
The total number of combinations of “ACGT” string in the length of 10 is 4^10 which is 2^20. We allocate a sort data type to each combination storing the number of occurrence of it.
The memory use is 1MB.
O(n) in time, O(1) in space.
Warning, when we use “<<” operator, remember to use parentheses to ensure the result is exactly what you want.
//A C G T //0 1 2 3 class Solution { public: vector<string> findRepeatedDnaSequences(string s) { if(s.size() <= 10) return vector<string>(); short map[0x00100000]; memset(map, 0, sizeof(map)); vector<string> ans; unordered_map<char, int> idxTable; idxTable['A'] = 0; idxTable['C'] = 1; idxTable['G'] = 2; idxTable['T'] = 3; for(size_t i = 0; i < s.size() - 9; i++){ string subs = s.substr(i, 10); int idx = 0; for(size_t j = 0; j < subs.size(); j++){ idx = (idx << 2) + idxTable[subs[j]];//BUG HERE, the priority of operators. } map[idx]++; if(map[idx] > 3) map[idx] = 3; //avoid overflow if(map[idx] == 2){ ans.push_back(subs); } } return ans; } };
O(n)space,force solution, Memory Limit Exceed.
class Solution { public: vector<string> findRepeatedDnaSequences(string s) { unordered_map<string, int> map; if(s.size() <= 10) return vector<string>(); for(size_t i = 0; i < s.size() - 9; i++){ string subs = s.substr(i, 10); if(map.find(subs) != map.end()){ map[subs] += 1; } else{ map[subs] = 1; } } vector<string> ans; for(auto it = map.begin(); it != map.end(); it++){ if((*it).second > 1){ ans.push_back((*it).first); } } return ans; } };