[00:36] *** buckket has joined #urlteam [00:42] *** Mayonaise has joined #urlteam [01:12] somebody2 or chfoo azon.biz would be a good one to add to the tracker [01:26] Also how does the custom code work? Does it not request urls it has already found or something? [01:31] The custom code just deals with the nastiness of some of the shorteners, I think. And I believe it's also responsible for the "random" order in which the shortcodes are retrieved for those shorteners. [01:32] Normal shorteners, i.e. ones without custom code, are treated fairly simply and are simply retrieved in order from e.g. aaaaa to ZZZZZ. [01:32] Well, split into blocks, but you get the idea. [01:33] Alright so there is no way to take out from re requesting urls we already know [01:33] Ah no, we only scan each code once. [01:34] But for those shorteners, the order in which the codes are attempted appears random. I assume it's because those shorteners block attempts to do it in order or something. [01:34] Ok cause for example I am noticing 0rz-tw is getting 1 or 2 for batches of 50 so next time we try and scan 0rz-tw we will exclude ones we have already grabbed? [01:35] Oh, I see what you mean. No, I don't think we can exclude those. That would mean that we'd have to have the list of discovered URLs available on the worker or something. [01:36] So if we decide to recrawl a service, we'd start from scratch again. [01:36] ok [01:36] But that only applies to the shorteners that create random shortcodes. [01:37] For shorteners which use incremental codes, we can obviously simply continue from the last ID where the previous scrape found a result. [01:37] I think we should recrawl incremental ones once a year [01:37] at least [03:56] Flashfire: But from my understanding, the URLs from the incremental ones don't change. [03:57] whoops, sorry I missed that buff.ly is a bit.ly alias [03:57] we can stop it, yes [03:57] VariXx: No need to worry. If you have any questions, you can just ask. [03:58] turned off the auto-quue [03:58] for buff.ly [03:59] azon.biz looks feasible; let me set that up now [03:59] VariXx: Also, if you want, you can just use the URLTeam script, so it can take less of your computing resources. [04:01] azon-biz started [04:01] someone other than me should update the wiki page [04:01] pretty please... :-) [04:01] Somebody2: Are all of the URLs recorded and uploaded back to the tracker? [04:02] teej_: yes... as opposed to what? [04:02] And because the URs are static, the tracker doesn't ask for the ones that have responses, right? [04:02] URLs* [04:03] teej_: I'm not following [04:03] Individual Warrior instances request a set of shortcodes to try, from the tracker. [04:03] Does the tracker send a list of shortened URLs to the client? [04:04] They try each of them, noting down the ones that redirect (and where), and send that back to the tracker. [04:04] Also, if trying a particular one gives back something weird, that also gets sent back to the tracker, and shows up in the Error Reports [04:05] Okay. And those redirected shortened URLs are never queued again, right? [04:05] once the tracker gets back data on a particular set, it marks that as complete [04:06] As for deciding which set of shortcodes to send out -- each shortener has an alphabet that converts the shortcodes to/from sequence numbers [04:06] there's a feature, called AutoQueue, that will keep creating new sets of shortcodes with increasing seq#s, maintaining a specified number of ones outstanding [04:06] at once [04:07] if an admin resets the starting seq number, it will happily re-try the same set of shortcodes [04:07] but otherwise it won't [04:07] It will re-try the same URLs even though we know it redirected before? [04:08] the data is stored on the tracker until a given number of results are accumulated (generally every couple days) at which time the results are uploaded to IA, and removed from the tracker [04:08] The tracker doesn't know anything about previous data -- it's not that smart [04:09] it just creates new sets of shortcodes to try, and collects the results of the individual warrirors trying them [04:10] So couldn't many URLs be taken out of the queue if we already know where they redirect from the previous try? [04:11] This way, when it's time to re-try, the queue becomes smaller... ? [04:11] Manually, yes -- if you see on the wiki page that I noted down the range of seq# checked for various shorteners; that's what that is for. [04:12] teej_: No, storing the list of which shortcodes previously redirected on the tracker would be pretty heavyweight. I suppose we could do it, but it's probably not worth it [04:12] also, it's informative to know that a shortcode that *used* to redirect somewhere *still does* [04:12] Oh. I see. [04:12] some shorterners re-use shortcodes, and/or block some for spam, or other reasons. It's nice to have a record of that [04:13] thanks for asking about this! [04:13] Oh! Okay. I thought they are generally static. [04:13] they are *generally* static, but not *universally* static [04:14] Somebody2: Thanks for answering my questions! It makes more sense now. [04:14] :-) [04:16] *** odemg has quit IRC (Ping timeout: 246 seconds) [04:30] *** odemg has joined #urlteam [04:52] *** dashcloud has quit IRC (No Ping reply in 180 seconds.) [04:54] *** dashcloud has joined #urlteam [05:06] somebody2 I have one more question. If a block returns 50/50 or 25/25 does that block get removed from circulation [05:06] *** dashcloud has quit IRC (No Ping reply in 180 seconds.) [05:08] *** dashcloud has joined #urlteam [05:13] *** dashcloud has quit IRC (No Ping reply in 180 seconds.) [05:15] *** dashcloud has joined #urlteam [05:24] *** dashcloud has quit IRC (No Ping reply in 180 seconds.) [05:26] *** dashcloud has joined #urlteam [09:39] *** teej_ has quit IRC (Ping timeout: 252 seconds) [09:44] *** voltagex_ has quit IRC (Ping timeout: 260 seconds) [09:46] *** HCross has quit IRC (Ping timeout: 260 seconds) [09:47] *** voltagex_ has joined #urlteam [09:49] *** HCross has joined #urlteam [09:49] *** chr1sm has quit IRC (Ping timeout: 260 seconds) [09:49] *** svchfoo3 sets mode: +o HCross [09:50] *** svchfoo1 sets mode: +o HCross [09:51] *** chr1sm has joined #urlteam [09:55] *** teej_ has joined #urlteam [12:34] *** kiska1 has joined #urlteam [13:33] *** yuitimoth has quit IRC (Read error: Operation timed out) [13:33] *** yuitimoth has joined #urlteam [13:33] *** zerkalo has quit IRC (Read error: Operation timed out) [13:34] *** zerkalo has joined #urlteam [13:34] *** treora has quit IRC (Ping timeout: 268 seconds) [13:34] *** BnAboyZ has quit IRC (Ping timeout: 268 seconds) [13:35] *** treora has joined #urlteam [13:35] *** Frogging has quit IRC (Ping timeout: 268 seconds) [13:35] *** BnAboyZ has joined #urlteam [13:38] *** Frogging has joined #urlteam [13:40] *** Jusque has quit IRC (Read error: Operation timed out) [13:42] *** Jusque has joined #urlteam [13:44] *** moufu has joined #urlteam [13:47] *** moufu_ has quit IRC (Read error: Connection reset by peer) [13:48] *** kiska1 has quit IRC (Read error: Connection reset by peer) [13:52] *** wmvhater has joined #urlteam [13:52] *** kiska1 has joined #urlteam [14:13] *** SilSte has quit IRC (Read error: Operation timed out) [18:14] *** dashcloud has quit IRC (No Ping reply in 180 seconds.) [18:16] *** dashcloud has joined #urlteam [18:20] *** dashcloud has quit IRC (No Ping reply in 180 seconds.) [18:21] *** dashcloud has joined #urlteam [18:47] *** dashcloud has quit IRC (No Ping reply in 180 seconds.) [18:48] *** dashcloud has joined #urlteam