#urlteam 2018-12-14,Fri

↑back Search

Time Nickname Message
00:36 🔗 buckket has joined #urlteam
00:42 🔗 Mayonaise has joined #urlteam
01:12 🔗 Flashfire somebody2 or chfoo azon.biz would be a good one to add to the tracker
01:26 🔗 Flashfire Also how does the custom code work? Does it not request urls it has already found or something?
01:31 🔗 JAA The custom code just deals with the nastiness of some of the shorteners, I think. And I believe it's also responsible for the "random" order in which the shortcodes are retrieved for those shorteners.
01:32 🔗 JAA Normal shorteners, i.e. ones without custom code, are treated fairly simply and are simply retrieved in order from e.g. aaaaa to ZZZZZ.
01:32 🔗 JAA Well, split into blocks, but you get the idea.
01:33 🔗 Flashfire Alright so there is no way to take out from re requesting urls we already know
01:33 🔗 JAA Ah no, we only scan each code once.
01:34 🔗 JAA But for those shorteners, the order in which the codes are attempted appears random. I assume it's because those shorteners block attempts to do it in order or something.
01:34 🔗 Flashfire Ok cause for example I am noticing 0rz-tw is getting 1 or 2 for batches of 50 so next time we try and scan 0rz-tw we will exclude ones we have already grabbed?
01:35 🔗 JAA Oh, I see what you mean. No, I don't think we can exclude those. That would mean that we'd have to have the list of discovered URLs available on the worker or something.
01:36 🔗 JAA So if we decide to recrawl a service, we'd start from scratch again.
01:36 🔗 Flashfire ok
01:36 🔗 JAA But that only applies to the shorteners that create random shortcodes.
01:37 🔗 JAA For shorteners which use incremental codes, we can obviously simply continue from the last ID where the previous scrape found a result.
01:37 🔗 Flashfire I think we should recrawl incremental ones once a year
01:37 🔗 Flashfire at least
03:56 🔗 teej_ Flashfire: But from my understanding, the URLs from the incremental ones don't change.
03:57 🔗 Somebody2 whoops, sorry I missed that buff.ly is a bit.ly alias
03:57 🔗 Somebody2 we can stop it, yes
03:57 🔗 teej_ VariXx: No need to worry. If you have any questions, you can just ask.
03:58 🔗 Somebody2 turned off the auto-quue
03:58 🔗 Somebody2 for buff.ly
03:59 🔗 Somebody2 azon.biz looks feasible; let me set that up now
03:59 🔗 teej_ VariXx: Also, if you want, you can just use the URLTeam script, so it can take less of your computing resources.
04:01 🔗 Somebody2 azon-biz started
04:01 🔗 Somebody2 someone other than me should update the wiki page
04:01 🔗 Somebody2 pretty please... :-)
04:01 🔗 teej_ Somebody2: Are all of the URLs recorded and uploaded back to the tracker?
04:02 🔗 Somebody2 teej_: yes... as opposed to what?
04:02 🔗 teej_ And because the URs are static, the tracker doesn't ask for the ones that have responses, right?
04:02 🔗 teej_ URLs*
04:03 🔗 Somebody2 teej_: I'm not following
04:03 🔗 Somebody2 Individual Warrior instances request a set of shortcodes to try, from the tracker.
04:03 🔗 teej_ Does the tracker send a list of shortened URLs to the client?
04:04 🔗 Somebody2 They try each of them, noting down the ones that redirect (and where), and send that back to the tracker.
04:04 🔗 Somebody2 Also, if trying a particular one gives back something weird, that also gets sent back to the tracker, and shows up in the Error Reports
04:05 🔗 teej_ Okay. And those redirected shortened URLs are never queued again, right?
04:05 🔗 Somebody2 once the tracker gets back data on a particular set, it marks that as complete
04:06 🔗 Somebody2 As for deciding which set of shortcodes to send out -- each shortener has an alphabet that converts the shortcodes to/from sequence numbers
04:06 🔗 Somebody2 there's a feature, called AutoQueue, that will keep creating new sets of shortcodes with increasing seq#s, maintaining a specified number of ones outstanding
04:06 🔗 Somebody2 at once
04:07 🔗 Somebody2 if an admin resets the starting seq number, it will happily re-try the same set of shortcodes
04:07 🔗 Somebody2 but otherwise it won't
04:07 🔗 teej_ It will re-try the same URLs even though we know it redirected before?
04:08 🔗 Somebody2 the data is stored on the tracker until a given number of results are accumulated (generally every couple days) at which time the results are uploaded to IA, and removed from the tracker
04:08 🔗 Somebody2 The tracker doesn't know anything about previous data -- it's not that smart
04:09 🔗 Somebody2 it just creates new sets of shortcodes to try, and collects the results of the individual warrirors trying them
04:10 🔗 teej_ So couldn't many URLs be taken out of the queue if we already know where they redirect from the previous try?
04:11 🔗 teej_ This way, when it's time to re-try, the queue becomes smaller... ?
04:11 🔗 Somebody2 Manually, yes -- if you see on the wiki page that I noted down the range of seq# checked for various shorteners; that's what that is for.
04:12 🔗 Somebody2 teej_: No, storing the list of which shortcodes previously redirected on the tracker would be pretty heavyweight. I suppose we could do it, but it's probably not worth it
04:12 🔗 Somebody2 also, it's informative to know that a shortcode that *used* to redirect somewhere *still does*
04:12 🔗 teej_ Oh. I see.
04:12 🔗 Somebody2 some shorterners re-use shortcodes, and/or block some for spam, or other reasons. It's nice to have a record of that
04:13 🔗 Somebody2 thanks for asking about this!
04:13 🔗 teej_ Oh! Okay. I thought they are generally static.
04:13 🔗 Somebody2 they are *generally* static, but not *universally* static
04:14 🔗 teej_ Somebody2: Thanks for answering my questions! It makes more sense now.
04:14 🔗 Somebody2 :-)
04:16 🔗 odemg has quit IRC (Ping timeout: 246 seconds)
04:30 🔗 odemg has joined #urlteam
04:52 🔗 dashcloud has quit IRC (No Ping reply in 180 seconds.)
04:54 🔗 dashcloud has joined #urlteam
05:06 🔗 Flashfire somebody2 I have one more question. If a block returns 50/50 or 25/25 does that block get removed from circulation
05:06 🔗 dashcloud has quit IRC (No Ping reply in 180 seconds.)
05:08 🔗 dashcloud has joined #urlteam
05:13 🔗 dashcloud has quit IRC (No Ping reply in 180 seconds.)
05:15 🔗 dashcloud has joined #urlteam
05:24 🔗 dashcloud has quit IRC (No Ping reply in 180 seconds.)
05:26 🔗 dashcloud has joined #urlteam
09:39 🔗 teej_ has quit IRC (Ping timeout: 252 seconds)
09:44 🔗 voltagex_ has quit IRC (Ping timeout: 260 seconds)
09:46 🔗 HCross has quit IRC (Ping timeout: 260 seconds)
09:47 🔗 voltagex_ has joined #urlteam
09:49 🔗 HCross has joined #urlteam
09:49 🔗 chr1sm has quit IRC (Ping timeout: 260 seconds)
09:49 🔗 svchfoo3 sets mode: +o HCross
09:50 🔗 svchfoo1 sets mode: +o HCross
09:51 🔗 chr1sm has joined #urlteam
09:55 🔗 teej_ has joined #urlteam
12:34 🔗 kiska1 has joined #urlteam
13:33 🔗 yuitimoth has quit IRC (Read error: Operation timed out)
13:33 🔗 yuitimoth has joined #urlteam
13:33 🔗 zerkalo has quit IRC (Read error: Operation timed out)
13:34 🔗 zerkalo has joined #urlteam
13:34 🔗 treora has quit IRC (Ping timeout: 268 seconds)
13:34 🔗 BnAboyZ has quit IRC (Ping timeout: 268 seconds)
13:35 🔗 treora has joined #urlteam
13:35 🔗 Frogging has quit IRC (Ping timeout: 268 seconds)
13:35 🔗 BnAboyZ has joined #urlteam
13:38 🔗 Frogging has joined #urlteam
13:40 🔗 Jusque has quit IRC (Read error: Operation timed out)
13:42 🔗 Jusque has joined #urlteam
13:44 🔗 moufu has joined #urlteam
13:47 🔗 moufu_ has quit IRC (Read error: Connection reset by peer)
13:48 🔗 kiska1 has quit IRC (Read error: Connection reset by peer)
13:52 🔗 wmvhater has joined #urlteam
13:52 🔗 kiska1 has joined #urlteam
14:13 🔗 SilSte has quit IRC (Read error: Operation timed out)
18:14 🔗 dashcloud has quit IRC (No Ping reply in 180 seconds.)
18:16 🔗 dashcloud has joined #urlteam
18:20 🔗 dashcloud has quit IRC (No Ping reply in 180 seconds.)
18:21 🔗 dashcloud has joined #urlteam
18:47 🔗 dashcloud has quit IRC (No Ping reply in 180 seconds.)
18:48 🔗 dashcloud has joined #urlteam

irclogger-viewer