Time |
Nickname |
Message |
00:36
🔗
|
|
buckket has joined #urlteam |
00:42
🔗
|
|
Mayonaise has joined #urlteam |
01:12
🔗
|
Flashfire |
somebody2 or chfoo azon.biz would be a good one to add to the tracker |
01:26
🔗
|
Flashfire |
Also how does the custom code work? Does it not request urls it has already found or something? |
01:31
🔗
|
JAA |
The custom code just deals with the nastiness of some of the shorteners, I think. And I believe it's also responsible for the "random" order in which the shortcodes are retrieved for those shorteners. |
01:32
🔗
|
JAA |
Normal shorteners, i.e. ones without custom code, are treated fairly simply and are simply retrieved in order from e.g. aaaaa to ZZZZZ. |
01:32
🔗
|
JAA |
Well, split into blocks, but you get the idea. |
01:33
🔗
|
Flashfire |
Alright so there is no way to take out from re requesting urls we already know |
01:33
🔗
|
JAA |
Ah no, we only scan each code once. |
01:34
🔗
|
JAA |
But for those shorteners, the order in which the codes are attempted appears random. I assume it's because those shorteners block attempts to do it in order or something. |
01:34
🔗
|
Flashfire |
Ok cause for example I am noticing 0rz-tw is getting 1 or 2 for batches of 50 so next time we try and scan 0rz-tw we will exclude ones we have already grabbed? |
01:35
🔗
|
JAA |
Oh, I see what you mean. No, I don't think we can exclude those. That would mean that we'd have to have the list of discovered URLs available on the worker or something. |
01:36
🔗
|
JAA |
So if we decide to recrawl a service, we'd start from scratch again. |
01:36
🔗
|
Flashfire |
ok |
01:36
🔗
|
JAA |
But that only applies to the shorteners that create random shortcodes. |
01:37
🔗
|
JAA |
For shorteners which use incremental codes, we can obviously simply continue from the last ID where the previous scrape found a result. |
01:37
🔗
|
Flashfire |
I think we should recrawl incremental ones once a year |
01:37
🔗
|
Flashfire |
at least |
03:56
🔗
|
teej_ |
Flashfire: But from my understanding, the URLs from the incremental ones don't change. |
03:57
🔗
|
Somebody2 |
whoops, sorry I missed that buff.ly is a bit.ly alias |
03:57
🔗
|
Somebody2 |
we can stop it, yes |
03:57
🔗
|
teej_ |
VariXx: No need to worry. If you have any questions, you can just ask. |
03:58
🔗
|
Somebody2 |
turned off the auto-quue |
03:58
🔗
|
Somebody2 |
for buff.ly |
03:59
🔗
|
Somebody2 |
azon.biz looks feasible; let me set that up now |
03:59
🔗
|
teej_ |
VariXx: Also, if you want, you can just use the URLTeam script, so it can take less of your computing resources. |
04:01
🔗
|
Somebody2 |
azon-biz started |
04:01
🔗
|
Somebody2 |
someone other than me should update the wiki page |
04:01
🔗
|
Somebody2 |
pretty please... :-) |
04:01
🔗
|
teej_ |
Somebody2: Are all of the URLs recorded and uploaded back to the tracker? |
04:02
🔗
|
Somebody2 |
teej_: yes... as opposed to what? |
04:02
🔗
|
teej_ |
And because the URs are static, the tracker doesn't ask for the ones that have responses, right? |
04:02
🔗
|
teej_ |
URLs* |
04:03
🔗
|
Somebody2 |
teej_: I'm not following |
04:03
🔗
|
Somebody2 |
Individual Warrior instances request a set of shortcodes to try, from the tracker. |
04:03
🔗
|
teej_ |
Does the tracker send a list of shortened URLs to the client? |
04:04
🔗
|
Somebody2 |
They try each of them, noting down the ones that redirect (and where), and send that back to the tracker. |
04:04
🔗
|
Somebody2 |
Also, if trying a particular one gives back something weird, that also gets sent back to the tracker, and shows up in the Error Reports |
04:05
🔗
|
teej_ |
Okay. And those redirected shortened URLs are never queued again, right? |
04:05
🔗
|
Somebody2 |
once the tracker gets back data on a particular set, it marks that as complete |
04:06
🔗
|
Somebody2 |
As for deciding which set of shortcodes to send out -- each shortener has an alphabet that converts the shortcodes to/from sequence numbers |
04:06
🔗
|
Somebody2 |
there's a feature, called AutoQueue, that will keep creating new sets of shortcodes with increasing seq#s, maintaining a specified number of ones outstanding |
04:06
🔗
|
Somebody2 |
at once |
04:07
🔗
|
Somebody2 |
if an admin resets the starting seq number, it will happily re-try the same set of shortcodes |
04:07
🔗
|
Somebody2 |
but otherwise it won't |
04:07
🔗
|
teej_ |
It will re-try the same URLs even though we know it redirected before? |
04:08
🔗
|
Somebody2 |
the data is stored on the tracker until a given number of results are accumulated (generally every couple days) at which time the results are uploaded to IA, and removed from the tracker |
04:08
🔗
|
Somebody2 |
The tracker doesn't know anything about previous data -- it's not that smart |
04:09
🔗
|
Somebody2 |
it just creates new sets of shortcodes to try, and collects the results of the individual warrirors trying them |
04:10
🔗
|
teej_ |
So couldn't many URLs be taken out of the queue if we already know where they redirect from the previous try? |
04:11
🔗
|
teej_ |
This way, when it's time to re-try, the queue becomes smaller... ? |
04:11
🔗
|
Somebody2 |
Manually, yes -- if you see on the wiki page that I noted down the range of seq# checked for various shorteners; that's what that is for. |
04:12
🔗
|
Somebody2 |
teej_: No, storing the list of which shortcodes previously redirected on the tracker would be pretty heavyweight. I suppose we could do it, but it's probably not worth it |
04:12
🔗
|
Somebody2 |
also, it's informative to know that a shortcode that *used* to redirect somewhere *still does* |
04:12
🔗
|
teej_ |
Oh. I see. |
04:12
🔗
|
Somebody2 |
some shorterners re-use shortcodes, and/or block some for spam, or other reasons. It's nice to have a record of that |
04:13
🔗
|
Somebody2 |
thanks for asking about this! |
04:13
🔗
|
teej_ |
Oh! Okay. I thought they are generally static. |
04:13
🔗
|
Somebody2 |
they are *generally* static, but not *universally* static |
04:14
🔗
|
teej_ |
Somebody2: Thanks for answering my questions! It makes more sense now. |
04:14
🔗
|
Somebody2 |
:-) |
04:16
🔗
|
|
odemg has quit IRC (Ping timeout: 246 seconds) |
04:30
🔗
|
|
odemg has joined #urlteam |
04:52
🔗
|
|
dashcloud has quit IRC (No Ping reply in 180 seconds.) |
04:54
🔗
|
|
dashcloud has joined #urlteam |
05:06
🔗
|
Flashfire |
somebody2 I have one more question. If a block returns 50/50 or 25/25 does that block get removed from circulation |
05:06
🔗
|
|
dashcloud has quit IRC (No Ping reply in 180 seconds.) |
05:08
🔗
|
|
dashcloud has joined #urlteam |
05:13
🔗
|
|
dashcloud has quit IRC (No Ping reply in 180 seconds.) |
05:15
🔗
|
|
dashcloud has joined #urlteam |
05:24
🔗
|
|
dashcloud has quit IRC (No Ping reply in 180 seconds.) |
05:26
🔗
|
|
dashcloud has joined #urlteam |
09:39
🔗
|
|
teej_ has quit IRC (Ping timeout: 252 seconds) |
09:44
🔗
|
|
voltagex_ has quit IRC (Ping timeout: 260 seconds) |
09:46
🔗
|
|
HCross has quit IRC (Ping timeout: 260 seconds) |
09:47
🔗
|
|
voltagex_ has joined #urlteam |
09:49
🔗
|
|
HCross has joined #urlteam |
09:49
🔗
|
|
chr1sm has quit IRC (Ping timeout: 260 seconds) |
09:49
🔗
|
|
svchfoo3 sets mode: +o HCross |
09:50
🔗
|
|
svchfoo1 sets mode: +o HCross |
09:51
🔗
|
|
chr1sm has joined #urlteam |
09:55
🔗
|
|
teej_ has joined #urlteam |
12:34
🔗
|
|
kiska1 has joined #urlteam |
13:33
🔗
|
|
yuitimoth has quit IRC (Read error: Operation timed out) |
13:33
🔗
|
|
yuitimoth has joined #urlteam |
13:33
🔗
|
|
zerkalo has quit IRC (Read error: Operation timed out) |
13:34
🔗
|
|
zerkalo has joined #urlteam |
13:34
🔗
|
|
treora has quit IRC (Ping timeout: 268 seconds) |
13:34
🔗
|
|
BnAboyZ has quit IRC (Ping timeout: 268 seconds) |
13:35
🔗
|
|
treora has joined #urlteam |
13:35
🔗
|
|
Frogging has quit IRC (Ping timeout: 268 seconds) |
13:35
🔗
|
|
BnAboyZ has joined #urlteam |
13:38
🔗
|
|
Frogging has joined #urlteam |
13:40
🔗
|
|
Jusque has quit IRC (Read error: Operation timed out) |
13:42
🔗
|
|
Jusque has joined #urlteam |
13:44
🔗
|
|
moufu has joined #urlteam |
13:47
🔗
|
|
moufu_ has quit IRC (Read error: Connection reset by peer) |
13:48
🔗
|
|
kiska1 has quit IRC (Read error: Connection reset by peer) |
13:52
🔗
|
|
wmvhater has joined #urlteam |
13:52
🔗
|
|
kiska1 has joined #urlteam |
14:13
🔗
|
|
SilSte has quit IRC (Read error: Operation timed out) |
18:14
🔗
|
|
dashcloud has quit IRC (No Ping reply in 180 seconds.) |
18:16
🔗
|
|
dashcloud has joined #urlteam |
18:20
🔗
|
|
dashcloud has quit IRC (No Ping reply in 180 seconds.) |
18:21
🔗
|
|
dashcloud has joined #urlteam |
18:47
🔗
|
|
dashcloud has quit IRC (No Ping reply in 180 seconds.) |
18:48
🔗
|
|
dashcloud has joined #urlteam |