Time |
Nickname |
Message |
00:07
🔗
|
|
ivan` is now known as ivan_ |
01:23
🔗
|
|
trvz has quit IRC () |
02:33
🔗
|
|
Rotab has joined #urlteam |
03:41
🔗
|
Somebody2 |
tinyurl error reports piled up too high; trying to clear the queue now |
03:48
🔗
|
Flashfire |
somebody2 did you not update the wiki |
04:14
🔗
|
|
boutique has joined #urlteam |
04:15
🔗
|
|
odemg has quit IRC (Ping timeout: 265 seconds) |
04:27
🔗
|
|
odemg has joined #urlteam |
04:35
🔗
|
Somebody2 |
Flashfire: nope! :-) |
04:35
🔗
|
Somebody2 |
I'd love your help with that... (hint, hint) |
04:36
🔗
|
Flashfire |
Ahahahaha. I will have a look |
04:38
🔗
|
Somebody2 |
yay thank you! |
04:40
🔗
|
Flashfire |
It wont be updating the dates I dont know enough to do that but I can change the codes to what is currently running what isnt |
04:41
🔗
|
Flashfire |
Somebody2 Thats something that still helps though |
04:41
🔗
|
Somebody2 |
absolutely |
04:41
🔗
|
Somebody2 |
And you can figure out the dates by searching on archive.org |
04:42
🔗
|
Somebody2 |
each big pile of data that is uploaded gets tagged with which projects its in |
04:42
🔗
|
JAA |
One day, we'll automate this. |
04:43
🔗
|
Somebody2 |
AMEN. TELL IT, BROTHER... |
04:43
🔗
|
Flashfire |
Im not confident in messing with it more than anything. The FTP/List page is one I am comfortable changing a lot. Im not a huge part of URLTeam so I dont feel as confident editing that page |
04:43
🔗
|
Somebody2 |
nods |
04:43
🔗
|
Somebody2 |
(I'm subtly trying to *get* you to be...) |
04:44
🔗
|
Somebody2 |
there's notes on how the data is uploaded to IA at the bottom of the page |
04:45
🔗
|
Flashfire |
I found a few URL shorteners by scanning hundreds of QR codes and have added some to the wiki in dribs and drabs |
04:45
🔗
|
Somebody2 |
thanks |
04:45
🔗
|
Flashfire |
azon.biz was one of my findings which is one of the projects running now |
04:46
🔗
|
Flashfire |
I tend to come across a lot of scam and spam so |
05:21
🔗
|
|
boutique_ has joined #urlteam |
05:24
🔗
|
|
boutique has quit IRC (Ping timeout: 252 seconds) |
05:26
🔗
|
|
boutique has joined #urlteam |
05:28
🔗
|
|
boutique has quit IRC (Read error: Connection reset by peer) |
05:28
🔗
|
|
boutique has joined #urlteam |
05:29
🔗
|
|
boutique_ has quit IRC (Ping timeout: 252 seconds) |
05:41
🔗
|
|
boutique_ has joined #urlteam |
05:45
🔗
|
Flashfire |
Somebody2 any reason for the 512? |
05:45
🔗
|
|
boutique has quit IRC (Ping timeout: 252 seconds) |
05:47
🔗
|
|
boutique has joined #urlteam |
05:49
🔗
|
|
boutique_ has quit IRC (Ping timeout: 252 seconds) |
05:58
🔗
|
JAA |
We need to EXPORT OUR SHIT regularly. :-) |
05:59
🔗
|
JAA |
Means writing the results to files and prepare them for upload to IA. |
05:59
🔗
|
JAA |
Although I'm not really sure why it takes this long. |
05:59
🔗
|
JAA |
There's definitely some room for optimisation there. |
06:02
🔗
|
|
jodizzle has joined #urlteam |
06:13
🔗
|
|
boutique_ has joined #urlteam |
06:16
🔗
|
|
boutique has quit IRC (Ping timeout: 252 seconds) |
06:20
🔗
|
|
boutique has joined #urlteam |
06:20
🔗
|
|
boutique_ has quit IRC (Ping timeout: 252 seconds) |
06:30
🔗
|
|
JAA has quit IRC (leaving) |
06:34
🔗
|
|
JAA has joined #urlteam |
06:34
🔗
|
|
bakJAA sets mode: +o JAA |
06:59
🔗
|
Flashfire |
if x.co is incremental you may want to stop it |
08:43
🔗
|
Flashfire |
Somebody2 if Bitly requests stuff "Randomly" does that mean it wont re request what it knows is a result? |
08:45
🔗
|
JAA |
Flashfire: Think of it this way: you take all possible shortcodes and shuffle them. Then you start processing from the beginning. Each code only gets processed once this way. |
08:45
🔗
|
Flashfire |
ok |
08:45
🔗
|
Flashfire |
So it will request all the shortcodes then take what didnt work shuffle and try again? |
08:46
🔗
|
JAA |
No, nothing is tried again. |
08:46
🔗
|
JAA |
Each shortcode is attempted exactly once. |
08:46
🔗
|
JAA |
Well, aside from connection issues and similar. |
08:46
🔗
|
Flashfire |
But then URLteam would have lots of duplication issues |
08:47
🔗
|
JAA |
... no? |
08:47
🔗
|
Flashfire |
I think we are misunderstanding each other |
08:47
🔗
|
JAA |
Yeah |
08:48
🔗
|
JAA |
The tracker conceptually takes all possible shortcodes. In the case of bit.ly e.g. 0000000 to zzzzzzz. It shuffles these into random order. And that's the basic list it then operates on. |
08:49
🔗
|
JAA |
It cuts it into pieces of 50 codes and hands those out as items to the workers. |
08:49
🔗
|
JAA |
All items combined process exactly the entire possible shortcode range, and each individual shortcode is retrieved exactly once. |
08:50
🔗
|
JAA |
(The actual implementation is more efficient than the above, but the effect is the same.) |
08:59
🔗
|
|
JSharp has joined #urlteam |
09:24
🔗
|
psi |
previously-unfound links don't go back into the pool? |
09:24
🔗
|
JAA |
No, we do one pass over the whole possible space. |
09:25
🔗
|
JAA |
And when that completes, we start over. But including the information about which codes were already found before isn't really feasible there because that would be a *huge* list. |
09:26
🔗
|
psi |
I see |
09:27
🔗
|
JAA |
But in the case of bit.ly at least, we're still far away from reaching that point anyway. |
09:28
🔗
|
JAA |
~7 billion scanned, but there are ~56 billion 6-digit shortcodes. At ~100 codes per second, that'll take another 15 years. |
09:29
🔗
|
psi |
Somewhat related, but can I see how much I've done without having to load 10,000 entries on the leaderboard? |
09:31
🔗
|
JAA |
psi: Don't think so. Looks like there isn't anything on the admin panel either. |
09:32
🔗
|
psi |
bah |
09:33
🔗
|
JAA |
psi: Well, you can search the HTML. It's all in there, just hidden from view. |
09:33
🔗
|
psi |
oh |
09:33
🔗
|
JAA |
Ah no, looks like it's not the complete table. |
09:38
🔗
|
JAA |
Yeah, only top 300 in there. |
09:39
🔗
|
JAA |
The API should return everything though. But that goes through a WebSocket, so not easily accessible. |
10:17
🔗
|
|
hook54321 has quit IRC (Quit: Connection closed for inactivity) |
10:27
🔗
|
psi |
oof |
10:38
🔗
|
jodizzle |
Does anyone have a sense of what level of concurrency I can get away with on a small VPS (like $5 digitalocean droplet)? |
10:39
🔗
|
jodizzle |
I'm testing it out and it doesn't seem like the jobs are that demanding |
10:56
🔗
|
|
SmileyG_ has quit IRC (Read error: Operation timed out) |
10:58
🔗
|
|
Smiley has joined #urlteam |
11:20
🔗
|
|
caff_ has quit IRC (Read error: Connection reset by peer) |
12:01
🔗
|
|
boutique has quit IRC (Quit: Leaving) |
12:18
🔗
|
psi |
How does it happen that nothing is available, by the way |
12:18
🔗
|
psi |
more warriors requesting chunks faster than the tracker can assign them? |
12:49
🔗
|
|
hook54321 has joined #urlteam |
13:21
🔗
|
JAA |
jodizzle: Anything, more or less. URLTeam uses extremely little resources. We mostly just need a *lot* of IP addresses due to rate limits. |
13:22
🔗
|
JAA |
Also due to rate limits, you'll only be processing one item per active shortener at a time though, so going too high won't get you anywhere. I think there are about a dozen shorteners active at the moment. |
13:24
🔗
|
JAA |
psi: Yes, I think so. The tracker has a limit of how many items per shortener are available at any time (i.e. a global rate limit), and there are more workers than items for each shortener, so often enough the workers don't get any. In addition, you'll only get one item per shortener at a time, so if you run at a higher concurrency, you'll only get 404s on those additional threads. |
13:55
🔗
|
psi |
JAA: if you're still here, the tracker is 507ing (unless it's already known) (also cc Somebody2 ) |
13:59
🔗
|
JAA |
vbly-us is throwing errors due to unexpected 302 status replies. |
14:08
🔗
|
|
celso has joined #urlteam |
14:10
🔗
|
JAA |
So 302s go to the homepage apparently. Maybe deleted shortlinks or something? |
14:11
🔗
|
JAA |
Or maybe we reached the end already? |
14:12
🔗
|
psi |
The quick and dirty solution is to just then treat 302s as a failure, I assume |
14:14
🔗
|
psi |
Or turn off vbly for the time being and do manual testing |
14:20
🔗
|
|
celso has quit IRC (Read error: Connection reset by peer) |
14:21
🔗
|
JAA |
vbly-us disabled for now. |
14:21
🔗
|
JAA |
All resumed. |
14:21
🔗
|
psi |
Great, thanks |
14:22
🔗
|
JAA |
We were paused since about 11:30 UTC. |
14:23
🔗
|
JAA |
Some examples which caused 302s: http://vbly.us/34b0 http://vbly.us/34b2 http://vbly.us/2wgr http://vbly.us/2us4 http://vbly.us/334b |
14:38
🔗
|
JAA |
wp-me re-enabled starting from where last year's crawl stopped. 40 currently. |
14:48
🔗
|
JAA |
shar-es is throwing errors, reducing to 80. |
14:48
🔗
|
JAA |
504 errors* |
15:17
🔗
|
JAA |
Somebody2: Uhm, wtf are those entries in the errors with project "None"? |
16:25
🔗
|
|
ave_ has quit IRC (Quit: Connection closed for inactivity) |
17:39
🔗
|
|
chferfa has joined #urlteam |
17:56
🔗
|
|
celso has joined #urlteam |
19:17
🔗
|
|
klg has joined #urlteam |
19:32
🔗
|
|
t3 has quit IRC () |
19:36
🔗
|
|
teej_ has joined #urlteam |
20:06
🔗
|
JAA |
wp-me now at 70. |
21:25
🔗
|
|
maxadolla has joined #urlteam |
21:58
🔗
|
JAA |
wp-me boosted to 100. |
21:58
🔗
|
JAA |
Looks like we finally have more items available than warriors. (But only because Tumblr's the default project.) |
23:20
🔗
|
|
VariXx has quit IRC (Read error: Operation timed out) |