#urlteam 2018-12-16,Sun

↑back Search

Time Nickname Message
00:07 🔗 ivan` is now known as ivan_
01:23 🔗 trvz has quit IRC ()
02:33 🔗 Rotab has joined #urlteam
03:41 🔗 Somebody2 tinyurl error reports piled up too high; trying to clear the queue now
03:48 🔗 Flashfire somebody2 did you not update the wiki
04:14 🔗 boutique has joined #urlteam
04:15 🔗 odemg has quit IRC (Ping timeout: 265 seconds)
04:27 🔗 odemg has joined #urlteam
04:35 🔗 Somebody2 Flashfire: nope! :-)
04:35 🔗 Somebody2 I'd love your help with that... (hint, hint)
04:36 🔗 Flashfire Ahahahaha. I will have a look
04:38 🔗 Somebody2 yay thank you!
04:40 🔗 Flashfire It wont be updating the dates I dont know enough to do that but I can change the codes to what is currently running what isnt
04:41 🔗 Flashfire Somebody2 Thats something that still helps though
04:41 🔗 Somebody2 absolutely
04:41 🔗 Somebody2 And you can figure out the dates by searching on archive.org
04:42 🔗 Somebody2 each big pile of data that is uploaded gets tagged with which projects its in
04:42 🔗 JAA One day, we'll automate this.
04:43 🔗 Somebody2 AMEN. TELL IT, BROTHER...
04:43 🔗 Flashfire Im not confident in messing with it more than anything. The FTP/List page is one I am comfortable changing a lot. Im not a huge part of URLTeam so I dont feel as confident editing that page
04:43 🔗 Somebody2 nods
04:43 🔗 Somebody2 (I'm subtly trying to *get* you to be...)
04:44 🔗 Somebody2 there's notes on how the data is uploaded to IA at the bottom of the page
04:45 🔗 Flashfire I found a few URL shorteners by scanning hundreds of QR codes and have added some to the wiki in dribs and drabs
04:45 🔗 Somebody2 thanks
04:45 🔗 Flashfire azon.biz was one of my findings which is one of the projects running now
04:46 🔗 Flashfire I tend to come across a lot of scam and spam so
05:21 🔗 boutique_ has joined #urlteam
05:24 🔗 boutique has quit IRC (Ping timeout: 252 seconds)
05:26 🔗 boutique has joined #urlteam
05:28 🔗 boutique has quit IRC (Read error: Connection reset by peer)
05:28 🔗 boutique has joined #urlteam
05:29 🔗 boutique_ has quit IRC (Ping timeout: 252 seconds)
05:41 🔗 boutique_ has joined #urlteam
05:45 🔗 Flashfire Somebody2 any reason for the 512?
05:45 🔗 boutique has quit IRC (Ping timeout: 252 seconds)
05:47 🔗 boutique has joined #urlteam
05:49 🔗 boutique_ has quit IRC (Ping timeout: 252 seconds)
05:58 🔗 JAA We need to EXPORT OUR SHIT regularly. :-)
05:59 🔗 JAA Means writing the results to files and prepare them for upload to IA.
05:59 🔗 JAA Although I'm not really sure why it takes this long.
05:59 🔗 JAA There's definitely some room for optimisation there.
06:02 🔗 jodizzle has joined #urlteam
06:13 🔗 boutique_ has joined #urlteam
06:16 🔗 boutique has quit IRC (Ping timeout: 252 seconds)
06:20 🔗 boutique has joined #urlteam
06:20 🔗 boutique_ has quit IRC (Ping timeout: 252 seconds)
06:30 🔗 JAA has quit IRC (leaving)
06:34 🔗 JAA has joined #urlteam
06:34 🔗 bakJAA sets mode: +o JAA
06:59 🔗 Flashfire if x.co is incremental you may want to stop it
08:43 🔗 Flashfire Somebody2 if Bitly requests stuff "Randomly" does that mean it wont re request what it knows is a result?
08:45 🔗 JAA Flashfire: Think of it this way: you take all possible shortcodes and shuffle them. Then you start processing from the beginning. Each code only gets processed once this way.
08:45 🔗 Flashfire ok
08:45 🔗 Flashfire So it will request all the shortcodes then take what didnt work shuffle and try again?
08:46 🔗 JAA No, nothing is tried again.
08:46 🔗 JAA Each shortcode is attempted exactly once.
08:46 🔗 JAA Well, aside from connection issues and similar.
08:46 🔗 Flashfire But then URLteam would have lots of duplication issues
08:47 🔗 JAA ... no?
08:47 🔗 Flashfire I think we are misunderstanding each other
08:47 🔗 JAA Yeah
08:48 🔗 JAA The tracker conceptually takes all possible shortcodes. In the case of bit.ly e.g. 0000000 to zzzzzzz. It shuffles these into random order. And that's the basic list it then operates on.
08:49 🔗 JAA It cuts it into pieces of 50 codes and hands those out as items to the workers.
08:49 🔗 JAA All items combined process exactly the entire possible shortcode range, and each individual shortcode is retrieved exactly once.
08:50 🔗 JAA (The actual implementation is more efficient than the above, but the effect is the same.)
08:59 🔗 JSharp has joined #urlteam
09:24 🔗 psi previously-unfound links don't go back into the pool?
09:24 🔗 JAA No, we do one pass over the whole possible space.
09:25 🔗 JAA And when that completes, we start over. But including the information about which codes were already found before isn't really feasible there because that would be a *huge* list.
09:26 🔗 psi I see
09:27 🔗 JAA But in the case of bit.ly at least, we're still far away from reaching that point anyway.
09:28 🔗 JAA ~7 billion scanned, but there are ~56 billion 6-digit shortcodes. At ~100 codes per second, that'll take another 15 years.
09:29 🔗 psi Somewhat related, but can I see how much I've done without having to load 10,000 entries on the leaderboard?
09:31 🔗 JAA psi: Don't think so. Looks like there isn't anything on the admin panel either.
09:32 🔗 psi bah
09:33 🔗 JAA psi: Well, you can search the HTML. It's all in there, just hidden from view.
09:33 🔗 psi oh
09:33 🔗 JAA Ah no, looks like it's not the complete table.
09:38 🔗 JAA Yeah, only top 300 in there.
09:39 🔗 JAA The API should return everything though. But that goes through a WebSocket, so not easily accessible.
10:17 🔗 hook54321 has quit IRC (Quit: Connection closed for inactivity)
10:27 🔗 psi oof
10:38 🔗 jodizzle Does anyone have a sense of what level of concurrency I can get away with on a small VPS (like $5 digitalocean droplet)?
10:39 🔗 jodizzle I'm testing it out and it doesn't seem like the jobs are that demanding
10:56 🔗 SmileyG_ has quit IRC (Read error: Operation timed out)
10:58 🔗 Smiley has joined #urlteam
11:20 🔗 caff_ has quit IRC (Read error: Connection reset by peer)
12:01 🔗 boutique has quit IRC (Quit: Leaving)
12:18 🔗 psi How does it happen that nothing is available, by the way
12:18 🔗 psi more warriors requesting chunks faster than the tracker can assign them?
12:49 🔗 hook54321 has joined #urlteam
13:21 🔗 JAA jodizzle: Anything, more or less. URLTeam uses extremely little resources. We mostly just need a *lot* of IP addresses due to rate limits.
13:22 🔗 JAA Also due to rate limits, you'll only be processing one item per active shortener at a time though, so going too high won't get you anywhere. I think there are about a dozen shorteners active at the moment.
13:24 🔗 JAA psi: Yes, I think so. The tracker has a limit of how many items per shortener are available at any time (i.e. a global rate limit), and there are more workers than items for each shortener, so often enough the workers don't get any. In addition, you'll only get one item per shortener at a time, so if you run at a higher concurrency, you'll only get 404s on those additional threads.
13:55 🔗 psi JAA: if you're still here, the tracker is 507ing (unless it's already known) (also cc Somebody2 )
13:59 🔗 JAA vbly-us is throwing errors due to unexpected 302 status replies.
14:08 🔗 celso has joined #urlteam
14:10 🔗 JAA So 302s go to the homepage apparently. Maybe deleted shortlinks or something?
14:11 🔗 JAA Or maybe we reached the end already?
14:12 🔗 psi The quick and dirty solution is to just then treat 302s as a failure, I assume
14:14 🔗 psi Or turn off vbly for the time being and do manual testing
14:20 🔗 celso has quit IRC (Read error: Connection reset by peer)
14:21 🔗 JAA vbly-us disabled for now.
14:21 🔗 JAA All resumed.
14:21 🔗 psi Great, thanks
14:22 🔗 JAA We were paused since about 11:30 UTC.
14:23 🔗 JAA Some examples which caused 302s: http://vbly.us/34b0 http://vbly.us/34b2 http://vbly.us/2wgr http://vbly.us/2us4 http://vbly.us/334b
14:38 🔗 JAA wp-me re-enabled starting from where last year's crawl stopped. 40 currently.
14:48 🔗 JAA shar-es is throwing errors, reducing to 80.
14:48 🔗 JAA 504 errors*
15:17 🔗 JAA Somebody2: Uhm, wtf are those entries in the errors with project "None"?
16:25 🔗 ave_ has quit IRC (Quit: Connection closed for inactivity)
17:39 🔗 chferfa has joined #urlteam
17:56 🔗 celso has joined #urlteam
19:17 🔗 klg has joined #urlteam
19:32 🔗 t3 has quit IRC ()
19:36 🔗 teej_ has joined #urlteam
20:06 🔗 JAA wp-me now at 70.
21:25 🔗 maxadolla has joined #urlteam
21:58 🔗 JAA wp-me boosted to 100.
21:58 🔗 JAA Looks like we finally have more items available than warriors. (But only because Tumblr's the default project.)
23:20 🔗 VariXx has quit IRC (Read error: Operation timed out)

irclogger-viewer