[01:21] *** Odd0002_ has joined #urlteam [01:21] *** Odd0002 has quit IRC (Ping timeout: 600 seconds) [01:21] *** Odd0002_ is now known as Odd0002 [08:49] *** figpucker has joined #urlteam [09:07] *** figpucker has quit IRC (Quit: Leaving) [09:10] *** figpucker has joined #urlteam [11:59] *** dashcloud has quit IRC (Read error: Connection reset by peer) [12:00] *** dashcloud has joined #urlteam [15:02] *** dashcloud has quit IRC (Read error: Operation timed out) [15:31] *** dashcloud has joined #urlteam [17:51] JAA: grabbing the wordpress blog links sounds like a lovely idea [17:52] I'll see about setting that up [17:57] Somebody2: How about you teach me how to set it up? :-) [18:01] JAA: even better! [18:02] So, go to the toplevel admin page, https://tracker.archiveteam.org:1338/projects/overview [18:02] and enter the name of the new project in the obvious box. I like to use the name of the shortener, with dots replaced by dashes [18:03] (for, afaik, historical raisins) [18:03] Check [18:03] then you get to the shorterner settings apge [18:03] and you need to set the alphabet, if it isn't default [18:05] Looks like the default settings are fine for this one. [18:05] It returns 301 on success and 404 on failure, but I guess leaving the other codes in doesn't hurt? [18:06] Or would you remove those? [18:08] I'd leave them in, at least to start with. [18:09] You need to change the URL Template line [18:09] JAA: what *is* the format, btw? [18:11] So the shortcode in http://wp.me/code can be either the blog ID in base62 or one of the letters [sPpa] plus the encoded blog ID plus a dash plus the post ID in base62. [18:12] s = "slug", in that case it isn't the post ID but a custom shortcode, e.g. http://wp.me/sf2B5-shorten [18:12] P = page of a post, I think, but I'm not entirely sure about that one. [18:12] p is a link directly to a specific post on the blog. [18:12] And "a" is for attachments to a post. [18:14] Hm. [18:15] More examples, please? [18:16] https://wp.me/a92Te1-q is one which appeared in ArchiveBot yesterday and caused me to hunt all of this down. [18:16] Also, does it work with HEAD requests, or does it require GET requests? [18:16] HEAD works fine. [18:17] Cool -- note that there's an option in the settings for that, which defaults to HEAD [18:17] Yeah, saw that [18:18] I don't have more examples right now, but you can find them on any blog hosted at wordpress.com in the HTML in a tag. [18:18] So, to get the blog ID ones, can we just iterate through a-zA-Z0-9? [18:18] 0-9a-zA-Z is the order how it's used in the Wordpress plugin, but yeah. [18:19] Ah, better to change the order, then [18:19] You can do that in the alphabet setting [18:19] Default order is 0-9a-zA-Z though? [18:19] Oh, is it? [18:19] Good. :-) [18:19] :-) [18:19] It'll result in some duplicates because it's really just a base62-encoding. E.g. http://wp.me/02 == http://wp.me/2 [18:20] Hm, that's probably OK. [18:20] OK, so update the "URL template" setting on the Shortern settings page. [18:21] Guess so. Maybe we can skip 0xxxx later on because that would be quite large and unnecessary. [18:21] Yup [18:21] Yes, we can skip over ranges by adjusting where the auto-queue starts from [18:21] Right [18:21] Now on the Queue Settings page, change the Maximum number of items setting to something like 10, to start with. [18:22] You can boost it back up gradually [18:22] Make sure to check the AutoQueue checkbox. [18:22] Then check the Enabled checkbox, and we're good to go! [18:22] Sweet [18:23] A nice semi-hidden feature is that the current time the page was generated is listed in the upper corner; this is helpful for comparing to timestamps on the page [18:23] to see if things are still running as expected. [18:23] A likely nice feature addition would be to enhance all the timestamp displays with relative dates, too [18:24] and we've got results for wp-me!! [18:24] Ah yeah, I saw that. [18:24] \o/ [18:24] I'm not a fan of relative dates. "What is 'about 1 minute ago'? Give me a timestamp!" [18:26] *** figpucker has quit IRC (Read error: Connection reset by peer) [18:26] Oh, I certainly don't want it to *replace* the timestamps; dear god no. [18:26] Yeah, a setting would be neat. [18:26] But it's nice to have as an addition. [18:26] Also, I love that everything's in UTC here. :-) [18:27] Esspecially if color coded -- "within a minute", "within an hour", "from a previous day" [18:27] As that's usually what I'm looking for -- is this project stuck, and how badly? [18:27] Hm yeah, makes sense. [18:27] So now you'd slowly increase the queue size until it's either large enough or runs into trouble? [18:28] yep! [18:28] I tend to ramp up in units of at least 10, and usually 20 [18:28] You can check the Error Reports page to see "trouble" [18:29] So we're now grabbing 3 character ones -- they don't *seem* to be base62 encodings of the names... [18:29] e.g. 12L maps to aehso [18:29] No, it's the blog ID. [18:29] I'm not sure if that's exposed anywhere really. [18:29] Ah, cool [18:30] well, it's exposed here :-) [18:30] Oh yeah, it's in the HTML somewhere. [18:30] E.g. "siteID":"4618" on https://kidnicky2801.wordpress.com/ (1cu) [18:35] Queue at 80 now. [18:39] 100 [18:39] cool, seems fine [18:40] Is there any way to see how much "capacity" we have? [18:41] We have a huge list of shorteners to do, and I'd love to throw additional ones in. [18:42] I still see "no items available currently" errors on my machines, so clearly there's still space for more, but I wonder if there's anything to estimate how *much* more. [18:43] Please create projects for every single shortener you have the relevant information for [18:43] Once we've got them in, we can adjust which ones are running at the same time if needed. [18:43] Our current bottleneck is researching them. [18:43] Guessed so [18:44] Generally we have about 100 warriors, each of which can run about 3 different jobs at once. [18:44] So that's a good ballpark figure for how much capacity we can do simultaneously. [18:45] But if we can consistently hit that, we can likely recruit more warriors. [18:45] And if/when there isn't another job running, suddenly our number of warriors jumps up to 200+ [18:45] another non-URLTeam job [18:46] Makes sense [18:46] I need to add the shorteners to the wiki page manually, correct? [18:47] For now, until someone writes code to do it automatically, yeah. [18:48] I'm going AFK for a bit. [19:07] I'm cleaning up the wiki page now. Still lists various shorteners as active which were deactivated months ago, e.g. cmplx-it. [19:11] JAA: thank you!! [19:11] and we've found over 300,000 wp-me results [19:12] By the way, what's the matter with go-usa-gov? Did anything happen since treyo was here? [19:12] JAA: a couple of days later, it seemed to be blocking us, IIRC. [19:13] have they posted a dump yet? [19:14] * JAA shrugs [21:02] *** dashcloud has quit IRC (Remote host closed the connection) [21:03] *** dashcloud has joined #urlteam [22:21] 2M wp-me scanned, 1.85M found. :-) [22:36] Yay! [23:45] *** dashcloud has quit IRC (Read error: Operation timed out) [23:48] *** dashcloud has joined #urlteam