[01:13] *** cechk01 has joined #urlteam [01:32] *** tritutri has joined #urlteam [01:32] *** tritutri is now known as cybersec [01:58] so, are any stats of the project known? eg. "we can guess we've captured 'x' % of all bit.ly URLs" etc? [01:58] just curious to know how far it's come [02:02] oh and one other question [02:02] johtso, what kind of distributed rig/computer/setup/etc are you running? haha [02:02] *** dashcloud has quit IRC (Read error: Operation timed out) [02:08] *** dashcloud has joined #urlteam [02:09] *** svchfoo1 sets mode: +o dashcloud [02:44] cybersec, iirc its on the wiki [02:44] http://archiveteam.org/index.php?title=URLTeam#URL_shorteners [02:49] *** W1nterFox has joined #urlteam [02:55] *** WinterFox has quit IRC (Read error: Operation timed out) [02:56] ah cool, thanks WinterFox [03:04] *** JesseW has joined #urlteam [03:04] *** svchfoo1 sets mode: +o JesseW [03:15] *** JesseW has left [03:16] *** JesseW has joined #urlteam [03:16] *** svchfoo1 sets mode: +o JesseW [03:37] *** JesseW has quit IRC (Leaving.) [03:49] *** JesseW has joined #urlteam [03:50] *** svchfoo1 sets mode: +o JesseW [03:54] WinterFox, cybersec: wait, what's on the wiki? A percentage complete isn't there -- it's tricky with many of the shorteners to *tell* how many exist (and are being added to the source). We have estimates for *some* of them (and some of the ones that don't allow new entries are easier to figure out) -- but as for an overall percentage -- not so much. Regarding johtso's setup -- I don't think there's any mention of it on the wiki... [03:58] Ah, that makes sense to me. I figured it would be hard to get an overall percentage from some of the more uncooperative URL shorteners [03:59] Yeah, if you can figure out ways to populate the total column for any of the ones it's blank about, please do fill it in! [03:59] and darn, I'd be interested to find out what that user is running. Do many people run archiving software on a distributed cluster of computers, or something like that? I'm fairly new to all of this [03:59] haha sounds good, I'll keep an eye out [04:00] first of all -- Welcome! [04:03] regarding what johtso is running -- all the software (that shows up on the urlteam2 tracker) is the same: terroroftinytown, written by chfoo. It can be wrapped up as one of the possible projects to be run by the ArchiveTeam Warrior, which is a virtual machine image that is otherwise used for larger archiving projects, using a framework called seasaw. But the terroroftinytown (and the other projects) can also be run outside the virtual machine; I'm sure that' [04:04] I heard a rumor that the hardware it's running on is otherwise used for bitcoin mining, but I don't know any details. [04:04] It'd be great if you want to run a Warrior instance yourself. :-) [04:05] And/or help with checking over the <100 possible url shorteners on the wiki page and figuring out whether the current code can scrape them, or what else we'd need to do to make it work... [04:06] (if you're interested in that work, I'm glad to explain more) [04:26] *** Fletcher_ is now known as Fletcher [05:06] *** bwn has joined #urlteam [05:10] JesseW, Cant you just look at the total possible urls and the number that work to find out roughly how many urls there are in total? [05:11] W1nterFox: we don't even know of all the url shorteners out there... [05:11] but if you meant for a particular one... [05:11] Yeah just per shortener [05:12] we can certainly calculate the number of possible codes for a given length [05:12] but it's not always clear what lengths are legal [05:33] JesseW, I've actually been running my own Warrior instance for about half the day today :) [05:33] I'm almost up to 100k scanned URLs now [05:33] nice! [05:34] is running ToTT faster than Warrior, though? eg. would it benefit me more to run, say, 10 concurrent instances of ToTT in comparison to Warrior's maximum of 6 items? [05:35] I'm not sure. Probably not. [05:35] oh okay [05:36] It will only let you work on one item from each project at a time, and there are only 10 projects running right now. [05:37] So running more than 10 concurrency on the same IP address won't help. It will also likely tax your box pretty heavily. [05:37] 10 urlteam projects, I mean. [05:38] oh I see what you mean [05:38] that makes a lot more sense to me [05:38] thanks for the information :) [05:39] yeah, it's limited to one item from each project so the shorteners don't get upset. [05:40] there are also limits of how many items can be worked on at once, per project, which can be set by the admins (i.e. me, arkiver and chfoo) so we don't overload the shorteners. [05:41] oh that makes sense, so even if someone tried spoofing 50 instances or something of that sort, they'd never reach a point where they're overloading the shorteners [05:42] or is the one-item limit IP-based? [05:42] but reguardless, that's a nice protection you have built in [05:42] regardless* [05:49] I'm not sure, actually. You can read the code and figure it out, though: https://github.com/ArchiveTeam/terroroftinytown [05:52] Im at 6.1 mill scans now :D [05:53] last I saw it was one IP per shortener [05:54] oh cool, thanks JesseW [05:54] and damn W1nterFox nice [05:54] and thanks for the info achip [05:55] I have had the warrior docker container running for about a month now [05:55] * JesseW goes to check my stats [05:56] over 11 million scans. [05:56] Nice [05:56] (nick is "Somebody") [05:57] Why am I getting 404 links from migre.me? Shouldnt every link work? [05:57] Are all the available links scanned? [05:57] W1nterFox: some have been removed for spamming, I think. [05:58] for migre-me, I manually delete items if one of them is missing, because otherwise it will keep retrying the missing one, because migre-me returns "Location: " for missing ones (sometimes) [05:58] W1nterFox: example of a 404 link from migre-me? [05:59] http://www.migre.me/4VU35 [06:00] hm, I get a 404 also. [06:00] Pausing the project. [06:00] I'll look into it. [06:01] I think we may have finally reached the present on migre-me. [06:09] hm [06:11] Could someone create a URL at http://migre.me (doesn't matter what), and post the short url returned? [06:12] I want to check what happens from another IP than mine. [06:13] http://migre.me/sorWa [06:13] ok, cool it is consistent with mine [06:15] **ah** [06:16] It puts the digits *AFTER* the alphabet [06:16] hm, bit of problem then... [06:18] order seems to be lowercase, uppercase, digits [06:21] giving only 269,383,546 urls shortened, rather than the 420 million I had thought. [06:22] hm, no that can't be right, because 4JXkM is valid [06:22] and if it is only up to sos74 [06:24] hm, maybe lowercase, digits, uppercase [06:25] eh, fuck it -- just scrape them all, God (or terroroftinytown) will recognize the valid ones... [06:25] i just did another, sosfc [06:26] hm -- any guesses about the ordering? [06:27] numbers 0-4, lowercase, uppercase, numbers 5-9? :P [06:28] Ha [06:28] it died somewhere after 4JXkM [06:29] and it's currently handing out sosfc [06:29] whatthehell? [06:35] sosjh i just made a few minutes ago, [06:35] sosta works, sostz is 404 [06:36] sostA works, sostZ as well [06:41] hm, apparently migre.me accepts (and terroroftinytown happily passes through) newlines *in* urls [06:49] hmm [06:50] just saw U before F [06:55] yeah, the ordering does seem ... quite odd [06:55] S7, Sr, Sz, Sc, SN, [06:56] well, that's because there are other people creating them in between [06:56] TB, TD, TF [06:57] i was just getting a feel for it [07:04] it does seem consistent with digits, lowercase, uppercase -- maybe we're just hitting a big patch where some spam urls were removed [07:05] yep, 5aabc exists, for example. [07:07] 50000 exists -- but 4ZZZZ is a 200 [07:08] apparently they decided to stop serving (or never hand out) the other half of 4 [07:14] i'm going to play with their api a little and see if that tells me anything [07:16] have you guys tried contacting them at all? or are they one of the less helpful url shorteners? [07:21] I haven't tried contacting them. We generally don't, although it's certainly worth a try. [07:26] bwn: nice; let us know what you find [07:27] It's also in Spanish, which I don't speak, so that was another reason I didn't contact the owner. [07:33] ah yeah, that makes sense [07:40] interestingly, you can see when a URL was created, here: http://www.migre.me/contar-cliques/?url=http%3A%2F%2Fwww.migre.me%2F4JXkM [07:40] we're only up to 2011... [07:43] it's sort of wasteful to be going through all the disabled 4.... ones, but, eh, better to do so just in case there is one working one somewhere in there, I suppose [07:52] it really does look like it's sequential [07:52] yep, I'm pretty sure it is [07:52] Interestingly, the 3,822,506 codes from 4JXAG to 50000 are all empty. 4JXAG was created at "07/06/2011 17:06:46" while 50000 was created at "07/06/2011 03:00:00" [07:53] i'm seeing 0-9, a-z, A-Z like you said earlier [07:53] good to get the confirmation [07:54] he's got a blog post talking about statistics from july 5 2011: http://migreme.com.br/blog/iphone-gera-mais-cliques-que-linux/ [07:55] i wonder if he bumped to 50000? [07:55] it looks like that, yeah [07:56] http://www.migre.me/contar-cliques/?url=http%3A%2F%2Fwww.migre.me%2F50je0 was made at 07/06/2011 às 17:05:05 [07:57] it looks like it was handing out both 4... and 5... for a few hours, then switched just to 5...'s [07:57] well, we're up to 4X -- should be done with the interruption soon [08:10] * JesseW is adjusting bitly's max queue size up, to see what happens... [08:23] and has doubled migre-me [08:23] it seems to be taking it fine [08:37] *** JesseW has quit IRC (Leaving.) [08:51] *** deathy___ has quit IRC (Ping timeout: 252 seconds) [08:53] *** deathy___ has joined #urlteam [08:59] *** bwn has quit IRC (Read error: Operation timed out) [09:36] *** bwn has joined #urlteam [11:03] *** Infreq has quit IRC (始めましょう!) [14:08] *** W1nterFox has quit IRC (Remote host closed the connection) [16:34] *** Start has quit IRC (Quit: Disconnected.) [16:52] *** JesseW has joined #urlteam [16:52] *** svchfoo1 sets mode: +o JesseW [17:10] *** Start has joined #urlteam [17:11] *** JesseW has quit IRC (Leaving.) [17:39] *** Start_ has joined #urlteam [17:42] *** Start has quit IRC (Ping timeout: 252 seconds) [18:01] *** dashcloud has quit IRC (Read error: Operation timed out) [18:04] *** dashcloud has joined #urlteam [18:05] *** svchfoo1 sets mode: +o dashcloud [18:39] *** Start_ has quit IRC (Read error: Connection reset by peer) [18:42] *** Start has joined #urlteam [18:50] *** bwn has quit IRC (Read error: Operation timed out) [19:08] *** Start has quit IRC (Quit: Disconnected.) [19:24] *** bwn has joined #urlteam [19:32] *** Start has joined #urlteam [20:25] Is it possible to rate limit the crawling of a single shortener? I know of a shortener that could use to be archived, but probably can't handle an enormous amount of load. [20:28] *** aaaaaaaaa has joined #urlteam [20:28] *** swebb sets mode: +o aaaaaaaaa [20:28] it certainly is [20:29] please add it to the wiki page [20:30] . [20:30] phuzion: what's the name? [20:32] JW_work: I'll get with you in a minute. I'm talking with the owner of the shortener right now [20:33] nice! if you're in contact with them, a full dump would be even easier than scraping it [20:33] It seems like they're not very likely to give away the db dump, unfortunately. [20:34] but they don't object to us scraping it? strange. [20:35] "phuzion: *shrug* if you want to attempt bruteforce thousands of random URLs that can include capital, lowercase, numbers, etc, I'm not going to stop you. I don't see the point, but *shrug*" [20:36] hahaha [20:36] I explained that we've archived url shorteners multiple orders of magnitude larger than theirs [20:38] depending on the "etc", "capital, lowercase, numbers" is just 62 possibilities. And 62**4 is less than 15 million. [20:39] Randomly generated URLs are 5x(A-Z a-z 0-9) [20:40] ok, so 916,132,832 possibilities [20:40] Vanity URLs are case-sensitive up to 10x A-Z a-z 0-9 _ - [20:41] gee I wonder if the backend uses base-64 integers. [20:42] so assuming one minute per item, 100 item queue, 50 urls per item, covering the random space would take about 4 months [20:42] not particularly a problem [20:42] the vanity urls would be more painful, but still doable [20:43] Yeah [20:43] 64^10 is over 1 quintillion possibilities [20:43] the big issue is whether it returns distinct http status codes for existing and non-existing shortcodes. [20:43] JW_work: what's that calculate out to being in queries per second? ~1 qps per thread? [20:44] assuming we do 2 queries per second per item, 100 items gives 200qps. [20:44] *** Start has quit IRC (Quit: Disconnected.) [20:45] Can that be slowly ramped up while we make sure we're not going to knock the site down? [20:45] (from 100 different IP addresses) [20:45] sure [20:45] ok [20:45] I can start with, say, a 5 item queue — which gives 10qps from 5 different IPs [20:45] and gradually increase it [20:45] and yeah, it returns 404s for nonexistant URLs and 302s for valid ones. [20:46] ok, that should be easy then [20:46] toss it on the wiki [20:46] Will do [20:46] (along with the <100 other ones...) [20:46] JW_work: under the Warrior projects tab? [20:47] no, under the Alive section [20:47] I'll move it the Warrior projects when I actually create the warrior project [20:47] (or feel free to put it in the warrior projects table, and just add a comment that it doesn't actually exist yet) [20:48] Added. The URL is da.gd [20:48] and it's open source, too. https://github.com/relrod/dagd [20:48] ha [20:48] ok, cool [20:48] *** Start has joined #urlteam [20:52] that reminds me, I bought a url shortener a while ago and need to get that set in to urlteam... should that just be "shorturl,destination" format? [20:53] JW_work: I'd prefer if we really didn't hammer the crap out of this one, because it's actually an IRL buddy of mine that runs this and I'd rather not piss him off, lol. But yeah, let me know when that's gonna get started, and I can give him heads up that it's gonna happen. [20:53] I see da.gd is random. How long has it been up? Given that it's random, I'd rather wait on scraping it until it gets as populated as possible. [20:54] achip: you did what? [20:54] phuzion: also, can you get an idea from your buddy about how many short URLs exist yet? [20:55] JW_work: Yeah, let me see if he can do a select count real quick [21:00] someone was selling their url shortening site, so I bought it for the domain and I have the list of short urls that were created. I'll just clean up the list and post an excerpt tonight, I was just curious what format was best to be integrated into the urlteam dataset. [21:02] achip: ah, ok. The format is called "BEACON", defined here: https://gbv.github.io/beaconspec/beacon.html [21:03] it's basically just a few header lines (that start with #) followed by shortcode vertical-bar destination [21:03] ah perfect, thanks [21:03] JW_work: 165K approximately. [21:04] phuzion: and how long has it been up? [21:04] JW_work: 3-3.5 years [21:04] achip: and probably the best way to add it to the urlteam collection is just to make a new IA item, stuff it in there, tag it with urlteam, and mention it on the wiki page [21:05] there isn't any particularly sensible way to either add it to the last non-incremental dump, or the daily incremental dumps [21:06] also, to be fully consistent with the other dumps, please xz compress the BEACON file (or files, if you prefer). [21:06] sounds good [21:08] phuzion: ok, that seems large enough to bother iterating through the search space for. 165,000 out of 916,000,000 means we should hit one every 10,000 or so. Not a great ratio, but hey, it's better than we did on the dropbox shortener! (21 out of 44 million) [21:08] 21... total? or 21 thousand? [21:09] 21. Total. [21:09] Jesus christ [21:09] It was a 8 character random. [21:09] check the logs [21:09] I ran it for a few days; if we get board, we can run it for a few more days sometime. [21:10] Wow. That's a horrendous hitrate. [21:11] the tracker will be down for a while because i need to add a new column to the database [21:11] what are you adding? [21:13] JW_work: So, quick question for you. da.gd does an interesting thing where you can append custom things to the end of the URL and that is preserved in the redirect. Do we need to make any accomodations for this? For example: http://da.gd/atw now redirects to http://archiveteam.org/index.php?title= so http://da.gd/atw/URLTeam redirects to the URLTeam wiki page [21:13] Or, at least it would if http://archiveteam.org/index.php?title=/URLTeam were a valid URL [21:14] I don't think so. We'll just capture the mapping between {shortcode} and {longurl}. What the particular server *does* with that mapping doesn't really matter to us. You might mention it on the wiki page, though. [21:14] Gotcha. [22:16] *** Start has quit IRC (Quit: Disconnected.) [23:02] *** aaaaaaaa_ has joined #urlteam [23:02] *** aaaaaaaaa has quit IRC (Read error: Connection reset by peer) [23:02] *** swebb sets mode: +o aaaaaaaa_ [23:21] *** dashcloud has quit IRC (Read error: Operation timed out) [23:24] *** dashcloud has joined #urlteam [23:24] *** svchfoo1 sets mode: +o dashcloud [23:27] *** aaaaaaaa_ is now known as aaaaaaaaa