#urlteam 2015-12-14,Mon

↑back Search

Time Nickname Message
01:13 πŸ”— cechk01 has joined #urlteam
01:32 πŸ”— tritutri has joined #urlteam
01:32 πŸ”— tritutri is now known as cybersec
01:58 πŸ”— cybersec so, are any stats of the project known? eg. "we can guess we've captured 'x' % of all bit.ly URLs" etc?
01:58 πŸ”— cybersec just curious to know how far it's come
02:02 πŸ”— cybersec oh and one other question
02:02 πŸ”— cybersec johtso, what kind of distributed rig/computer/setup/etc are you running? haha
02:02 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
02:08 πŸ”— dashcloud has joined #urlteam
02:09 πŸ”— svchfoo1 sets mode: +o dashcloud
02:44 πŸ”— WinterFox cybersec, iirc its on the wiki
02:44 πŸ”— WinterFox http://archiveteam.org/index.php?title=URLTeam#URL_shorteners
02:49 πŸ”— W1nterFox has joined #urlteam
02:55 πŸ”— WinterFox has quit IRC (Read error: Operation timed out)
02:56 πŸ”— cybersec ah cool, thanks WinterFox
03:04 πŸ”— JesseW has joined #urlteam
03:04 πŸ”— svchfoo1 sets mode: +o JesseW
03:15 πŸ”— JesseW has left
03:16 πŸ”— JesseW has joined #urlteam
03:16 πŸ”— svchfoo1 sets mode: +o JesseW
03:37 πŸ”— JesseW has quit IRC (Leaving.)
03:49 πŸ”— JesseW has joined #urlteam
03:50 πŸ”— svchfoo1 sets mode: +o JesseW
03:54 πŸ”— JesseW WinterFox, cybersec: wait, what's on the wiki? A percentage complete isn't there -- it's tricky with many of the shorteners to *tell* how many exist (and are being added to the source). We have estimates for *some* of them (and some of the ones that don't allow new entries are easier to figure out) -- but as for an overall percentage -- not so much. Regarding johtso's setup -- I don't think there's any mention of it on the wiki...
03:58 πŸ”— cybersec Ah, that makes sense to me. I figured it would be hard to get an overall percentage from some of the more uncooperative URL shorteners
03:59 πŸ”— JesseW Yeah, if you can figure out ways to populate the total column for any of the ones it's blank about, please do fill it in!
03:59 πŸ”— cybersec and darn, I'd be interested to find out what that user is running. Do many people run archiving software on a distributed cluster of computers, or something like that? I'm fairly new to all of this
03:59 πŸ”— cybersec haha sounds good, I'll keep an eye out
04:00 πŸ”— JesseW first of all -- Welcome!
04:03 πŸ”— JesseW regarding what johtso is running -- all the software (that shows up on the urlteam2 tracker) is the same: terroroftinytown, written by chfoo. It can be wrapped up as one of the possible projects to be run by the ArchiveTeam Warrior, which is a virtual machine image that is otherwise used for larger archiving projects, using a framework called seasaw. But the terroroftinytown (and the other projects) can also be run outside the virtual machine; I'm sure that'
04:04 πŸ”— JesseW I heard a rumor that the hardware it's running on is otherwise used for bitcoin mining, but I don't know any details.
04:04 πŸ”— JesseW It'd be great if you want to run a Warrior instance yourself. :-)
04:05 πŸ”— JesseW And/or help with checking over the <100 possible url shorteners on the wiki page and figuring out whether the current code can scrape them, or what else we'd need to do to make it work...
04:06 πŸ”— JesseW (if you're interested in that work, I'm glad to explain more)
04:26 πŸ”— Fletcher_ is now known as Fletcher
05:06 πŸ”— bwn has joined #urlteam
05:10 πŸ”— W1nterFox JesseW, Cant you just look at the total possible urls and the number that work to find out roughly how many urls there are in total?
05:11 πŸ”— JesseW W1nterFox: we don't even know of all the url shorteners out there...
05:11 πŸ”— JesseW but if you meant for a particular one...
05:11 πŸ”— W1nterFox Yeah just per shortener
05:12 πŸ”— JesseW we can certainly calculate the number of possible codes for a given length
05:12 πŸ”— JesseW but it's not always clear what lengths are legal
05:33 πŸ”— cybersec JesseW, I've actually been running my own Warrior instance for about half the day today :)
05:33 πŸ”— cybersec I'm almost up to 100k scanned URLs now
05:33 πŸ”— JesseW nice!
05:34 πŸ”— cybersec is running ToTT faster than Warrior, though? eg. would it benefit me more to run, say, 10 concurrent instances of ToTT in comparison to Warrior's maximum of 6 items?
05:35 πŸ”— JesseW I'm not sure. Probably not.
05:35 πŸ”— cybersec oh okay
05:36 πŸ”— JesseW It will only let you work on one item from each project at a time, and there are only 10 projects running right now.
05:37 πŸ”— JesseW So running more than 10 concurrency on the same IP address won't help. It will also likely tax your box pretty heavily.
05:37 πŸ”— JesseW 10 urlteam projects, I mean.
05:38 πŸ”— cybersec oh I see what you mean
05:38 πŸ”— cybersec that makes a lot more sense to me
05:38 πŸ”— cybersec thanks for the information :)
05:39 πŸ”— JesseW yeah, it's limited to one item from each project so the shorteners don't get upset.
05:40 πŸ”— JesseW there are also limits of how many items can be worked on at once, per project, which can be set by the admins (i.e. me, arkiver and chfoo) so we don't overload the shorteners.
05:41 πŸ”— cybersec oh that makes sense, so even if someone tried spoofing 50 instances or something of that sort, they'd never reach a point where they're overloading the shorteners
05:42 πŸ”— cybersec or is the one-item limit IP-based?
05:42 πŸ”— cybersec but reguardless, that's a nice protection you have built in
05:42 πŸ”— cybersec regardless*
05:49 πŸ”— JesseW I'm not sure, actually. You can read the code and figure it out, though: https://github.com/ArchiveTeam/terroroftinytown
05:52 πŸ”— W1nterFox Im at 6.1 mill scans now :D
05:53 πŸ”— achip last I saw it was one IP per shortener
05:54 πŸ”— cybersec oh cool, thanks JesseW
05:54 πŸ”— cybersec and damn W1nterFox nice
05:54 πŸ”— cybersec and thanks for the info achip
05:55 πŸ”— W1nterFox I have had the warrior docker container running for about a month now
05:55 πŸ”— * JesseW goes to check my stats
05:56 πŸ”— JesseW over 11 million scans.
05:56 πŸ”— W1nterFox Nice
05:56 πŸ”— JesseW (nick is "Somebody")
05:57 πŸ”— W1nterFox Why am I getting 404 links from migre.me? Shouldnt every link work?
05:57 πŸ”— W1nterFox Are all the available links scanned?
05:57 πŸ”— JesseW W1nterFox: some have been removed for spamming, I think.
05:58 πŸ”— JesseW for migre-me, I manually delete items if one of them is missing, because otherwise it will keep retrying the missing one, because migre-me returns "Location: " for missing ones (sometimes)
05:58 πŸ”— JesseW W1nterFox: example of a 404 link from migre-me?
05:59 πŸ”— W1nterFox http://www.migre.me/4VU35
06:00 πŸ”— JesseW hm, I get a 404 also.
06:00 πŸ”— JesseW Pausing the project.
06:00 πŸ”— JesseW I'll look into it.
06:01 πŸ”— JesseW I think we may have finally reached the present on migre-me.
06:09 πŸ”— JesseW hm
06:11 πŸ”— JesseW Could someone create a URL at http://migre.me (doesn't matter what), and post the short url returned?
06:12 πŸ”— JesseW I want to check what happens from another IP than mine.
06:13 πŸ”— bwn http://migre.me/sorWa
06:13 πŸ”— JesseW ok, cool it is consistent with mine
06:15 πŸ”— JesseW **ah**
06:16 πŸ”— JesseW It puts the digits *AFTER* the alphabet
06:16 πŸ”— JesseW hm, bit of problem then...
06:18 πŸ”— JesseW order seems to be lowercase, uppercase, digits
06:21 πŸ”— JesseW giving only 269,383,546 urls shortened, rather than the 420 million I had thought.
06:22 πŸ”— JesseW hm, no that can't be right, because 4JXkM is valid
06:22 πŸ”— JesseW and if it is only up to sos74
06:24 πŸ”— JesseW hm, maybe lowercase, digits, uppercase
06:25 πŸ”— JesseW eh, fuck it -- just scrape them all, God (or terroroftinytown) will recognize the valid ones...
06:25 πŸ”— bwn i just did another, sosfc
06:26 πŸ”— JesseW hm -- any guesses about the ordering?
06:27 πŸ”— W1nterFox numbers 0-4, lowercase, uppercase, numbers 5-9? :P
06:28 πŸ”— JesseW Ha
06:28 πŸ”— JesseW it died somewhere after 4JXkM
06:29 πŸ”— JesseW and it's currently handing out sosfc
06:29 πŸ”— JesseW whatthehell?
06:35 πŸ”— bwn sosjh i just made a few minutes ago,
06:35 πŸ”— bwn sosta works, sostz is 404
06:36 πŸ”— bwn sostA works, sostZ as well
06:41 πŸ”— JesseW hm, apparently migre.me accepts (and terroroftinytown happily passes through) newlines *in* urls
06:49 πŸ”— bwn hmm
06:50 πŸ”— bwn just saw U before F
06:55 πŸ”— JesseW yeah, the ordering does seem ... quite odd
06:55 πŸ”— bwn S7, Sr, Sz, Sc, SN,
06:56 πŸ”— JesseW well, that's because there are other people creating them in between
06:56 πŸ”— bwn TB, TD, TF
06:57 πŸ”— bwn i was just getting a feel for it
07:04 πŸ”— JesseW it does seem consistent with digits, lowercase, uppercase -- maybe we're just hitting a big patch where some spam urls were removed
07:05 πŸ”— JesseW yep, 5aabc exists, for example.
07:07 πŸ”— JesseW 50000 exists -- but 4ZZZZ is a 200
07:08 πŸ”— JesseW apparently they decided to stop serving (or never hand out) the other half of 4
07:14 πŸ”— bwn i'm going to play with their api a little and see if that tells me anything
07:16 πŸ”— cybersec have you guys tried contacting them at all? or are they one of the less helpful url shorteners?
07:21 πŸ”— JesseW I haven't tried contacting them. We generally don't, although it's certainly worth a try.
07:26 πŸ”— JesseW bwn: nice; let us know what you find
07:27 πŸ”— JesseW It's also in Spanish, which I don't speak, so that was another reason I didn't contact the owner.
07:33 πŸ”— cybersec ah yeah, that makes sense
07:40 πŸ”— JesseW interestingly, you can see when a URL was created, here: http://www.migre.me/contar-cliques/?url=http%3A%2F%2Fwww.migre.me%2F4JXkM
07:40 πŸ”— JesseW we're only up to 2011...
07:43 πŸ”— JesseW it's sort of wasteful to be going through all the disabled 4.... ones, but, eh, better to do so just in case there is one working one somewhere in there, I suppose
07:52 πŸ”— bwn it really does look like it's sequential
07:52 πŸ”— JesseW yep, I'm pretty sure it is
07:52 πŸ”— JesseW Interestingly, the 3,822,506 codes from 4JXAG to 50000 are all empty. 4JXAG was created at "07/06/2011 17:06:46" while 50000 was created at "07/06/2011 03:00:00"
07:53 πŸ”— bwn i'm seeing 0-9, a-z, A-Z like you said earlier
07:53 πŸ”— JesseW good to get the confirmation
07:54 πŸ”— bwn he's got a blog post talking about statistics from july 5 2011: http://migreme.com.br/blog/iphone-gera-mais-cliques-que-linux/
07:55 πŸ”— bwn i wonder if he bumped to 50000?
07:55 πŸ”— JesseW it looks like that, yeah
07:56 πŸ”— JesseW http://www.migre.me/contar-cliques/?url=http%3A%2F%2Fwww.migre.me%2F50je0 was made at 07/06/2011 Γ s 17:05:05
07:57 πŸ”— JesseW it looks like it was handing out both 4... and 5... for a few hours, then switched just to 5...'s
07:57 πŸ”— JesseW well, we're up to 4X -- should be done with the interruption soon
08:10 πŸ”— * JesseW is adjusting bitly's max queue size up, to see what happens...
08:23 πŸ”— JesseW and has doubled migre-me
08:23 πŸ”— JesseW it seems to be taking it fine
08:37 πŸ”— JesseW has quit IRC (Leaving.)
08:51 πŸ”— deathy___ has quit IRC (Ping timeout: 252 seconds)
08:53 πŸ”— deathy___ has joined #urlteam
08:59 πŸ”— bwn has quit IRC (Read error: Operation timed out)
09:36 πŸ”— bwn has joined #urlteam
11:03 πŸ”— Infreq has quit IRC (始めましょう!)
14:08 πŸ”— W1nterFox has quit IRC (Remote host closed the connection)
16:34 πŸ”— Start has quit IRC (Quit: Disconnected.)
16:52 πŸ”— JesseW has joined #urlteam
16:52 πŸ”— svchfoo1 sets mode: +o JesseW
17:10 πŸ”— Start has joined #urlteam
17:11 πŸ”— JesseW has quit IRC (Leaving.)
17:39 πŸ”— Start_ has joined #urlteam
17:42 πŸ”— Start has quit IRC (Ping timeout: 252 seconds)
18:01 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
18:04 πŸ”— dashcloud has joined #urlteam
18:05 πŸ”— svchfoo1 sets mode: +o dashcloud
18:39 πŸ”— Start_ has quit IRC (Read error: Connection reset by peer)
18:42 πŸ”— Start has joined #urlteam
18:50 πŸ”— bwn has quit IRC (Read error: Operation timed out)
19:08 πŸ”— Start has quit IRC (Quit: Disconnected.)
19:24 πŸ”— bwn has joined #urlteam
19:32 πŸ”— Start has joined #urlteam
20:25 πŸ”— phuzion Is it possible to rate limit the crawling of a single shortener? I know of a shortener that could use to be archived, but probably can't handle an enormous amount of load.
20:28 πŸ”— aaaaaaaaa has joined #urlteam
20:28 πŸ”— swebb sets mode: +o aaaaaaaaa
20:28 πŸ”— JW_work it certainly is
20:29 πŸ”— JW_work please add it to the wiki page
20:30 πŸ”— JW_work .
20:30 πŸ”— JW_work phuzion: what's the name?
20:32 πŸ”— phuzion JW_work: I'll get with you in a minute. I'm talking with the owner of the shortener right now
20:33 πŸ”— JW_work nice! if you're in contact with them, a full dump would be even easier than scraping it
20:33 πŸ”— phuzion It seems like they're not very likely to give away the db dump, unfortunately.
20:34 πŸ”— JW_work but they don't object to us scraping it? strange.
20:35 πŸ”— phuzion "phuzion: *shrug* if you want to attempt bruteforce thousands of random URLs that can include capital, lowercase, numbers, etc, I'm not going to stop you. I don't see the point, but *shrug*"
20:36 πŸ”— JW_work hahaha
20:36 πŸ”— phuzion I explained that we've archived url shorteners multiple orders of magnitude larger than theirs
20:38 πŸ”— JW_work depending on the "etc", "capital, lowercase, numbers" is just 62 possibilities. And 62**4 is less than 15 million.
20:39 πŸ”— phuzion Randomly generated URLs are 5x(A-Z a-z 0-9)
20:40 πŸ”— JW_work ok, so 916,132,832 possibilities
20:40 πŸ”— phuzion Vanity URLs are case-sensitive up to 10x A-Z a-z 0-9 _ -
20:41 πŸ”— aaaaaaaaa gee I wonder if the backend uses base-64 integers.
20:42 πŸ”— JW_work so assuming one minute per item, 100 item queue, 50 urls per item, covering the random space would take about 4 months
20:42 πŸ”— JW_work not particularly a problem
20:42 πŸ”— JW_work the vanity urls would be more painful, but still doable
20:43 πŸ”— phuzion Yeah
20:43 πŸ”— aaaaaaaaa 64^10 is over 1 quintillion possibilities
20:43 πŸ”— JW_work the big issue is whether it returns distinct http status codes for existing and non-existing shortcodes.
20:43 πŸ”— phuzion JW_work: what's that calculate out to being in queries per second? ~1 qps per thread?
20:44 πŸ”— JW_work assuming we do 2 queries per second per item, 100 items gives 200qps.
20:44 πŸ”— Start has quit IRC (Quit: Disconnected.)
20:45 πŸ”— phuzion Can that be slowly ramped up while we make sure we're not going to knock the site down?
20:45 πŸ”— JW_work (from 100 different IP addresses)
20:45 πŸ”— JW_work sure
20:45 πŸ”— phuzion ok
20:45 πŸ”— JW_work I can start with, say, a 5 item queue β€” which gives 10qps from 5 different IPs
20:45 πŸ”— JW_work and gradually increase it
20:45 πŸ”— phuzion and yeah, it returns 404s for nonexistant URLs and 302s for valid ones.
20:46 πŸ”— JW_work ok, that should be easy then
20:46 πŸ”— JW_work toss it on the wiki
20:46 πŸ”— phuzion Will do
20:46 πŸ”— JW_work (along with the <100 other ones...)
20:46 πŸ”— phuzion JW_work: under the Warrior projects tab?
20:47 πŸ”— JW_work no, under the Alive section
20:47 πŸ”— JW_work I'll move it the Warrior projects when I actually create the warrior project
20:47 πŸ”— JW_work (or feel free to put it in the warrior projects table, and just add a comment that it doesn't actually exist yet)
20:48 πŸ”— phuzion Added. The URL is da.gd
20:48 πŸ”— phuzion and it's open source, too. https://github.com/relrod/dagd
20:48 πŸ”— JW_work ha
20:48 πŸ”— JW_work ok, cool
20:48 πŸ”— Start has joined #urlteam
20:52 πŸ”— achip that reminds me, I bought a url shortener a while ago and need to get that set in to urlteam... should that just be "shorturl,destination" format?
20:53 πŸ”— phuzion JW_work: I'd prefer if we really didn't hammer the crap out of this one, because it's actually an IRL buddy of mine that runs this and I'd rather not piss him off, lol. But yeah, let me know when that's gonna get started, and I can give him heads up that it's gonna happen.
20:53 πŸ”— JW_work I see da.gd is random. How long has it been up? Given that it's random, I'd rather wait on scraping it until it gets as populated as possible.
20:54 πŸ”— JW_work achip: you did what?
20:54 πŸ”— JW_work phuzion: also, can you get an idea from your buddy about how many short URLs exist yet?
20:55 πŸ”— phuzion JW_work: Yeah, let me see if he can do a select count real quick
21:00 πŸ”— achip someone was selling their url shortening site, so I bought it for the domain and I have the list of short urls that were created. I'll just clean up the list and post an excerpt tonight, I was just curious what format was best to be integrated into the urlteam dataset.
21:02 πŸ”— JW_work achip: ah, ok. The format is called "BEACON", defined here: https://gbv.github.io/beaconspec/beacon.html
21:03 πŸ”— JW_work it's basically just a few header lines (that start with #) followed by shortcode vertical-bar destination
21:03 πŸ”— achip ah perfect, thanks
21:03 πŸ”— phuzion JW_work: 165K approximately.
21:04 πŸ”— JW_work phuzion: and how long has it been up?
21:04 πŸ”— phuzion JW_work: 3-3.5 years
21:04 πŸ”— JW_work achip: and probably the best way to add it to the urlteam collection is just to make a new IA item, stuff it in there, tag it with urlteam, and mention it on the wiki page
21:05 πŸ”— JW_work there isn't any particularly sensible way to either add it to the last non-incremental dump, or the daily incremental dumps
21:06 πŸ”— JW_work also, to be fully consistent with the other dumps, please xz compress the BEACON file (or files, if you prefer).
21:06 πŸ”— achip sounds good
21:08 πŸ”— JW_work phuzion: ok, that seems large enough to bother iterating through the search space for. 165,000 out of 916,000,000 means we should hit one every 10,000 or so. Not a great ratio, but hey, it's better than we did on the dropbox shortener! (21 out of 44 million)
21:08 πŸ”— phuzion 21... total? or 21 thousand?
21:09 πŸ”— JW_work 21. Total.
21:09 πŸ”— phuzion Jesus christ
21:09 πŸ”— JW_work It was a 8 character random.
21:09 πŸ”— JW_work check the logs
21:09 πŸ”— JW_work I ran it for a few days; if we get board, we can run it for a few more days sometime.
21:10 πŸ”— phuzion Wow. That's a horrendous hitrate.
21:11 πŸ”— chfoo the tracker will be down for a while because i need to add a new column to the database
21:11 πŸ”— JW_work what are you adding?
21:13 πŸ”— phuzion JW_work: So, quick question for you. da.gd does an interesting thing where you can append custom things to the end of the URL and that is preserved in the redirect. Do we need to make any accomodations for this? For example: http://da.gd/atw now redirects to http://archiveteam.org/index.php?title= so http://da.gd/atw/URLTeam redirects to the URLTeam wiki page
21:13 πŸ”— phuzion Or, at least it would if http://archiveteam.org/index.php?title=/URLTeam were a valid URL
21:14 πŸ”— JW_work I don't think so. We'll just capture the mapping between {shortcode} and {longurl}. What the particular server *does* with that mapping doesn't really matter to us. You might mention it on the wiki page, though.
21:14 πŸ”— phuzion Gotcha.
22:16 πŸ”— Start has quit IRC (Quit: Disconnected.)
23:02 πŸ”— aaaaaaaa_ has joined #urlteam
23:02 πŸ”— aaaaaaaaa has quit IRC (Read error: Connection reset by peer)
23:02 πŸ”— swebb sets mode: +o aaaaaaaa_
23:21 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
23:24 πŸ”— dashcloud has joined #urlteam
23:24 πŸ”— svchfoo1 sets mode: +o dashcloud
23:27 πŸ”— aaaaaaaa_ is now known as aaaaaaaaa

irclogger-viewer