[00:03] underscor shows up and fixes things [00:03] Although when he shows up, it usually means six weeks more winter [00:15] DFJustin: I thought that was archived long ago? [00:16] xmc: emirjp is still around, he's in charge of #wikiteam but is never on IRC [00:16] he does respond to emails afaict [00:22] what was [00:27] DFJustin: opensolaris [00:29] where is it then [01:05] kyan: guessing this is you? https://archive.org/details/mail.google.com-saved-1Oct2014 [01:06] (just noticed author field) [01:33] GChriss: you figured it out, but yes :) [01:54] ----------------------------------------------------- [01:54] Archive Team Members Up For Being in SF on October 28th: [01:54] https://ia601401.us.archive.org/34/items/LibraryBuildingInvitation/LibraryBuildingInvitation-nolink.html [01:54] ----------------------------------------------------- [06:40] sup [06:42] So are you guys like archive.org only better? [06:42] More badass would be more appropriate [06:43] I bet. [06:44] Archive.org is indispensible for finding old software downloads. [06:44] better is probably the wrong word [06:44] IA hosts most of what we get [06:44] I see. So they take submissions [06:45] I wish they had competition, honestly. [06:45] yeah, get an account there, and you can upload whatever [06:45] I guess because I've seen so many "long running" sites suddenly disappear eventually [06:45] Do they take MHTs? [06:45] I have TONS of those [06:45] (unfortunately) [06:45] what's an MHT [06:45] the Microsoft thing [06:46] It's the "save as single file" in IE [06:46] And Opera [06:46] And Chrome if you enable it [06:46] I mean, you can upload them, but Wayback won't read them as of yet [06:46] Yeah, I figured [06:46] the other problem with them is figuring out what the URL was [06:46] I imagine google can help sometimes [06:46] they're missing a lot of vital information [06:46] like the URL [06:46] yeah [06:46] and all the request/response headers [06:47] it makes me regret having so many but they ARE so convienient [06:47] and I guess I'm usually too lazy to mess with a spider [06:47] consider using instead https://webrecorder.io/ [06:47] we also have archivebot [06:47] Thanks to IA, I just got H.264 videos playing on Windows NT 3.51 [06:48] One thing I hate: robots.txt [06:49] Also, how would one archive dynamic Web 2.0/DHTML/HTML5/cloud crap [06:50] if it's one thing I HATE about the web, it's that stuff [06:51] https://webrecorder.io/ [06:51] we also have archivebot, which has a PhantomJS mode [06:51] it is not as robust as webrecorder.io but does not require human interaciton [06:52] there are also WARC-generating proxy servers, which serve a similar purpose to webrecorder.io [06:52] Also, is there anything that archives with multiple user agents? There is so much mobile only and wap stuff out there and it displays different depending on your UA [06:52] we have archivebot [06:52] which has a --user-agent-alias option [06:52] webrecorder.io also works with whatever browser you choose to use [06:52] Would it even be possible to archive something like this? pcjs.org/ [06:53] I don't see why not [06:53] archiving *sessions* done in the emulated PCjr environment is tricky [06:53] archiving the *code* is not that hard [06:53] in fact [06:54] well it's open source, sure [06:54] you can just download it [06:54] I added it to the queue in #archivebot, you can join that channel to check on it [06:54] but how could it archive the state of the emulated pc [06:54] it won't [06:54] no tool does [06:54] yeah, I imagine that is impossible [06:55] however, retrieving individual disk images is not hard [06:55] the tools have not kept up with the "advances" of the web [06:55] so you can approximate state [06:55] Do you guys do software too? [06:56] to some degree, that's inevitable. if there's a webpage that chooses to represent state in a form that isn't capturable via a URL or some other parameter not present in a request, you aren't going to get it as a request/response pair [06:56] there's no way around that one except not capturing the result as a request/response pair [06:56] so you can save it as, say, an image, or a PDF, a DOM snapshot that could perhaps be unfrozen as some later time [06:56] Lotta talkin' [06:56] and yes, there's a huge software archive on IA now [06:56] #-bs etc [06:57] Can you submit your own stuff too? [06:57] in fact the curator of that software collection just showed up [06:57] in the channel? [06:59] He's going to bed sooner rather than later. [06:59] is there some place I can submit IRC logs? [14:19] tfgbd: https://archive.org/upload/ [15:59] ---------------------------------------- [15:59] Which Archive Team Members want an Ello Invite [15:59] Condition: Gotta be a troublemaker [15:59] ---------------------------------------- [16:01] I'll take one if I can sign up as index.html [16:01] or about, terms-of-service, etc [16:07] public-beta-profiles [16:08] robots.txt [16:09] Msg me with an e-mail, I'll send one. They might be closing the invites, but I have them. [16:28] I will say, by the way, these tele2 and verizon jobs are rough - just moving the files into position has taken weeks, with all the tiny files on them. [16:30] im still working on the verizon megawarc box [16:34] When rsync is slow because of small files, tar is the solution, isn't it? [16:34] megawarcing is slow [16:34] Oh, right [16:34] because of the ~10 or 20 million files [16:35] Still, the sounds of "10 or 20 million files" is delicious [16:36] yeah but to create a 25 or 50GB warc file is a killer for your system [16:44] Remember that a lot of those files were bruteforces - so there are lots of 404s underneath the 'millions of files' headline. [16:44] Don't know if a little pre-processing of the Warcs just to extract the good stuff would help any [16:48] Or if the megawarcing process could work in size order - e.g. largest files first [most likely to have content] etc. [16:51] looking into the content of each file to see if it's a 404 isn't likely to be faster than just catting it [16:52] I'd go by size order - that way you're getting the obvious cream off the top, then once you hit the small files that are all 404s, you can take the view on whether to continue or not. [16:52] but I don't know how practical that is, admittedly. [16:52] SketchCow: you can skip some files if you want [16:53] We did kind of two stages [16:53] 1. Bruteforce, we didn't have the sites list, so we just followed the patterns we could find [16:53] 2. Downloading list of sites. We got a list of sites from Tele2 and downloaded everything from the list [16:54] SketchCow: You can skip the warc's downloaded with the first stage, since that's ust bruteforces and whatever has been downloaded there has also been downloaded during the second stage when we had the list of sites [16:55] I'll take a look at how to find the warc's from the second stage. give me some time [16:57] Not going to skup. [16:57] Stop thinking of it. [17:00] I thought your system might not be able to handle it or so due to the large amount of items [17:00] It's always the best to do everything [17:00] good luck! [17:01] No, it can handle it, it's just boring [17:01] Windows in screen open for weeks. [17:01] Ah, that's ok then [17:02] Forced to blast Pink Floyd "Animals" album while considering next moves. [17:03] I'm going to start testing qwiki-grab. Can you please tell me how much space is still left on FOS? [17:04] So we won't run out of space during the grab. [17:04] 1.2T [17:05] 4.8T on another drive [17:06] Ok, thank you, I'll let you know how big the average item of qwiki is [17:08] DFJustin: heh, https://ello.co/robots.txt [17:15] "username may only contain letters, numbers, dashes, and underscores" [17:15] boo [17:17] -- i'm also providing ello invites, just /msg me -- [17:17] -- no email address needed -- [17:17] yipdw: and somebody has already registered ck. :( [17:18] ha [17:19] nobody got ckfight [17:19] cks? [17:19] taken [17:20] SketchCow: what sort of hell raising is being done at ello? [17:22] xmc: didn't you have an amusing facebook custom url, or perhaps I'm thinking of bsparks? [17:22] cksucker [17:22] nice DFJustin [17:23] :D [17:27] https://ello.co/cksucker [17:27] ok done [17:27] heyooo [17:28] now I get to see how long it takes for them to ban me [17:29] in the meantime maybe I should glitchr the shit out of this account [17:30] Do you guys know if there is any effort to archive descriptions/app versions from the various mobile app stores? [17:31] That is one piece of content that is almost lost due to not even using the web sometimes [17:31] ugh, stupid cloud shit [17:31] At least the web is scrapable even if some efforts generally don't get everything [17:32] tfgbd: most of that stuff is web-based [17:32] steamcommunity, app store, etc are [17:33] In the earlier days some of it wasn't [17:33] it's only been SORT of recently that Google's Android Market has even been available on the website [17:34] ok, but nowadays you can access Google Play, the Apple App Store, the Steam store listing, etc all via the Web [17:34] I don't know of any particular effort to get that data, but if you want to start one feel free [17:34] Though, even before Google had an official website, there were scrapers [17:34] I'm wondering if there is something that downloads the apps/archives them along with the descriptions and comments [17:35] I assume most of the early Apple iPhone stuff is lost, though [17:35] I know there are a few mirrors of the Android Market/Play Store but they seem to have the same (current) stuff the official store had [17:36] * yipdw shrugs [17:36] no point in getting bent up about data that's gone [17:36] https://archive.org/details/android_apps [17:37] and yeah, there are lots of warezish sites full of ads that seem to archive some old .apk/.xap/etc for download [17:37] but it's hardly a perfect effort [17:37] and those can likely go down too [17:37] lol [17:37] it seems we need to archive the archives [17:37] Why not cry? It's depressing to lose good information/tools/data [17:38] s'why pisses me the most off about the damn internets [17:38] what* [17:38] 13:33 < tfgbd> Do you guys know if there is any effort to archive descriptions/app versions from the various mobile app stores? [17:38] Neat, Archive.org already has an effort. Love those guys :P [17:38] Yes, I am working with people who are downloading the entirety of Google Play [17:39] do they have some kind of app that runs and downloads everything (free) it can? [17:39] what about the paid stuff? [17:39] oh, and did anyone get Handango and PocketGear? [17:39] These are a lot of questions. [17:39] sorry [17:40] There was so much nice stuff there. And when they merged it also killed lots of older freeware [17:40] tfgbd: if you ever come by big/small/whatever size websites that are going down inform us asap [17:41] do I just paste it in channel? [17:41] There is a lot I know of, probably [17:41] I think you'll find on Archive Team that instead of crying over what is lost, we tend to focus on what needs to be saved or preserved now. [17:42] I have a pretty nice collection of Windows CE software (including stuff I've compiled myself) but I unfortunately don't have most of the pages that hosted the tuff [17:42] Let the many, many, many other people in the world who do nothing sit around going "oh, if only" [17:42] stuff* [17:42] Coffee is for Closers [17:42] some stuff is from archive.org, anyway [17:43] There is a lot I know of, probably [17:43] Please post it all here [17:43] is there some keyword I should use so you guys find it? [17:43] i mean those are a lot of logs to sort through [17:45] While they're not going down, I can think of quite a few fairly large WinCE/geneal mobile sites that have huge collections of useful stuff [17:45] make a hitlist on the wiki [17:45] do I just need to make an account? [17:46] yep [17:46] secred word is yahoosucks [17:46] www.hpcfactor.com is pretty much the only source left for support of the Handheld PC platform and some other similar WinCE devices along with some things for Win9x/old NT [17:46] BUT it's pretty much run by one guy [17:47] And it has lost forum posts before [17:47] wayback's latest snapshot of that site is from July 23, 2014 [17:47] how recent is that considering its update rate? [17:47] Very [17:48] but wayback generally doesn't get the whole thing. [17:48] it seems to have trouble with forums and the like [17:48] got most of it [17:48] probably just need to grab the forum [17:48] s [17:48] is there any way to do it with a login so the download links can be obtained? [17:49] yeah that's possible [17:49] but not likely to be done [17:49] most tools have support for cookie jars and/or HTTP basic auth [17:49] I know the author. maybe I can ask him to contribute his own archive [17:49] that'd be great [17:49] The other biggie is xda-developers.com [17:49] I mentioned last night that ArchiveBot doesn't, which is less a technical question and more of a self-imposed rule [17:49] xda-developers is on my hitlist [17:50] the are HUGE and hopefully aren't going anywhere but still, it would be a crime to lose all that info and software [17:50] yes [17:50] heh [17:50] xda already loses data and self-censors at a hilarious rate [17:50] Some stuff was already lost before due to MS requesting stuff be removed from the FTP [17:51] not going to talk about it much, but i'm working on a tool to archive forums specifically [17:51] though, I know of a few people who got the FTP before that was done [17:51] but the other problem is that a lot of newer ROMs/tools use shit like rapidshare or mediafire for hosting [17:51] tool/service i guess [17:52] how does archive.org treat warez? Do they generally just look the other web if it ends up in their archive? [17:53] it's the same as youtube or any other site, you're not supposed to upload it but in practice it probably won't go anywhere unless someone asks for it to be removed [17:53] That sucks [17:53] i assume they still take backups for posterity, though [17:53] there needs to be an alternative to them for the illegal stuff, I guess [17:53] it isn't removed [17:53] it's made unavailable [17:54] mmm [17:55] This is another one: http://www.yetanotherhomepage.com/j7xx/ [17:55] It's been stable for years but you never know [17:55] also, I imagine many of the links to authors sites are dead [17:55] do a quick check with the wayback machine and if there are pages or files missing, bring it up in #archivebot [17:56] note that wayback will auto-grab images or file downloads but you can tell if the archive date is today's date that they didn't have it [17:56] does anyone archive cvs/svn/git stuff [17:56] fortunately there is sourceforge but the self hosted stuff tends to randomly die [17:56] somebody in here was downloading a large percentage of github [17:57] does archive.org preserve the server time stamps too? [17:57] I believe so, all of the http headers should go into the warc archive [17:57] they do [17:58] i think it's time for us to have an faq [17:58] ^ [17:58] is there anyone working on a modified jdownloader that could automate mirroring mediafire/rapidshare stuff? [17:58] after work today i'll put a list of faq questions up on the wiki [17:58] can the warc tools get something like a rapidshare if you manually do the wait? [17:59] https://github.com/espes/jdget [17:59] due to the reliance on generate download URLs, that's unlikely to work very well, even with something like webrecorder.. [17:59] oh shit we do have an faq already http://archiveteam.org/index.php?title=FAQ [17:59] on javascript to generate* [17:59] it doesn't cover most of my questions, unfortauntely [18:00] DFJustin: ?! [18:00] DFJustin: native?! [18:00] tfgbd: yes. it needs some expansion. [18:00] I dunno someone linked it in here the other day, think it's unfinished though [18:00] very cool regardless... [18:09] there i added section headings or whatever [18:09] what do you guys think of offline explorer? [18:10] what is that [18:11] http://www.metaproducts.com/mp/Offline_Explorer.htm probably [18:11] Wha? you've never heard of it? [18:11] it's pretty famous [18:11] supposedly it can even do dhtml and html5 now [18:13] It also has a download while you browse function [18:13] product page doesn't detail its file format [18:13] i can send you one if you'd like [18:14] it mostly just seems to be html in folders [18:14] * yipdw shrugs [18:14] with a mirror of the original url [18:14] try it and add it to the FAQ? [18:15] can archive.org accept that? [18:15] yes, but WARC is preferred [18:15] does htttrack support warc now? [18:15] I don't know [18:16] I also have a number of 4chan/other imageboard thread archives I've done with the tool ChanThreadWatch [18:16] does archive.org take those too? [18:16] or someone else [18:16] part of what we're trying to do here is also to archive stuff in file formats that are de jure/de facto standards [18:16] if there's stuff in older formats that's fine too [18:16] but going forward [18:17] unfortunately, it's not always possible to use these standards [18:17] you have probably seen WARC thrown around here, and no it's not perfect, but it is also the second serious attempt at formalizing a standard for archive access [18:17] web archive access anyway [18:17] the first one being IA's original ARC [18:18] as for whether or not IA will accept it, the answer is usually yes [18:18] whether or not the data can be used by automated processes to derive other product is a harder question [18:18] one reason why WARC is attractive is that its format lends itself nicely to request replay [18:19] also, wayback/pywb/webrecorder/archivebot/wget/wpull etc all speak it [18:19] warcproxy, warcmitmproxy, warctools, ... [18:19] is there any kind of browser plugin that will just automatically download everything you view and send it to archive.org without user intervention? [18:20] the barrier to entry still seems too high for most people [18:20] web.archive.org/save can be turned into a bookmarklet [18:20] as for recording sessions, I'm not aware of one that isn't webrecorder.io, and that still requires a WARC download [18:20] as for barrier of entry, feel free to work on things to lower eit [18:21] There is Offline Explorer but unfortunately it's not the same standard [18:21] no reason it couldn't be, except for its proprietary nature and whether or not its developers care about standards [18:21] When I seriously looked into the tools available a few years ago, I was kind of surprised there wasn't more [18:21] things have changed a lot in a few years [18:21] i guess people just don't care [18:21] i've heard the same happens at libraries [18:21] you're indirectly insulting a lot of people here, heh [18:22] I don't mean you guys [18:22] I just mean the world in general [18:22] it's amazing you were able to get geocities [18:22] oh ok [18:22] tfgbd: I know someone is working on a tool that eats firefox history and outputs warc [18:23] after time and bandwidth [18:23] bookmarklet to webrecorder is probably one place to start [18:24] actually, it has one [18:28] webrecorder.io? [18:29] yeah [18:29] I tried it yesterday but it seems kind of counterintuitive to have to download everything on the server and then download the warc yourself [18:29] doesn't archive.org have some kind of browser or something like that [18:29] the WARC is generated as you go [18:29] none that I've seen [18:30] anyway, if you have UI suggestions, much of the code for webrecorder is open-source and the guy who maintains it is also receptive to ideas [18:30] https://github.com/ikreymer [18:31] oh, so I can host it myself? [18:31] yes [18:31] I haven't tried to set one up yet; it's on my list-of-things-to-try-for-ArchiveBot [18:31] but the code is out there [18:31] is it possible to get it to automatically send everything to archive.org without the end user needing to download a warc at all [18:32] maybe [18:32] if you have a set of archive.org S3 keys you can upload [18:32] that option probably doesn't exist in the software yet [18:35] you can use httrack etc. if you route them through a warc proxy [18:36] ia will accept anything you upload but only ARC and WARC can be ingested into the wayback machine [18:36] so that's where we've been focusing our efforts [18:41] so if you used httrack you'd effectively have too copies of the site? [18:41] do you guys offer any sort of VPS hosting or anything for people who don't have access to the store space required? [18:42] yes you would end up with two copies [18:43] tfgbd: archivebot is generally used by people who can't or don't want to maintain their own wget+warc setup, for archiving stuff [18:43] if you need more advanced stuff... *somebody* in here may be able to help out with a VM [18:43] but there's no 'official' service of any kind afaik [18:43] just random people going "oh yeah, sure, here, have a VM" occasionally [18:43] :P [18:47] two* [18:47] cool [18:47] who do I ask for a VM? [18:48] word [18:50] no idea, generally just post what you need in here with sufficient detail and see if anybody offers :P [18:50] or, if you just need one temporarily, you can also consider trying digitalocean or so [18:50] Okay, I made an account [18:51] There is another completely free (for life supposedly) vps service but it kind of sucks and isn't a "real" vm [18:51] and you don't get much disk space either [18:51] which service is that? [18:52] hold on [18:52] host1free by any chance? [18:52] they only offer ipv6 IPS, though [18:52] ever heard of that one [18:52] will check it out [18:52] don't, host1free is awful [18:52] lol [18:52] and very dodgy [18:53] they want way too much info [18:53] it's not a "real" vps as much as a container [18:53] openvz? [18:53] you still get root but can't install any drivers or anything [18:53] http://archiveteam.org/index.php?title=Clown_hosting [18:53] yes, openvz [18:53] I hate that shit [18:53] openvz is fine for most cases :P [18:53] Yeah, I'm one of those weird people who wants to run VMs in VMs [18:53] also, no point in installing drivers in a VM anyway [18:53] well [18:53] you can, theoretically [18:53] under openvz [18:53] using qemu [18:54] it's just going to be horrifically slow [18:54] exactly [18:54] anyway [18:54] realistically [18:54] on Hyper-V, KVM or VMWare you can usually run virtualbox, VMware, Virtual PC, etc [18:54] well [18:54] yes and no [18:55] it works [18:55] I've signed up for tons of VPS trials just to try it out [18:55] usually, on consumer VM services, there's not really a difference between openvz and kvm in terms of what nested virt you can run [18:55] because no virt extensions [18:55] afaik even KVM/Hyper-V/VMWare don't do that [18:55] that is [18:55] emulating it [18:55] virtualbox requires a kernel module, but that's *technically* possible on openvz [18:56] just requires a cooperating host [18:56] But they still virtualize a full PC [18:56] sure, but that doesn't mean you can suddenly run all kinds of virt [18:56] which is enough to run Virtual PC or VirtualBox [18:56] try running KVM without virt extensions :) [18:56] they might be a little slower but it's not QEMU slow :P [18:56] sure [18:56] but again, theoretically you can run virtualbox under openvz [18:56] you just need some cooperation from the host [18:56] Right [18:56]