[00:14] I'm typing from my Kindle to tell you that myloc.gov is closing on November 19th. This site by the Library of Congress has been open for a while, and people's reading lists, collections, etc. will be deleted after this time. [00:17] http://myloc.gov/Pages/notice.aspx [00:43] JRWR: if you work at a tech recycling place, let me pitch one of my favorite causes: donate items to iFixit (free shipping!) and repair manuals get made for them by students: http://www.ifixit.com/Info/Device_Donations [02:10] For the moment, everything is stream only [02:12] ok [02:12] just thought it was odd [02:14] People need to realize jsmess.textfiles.com is my big demo [02:15] And the Internet Archive Historical Software collection is suit and tie [02:29] the speech was really good, and the press about it is very good so far [03:14] dashcloud: ill send over the 30+ versions of coby tablets we got in [03:16] anyone have a more automated twitch.tv video downloader? twitch seems to split up streams into 30 minute chunks, and you need to download each chunk separately [03:17] sounds like HLS [03:17] is it HLS? [03:22] hello, i just learned about archive team just recently. i realized there's no backup of 4chandata archive, which deleted all its images and went text-only months ago. i have a full backup of all the images with original filenames, is there anything i should do with it? [03:23] SketchCow, you're needed [03:26] also, semi-related, is there any plans to archive archive foolz? for some boards like /a/, they massacred a shitload of images when they set full size image limit to 6 months [03:27] JRWR: no clue- is there an easy way to tell? [03:31] twitch.tv is the site- pick any stream there. http://www.twitchtools.com/video-download.php is the site I'm using right now to download the videos- it presents you all the segments of the video. [03:33] dashcloud: If i remember right, Twitch is switching over to HLS, should be easy to extract [03:33] Has anyone considered using the google cached pages to grab some more of isohunt? [03:33] ryonaloli: hi, still there? [03:33] i am joepie92 [03:33] fgsfds: yes, but anything Google is generally a pain in the ass :( [03:33] ryonaloli: you could upload your backup to archive.org [03:33] how do i do that? [03:34] basically, go to http://archive.org/, create an account, log in, click the upload button in the top right corner [03:34] one note: the e-mail address you used to sign up will be available alongside your upload [03:34] (not that anyone ever looks at that, but it's technically there) [03:34] is there any way i can do it anonymously? [03:34] or is it enough just to use 10minutemail [03:35] a throw-away email would probably be as anonymous as it gets [03:36] ok, i just don't want it to be rejected if i use a temp email [03:36] fgsfds: the issue with Google is primarily that they throttle like crazy on all their services - including the cache [03:36] bing/yahoo cache would be more likely to turn up something useful than that of google [03:36] (if any of them even cache isohunt in the first place) [03:36] ryonaloli: I don't think it would be [03:36] I'm not sure, since I upload all my own stuff under my own e-mail address... but I can't see an immediate reason why temp emails would be rejected [03:37] oh, and the other thing, i also have a copy of all of gelbooru's deleted (but not purged) images along with gzipped html from the posts, is that worth uploading? i scrapped and archived it myself to capture images which me and others enjoy, but where deleted because of ToS violations (guro, etc). is that worth uploading as well or should i just make my own torrent like i planned to previously? [03:39] ryonaloli: everything is worth uploading :) [03:39] can't say whether it will stay accessible (depending on content), but either way it'll be archived [03:40] conetent like, nsfw? [03:41] ryonaloli: see, the main purpose of archive.org is to store; not to distribute (that's the secondary purpose) [03:41] even if something were to be blacked out because of issues with the content, it would still be archived [03:42] so things like nsfw content really shouldn't be an issue [03:42] (as far as my understanding goes) [03:42] better to upload it now and have it potentially blacked out later, than to not upload it now and need it later when no copy exists anymore [03:43] if a large percentage of the content is garbage (aka deleted because obvious troll images, etc), would that still be acceptable? i can't sort all the content manually, it's 91k images [03:43] sure, just upload it [03:43] one persons garbage may be another persons treasure :) [03:44] * joepie92 has "archive first, ask questions later" as personal slogan by now [03:45] ok, and last question, both of the archives are completely uncensored, and some (such as the 4chandata archive) might have occasional highly illegal content (child porn, bestiality, etc). i obviously can't search through such a large amount of images, so if a few such images are accidentally uploaded, is that going to be ok? i'll use Tor to upload anyway, i just want to be on the safe side [03:45] * joepie92 CCs BlueMax [03:46] yo [03:46] i don't care about the morality, i believe everything period should be saved, but i care about my safety and the archive's saftey [03:46] BlueMax: see above [03:46] thoughts? [03:46] I don't know much about the archive but if this data's going up either way it should be kept dark until the images are searched [03:48] SketchCow, you're needed [04:02] is there any issue with archives obtained illegally or semi-legally? like, will archive.org still accept that kind of thing? [04:04] like, from archiving sites with a no-archive in the ToS, or using compromized computers to archive sites in an emergency [04:07] are you talking about things you did personally, ie comprimized computers to archive or something someone else did [04:07] hypothetically [04:08] so hypotheticaly would it be you using comprimized computers or someone else [04:08] also we are both spelling compromised wrong, lol [04:08] hypothetically if i had such a thing, would archive.org or archive team accept an archive created using such a tool [04:09] yeah well irssi spell checker sucks [04:10] in terms of emergency archives thats not the way everyone here go about doing that. in terms of archive.org they wouldnt know. [04:10] how many people are mobilized during emergency archiving? [04:10] if your going to do that then dont tell anyone your doing it [04:11] whoever steps up. some people here are affilated with archive.org but many archiving projects are not run under archive.org or have any affilation [04:11] aka archiveteam [04:11] >1k? [04:11] doubtful [04:12] oh [04:12] bw/resources is usually not a giant problem. the problem in emergency archiving lies most of the time in someone to write a script to do it all right [04:13] there's no c&c for scripts? [04:13] https://github.com/archiveteam [04:14] http://archiveteam.org/index.php?title=Main_Page [04:14] that's a lot :/ [04:14] all the grab scripts are mostly based off of each other and have similar framework [04:15] ryonaloli: curious; how'd you find out about archiveteam? [04:15] isohunt's fall [04:15] ahh :) [04:16] i never would have thought something like this existed... i always used as much means as i could to archive dying sites, had no idea there was a whole team lol [04:17] ryonaloli: I've mostly been there [04:17] though I wasn't as busy archiving sites [04:17] mostly just saving everything I ran across [04:17] "just in case" [04:18] (some of those just in cases have actually occurred) [04:18] yeah, same. i only archive sites when i notice a moralfaggy ToS change [04:18] but my shitty DSL and VPN slows things down :/ [04:19] so who on this irc is incharge of al this? [04:19] there's technically not really one person "in charge" of things [04:19] it's mostly just "go get shit done" [04:19] no hierarchy at all? [04:20] though a few bits of the 'infrastructure' (think the tracker, etc.) are centralized [04:20] no formal hierarchy [04:20] I think [04:20] although everybody listens to SketchCow [04:20] but that's not really a hierarchy thing I think :P [04:20] is i set up a script on my vps or idle computers that are low maintanance, will using the tracker be all that's needed to contribute? [04:21] or do the scripts used and api change all the time or something [04:21] ryonaloli: well, the current architecture is that there is A. a VM image that you can run in virtualbox etc., it will automatically pick up new projects and B. you can run scripts manually using the seesaw kit [04:22] the former is pretty maintenance-free but not really suited for use on VPSes [04:22] the latter requires some work [04:22] is there no low maintanance way with less over head? [04:22] not yet :) [04:22] well the warrior vm is pretty low maintainence? [04:22] S[h]O[r]T: yes, but not low overhead [04:23] seesaw scripts are low overhead, but not low maintenance [04:23] (because you have to manually clone and run every one of them) [04:23] yeah, and manually update. the warrior is all automatic [04:23] this was one of the reasons I suggested docker a while ago, but that seems to rely on lxc so won't work on many VPSes [04:23] but there are people who have done warrior images on ec2 and whatnot [04:23] should probably just have a framework [04:23] yes, but ec2 is ec2 [04:24] that is not at all the same usecase as "I have a VPS that's not really doing much..." [04:24] :P [04:24] yeah [04:24] the warrior is also just really a set of scripts so you could pull it apart and run it on a vps probably [04:24] anyway, ryonaloli; the takeaway is that for a VPS you'd be pretty much stuck running them manually now (although it's now properly documented, since isohunt) [04:24] and help is absolutely needed to automate that more [04:24] a la warrior :) [04:25] S[h]O[r]T: I had a look at it and you can't just copypaste it into a VPS, basically [04:25] right [04:25] can't recall why exactly, but it needed more work than that [04:27] I may try to run warrior in a freeBSD jail. Would need to use Debian with the BSD kernel, AFAIK. [04:27] does *BSD support openvz? [04:29] I looked that up in the past few weeks... IIRC, my hardware is not new enough for that :P [04:29] ryonaloli: nope, openvz is based on a custom Linux kernel [04:29] though more and more is being merged into mainline [04:30] so perhaps over time you may some stuff transfering over to BSD [04:30] but for now it's not possible to run ovz on BSD [04:30] you can run openvz without the vz kernel [04:30] it'll just have less features [04:30] we will have bhyve, we don't need no openvz :3 [04:31] what's bhyve? [04:32] their 250k container system [04:33] kinda like a more-native kvm or so [04:33] similar in idea, but doesn't support old hardware at all [04:34] but kvm works totally differently than openvz [04:34] ryonaloli: the main reason why there are so many grabbers is that website structures are freeform and there has not been to date much effort into consolidating the common patterns (because there's not much payoff) [04:34] ArchiveBot is one of a few efforts to do that, but to date it will not handle multi-terabyte dumps [04:34] (it is also not designed to do that) [04:35] is there any plans for a centralized c&c that will at least give URLs to archive and instructions to individual grabbers? [04:35] that already exists [04:35] http://tracker.archiveteam.org/ [04:35] ryonaloli: the tracker hands out 'tasks' [04:35] but the code for actually downloading stuff is distributed separately [04:35] yipdw: doesn't that go via warrior hq? [04:35] that *could* be generalized to "here is a list of URLs, scrape them and report back to me about what you find" [04:35] projects.json and all that [04:36] warriorhq is a separate program [04:36] what information is contained in these "tasks"? [04:36] project-dependent [04:37] often they're a URL component, e.g. the MEMBERID in http://www.example.com/[MEMBERID] [04:37] in some cases they are a reference to a larger data packet [04:38] where can i get documentation on the tracker? [04:39] :P [04:39] https://github.com/ArchiveTeam/universal-tracker [04:39] i mean, for the format of tasks, etc [04:40] oh, that's not documented -- a task is just a string [04:40] example string? [04:40] more concretely, it's an element in a Redis set [04:41] ryonaloli: there is no fixed format [04:41] sometimes it's a username, sometimes it's some other unique identifier [04:41] is it fixed enough that a computer can understand it alone? [04:41] no [04:41] well, actually [04:41] yes [04:41] the meaning of the identifier is contained in the fetch pipeline [04:42] but if you're looking for a task schema, there is no such thing [04:42] there is in theory no reason why tasks could not be URLs, or groups of URLs [04:42] in practice, that isn't done [04:43] i'm just looking for a way to set up an archiver on a few idle windows computers that i won't be able to maintain all the time, but that i'd like to be able to archive for archiveteam automatically [04:43] the best way to do that right now is to start up the Warrior VM and set them to "ArchiveTeam's Choice" [04:44] for Windows machines, that will probably remain the best way for the foreseeable future [04:44] i can't run vms [04:44] oh uh [04:44] too much overhead [04:44] cgywin? [04:44] so, historically, we have not had good results from Windows systems [04:44] if my computers were all linux, i'd just take whatever scripts the warrior distro is using [04:44] or Cygwin [04:44] that was one of the main reasons for the Warrior VM [04:44] can't do cgywin i don't think [04:45] yeah windows is kind of shit [04:45] i agree Aranje [04:45] it fucks up filenames [04:45] as a result, running the Archive Team programs on Windows machines is unsupported and discouraged [04:45] well, programs save the Warrior VM [04:45] :P [04:45] As of window 8, you can now set the Hardware clock to UTC :) [04:46] fancy! [04:46] welcome what, 1993? [04:46] I'm not saying it's impossible to get good results, just that nobody in AT has put in the effort to make it work *and* maintain it [04:46] the majority of people writing and maintaining grabber code / tracker code don't run it on Windows, so [04:46] yeah, the usual reasons [04:47] it's not just a picking-on-Microsoft thing, either [04:47] HFS+' case-preserving behavior has also caused problems [04:48] er, case-insensitive-and-preserving [05:25] It could be nicer, we could have a Python core that is ran [05:25] some client that runs in your tray and set and forget [05:26] how difficult would it be to code something like that? [05:26] about a month's worth [05:27] for one person? [05:27] ya, I can see that [05:27] maybe more if a standard is made [05:27] like how a worker is ment to be made and how they should be ran / throttled [05:28] and what libs are on a system, and a package system for the workers, but I guess thats kinda already in place [05:28] the tracker would have to begin using a standard format for that to be of much use [05:29] then a nice config menu, could still be web based, since python supports spinning one up very fast [05:29] if it's python it could much more easily be cross platform [05:30] Yep [05:30] and have a cli version for linux/windows cmd [05:30] and having a central tracker would be nice [05:31] a API with storage for the work done, and a format to be stored on what needs to be done [05:31] it'd make it a lot easier for volunteers to just set up a scraper client and forget about it, like BOINC [05:31] Yes [05:31] What format/API does the tracker currently use? [05:31] unknown [05:31] it currently runs as a VM [05:31] Yeah, but it communicates somehow [05:31] oh, for the tracker [05:32] what does each archiving task involve? a url, probably a deadline... does it communicate with other clients at all or post updates on it's progress? [05:32] Well most are custom [05:32] and is there any way to detect fake results or bad clients trying to send garbage? [05:32] yeah that's the issue [05:33] there can't be a standard if it's custom and changes each time [05:33] https://github.com/ArchiveTeam/universal-tracker is the tracker [05:33] nope, but you could do cross checks [05:33] that'd slow down the process though [05:33] and no, items aren't verified [05:34] It uses a general HTTP JSON API. That would work in any language. [05:34] *and on any platform [05:34] what information would be required to fully automate it? [05:35] maybe it could be as simple as sending new scripts each time, but then there's the security issue, and even in a MAC i wouldn't trust such code [05:35] that is why a VM works well [05:36] its completly sandboxed so people know that what they run won't harm their computer [05:36] but it's very high overhead, requires admin privilages, etc [05:36] also, a VM isn't very good security at ALL [05:36] there's already rootkits in the wild specifically designed to break out of VMs [05:36] let me stop your giht there [05:36] it's a good thing that the Warrior VM image doesn't contain any [05:36] yes, there are some vulnerabilities occasionally [05:36] but VMs are extremely secure [05:37] they aren't designed for security [05:37] so [05:37] ok [05:37] and if they're receiving self-updating code... [05:37] yes, it is theoretically possible for a Warrior VM to get data that has been modified to contain a rootkit that will break out of a hypervisor [05:37] There is less potential for a VM to be exploite and used to access the host than the code running *directly* on the host [05:38] we need a sanboxed system.. a set of APIs to do the archiving [05:38] yipdw: Warrior VM would not be the major target for a researcher finding that kind of bug. [05:38] maybe some LUA? [05:38] in *practice*, this isn't the sort of thing that happens often enough for effort to be spent on it [05:38] which is why nobody has done it [05:38] maybe use a lua sandbox with a limited API [05:38] that should lock it down [05:39] what you will find is that there is no threat model for Archive Team software because the MO up to now has been "we save stuff" [05:39] and that has actually worked alright, at least measured by the metric of "how much did you save" [05:39] if we did use python, maybe just some basic code review would do the trick [05:39] a much bigger risk to the project is someone that sends junk data. [05:39] security improvements are welcome [05:40] a lua sandbox might work for this use case, limit the API and you should be good [05:40] further discussion should probably be in #warrior [06:32] Is there much effort being made to hold onto IRC logs? [06:34] i thought the IRC was already logged [06:44] fgsfds: whose logs? [06:45] I mean general/popular channels. [06:45] well, i can always give my logs, but i just came here today lol [06:46] fgsfds: public logging, on most servers, is severely prohibited unless decided by the channel [06:46] i think i read in the wiki that this channel did public logging [06:46] EFNet seems not to be too paranoid about it, but FreeNode definitely is [06:46] sure, because we decided so :) [06:47] IMHO it should be stated in the topic but as I said it seems EFNet doesn't care and I don't know it enough [06:48] anyway, you should convince channels one by one or just start archiving the already-public logs (there is plenty, from FLOSS support channels for instance) [06:49] Even if it's not public logging, a complete copy would probably be handy to someone in about 6-7 years if the service dies or changes. [07:01] Where's my hug [07:01] Again, I bought into insane travel [07:03] * joepie91 hugs SketchCow [07:03] also, SketchCow, there's a question a bit more up that needs your attention [07:03] approximately 3 hours and 15 minutes back [07:16] What sort? [07:21] a guy told us he has tons of 4chan images [07:22] I see we had another person go through the "oh, surely we can use windows" [07:23] Ah, with a dash of "but what about cygwin" [07:25] ryonaloli: How big is this collection? I can give you an FTP for it. [07:26] collection of 4chandata and gelbooru? [07:26] i think it's over 100gb, maybe more [07:26] That's no problem. [07:26] I have a few extra terabytes [07:27] do you use sftp? [07:27] I have regular FTP. Will that work? [07:28] sure, but i'll be using Tor so the exit node will be able to view plaintext traffic (idk if that includes username and pass) [07:39] SketchCow: what is the header to create a test collection on IA? [07:40] er [07:40] test item, sorry [07:42] test can be the collection name I believe [07:43] aha [07:44] SketchCow: decided to try it out with the HTML5 uploader; https://ia801008.us.archive.org/30/items/TestingItemVwMin/TestingItemVwMin_meta.xml [07:44] also that's a very noisy description editor [07:44] * joepie91 blinks [07:44] oh never mind, it worked as intended [07:47] SketchCow: just want you to know i'm saving microsoft research presentations: https://archive.org/details/msrvideo.vo.msecnd.net-pdf-grab-103000-to-104000 [07:47] there from a 3rd party CDN from what i can tell [10:04] I've been away for some days... Is there a project to do right now or just nothing? [10:04] I don't really mind if there isn't.. [10:12] url team is running again as of this morning [10:13] ah.. really? [10:13] i'll start my warrior then [10:14] i'm not at home, but i can run it [10:14] tracker is back up [10:14] good [10:14] I'll have a look at it [10:16] BiggieJon: does it show anywhere on how much you downloaded? [10:17] http://urlteam.terrywri.st/ [10:17] yea, but it does show some nice stats [10:17] or do you mean your total bandwidth used ? [10:17] but not the amount of GB? [10:17] yea, that [10:17] you are running the virtualbox warrior in windows ? [10:18] yes [10:18] do you have the web client opwn (http://localhost:8001/) [10:18] yes [10:19] click on current project, at the bottom left it shows your bandwidth, curent and total [10:19] totals on top line abover graph [10:19] ah [10:19] thanks [10:27] can someone mirror this for me: http://blip.tv/projectlore [10:27] since alex on diggnation made it [10:27] and the website is gone now too [11:43] well... my warrior seems to crash my windows every time it starts up... [11:43] (read: BSO [11:43] D [11:47] i only managed to upload 1 task [11:47] before it crashed all the time [12:15] ehm... I'm getting 500 error messages in the console on doing a task.. (urlteam) [12:15] does any one get this also? [12:21] Are there any tasks to do right now? [12:21] What crashes? Your windows installation? Or the warrior? [12:22] odie5533: Yes, there are tasks available @ the urlteam project [12:23] http://www.loopinsight.com/2013/10/28/lost-return-of-the-jedi-footage-discovered/?utm_source=loopinsight.com&utm_campaign=loopinsight.com&utm_medium=referral [12:24] There was a clip on reddit recently of a wampa attacking the rebel hase in V [12:24] Maybe this is something for #archiveteam-bs? [12:36] ersi: it completely crashes my windows [12:36] causing it to give a BSOD [12:37] altough this is annoying... i'm also getting 500 errors on doing the urlteam task [12:38] I can run the warrior for some minutes, but that the above happens [12:38] What version of Windows do you have? And what version of VirtualBox do you have installed? [12:38] ehm [12:38] 6.1 ( windows 7) [12:39] and virtualbox 4.3 afais [12:39] this computers is using an AMD GPU and CPU [12:39] I heard from someone else this might causes this kind of things to happen? [12:42] What? That AMD GPU and CPU's aren't supported? Sounds like either a miscommunication or a gross misconception [12:43] well, I run the same thing at home, with Intel stuff tough, and there it just runs fine without crashing my windows [12:44] I'd give downgrading to VirtualBox 4.2.X a try on the machine that it continues to crash on. [12:47] ah ok, it just crashed again [12:48] I'll give it a try [14:13] ersi: I'm getting error 500 constantly.. Do you have idea what could be causing it? [14:16] anyone else maybe? [14:25] Does it only say "HTTP 500" or does it actually say something more? [14:52] DDG: I have no idea why you'd be getting 500. It's running smoothl.. [14:52] smoothl? [14:52] smoothl! [14:53] mhm ok [14:53] it works fine now [14:54] but no tasks coming now... [14:55] 2013-10-28 14:55:20,440 tinyback.Tracker INFO: No tasks available <- :S [15:07] GLaDOS: I think the problem is that he has claimed tasks. Try doing a search for his IP in the tasks DB and clear his claims [15:08] DDG: IP? [15:08] I think they time out after a while (can't remember) - but it's probably a pretty long while [15:08] Also, time out takes 30 minutes [15:08] what do you mean with IP? [15:08] my adress? [15:08] Yeah [15:08] I'll pm it, is that ok? [15:09] Yeah, thats fine [15:09] EVERYONE, HIS IP IS 127.183.59.34 [15:10] Heh [15:10] lol [15:10] but my warrior has done nothing [15:10] the last 30 mins [15:10] at least [15:14] DDG: try again [15:14] i'll reboot the thing then [15:14] No need to reboot it though, it'll do request ever so often [15:15] ever so often = how long? [15:16] just curious [15:16] I think every 300 seconds? [15:16] or 60 maybe [15:16] It says [15:17] strange, it was inactive for me atleast 30 mins.... [15:18] As in every few seconds (can't remember detail but 10-300s) it'll contact the tracker to request new work. [15:20] but after a few requests I did it stopped for some strange reason (I didn't touched the thing at all) [15:20] well it works again now [15:38] so...i started cloning repos from bitbucket as well as github... does anyone have a list of bitbucket users? [15:39] i started with a seedlist of 20 users from google using site:bitbucket.org [15:39] and now have a list of 35k users [15:39] just spidering their followers and who they are following [15:49] Whoah :) [15:51] SketchCow, pinged you with this a couple days ago, but don't think you saw it — writing again: Here's a list of web archives I've uploaded that aren't in the right collections yet (WikiTeam and Archive Team) yet… I'd appreciate if you could move them. Thanks. :D http://hastebin.com/raw/hejuvokoru [15:51] Wow, that is a going to be… a lot of data XD [15:52] kyan__: try emailing him. It works better. [15:52] (jscott@archive.org) [15:52] GLaDOS, ah ok. Sounds like a plan. Thanks :D [16:12] kyan__: Done [16:12] But just luck, it's much better to mail me. [16:12] SketchCow, Ok, sounds good. :) Thanks! [17:03] it would be nice if IA had more anonymity for uploaders, to match what they're doing with reader privacy [17:34] wait 127.x.x.x is local loop-back (or was that the joke?) [18:36] phillipsj: to answer your question: that was the joke ;) [20:27] anyone here in .jp / has access to an IP there? [20:27] apparently http://www.emulation9.com/emulators/ (and the whole domain) is blocked elsewhere [20:27] balrog, rdns? [20:27] or ip blocks [20:27] because nobody said you had to own the domain you're rdnsing to :) [20:28] not sure... [20:32] joepie knew some Japanese, who visited this chan [20:40] another website like that is http://www.alicesoft.com/ [20:48] that one's blocked from IA [22:37] http://urlteam.terrywri.st/ <- and it's down again.. [22:38] oh well [22:38] I have to go for now