[00:01] first thing first. bed. [01:03] does anyone know how to save all the comments/metadata from a youtube channel? dascottjr's channel is at risk of being suspended due to copyright strikes [01:20] midas, i think downloading petabytes of data (a lot(?) from adsl lines) is pretty hard. [02:42] comments no, but youtube-dl looks like it handles most of the other metadata- descriptions and thumbnails [03:16] ------------------------------------------ [03:17] Internet Archive is Rewriting Wayback Machine from Scratch in Python. [03:17] Invites testing, comments, archiveteam bastardry: https://github.com/ikreymer/pywb [03:17] ------------------------------------------ [03:19] whoa! [03:20] i also hope that wayback machine will be made easier for home usage [03:20] that is what they are going for. [03:20] The ability to start it up on a machine and start banging through downloaded WARCs [03:33] hi, i'm @ikreymer on github. this is a brand new project to try to make wayback machine easy to use, as well as extensible. there's a lot more to do, but welcome any comments you have so far [04:58] coolio [05:08] so i found xml data for cnet and cbsnews videos [05:20] so my xml dump of api.cnet.com is moving very fast [05:21] its already 12k+ downloaded [05:30] i'm fucking awesome [05:30] i'm grabbing 2003 cbsnews videos [05:51] rad [06:34] SketchCow: are they also going to add more features to the wayback machine? [06:34] like [06:34] will there ever be some kind of search function in the wayback machine? [06:36] search would be nice [06:36] i want to be able to search stuff like meda files [06:37] I think the issue with search is the huge amount of indexing infrastructure that would be required [06:37] I would love even like a basic filename search though [06:37] basic filename search would be great too [06:38] anyways i'm starting to upload more episodes of wilkow [06:39] I think that people need to get into their heads that the Internet Archive is not a business. [06:39] This isn't amazon and being able to go "fuck, man, make my purchases show up in pinterest, yo" [06:40] If people want a feature, they should ask for it by providing code or links to code [06:40] Otherwise the team moves at their pace [06:40] Keeping the ship together is a task in itself. I sit in the infrastructre channel. [06:40] Blowups every week [06:40] It's like the last third of the titanic movie [06:41] People falling and bouncing off decks, old people holding each other inside flooding cabins and crying [06:41] Leo falling in love with kate [06:43] Haha, ok, I see [09:55] SketchCow: Awesome rad re new dev Wayback [09:58] Ilya wants help and information [10:02] ikreymer: Anything specific things? :-) I'm def willing to shake trees as well as help out. [10:06] thanks! well, right now, just basic testing.. especially if you local warc files that you have crawled.. it would be interesting to see if you can get them to replay in a local install of the new wayback [10:08] there are instructions in the readme on how to generate cdx files for them.. the cdx_writer tool is maintained by another engineer at this point, but hopefully steps are clear enough that its possible to create cdx index files and test your warcs. hoping to improve that process [10:08] to start, see if the deployment steps in the readme work, to at least replay the sample data [10:12] I'm planning to add a lot more documentation of the project in the upcoming days/weeks, so far it supports basic replay of a typical page. there will be additional customizations involving javascript, and certain domain-specific sites. stay tuned! [10:14] but at this point, any testing feedback would be really helpful! [10:20] my fiber! she is back! [10:27] ikreymer: Sure thing - that sounds completely reasonable. I'll give it a spin :-) [10:35] I *love* when I upload at 14 MiB/s to archive.org from Europe [10:35] 652333333 452 [10:35] CAT, SRY [16:53] red alert [16:53] http://wallbase.cc/ [16:53] in the process of dying [16:53] warriors activate [16:53] cc yipdw and chfoo [16:53] (yipdw: every time I read your nick I mentally think "yip yip yip", heh) [16:55] "owner abandoned it, forums and new uploads have been shut off until owner returns, methinks owner won't return whatsoever" [16:56] good news: incrementally ordered IDs [16:57] bad news: appears javascript-heavy [17:01] http://wallbase.cc/forum already down? [17:02] appears so [17:04] http://wallbase.cc/wallpaper/3033009 [17:04] this one is the highest number [17:04] they go up to 3033009 [17:05] looks like i should try save copy of finnish gaming site www.peliplaneetta.net/ [17:05] and they seem to start from 1000000 [17:05] http://wallpapers.wallbase.cc/rozne/wallpaper-2903212.jpg [17:05] www.peliplaneetta.net/tietokonepelit/uutiset/15112/ILMOITUS-Peliplaneetta-sulkeutuu/ (if someone is interested look it with google translate) [17:06] no!! [17:06] i think its just http://wallpapers.wallbase.cc/rozne/wallpaper-${id}.jpg for images [17:06] they don't start from 1000000 [17:06] godane: depends on the kind of images [17:06] who asked about aol? [17:06] someone did [17:06] those are the numbers for rozne images [17:06] other images have other urls [17:07] example: [17:07] http://wallpapers.wallbase.cc/high-resolution/wallpaper-2119864.jpg [17:07] okd [17:09] maybe a i will do a dump of rozne images [17:10] would be great [17:10] going to start multiple crawls [17:10] I'll see if I get banned or not [17:15] going to start 30 simultaneous crawls [17:15] 100.000 wallpapers per crawl [17:16] testing http://wallbase.cc/wallpaper/3033000 [17:16] done [17:17] will uplaod the warc of http://wallbase.cc/wallpaper/3033000 now [17:17] so people can test it [17:18] done. [17:18] https://www.filepicker.io/api/file/BhHu8SK4SM2XwZWS3dGn [17:19] can someone please download and open the warc in the wayback machine? [17:19] (not on IA server since robots.txt are still blocking access) [17:19] please tell me how it turned out [17:24] wat [17:24] done? [17:24] alraedy? [17:24] oh [17:24] for one ID [17:25] Lol done already [17:27] yes [17:27] a test for one ID [17:27] joepie91 can you test it? [17:27] nyu: no not done yet [17:27] just one ID [17:27] a test [17:28] to see how it turns out with the javascript and stuff [17:28] running 2 crawls for first 200.000 urls now [17:28] arkiver: just a moment [17:28] I forgot where I put warcviewer [17:28] joepie91: thanks [17:28] (I'm a terrible archivist, heh) [17:28] haha [17:28] lol [17:28] found it [17:30] cool [17:30] I'll hear the result from you [17:30] it is crawled with heritrix [17:30] same crawler IA uses [17:30] and it can even unpack and find urls in swf files [17:30] heritrix 3.3.0 version of 2014-01-28 [17:31] arkiver: it contains several wallpapers [17:31] what was the original URL [17:31] it seems odd for one ID to contain multiple wallpapers [17:31] http://wallbase.cc/wallpaper/3033000 [17:31] I know [17:31] heritrix also download the urls that are linked to from that page [17:31] so let's say [17:32] 3033000 contains a link to 1234567 [17:32] arkiver: where do the others come from? [17:32] then it also download 1234567 [17:32] also, I should point out that I don't have a full-fledged wayback machine here [17:32] wioth the wallpaper from that page [17:32] so I can't really test beyond "does it have these and these files" [17:32] ah [17:32] ah well [17:32] I can see that too [17:32] but why is it only a few then [17:32] however [17:32] it should infinitely recurse [17:32] because of "related wallpapers" [17:32] stopping other projects of mine [17:33] and giving all power to wallbase.cc [17:33] n [17:33] no* [17:33] as I said it downloads the pages next to that page too [17:33] which are linked too from that page [17:33] and it doesn't go further [17:33] and well [17:33] the test crawl of that one ID proves it [17:33] otherwise it wouldn't have finished [17:34] also I should note that my internet is shitty and laggy [17:35] brb [17:39] arkhive: oh, only one deep? [17:40] hmm [18:03] arkiver* [18:05] so [18:05] going totally fine [18:05] not banned o r anything [18:05] or* [18:06] that's good [18:06] if someone here can actually view a warc in a small wayback machine, please test this warc file: [18:06] https://www.filepicker.io/api/file/BhHu8SK4SM2XwZWS3dGn [18:06] and tell me how it looks [18:07] since there is a lot of javascript [18:11] arkiver: https://github.com/ArchiveTeam/warc-proxy <-- you can test it yourself [18:11] highly encouraged to do so, as well [18:11] I have windows [18:11] not sure if it will work there? [18:11] VirtualBox [18:11] ... [18:11] it should work in Windows, though [18:11] well [18:12] I really have zero experience with virtualbox or anything [18:12] I should start learning it [18:12] if you have a Python installation with all required libraries [18:12] but maybe someone else can tet it for me right now [18:12] I will learn it yipdw, promised! :) [18:12] If you know how to set up a computer, running virtualbox/vmmware player is for the most part like clockwork [18:13] I'd recommend learning it now; there's no real point in continuing a grab if what you're grabbing is grossly incomplete and/or unreadable [18:13] no [18:13] and warc-proxy is a very good tool to figure that out [18:13] I know it is readable [18:13] how do you know it's readable if you haven't tested it? [18:13] the only thing I want to know how it is turning out because of all the javascript [18:13] well [18:14] still know that thing with jason? when my files didn't seem to work? [18:14] there is, I suppose, a trivial definition of "readability" which means "your shit isn't corrupt" [18:14] but there is a higher standard that is not only possible but indeed is now feasible thanks to alard etc. [18:14] and I'm just saying "here's a tool that makes it possible, please use it" [18:15] back then they didn't show in the wayback machine because of, as I later found out, the torrent I uploaded them with had spaces " " [18:15] I later uploaded some by hand [18:15] and those worked actually [18:15] totally fine [18:15] but I'll use that warc-proxy... ;) [18:16] in any case, I have looked at that WARC in warc-proxy [18:16] the pages look okay [18:17] but the full-size wallpapers do not appear to be in there [18:17] the zoom feature on the wallpapers does not seem to work [18:17] hmm [18:17] I think that's the javascript... [18:17] :( [18:18] you're also not fetching wallpapers.wallbase.cc/wallpapers/, it seems [18:18] oh, wait [18:18] there it is [18:18] ok [18:19] it appears that each page references a file at URL http://static.wallbase.cc/js/jquery-1.10.2.min.map [18:29] so [18:30] they are woroking apart from the zoom not working? [18:30] wish we had a way to do javascript well.... [18:31] if the appropriate files are fetched, it will work [18:31] the only issue here is that a file is missing [18:35] well [18:35] I don't think I can change that [18:35] you mean the http://static.wallbase.cc/js/jquery-1.10.2.min.map url right? [18:35] I think that url is dynamic and created by javascript [18:36] but I'll see if I can download those manually [18:36] well [18:36] create the links and download those [18:38] that URL actually 404s out when you try to access it [18:38] I suspect there's something else going wrong [18:38] please do investigate [18:39] hmm [18:39] maybe if I go deeper into the external urls [18:58] arkhive: .map is just a dev tools thing [18:58] ignore any. map files [18:58] it's not actually referenced in the page [19:06] oke [19:06] so that's not the problem [19:06] joepie91: do you think the zoom issue is a javascript thing? [19:07] arkhive: probably, but no idea [19:08] ideally compare the list of URLs in the warc with those in your browser [19:08] (except for .map) [19:08] and see if anything is missing [19:08] it's possible that one .js imports another .js [19:08] in that case a browser would get it, but heritrix wouldn't [19:12] joepie91: it's ARKIVER [19:12] lol sorry [19:12] arkhive comes first in completion [19:14] is there any indication how long wallbase will still be online? [19:16] arkiver: none [19:16] according to their own claims it will stay online [19:16] but that seems unlikely [19:16] hmm [19:16] will also use another way to do the website [19:16] which is probably faster [19:16] as long as it doens't crash [19:21] So http://don.na/ is going to be shut down by yahoo... [19:31] don.na will be done in a few minutes [19:37] is it possible to do a wide crawl with heritrix? [19:37] does someone know how to do that? [19:40] wide crawl as in? [19:41] just the internet [19:41] like alexa is doing [19:41] when I get faster internet [19:41] it would be cool to do that I think [19:41] Howdy :) [19:42] het Konata_ [19:42] hey* [19:42] crawl going fine [19:42] but [19:42] for some reason they pause every few minutes for some minutes [19:42] but well [19:42] they are going [19:42] that's what counts [19:42] That's a good thing [19:42] Yeah, it's the fact that it's working that matters [19:42] doing around 1 percent every 1-2 hours [19:42] not extremely much [19:43] should be done in around 4-8 days [19:43] and the owners say it won't go offline [19:43] so we may have time enough [19:43] So they've been notified that we're doing this? [19:43] hehe [19:43] nope [19:43] they won't allow it [19:43] I think [19:44] Oh well lol [19:44] looking at their robots.txt [19:44] their robots.txt hides almost everything from web crawlers [19:44] just the internet [19:44] Wouldn't they notice it though? [19:44] lol [19:44] so why would they allow us then to crawl the website and ignoring the robots.txt? [19:44] framing this one on my wall [19:44] "what are you archiving?" "just the internet" [19:44] haha [19:44] I mean like the alexa crawls [19:45] or the IA web wide crawls [19:45] They'll probably notice at one point or another [19:45] haha yep [19:45] then I just change my ip and start again [19:45] Konata_: yes, but that point is usually the point where there are so many warriors running that you can't just block a single IP [19:45] or maybe we have already saved the full site by then [19:45] but remember] [19:45] wait, is this a warrior project right now? [19:45] the website won't be visible in the wayback machine till the real website is gone [19:45] Konata_: not yet [19:45] Ah okay [19:46] because of their robots.txt blocking it from being viewed in the wayback machine [19:47] Surely they must think that a sudden interest in just purely random wallpapers is suspicious [19:48] haha yes [19:48] think so [20:11] Does anybody know: If archive.org takes something down... can the uploader still access it? [20:11] or does the uploader get any warning? [20:12] well [20:12] getting a warning or not depends on the upload of course.... [20:14] erm, not "a warning": You shouldn't upload mp3s of that album that isn't out yet [20:15] but rather "warning" in the sense of: hey, we've got to take this thing down... you've got 24 hours [20:27] well, archive.org is not your personal storage you know [20:27] under DMCA you can oppose and bring the downtaker to court ;) [20:28] Wait it isn't? :( *goes and removes 2tb of files* [20:35] uploader doesn't get a warning and you can't access it anymore [20:38] That all makes sense... thanks guys [20:41] you just get a notice that it was taken down [20:46] For some definitions of "Taken Down" depending on the material. [20:47] they do retain all the files still it just can't be accessed from the outside [21:20] going to bed now [21:27] before other people are going to do wallbase.cc [21:27] please let me do it [21:27] I just want to that to have my first really big website done [21:32] ok [21:32] i'm still going to put up the first 100k ids [21:32] only cause i'm half way there [21:33] note its only a image grab based on the fixed wallpapers.wallbase.cc/rozne/wallpaper-$id.jpg urls [21:35] you wanna ping me a list of the rest godane ? [21:36] are they just 100,001 - 200,000 ? [21:47] i'm brute forcing the grab [21:48] i'm going thur every number between 1 to 100000 [22:01] 4 notifications that my name has been mentioned on IRC! omg.. oh wait.. it's arkiver they wanted. not me [22:02] oh btw [22:03] http://arstechnica.com/gadgets/2014/01/intel-closes-appup-its-pc-app-store-intel-had-a-pc-app-store/ [22:03] don't know if it has been mentinoed. but there. [22:05] http://software.intel.com/sites/landingpage/intelappup/ [22:05] March 11th 2014 [22:08] ha. appup gave me a free meego intel tablet [22:08] it was a piece of shit [23:34] balrog: I asked about AOL- ping me you're available