[03:42] Once I have a WARC/megawarc, how best to extract outbound URLs from it?? [03:43] (and internal links) [03:52] is there anything in the .cdx file of use for you? [03:52] use hanzo warc-tools to get the response bodies, then parse them with html5lib/lxml/beautifulsoup/whatever the pythonistas are using now [03:54] or if you want something super-terrible to extract indeed joepie91 [13:16] "fuck this, im getting drunk @ 600 dollar" [15:14] http://scr.terrywri.st/1385996531.png i have no idea what im doing [15:17] wot wot GLaDOS [15:18] the faces, oh god the faces [15:19] glad that was tiny, cuz I'm guessing ti was NSFW :) [15:19] http://scr.terrywri.st/yolo.png here you go, full size at 10mb [15:20] they all look sooo happy :) [15:21] 10MB...wtf [15:21] MY LORD THE FACES! [15:21] and this is why you never let me down a pepsi when im sleep deprived [15:21] faceswap ALL the people. [15:22] i'm no person that believes in god, but hell, this is horrible, i would almost start to pray if i knew how [15:22] lets play this fun game called spot the original! [15:23] the girl 4th of the right? :P [15:23] nope [15:23] 3rd of the right second row? [15:24] was it a girl? :P [15:24] nope, and yep [15:25] ah yes! i got her [15:25] it was the 4th from the left [15:25] nope. [15:25] hahaha [15:25] the one on the right [15:25] last one [15:25] yeah, its her [15:26] i missed her, im watching this picture on a potato as screen [15:26] ah, that'd be why [15:27] it's a 19"screen with a res of 1024x768 [15:27] so yeah, i can see 2 faces when it's full size [15:29] thank god for 100mbit@home [15:29] and now, to fix the sleep deprivation, i sleep. [15:29] o7 [15:31] good night! [17:01] if a website is blocked from being archived in the robots.txt [17:01] is it then still downloaded by the Wayback machine, but not shown? [17:01] or is it not downloaded at all [17:04] it's not downloaded at all while it is blocked via robots.txt [17:05] old versions are retained but not shown [17:06] hmm oke [17:06] I'm going to search for robots.txt blocked pages as well then [17:07] for archival for the IA [17:07] oke = ok* [18:03] arkiver: it would be interesting to grab robots.txt for every domain on the 'net and search for those that block IA or block all unknown bots [18:05] someone here was downloading all robots.txt [18:07] IA *does* grab robots.txt [18:11] <@DFJustin> someone here was downloading all robots.txt [18:11] you rang [18:12] only top 10000 alexa sites [18:23] can you easily filter for ones that block ia_archiver or * [18:25] sure, let's see [18:25] err, well, i cant [18:25] only grep [18:25] i never found a good parser so i never did anything with them [18:30] i am running: grep -ER -A 1 "(ia_archiver|User-agent: \*)" * [18:31] for "some" hits [18:34] https://pastee.org/88nu6 [18:45] 95 hyves.nl [20:01] I'm submitting a few of those to archivebot [20:02] so [20:02] if you see anything remotely interesting please do the same [20:02] do we have a list of blocked websites? [20:02] I got a few terabytes free here [20:02] so I can still downlad quite some websites [20:02] and then upload them [20:02] :) [20:02] not all of https://pastee.org/88nu6 are blocked but there's a lot [20:02] arkiver: do you have upstream? [20:03] ? [20:03] nope, what is it upstream? [20:03] how fast can you upload? [20:03] well [20:03] let's see [20:03] download speed: 7 - 8 Megabyte per second [20:04] upload speed: 700 - 800 Kilobyte per second [20:04] so I think that should be ok [20:04] buuut [20:04] what's upstream? [20:05] "In computer networking, upstream refers to the direction in which data can be transferred from the client to the server (uploading)." [20:05] ah nope [20:05] upstream == upload [20:05] how much RAM do you have? wondering if you could run an archivebot pipeline to do archivebot jobs [20:06] I am downloading everything with heritrix 3.1.1, then uplaoding it to the archive and then sending an email to jason to move the files to the wayback machine [20:06] my ram? [20:06] 4 gb right now [20:06] but [20:06] soon I'm going to buy a new computer, which will have around 16GB SDRAM [20:06] (like 70% sure I'm going to buy it) [20:06] also [20:06] you know you want 32GB ;) [20:07] I don't have my computer on 24/7 [20:07] heh [20:07] you got 32?? [20:07] O.o [20:07] I have 96GB in a box but my upstream is 160KB/s [20:07] ah [20:07] * ersi pokes the VM host machine with 256GB RAM [20:07] my ram is lower but upstream faster [20:07] :P [20:07] -.- [20:07] ok ok ok [20:07] why do you turn off your computer? [20:07] I turn off most of my shit as well [20:08] I now know that my ram isn't high guys... -.- [20:08] yep [20:08] at night [20:08] it's in my room [20:08] that machine isn't mine [20:08] and making noice [20:08] it's a machine at work [20:08] and it's irritating then... [20:08] yeah [20:08] as ersi says [20:08] My laptop got 8GB and my workstation got 4GB [20:08] do you have a closet? perfect place for a computer [20:08] -.- [20:08] I do have a closet "server" machine though ^_^ [20:08] not gonna place my pc in there [20:09] in my closet... [20:09] so oke [20:09] I'm going through that list right now and looking at the robots.txt [20:09] and then selecting the websites to download [20:09] closet blocks like 30dB [20:10] yeah well [20:10] nah [20:10] I'm happy like this [20:10] maybe some other time [20:10] so [20:10] which sites from the list are already downloaded? [20:10] or downloading [20:13] http://www.insideview.com/robots.txt [20:14] # bad crawlers [20:14] Disallow: / [20:14] User-agent: * [20:14] "bad crawlers" O.o :'( [20:14] :D [20:15] is there any value in keeping the daily 1m top sites zip from alexa? i want to clean up [20:15] I don't know [20:15] but did you create that list of websites? [20:15] Schbirid: How large is the data? [20:15] arkiver: alexa.com provides a list of 1m top sites [20:16] yes [20:16] but can we automatically check the robots.txt? [20:16] also [20:16] this site is also blocked: [20:16] http://svs.gsfc.nasa.gov/ [20:16] "Free download" from http://www.alexa.com/topsites -> http://s3.amazonaws.com/alexa-static/top-1m.csv.zip [20:16] Well, sure.. [20:16] many GB's of create visualisation videos [20:16] just blocked... :( [20:16] if gone everything is gone [20:17] y [20:17] ~10M per da [20:17] i got 8G here [20:17] ~10MB/day? [20:18] 8GB of robots.txt's? [20:18] nah, 8G of the 1m file [20:18] 3G of robots files :D [20:19] ah [20:19] 365 7z files in one item sound idiotic or ok? i want to dump them to IA [20:19] and are they already looked at for if IA is blocked? [20:19] no [20:19] dumb daily downloading [20:20] https://github.com/ArchiveTeam/robots-relapse is some version, not sure what exactly that one does [20:21] already downloaded 2.5 GB of www.webmonkey.com/ [20:22] hmm [20:22] maybe it would be helpful to create a list of websites people from the Archiveteam are currently downloading at home? [20:22] maybe in the wiki? [20:22] and that we regularly update it? [20:26] better to just get everything to IA [20:27] I mean that we have more organised list of what is done and what still needs to be done? [20:28] and that we take some website, put our name behind them if we are working on them [20:28] know what I mean [20:28] ? [20:29] once you're grabbing many domains per day, I don't think you'll be motivated to keep it in sync [20:30] hmm oke then [20:30] but some domains will be grabbed twice or more maybe... [20:30] well [20:30] so? :) [20:30] :P [20:30] disk is cheap [20:30] and everything is going to IA [20:30] so not on our disk [20:46] Redundancy is good for archiving. [21:11] is anyone else here usin heritrix? [21:11] I'm having a problem right now... :( [21:17] I always recommend the following: Write about the problem instead of asking to ask about asking to ask [21:17] I'm not running heritrix. What's the problem? [21:18] i think my bluray player may hate me [21:18] *bluray burner [21:18] I'm keep getting this error: [21:18] 2013-12-02T21:15:01.002Z SEVERE Failed to start bean 'bdb'; nested exception is java.lang.UnsatisfiedLinkError: Error looking up function 'link': Kan opgegeven procedure niet vinden. [21:18] i see it to burn at speed 4x [21:18] when I try to swtart a job from the checkpoint [21:18] and now its trying to burn at 10x [21:18] What does the Dutch error message mean? [21:19] Can't find given procedure [21:20] Hm. Has it worked before? [21:20] well [21:20] It suddenly worked on eitme [21:20] time [21:20] but then not [21:20] and before that time also not [21:21] if I try it one time and I get the before error and then try it a second it gives me this error: [21:21] 2013-12-02T21:20:48.918Z SEVERE Failed to start bean 'bdb'; nested exception is java.lang.IllegalStateException: com.sleepycat.je.EnvironmentFailureException: (JE 4.1.6) Environment must be closed, caused by: com.sleepycat.je.EnvironmentFailureException: Environment invalid because of previous exception: (JE 4.1.6) K:\Internet-Archive\heritrix-3.1.1\bin\.\jobs\test\state fetchTarget of 0x0/0xbf parent IN=2 IN [21:21] class=com.sleepycat.je.tree.BIN lastFullVersion=0x1/0x540 parent.getDirty()=false state=0 LOG_FILE_NOT_FOUND: Log file missing, log is likely invalid. Environment is invalid and must be closed. (in thread 'test launchthread') [21:21] and yeah [21:21] the problem is that a log file is missing [21:21] so I checked it [21:21] it is missing a 00000000.jdb file [21:21] now [21:22] I opened that folder and right before I click to start the job again 00000000.jdb is still there [21:22] but right after I click it 00000000.jdb dissapears and then I get the error [21:22] as if it is first deleting it and then trying to open it... [21:23] instead of first opening and then deleting it [21:53] so it looks like this disc is doing better then the last one [21:54] not saying everything is ok yet [21:55] the video is still like last time [21:55] but filesystem can be viewed [21:56] and the video does play [21:56] just fastforwarding is slower then normal [22:01] oh neat, DigitalOcean has a second datacenter in Amsterdam [22:18] so good news [22:18] turns out i missing type my burning script [22:18] http://fortvv2.capitex.se/beskrivning.aspx?guid=46PQ44OP65VBJM2B&typ=CMFastighet [22:19] want [22:19] so bad [22:19] it had --speed=4 instead of -speed=4 [22:19] (Old Swedish Military fortification/base with tunnels and everything) [23:17] SketchCow: at some point i will be uploading all the pdfs i got from ftp.qmags.com to my godaneinbox [23:17] that way we can make tons of collections for them [23:19] there are like magazines about cleaning rooms [23:19] for making computer chips [23:20] this doesn't take of the table anyone want into put a full ftp tar of it up [23:33] i copied a html file of the ftp root index and made a list of pdf files to grab [23:33] this way i don't have download every exe and sea.hqx file