[00:40] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…) [00:58] *** AlbardinG is now known as flashback [01:03] *** flashback is now known as Flashback [01:18] *** omglolbah has quit IRC (Read error: Operation timed out) [01:23] *** omglolbah has joined #archiveteam-bs [01:40] moufu - thanks so much! That worked! Also, no problem schbirid! I'm now running into an issue where wget isn't grabbing the photos and other content on the page, but just the html. I'm sure I'm missing something obvious, but I can't figure out what it is [01:51] *** BartoCH has quit IRC (Ping timeout: 615 seconds) [01:51] *** BartoCH has joined #archiveteam-bs [01:52] *** BlueMaxim has joined #archiveteam-bs [01:52] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [01:53] *** BlueMaxim has joined #archiveteam-bs [01:55] *** BlueMax has quit IRC (Ping timeout: 633 seconds) [02:01] *** albardin has joined #archiveteam-bs [02:10] godane: nice! how'd you do it? did you use a really fancy scanner? were you able to keep the magazines in-tact (aka w/o cutting the pages out) [02:11] last question, what's the source format? i want to download it in the best quality possible [02:12] JPEG definetly JPEG [02:12] and .txt files [02:14] W0rmhole I know all of the file types [02:15] -_- [02:15] you mean .bmp? [02:15] 16 colour [02:15] Yes thats what I meant [02:16] and XBM files [02:50] i use JPEG with 90% compression [02:51] i do have a fancy scanner and the books are kept intact [02:51] its a plustek opticbook 4800 [02:52] whats funny is that had the scanner giving to me in early 2013 by Jason [02:52] i think only scanned 17 things before cause i had go to windows to use it cause there is no linux driver [02:54] best part is it was best scanner of 2011 on a linux site: linux.sys-con.com/node/2068241 [02:55] in case you can view the website: https://web.archive.org/web/20131201213438/linux.sys-con.com/node/2068241 [02:55] *can't view the website [02:57] a part of me thought that was always weird cause if there is no linux drivers then how is it the best scanner of the year on a linux site :P [03:20] SketchCow: btw i need those return labels so i buy more tapes on ebay using my patreon money [03:20] sent more the 12 labels also please [03:21] cause i want to sent ALL of the boxes and then some afterwards [03:23] also if possible put return labels in future boxes so we don't have this problem again [06:07] *** Mateon1 has quit IRC (Ping timeout: 268 seconds) [06:07] *** Mateon1 has joined #archiveteam-bs [06:11] *** omglolbah has quit IRC (Ping timeout: 268 seconds) [06:13] *** svchfoo1 has quit IRC (Ping timeout: 268 seconds) [06:14] *** svchfoo1 has joined #archiveteam-bs [06:15] *** svchfoo3 sets mode: +o svchfoo1 [06:15] *** omglolbah has joined #archiveteam-bs [06:46] *** Stilett0 has joined #archiveteam-bs [07:02] *** svchfoo1 has quit IRC (Ping timeout: 268 seconds) [07:04] *** kiskabak has quit IRC (se.hub efnet.portlane.se) [07:04] *** Kaz has quit IRC (se.hub efnet.portlane.se) [07:09] *** svchfoo1 has joined #archiveteam-bs [07:09] *** kiskabak has joined #archiveteam-bs [07:09] *** Kaz has joined #archiveteam-bs [07:09] *** svchfoo3 sets mode: +o svchfoo1 [07:10] *** t2t2 has quit IRC (Read error: Operation timed out) [07:11] *** t2t2 has joined #archiveteam-bs [08:48] *** t2t2 has quit IRC (Quit: No Ping reply in 210 seconds.) [08:49] *** BartoCH has quit IRC (Quit: WeeChat 2.2) [08:50] *** BartoCH has joined #archiveteam-bs [08:50] *** t2t2 has joined #archiveteam-bs [08:50] *** Stilett0 is now known as Stiletto [09:25] *** antomatic has joined #archiveteam-bs [09:25] *** swebb sets mode: +o antomatic [09:27] *** antomati_ has quit IRC (Read error: Operation timed out) [10:35] *** wp494 has quit IRC (Ping timeout: 492 seconds) [10:37] *** wp494 has joined #archiveteam-bs [10:38] *** icedice has joined #archiveteam-bs [11:11] *** BlueMaxim has quit IRC (Quit: Leaving) [12:47] *** icedice has quit IRC (Quit: Leaving) [13:10] *** bitBaron has joined #archiveteam-bs [13:51] *** Nicu has joined #archiveteam-bs [13:53] How about attempt archiving Tripod.Lycos? [13:55] Nicu: We *do* accept donations: https://opencollective.com/archiveteam [13:56] Nicu How would we discover hosted services? [13:58] *** eientei95 has joined #archiveteam-bs [13:58] manually in case it's not yet ready for automatic work? [13:59] Anyway eientei95 is here from your previous discussion, continue [14:00] so i think you could expand [14:00] it's a great cause [14:01] but there's not much info [14:01] about what has been done [14:01] or it's not readily available [14:02] also, we understand how valuable is the information stored but what could one do with it in the future? [14:04] Unfortunately I can't find any reference to any sites stored on tripod.lycos through their robots.txt and sitemap [14:04] Any site crawl we do end up on Wayback Machine [14:06] tripod was the homologue of GeoCities right [14:06] doesn't it make sense to occupy that too? [14:06] that's were i'm coming from [14:07] kiska: There's http://members.tripod.com/robots.txt . I'm not really sure what the relation between tripod.com and tripod.lycos.com is. [14:08] Nicu: We usually only go after sites when they are shutting down or removing content or in immediate risk. I agree though that it would be nice to grab Tripod in its entirety. [14:09] I guess we could chuck a few at a time into #archivebot [14:09] And we might as well grab angelfire as well since their the same company(I think) [14:10] Also, I went through the IRC logs. While Tripod has been brought up several times, I don't see anything regarding a systematic archival. [14:10] Angelfire is in progress over at #angelonfire . [14:10] that's good. If I help to copy how do I know that my work will be safe and useful in the future? [14:10] s/in progress/being worked on/ [14:11] is it stored in a data center [14:11] It is stored at the Internet Archive. [14:11] that will "work for human kind" in the future? [14:11] do you have an agreement or is it an open kind of thing [14:11] IA has their own DC in San Francisco. [14:11] There are several people from IA in ArchiveTeam. [14:12] ok [14:12] (Well, I know of two, so not sure if that counts as "several".) [14:13] We also have an independent project to mirror the most important content from IA in a distributed manner. Check out IA.BAK on our wiki. (I have no idea how active that project is nowadays though.) [14:13] However, IA stores around 45 PB of unique data currently, so mirroring it all is expensive. [14:14] I know kiskaJDC has ~900GB of a shard [14:16] can i get a pass for wiki? [14:16] also Nicu we have been around a good while, If you want to archive something and a project is stalled or not even started, learn how our pipelines work on submit it to be archived as a warrior project [14:16] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [14:16] thats the resource we are the the most limited on [14:17] programmers and engineers [14:17] Nicu: What do you want to do on the wiki? [14:17] expand the mindset for archiving the Internet [14:17] How so? [14:18] Lets discuss it [14:18] i'm nostalgic for how creative it used to be [14:18] homepages that expressed individuality [14:18] web design that was unexplainable beautiful [14:18] perhaps there might be a way to preserve this [14:18] How do you see us expanding the mindset for archiving the internet [14:19] also the names for irc channels that were beautiful (esp. on Undernet) [14:19] perhaps archive the IRC logs [14:19] Our mindset is already "archive ALL the things". I don't think there's much to expand there. :-) [14:19] lol [14:19] lol [14:19] but you don't do anything in this direction [14:19] how to categorize [14:19] web design under ALL? [14:19] JAA: how much per month do we shove into archivebot? [14:19] 1-2TB? [14:20] IRC operators tend to not like channels being idled in for logs [14:20] Way more jrwr [14:20] More like 20-30+ some months [14:20] Ya [14:20] jrwr: We uploaded 23.7 TiB in August. [14:20] 485 TiB in total since ArchiveBot was started. [14:20] Nicu: even now: Job status: 95499 completed, 2931 aborted, 567 failed, 78 in progress, 23 pending [14:20] how to access it? [14:21] all gets uploaded to the wayback machine [14:21] and the internet archive as WARCs [14:21] so normal people can access it with the wayback machine [14:21] sounds like trash and not info that could be instructable [14:21] https://archive.org/details/archivebot if you want the raw data. https://web.archive.org/ if you want to browse it. [14:22] Aug '18 3.09 TiB / 2.57 TiB / 5.66 TiB [14:22] Well, thats the thing, how do you present the billions of webpages we have saved [14:22] eg https://web.archive.org/web/20180906155341/https://oldforums.eveonline.com/ Archivebot did this about 2 weeks ago. And you can see it by clicking on "About this capture" [14:22] A eve nerd! :) [14:22] Nicu: Feel free to download the CDXes from the ArchiveBot collection and build a nice interface for it. Hint, it'll be huge. [14:22] Ya, the wayback does allow for searching [14:23] Yeah, but only page titles unfortunately. [14:23] but browsing is a little harder, like the old webrings (that still work!) [14:23] jrwr xD And I think that upload was due to CCP being bought out by Pearl Abyss [14:23] lol [14:23] there are dozens of us, DOZENS [14:23] So I chucked it into the bot, it landed on my SSD pipeline, therefore it ran out of space [14:24] Nicu: thats the hardest thing, there is a metric ton of data, stored in open formats waiting for someone to do something with it [14:24] I'm still on the tutorial [14:24] We are here to make sure that data event exists [14:24] Archive now, make pretty later [14:24] eientei95: Play with others, join Pandemic Horde or any of the other newbie alliances [14:24] you will love eve then [14:25] If we make pretty now, archiving will be held back [14:25] Correct [14:25] I played for just about 8 years I think [14:25] jrwr: Will do, going to get back to actually playing games [14:25] Take the Eve discussion to -ot please. [14:25] and we are open to any project/website to be archived, we have archivebot for the smaller jobs [14:25] But yes make pretty now will make archival very slow [14:25] and anyone can make a pipeline / wget-lua code to archive a site when required [14:26] Especially sites that use a ton of Javascript [14:26] Nicu: any thoughts so far? [14:26] i,e. Modern websites [14:26] i'm thinking of my 2TB of free space [14:26] and that they can't be useful [14:26] since it's stored in IA? [14:26] "can't be useful" [14:26] [02:25:56] and anyone can make a pipeline / wget-lua code to archive a site when required [14:26] or maybe use tha IA.BAK thing? [14:27] Run a warrior, run a dozen of them, docker is great for this [14:27] eientei95: Warrior instances don't use much disk space and normally don't keep all the data anyway. [14:27] your part helps with archiving sites when the need arises [14:27] i feel like i am losing the motivation since it's like grabbing what you can [14:27] Can you program Nicu [14:28] Yep, that. And you can join IA.BAK also if you want. [14:28] JAA: Right, I've had a few instances where Warrior failed due to lack of disk space [14:28] IA.BAK is pretty good [14:28] always need more diskspace for that [14:28] the urge is now, for all the good thing social networks and telegram destroy like hurricanes [14:28] eientei95: Oh, that can certainly happen. But it doesn't need 2 TB of disk space. [14:28] and ArchiveTeam "just shoves" :-D [14:29] What would you suggest we do? Ignore the sites that are shutting down all the time, and let their data be lost forever? [14:29] i am not suggesting, just saying this is too generalistic [14:29] am i wrong? [14:30] Its all we can do, there is so much to save [14:30] WE are all but a drop in the ocean [14:30] ^^ [14:30] doesn't it matter WHAT we save [14:30] We save what we we can. As much as we can [14:30] i don't want to interrupt the good work though [14:30] just putting it to debate [14:30] Start up a pipeline, shove sites you want archived into it [14:31] Also, Anyone can upload to the internet archive [14:31] Not anymore IIRC [14:31] ^ [14:31] I think only some stuff goes into the way back [14:31] ya [14:31] Anyone can upload to IA. But it only goes into the WBM if you're whitelisted. [14:31] That doesn't mean that uploads that don't go into the WBM aren't useful. [14:31] but if you make good WARCs (built into wget ) save anything and everything you can, add good metadata for people to find it [14:32] breaking my dreams of seeing Internet not just a dump like now :-D [14:32] the wayback machine is the best method to relive that old internet [14:32] we are doing everything we can to give it as much data as we can [14:34] If you want to archive websites we are not covering, go do it then, all the knowledge we have is in the public, our wiki, our github [14:35] i know the logic tolds you it's correct to run this machine of endless copying don't know what but the heart and soul tells you I'm right but you don't want to take this into consideration [14:35] *** dxrt- has joined #archiveteam-bs [14:35] *** dxrt has quit IRC (Ping timeout: 252 seconds) [14:35] Ok, you are not making any sense Nicu [14:36] Right now, What do you want us to do [14:36] Dumb is down for me [14:37] go to the darkside [14:37] *** hook54321 has quit IRC (Ping timeout: 252 seconds) [14:37] *** i0npulse has quit IRC (Ping timeout: 252 seconds) [14:38] Thats not even remotely helpful, [14:38] I'm ignoring you now [14:38] Completely off-topic from the discussion going on, but does anyone know why my WGetDownload isn't grabbing images? [14:39] Content gets lost all the time because websites shut down. We try to save what we can. And you're saying this isn't useful? [14:39] Also, it's not ours to decide which information should be preserved. We don't know what will be useful for future historians. So "archive ALL the things". [14:40] https://github.com/adinbied/angelfire-grab/blob/master/pipeline.py#L162 is my WGetArgs & for some reason it's only grabbing the HTML & not all of the resources (gifs, pngs, jpgs, embedded content) [14:40] https://www.archiveteam.org/images/c/ce/Archive-all-the-things-thumb.jpg [14:41] *** Nicu has quit IRC () [14:41] adinbied: The problem is your download_child_p hook. It always returns false, meaning everything should be skipped. [14:42] I think the initial URLs passed on the command line are exempt from this filtering. And so that's all it grabs. [14:42] Ah, derp. I really need to learn more Lua - that makes sense. Thanks! [14:48] Hmmm.. Getting rid of that still doesn't seem to work.... [14:49] *** pikhq has quit IRC (se.hub irc.underworld.no) [14:49] *** kiska has quit IRC (se.hub irc.underworld.no) [14:49] *** Flashfire has quit IRC (se.hub irc.underworld.no) [14:49] *** w0rmhole has quit IRC (se.hub irc.underworld.no) [14:50] *** pikhq_ has joined #archiveteam-bs [14:52] *** kiska has joined #archiveteam-bs [14:52] *** hook54321 has joined #archiveteam-bs [14:52] *** w0rmhole has joined #archiveteam-bs [14:54] *** i0npulse has joined #archiveteam-bs [14:54] *** Flashfire has joined #archiveteam-bs [15:08] JAA: so I had an interesting idea [15:08] IPFS content mirroring to the IA, cache all the content we can (its easy to discover random content) and upload it to the IA or have a IA box pin content on the IPFS to save it [15:10] Yeah, the latter would probably be the easiest, just get IA to pin the content and it'll live forever. [15:10] https://github.com/ipfs-search/ipfs-search [15:10] looks like you can sniff the DHT traffic to find content [15:11] since content that doesn't get pinned will get purged after some time [15:11] or not access [15:22] Is there any estimate how big IPFS is? [15:27] ~Lots~ [15:27] https://github.com/victorbjelkholm/ipfscrape [15:27] JAA: interesting project [15:27] saves webpages and stores them into ipfs [15:32] Interesting indeed. [15:38] like right now [15:39] im shoving my entire dataset (about 20k files) into IPFS [15:39] so my users can use it, since they like having the entire dataset at times [15:47] *** Jon has quit IRC (Read error: Operation timed out) [15:50] *** jmtd has joined #archiveteam-bs [16:22] *** icedice has joined #archiveteam-bs [16:25] wonder who we poke to do something like that :) [16:25] arkiver, I'm sure you are insanely busy (I know life gets in the way) - whenever you can, would you be able to get the Angelfire project setup (ie getting the tracker setup and the github repo initialized) and looking over the quizlet target and giving the OK to proceed? [16:45] that joker "hello_" /msg'd me and demanded pictures [16:45] buddy, i grew up online and my girlfriend does porn. this isn't my first rodeo. [16:52] OK, so it seems my issue is in the Lua - how do I specify the if string match *.jpg,*.png,*.gif then add to url queue [16:54] In the wget callbacks get urls function I need to specify in broad general terms that if an image is found in the HTML of the page, then to add it to grab [16:56] *** Dimtree has joined #archiveteam-bs [17:28] *** Dimtree has quit IRC (Peace) [17:44] *** Dimtree has joined #archiveteam-bs [17:52] *** Dimtree has quit IRC (Quit: Peace) [17:54] *** icedice has quit IRC (Quit: Leaving) [17:57] *** Dimtree has joined #archiveteam-bs [18:15] *** jmtd has quit IRC (Ping timeout: 252 seconds) [18:15] *** i0npulse has quit IRC (Ping timeout: 252 seconds) [18:15] *** w0rmhole has quit IRC (Ping timeout: 252 seconds) [18:16] *** Jon- has joined #archiveteam-bs [18:16] *** Flashfire has quit IRC (Ping timeout: 252 seconds) [18:16] *** hook54321 has quit IRC (Ping timeout: 252 seconds) [18:18] *** kiska has quit IRC (Ping timeout: 252 seconds) [18:43] *** i0npulse has joined #archiveteam-bs [18:46] *** hook54321 has joined #archiveteam-bs [20:39] *** Lord_Nigh has quit IRC (Quit: ZNC - http://znc.in) [20:41] *** Lord_Nigh has joined #archiveteam-bs [21:09] *** ColdIce has joined #archiveteam-bs [21:13] *** bitBaron has quit IRC (Read error: Connection reset by peer) [21:14] *** bitBaron has joined #archiveteam-bs [21:26] *** bitBaron has quit IRC (My computer has gone to sleep. 😴😪ZZZzzz…) [21:27] *** bitBaron has joined #archiveteam-bs [21:35] *** Flashfire has joined #archiveteam-bs [23:50] *** bitBaron has quit IRC (Quit: My computer has gone to sleep. 😴😪ZZZzzz…)