[00:00] *** fie_ has joined #archiveteam-bs [00:02] *** fie has quit IRC (Read error: Operation timed out) [00:08] *** tomwsmf-a has quit IRC (Read error: Operation timed out) [00:16] *** w0rp has quit IRC (Read error: Operation timed out) [00:20] *** fie_ has quit IRC (Quit: Leaving) [00:21] *** fie has joined #archiveteam-bs [00:21] *** w0rp has joined #archiveteam-bs [00:31] *** Stiletto has joined #archiveteam-bs [00:31] *** Fletcher_ has joined #archiveteam-bs [00:50] http://fos.textfiles.com/ARCHIVETEAM/ has proper timestamps now [00:52] *** FalconK has quit IRC (Ping timeout: 260 seconds) [00:53] *** FalconK has joined #archiveteam-bs [01:00] *** JesseW has joined #archiveteam-bs [02:01] *** Fletcher_ has quit IRC (Ping timeout: 250 seconds) [02:01] *** koon has quit IRC (Ping timeout: 250 seconds) [02:14] SketchCow: the first 3 links on that page has tar balls that are 0 bytes [02:23] *** Coderjoe_ has joined #archiveteam-bs [02:25] *** Coderjoe has quit IRC (Read error: Operation timed out) [02:28] Yes [02:28] That's a thing I will fix soon. [02:28] The system has always made zero-length tarballs. [02:29] Now that I've centralized the upload script, I will put a check in. [02:31] Line added. Thanks for the tip. [02:39] i'm up to 2015-10-31 with kfpa [02:40] so that weight of kpfa is mostly off my shoulder's now [02:40] and we don't have make a bot to grab either [02:51] *** Fletcher_ has joined #archiveteam-bs [02:51] *** koon has joined #archiveteam-bs [02:55] *** fie_ has joined #archiveteam-bs [02:56] *** bwn has quit IRC (Read error: Operation timed out) [03:00] *** fie has quit IRC (Read error: Operation timed out) [03:09] *** bsmith094 has quit IRC (Ping timeout: 190 seconds) [03:10] *** Yoshimura has quit IRC (Ping timeout: 190 seconds) [03:16] *** fie__ has joined #archiveteam-bs [03:16] *** fie_ has quit IRC (Read error: Connection reset by peer) [03:22] SketchCow: looks like Hard Knock Radio is encoded at 256kps from at least 2015-10 on [03:31] *** koon has quit IRC (Ping timeout: 250 seconds) [03:32] *** Fletcher_ has quit IRC (Ping timeout: 250 seconds) [03:43] Here's an interesting DOS game I enjoyed playing before: https://archive.org/details/SeaRogue (Underwater archeology & treasure hunting) [04:04] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:13] *** Sk1d has joined #archiveteam-bs [04:18] *** Yoshimura has joined #archiveteam-bs [04:21] *** koon has joined #archiveteam-bs [04:22] *** Fletcher_ has joined #archiveteam-bs [04:34] *** dashcloud has quit IRC (Read error: Operation timed out) [04:37] *** dashcloud has joined #archiveteam-bs [05:00] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [05:06] *** Sk1d has joined #archiveteam-bs [05:06] *** wyatt8740 has quit IRC (Read error: Operation timed out) [05:22] *** wyatt8740 has joined #archiveteam-bs [05:30] *** decay_ has quit IRC (Read error: Operation timed out) [05:33] *** SN4T14 has quit IRC (Read error: Operation timed out) [05:39] *** decay has joined #archiveteam-bs [05:43] *** SN4T14 has joined #archiveteam-bs [05:47] *** metalcamp has joined #archiveteam-bs [05:57] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [06:01] *** bwn has joined #archiveteam-bs [06:03] SketchCow: so i'm uploading more dvds to you FOS server [06:04] anyone here who worked on this? (doing incremental grab, starting where you guys left off. Had to write things that were missing like the discovery stuff) https://github.com/ArchiveTeam/furaffinity-grab [06:22] *** GLaDOS has quit IRC (Quit: Oh crap, I died.) [06:27] *** GLaDOS has joined #archiveteam-bs [06:32] *** bsmith094 has joined #archiveteam-bs [06:58] Great [07:00] *** vitzli has joined #archiveteam-bs [07:01] I saw that article about web scraping, I think I can do a better flowchart: http://i.imgur.com/jB7qlvr.png [07:27] *** Honno has quit IRC (Read error: Operation timed out) [07:32] *** schbirid has joined #archiveteam-bs [07:33] *** metalcamp has joined #archiveteam-bs [07:39] vitzli: nice [07:39] missing some es in the IRC channel names, though :P [07:40] it's not a final version, I can change anything [07:41] neat [07:42] too much copy-paste :( [07:49] vitzli: I thought your flowchart was going to be scarcastic -- but it's actually useful. [07:51] It's little bit sarcastic and I could add more [07:52] vitzli: "Check the Deathwatch page" -> "Check (and, if missing, add it to) the Deathwatch page" [07:52] both? Both is good. [07:53] BTW, I'm starting a new IA census, this time with an up-to-date list of about 19 million identifiers. [07:53] Currently done about a million of them. [07:54] Are you going to keep sha1s? [07:54] Does the site refuse to provide a useful API? -> Yes -> Scrape it. [07:54] yes, I'm keeping sha1s this time. [07:54] eh, scrape it whether it provides an API or not. [07:55] WARCs are *always* good to have, if at all possible. [07:55] I can keep raw results [07:55] JesseW: well, by "scrape" I think they're talking about extracting structured data from unstructured documents [07:55] ah, fair point [07:55] obviously archiving raw data is useful no matter what [07:56] But I'd say, grab WARCs *then* think about how to extract structure from them. :-) [07:56] agreed [07:56] I'd say the only time *not* to make WARCs is if the site is too large/dynamic, and even in that case, grab WARCs of a representative sample. [07:57] I hope to upload the hash archive into the IA this week [07:57] I decided to go ahead and go the whole way, and there's now one directory for FOS pipelines, and therefore one script that runs that says "push out all the new FOS packs". [07:57] Also, if the site is too aggressive with, say CAPCHAs or other "we insist that you have a human sitting in front of a browser before our server will talk to you" -- in that case, something like webrecorder.io [07:57] And that will add to the log, etc. [07:58] great! [07:58] vitzli: which hash archive? [07:58] the one I did [07:58] ah, cool [07:58] (md5,sha1,sha256) [07:58] plus all other stuff it calculated [08:00] do add links to it from http://archiveteam.org/index.php?title=Internet_Archive_Census once you upload it [08:00] JesseW: actually, it would be cool if https://morph.io/ (spiritual successor to scraperwiki) proxied everything through something that archived WARCs by default, rather than just throwing away the sources after extracting the data from it [08:01] it's not a census or IA related stuff, just bunch of hashes I could get my hands to [08:01] some of it got to the IA or was downloaded from IA [08:01] davidar: yes, that would be much better [08:02] vitzli: in that case, pass it along to Ben Trask https://github.com/btrask [08:02] he likes hashes :-) [08:02] cool, will do, thank you [08:03] * davidar used to be involved in scraping stuff many years ago, but not so much anymore [08:07] JesseW: do you want some help running through some of those? [08:10] bwn: some of what? [08:12] sorry, you mentioned you were starting a census [08:12] Ah, those. [08:14] If you'd like to work on the 600,000 identifiers that were in the last list, but aren't in the current one, that'd be welcome. [08:14] I'll need to walk you through getting set up, though. [08:14] And I should probably head to sleep sooner than later... [08:15] absolutely, whenever you get some time [08:15] JesseW, Is Ben Trask on IRC? [08:16] vitzli: IDK -- I haven't seen him on here, no. [08:16] ok, will email him [08:17] bwn: Do you have a unix system available? [08:18] Are there stairs in your house [08:18] yes and yes :) [08:19] You'll need to install GNU parallel, jq and iamine. [08:19] the first two should be available from your distro (and jq is a standalone binary so that's easy) [08:21] iamine you can get from https://archive.org/download/iamine-pex (I used ia-mine-0.5-py3.4.pex because I'm on py3.4) [08:32] oh, you'll also need pv for progress display [08:32] technically optional, but neat [08:37] It's made it to the C's! [08:44] *** JesseW has quit IRC (Ping timeout: 370 seconds) [09:34] *** bwn has quit IRC (Ping timeout: 492 seconds) [09:34] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [09:39] *** metalcamp has joined #archiveteam-bs [09:47] *** bwn has joined #archiveteam-bs [10:01] *** dashcloud has quit IRC (Read error: Operation timed out) [10:05] *** dashcloud has joined #archiveteam-bs [10:27] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [10:54] *** dashcloud has quit IRC (Read error: Operation timed out) [10:58] *** dashcloud has joined #archiveteam-bs [11:17] *** metalcamp has joined #archiveteam-bs [12:03] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [12:33] *** Honno has joined #archiveteam-bs [13:42] *** VADemon has joined #archiveteam-bs [14:13] *** Start has quit IRC (Quit: Disconnected.) [14:29] http://fos.textfiles.com/ARCHIVETEAM/ now is called automatically, all the uploads happen 24/7 without my intervention, so as whatever the amount of items are is hit, they get packed up and uploaded into their collections. Automatically. [14:34] Hey I've got a secondary internal HDD now but it takes like 3 minutes to load a page using wayback (with pywb). I'm sure that I'm using the necessary index files and have the directories sorted in the correct way due to having that already checked out before regarding a different issue [14:34] What are the identifiers? What they are for? Can I help with anything? [14:37] Also a second question: How do I work (warrior) on FTP project? [14:54] *** Start has joined #archiveteam-bs [15:06] *** metalcamp has joined #archiveteam-bs [15:14] Yoshimura: As far as I know FTP Project is NOT for warriors [15:14] Oh. Why not? I feel like waste of time, it merely does 5GB a day. [15:15] Which is a lot with web pages, but if half connections stall. [15:16] Q about Archive.org: How do I upload warcs or stuff onto it myself? "Please contribute books, audio, and video files that you have the right to share." ... Does that mean I cannot archive valuable stuff that is publicly on internet and not used to make money? [15:18] I tried to switch to google code, but without complete shutdown it does not switch, and a single task is running 15 hours. I just do not get how regular people can help, if its not money, running (often) sluggish warrior instance, or providing code to something they have no idea about how it works and maybe does not even have public repository. [15:19] http://www.gearthblog.com/blog/archives/2016/04/big-google-earth-database.html [15:20] found that interesting [15:21] tldr: 'bout 3TB [15:22] *3PB [15:26] Could we archive google earth/maps? [15:26] IDK if it would make sense. [15:26] HCross: yes. [15:27] HCross: I'm not sure google would be happy with us ;) but yes, it would be a good idea [15:27] If we want to, I'll create a project [15:27] for personal use? There was a program that pulls the maps and stores them in local storage [15:27] But I'm quite sure that's not going to happen anytime soon [15:27] I was only asking "if" it was possible [15:28] *** signius has quit IRC (Read error: Operation timed out) [15:29] alfie, we already have our fair share of google employees who dont like us :P [15:29] HCross: it's only fair, i don't like them either :P [15:29] Company has no problem defending its copyright, person has to give us his personal information in order to defend himself. Company has no problems defending themselves about violations of copyright blaming it on their users via ToS. Person has no way to get out. ... I have no idea how should I proceed in getting information of what I can or can not [15:29] do in terms of preserving knowledge that is currently publicly accessible for free, but might cease in future. If anyone has any links related, would be appreciated. [15:30] DO NOT ARCHIVE GOOGLE MAPS [15:30] DO NOT ARCHIVE GOOGLE MAPS [15:31] Q: Is there a way to get into FTP project or another, higher bandwidth one? [15:32] Btw, anything running on Parse Server might get out of business, unless they migrate by January. [15:33] technically it is possible, for example SAS.Планета / SAS.Planet / SASPlanet downloads chunks from kh.google servers - it can combine them into large .jpg maps, different resolutions are possible from 1 (entire world) to 18 (human shadows are visible) [15:34] vitzli: depends if my "archiving google maps" we mean the imagery or the maps data [15:34] http://www.sasgis.org/ - can't find the english version of the website [15:35] SketchCow: don't worry, we won't [15:35] ^ Was only being hypothetical [15:35] :P [15:35] I think he doesn't want you do archive google maps [15:36] that's exactly what he said [15:36] :p [15:38] soo, archive bing maps then? [15:38] * midas hides [15:39] lol [15:39] Yahoo maps - if they are a thing [15:40] By the way, that music bootleg site CONTINUES to download. [15:40] slow and steady wins the race, again [15:41] I was going to say as an alternative to archiving Google Maps or Bing Maps, maybe we could look at grabbing OSM's datasets, but it's already on IA, and relatively current at that. [15:41] Granted, OSM doesn't have satellite imagery or anything like that [15:42] *** signius has joined #archiveteam-bs [15:44] Why is fotolog taking so long? Do you need more warriors or is it just the rate limiting? [15:45] They're website can't handle more load [15:45] err [15:45] their website* [15:46] ah [15:46] Rate limit, yes, its nonstop Service over load. [15:47] arkiver: fotolog.com right? [15:47] yes [15:49] the pictures we are grabbing, are we grabbing them via the cloudflare service or directly? [15:49] just like they are on the page [15:49] k [15:55] *** jut has joined #archiveteam-bs [16:06] *** Start has quit IRC (Quit: Disconnected.) [16:08] *** VADemon has quit IRC (Quit: left4dead) [16:23] *** JesseW has joined #archiveteam-bs [16:25] *** jut has quit IRC (Ping timeout: 250 seconds) [16:51] *** JesseW has quit IRC (Ping timeout: 370 seconds) [17:06] Yoshimura: URLTeam can always use people investigating new shorteners: see the wiki page for URLTeam for details. [17:07] *** dashcloud has quit IRC (Read error: Operation timed out) [17:07] yes, urlteam is an endless sink of only semi-automatable labor [17:11] *** dashcloud has joined #archiveteam-bs [17:13] *** SimpBrain has joined #archiveteam-bs [17:21] Yes, urlteam could use JesseW's "intense" attention. [17:23] * Yoshimura will look into that [17:24] Jesse should. [17:33] SketchCow> DO NOT ARCHIVE GOOGLE MAPS <- i would LOVE it if someone did so for selected regions [17:34] SketchCow: I already keep an eye on URLteam. Were you talking about Yoshimura? [17:34] i think it was a typo yes [17:35] hopefully. [18:04] *** Start has joined #archiveteam-bs [18:05] *** vitzli has quit IRC (Leaving) [18:07] *** Start has quit IRC (Read error: Connection reset by peer) [18:08] *** Start has joined #archiveteam-bs [18:10] *** Start has quit IRC (Read error: Connection reset by peer) [18:56] *** dashcloud has quit IRC (Read error: Operation timed out) [19:00] *** dashcloud has joined #archiveteam-bs [19:06] *** Start has joined #archiveteam-bs [19:07] No, I mean you [19:07] I want that thing singing [19:10] :-P [19:24] *** ohhdemgir has quit IRC (Remote host closed the connection) [19:33] *** bwn has quit IRC (Ping timeout: 246 seconds) [19:36] http://blog.dshr.org/2016/04/brewster-kahles-distributed-web-proposal.html [19:38] If there is something more specific I can do in terms of urlteam, let me know. [19:39] I tried finding API for writing the settings. And my warrior for url does nto have enough work [19:40] Yoshimura: come over to #urlteam and I'll try to clarify [19:43] *** Start has quit IRC (Quit: Disconnected.) [19:47] *** SimpBrai1 has joined #archiveteam-bs [19:47] *** SimpBrai1 has quit IRC (Read error: Connection reset by peer) [19:59] *** schbirid has quit IRC (Quit: Leaving) [20:05] *** bwn has joined #archiveteam-bs [20:05] *** dashcloud has quit IRC (Read error: Operation timed out) [20:06] *** dashcloud has joined #archiveteam-bs [20:15] i just wish we could have the warrior join more than one project at once without have to launch another VM [20:15] like make the # of projects = the number of concurrent connections or whatever [20:16] atrocity: I've wanted the same thing — I just haven't gotten around to learning (and setting up a test environment) for the warrior code enough to implement it. [20:22] Btw a better way to purge the data drive, by setting the regular to Writethrough and useing snapshot [20:22] So you have "Clean start" snapshot. [20:23] atrocity: Make a VM and use multiple dockers :P [20:24] I might work on warrior or something code also, but everythign takes time learnign how stuff works. [20:50] *** metalcamp has quit IRC (Ping timeout: 244 seconds) [20:50] *** toad2 has joined #archiveteam-bs [20:51] *** toad1 has quit IRC (Read error: Operation timed out) [20:58] yeah, i won't even spend the time, haha! just working on too many other projects atm that require a lot of time/effort [21:00] *** Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [21:01] *** Yoshimura has joined #archiveteam-bs [21:02] *** Honno has quit IRC (Read error: Operation timed out) [21:13] so i got Mighty Morphin' Power Rangers Host TMNT 2 on fox [21:14] i also figured out the date was 1993-11-26 when it aired [21:14] Black Friday Night [21:35] *** Yoshimura has quit IRC (http://www.kiwiirc.com/ - A hand crafted IRC client) [21:39] lol [22:07] *** Start has joined #archiveteam-bs [22:41] *** BlueMaxim has joined #archiveteam-bs [22:57] *** Yoshimura has joined #archiveteam-bs [23:18] *** Yoshimura has quit IRC () [23:34] *** Jonimus has quit IRC (Read error: Operation timed out) [23:37] *** Yoshimura has joined #archiveteam-bs [23:46] *** Mayonaise has quit IRC (Read error: Operation timed out) [23:48] *** SimpBrain has quit IRC (Ping timeout: 633 seconds) [23:48] *** SimpBrain has joined #archiveteam-bs [23:50] *** kvieta has quit IRC (Ping timeout: 633 seconds) [23:50] *** kvieta has joined #archiveteam-bs [23:56] *** Mayonaise has joined #archiveteam-bs