[00:10] *** Sokar has quit IRC (Ping timeout: 258 seconds) [00:11] *** BlueMax has joined #archiveteam-bs [00:26] *** Sokar has joined #archiveteam-bs [00:39] *** wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES) [01:05] "9th Circuit holds that scraping a public website likely does not violate the CFAA, even after website owner prohibits with a cease-and-desist letter; language strongly suggests CFAA only applies to bypassing authentication." [01:05] https://twitter.com/OrinKerr/status/1171116153948626944 [01:06] Yes, all the loot, all of it~ [01:25] FREE AARON SWARTZ [01:35] JAA: wooooh awesome! [01:36] we'll get everything now [01:37] everyone's everything. [02:30] *** Zebranky_ is now known as Zebranky [03:31] *** qw3rty has joined #archiveteam-bs [03:39] *** jognsmith has joined #archiveteam-bs [03:40] *** qw3rty2 has quit IRC (Ping timeout: 745 seconds) [03:44] *** odemgi_ has joined #archiveteam-bs [03:45] *** odemg has quit IRC (Read error: Operation timed out) [03:48] *** odemgi has quit IRC (Read error: Operation timed out) [04:00] *** odemg has joined #archiveteam-bs [04:39] *** Quirk8 has quit IRC (END OF LINE) [04:41] *** Quirk8 has joined #archiveteam-bs [04:53] *** tuluu has quit IRC (Quit: tuluu) [04:56] *** tuluu has joined #archiveteam-bs [05:03] *** larryv has quit IRC (Quit: larryv) [06:04] *** killsushi has quit IRC (Ping timeout: 255 seconds) [06:13] *** killsushi has joined #archiveteam-bs [07:59] SketchCow: can you pull those items with warcs out of open source and put them into a separate collection + make sure they are indexed into wbm? [07:59] https://archive.org/search.php?query=archiveteam_sonysketchimg_ [08:41] *** killsushi has quit IRC (Quit: Leaving) [08:52] SketchCow: just to let you know the new SD Times are not in the SD Times Collection you made years ago : https://archive.org/details/sdtimes [08:53] example : https://archive.org/details/sdtimes287 [08:57] *** deevious has quit IRC (Quit: deevious) [09:06] *** godane has quit IRC (Leaving.) [09:09] *** Raccoon has quit IRC (Remote host closed the connection) [10:05] *** godane has joined #archiveteam-bs [10:35] *** deevious has joined #archiveteam-bs [11:18] *** VerifiedJ has joined #archiveteam-bs [11:28] *** BlueMax has quit IRC (Read error: Connection reset by peer) [12:12] *** ave_ has joined #archiveteam-bs [12:17] *** DogsRNice has joined #archiveteam-bs [12:26] *** Dallas has quit IRC (Quit: The Lounge - https://thelounge.chat) [12:26] *** Dallas has joined #archiveteam-bs [12:28] *** qw3rty has quit IRC (Ping timeout: 745 seconds) [12:50] *** qw3rty has joined #archiveteam-bs [13:09] *** Raccoon has joined #archiveteam-bs [14:21] *** kiska1 has quit IRC (Remote host closed the connection) [14:21] *** Ryz has quit IRC (Remote host closed the connection) [14:22] *** Ryz has joined #archiveteam-bs [14:22] *** Fusl sets mode: +o Ryz [14:22] *** kiska1 has joined #archiveteam-bs [14:22] *** Fusl_ sets mode: +o Ryz [14:22] *** Fusl__ sets mode: +o Ryz [14:22] *** Fusl__ sets mode: +o kiska1 [14:22] *** svchfoo1 sets mode: +o kiska1 [14:22] *** Fusl sets mode: +o kiska1 [14:22] *** Fusl_ sets mode: +o kiska1 [14:27] *** ave_ has quit IRC (Quit: Connection closed for inactivity) [14:55] *** Raccoon` has joined #archiveteam-bs [14:58] *** Raccoon has quit IRC (Ping timeout: 360 seconds) [14:59] *** Raccoon` is now known as Raccoon [15:44] *** larryv has joined #archiveteam-bs [15:46] Fusl_: Why are those not going into archiveteam_inbox [15:48] I've gone ahead and moved them. They'll probably go into WBM but I don't know how they do things anymore. [15:48] !a http://www.fyfz.cn/ [15:52] they should appoint you as president god emporer of WBM [15:53] *** VADemon has joined #archiveteam-bs [16:01] Oh I do not want that job. [16:03] isn't that how you got into this [16:07] I think SketchCow does computing history archiving, not 'ingest the whole web lol' [16:07] I got pulled in to 'preserve software' and it turns out I did a bunch of other shit [16:08] I thought his primary focus was soy sauce :) [16:09] WBM should offer a proper search engine results, but under the guise of being a public record archive to bypass european law [16:10] all search results are at least 30 seconds old, making it right and proper. [16:11] Raccoon is going to donate the petabyte and the expertise to make it happen [16:12] Awesome :) [16:12] :P [16:12] *** RichardG has quit IRC (Ping timeout: 246 seconds) [16:13] *** K4k has joined #archiveteam-bs [16:13] even just old school 2003 google or 1998 Altavista would be nice [16:14] as long as I can get search results that aren't pre-filtered for my protection, de-ribbed for my pleasure. [16:14] ivan_: he doesn't get it :P [16:16] WBM is supposed to already have all the page content. just how large would the index be to make it searchable? [16:17] and while bsing about this, is there any way to search WBM for all page titles beginning with "Index of" like I used to be able to do with Google up until the last few years [16:18] I miss the glory days of 6 to 10 years ago when I was wget'ing every open directory for the sake of filling harddrives [16:18] 377 billion pages * 10KB of actual text = 3.77PB [16:19] can it be indexed? [16:19] what do you think indexes are made of, Raccoon [16:20] * Raccoon searches for any ex-HOTBOT employees might still be alive in 2019 [16:20] unless you've got exotic compression schemes it's something like a giant KV of (normalized word) -> a list of pointers to every document that has word [16:22] if 90% of words in a book are structural connector words, we can probably shave it down to just indexing 10% of a page's content. Those words that rank with low popularity [16:22] words like 'bukake' or 'palin nudes' [16:23] SketchCow: they were uploaded prior to the existence of the inbox [16:24] What a lame excuse [16:24] Anyway, all set [16:24] also betting a good chunk of that 3.7PB is html tags, tables, scripts, and now css [16:25] each page could probably be assigned with just 10 to 100 english index words. [16:29] Speaking of BS [16:30] So, years ago I made that cute thing that would look at a archivebot item, take out a nice pleasant set of screenshots of the pages, and then post them as .jpg files just so the things looked good. [16:30] I'd love to do that again - my concern is I could get my mega-hacky thing working again, but it's probably stupid easy now. [16:30] Maybe someone has something lying around - otherwise, I can go find my scripts and get them going. [16:31] Example of what I mean: https://archive.org/details/archiveteam_archivebot_go_20150107190002 [16:33] thats really neat [16:34] SketchCow: Something like: google-chrome-stable --headless --disable-gpu --screenshot --window-size=1920,1080 [16:34] * Raccoon thinks he just got Cow shatted #bs :) [16:34] * phillipsj got the soy sauce reference. [16:34] https://ia902302.us.archive.org/7/items/archiveteam_archivebot_go_082/www.nc911truth.org-inf-20140728-030309-p4pky-00000.warc.gz.png [16:34] oh no... [16:35] That's why we save them [16:36] yeah i get it [16:40] > 1MB "preview" screenshot in .png [16:43] i just found one of chipotles twitter with a swastica on it [16:43] https://ia802603.us.archive.org/35/items/archiveteam_archivebot_go_20150209010002/twitter.com-inf-20150208-022652-a8aok-00000.warc.gz.png [16:44] someone really dosnt like burretos [16:46] VADemon, jpg would probably make the text hard to read. [16:47] *** systwiALT has joined #archiveteam-bs [16:49] I appreciate the lossless quality but its a preview. It's not supposed to be larger than the item. (+unoptimized png) [16:51] *** systwiAL_ has quit IRC (Read error: Operation timed out) [16:54] *** systwiALT has quit IRC (Read error: Operation timed out) [17:18] *** VerifiedJ has quit IRC (Quit: Leaving) [17:32] *** RichardG has joined #archiveteam-bs [18:26] PurpleSym: Let me try it [18:28] http://teamarchive1.fnf.archive.org/screenshot.png [18:28] No fuckin' complaints [18:41] *** jognsmith has quit IRC (Remote host closed the connection) [18:45] I found my warc screenshotter (It's called WEBBERGRABBER) and will now do the work, and thanks to you it'll do screenshots REALLY fast. [18:45] So that's appreciated. [18:54] I'm excited for more screenshots. [19:14] *** Jens has quit IRC (Remote host closed the connection) [19:14] *** Jens has joined #archiveteam-bs [19:27] Oops, wiped a script out [19:27] Well, luckily it doesn't do much [19:40] *** ndiddy has quit IRC (Quit: WeeChat 1.4) [19:41] *** ndiddy has joined #archiveteam-bs [19:58] OK, screenshotter's back in business. [19:58] http://teamarchive1.fnf.archive.org/WEBGRAB/ [20:27] *** Stiletto has quit IRC (Read error: Operation timed out) [20:30] *** Stiletto has joined #archiveteam-bs [20:58] *** katocala has quit IRC () [21:06] *** Raccoon has quit IRC (Read error: Connection reset by peer) [21:09] *** katocala has joined #archiveteam-bs [21:32] *** kiskabak has quit IRC (Remote host closed the connection) [21:32] *** kiskabak has joined #archiveteam-bs [21:32] *** Fusl sets mode: +o kiskabak [21:32] *** Fusl__ sets mode: +o kiskabak [21:32] *** Fusl_ sets mode: +o kiskabak [22:06] *** killsushi has joined #archiveteam-bs [22:37] *** jognsmith has joined #archiveteam-bs [22:44] Hello arkiver :) [22:44] you said fotolog? [22:44] https://www.archiveteam.org/index.php?title=Fotolog this? [22:44] sorry i meant live spaces, my bad [22:44] https://www.archiveteam.org/index.php?title=Spaces_of_Windows_Live_Spaces_pending_for_download [22:44] *** Smiley has quit IRC (Read error: Operation timed out) [22:44] (Link in the main chan is broken) [22:44] alright [22:44] yeah I saw the page [22:44] The IRC logs don't go back that far. [22:44] i was confused since he said fotolog [22:45] I'm not sure if it was saved, which one was yours? [22:45] (I was not involved in this project) [22:45] jognsmith: ^ [22:45] photosoffmycats.spaces.live.com [22:46] (i'd like to ask later about fotolog as well) [22:46] alright and which one for fotolog? [22:47] for fotolog.com [22:48] wolf_alex [22:48] ok [22:48] If we have http://photosoffmycats.spaces.live.com/, I'm not sure where it is [22:48] perhaps chfoo knows something [22:48] oh :c does that mean its lost? [22:49] could be [22:49] looking into the fotolog one now [22:49] thank you [22:49] So apparently that list is also part of these grabs: https://www.archiveteam.org/index.php?title=Talk:Windows_Live_Spaces#Phase_2:_Downloading_Hotlists [22:49] Which are "Uploaded, awaiting verification", so at least they were grabbed at some point. [22:50] underscor: According to the wiki page, you were running an FTP server for that project at the time. Do you know anything? [22:52] oh! if they were grabbed it could mean good news i guess [22:52] are you sure it was wolf_alex? [22:52] so fotolog.com/wolf_alex ? [22:53] because I can't find it, and it's also not in the list of account we archived from fotolog. [22:53] yes, it was http://www.fotolog.com/wolf_alex [22:54] probably it wouldnt be grabbed though, i didnt have that much followers [22:55] I think we discovered users by checking followers, etc. [22:55] yeah I don't see it in the lists of account the archived :/ [22:55] hopefully there will still be good news on spaces [22:55] yeah i guessed so :/ it was a small site [22:56] yes! the fact that the link is in the wiki gives me hope [23:20] Happy to say the archivebot screenshotter works. [23:20] (Just did a full-run.) [23:21] Now I'm running it against archivebot in general. [23:21] Nice [23:26] so that gaming computer i wanted to get now back up to $700 [23:26] was on sale for $580 [23:31] *** coderobe has quit IRC (Remote host closed the connection) [23:38] SketchCow: what computer pre-build would you get for $600? [23:38] i was looking at this but it went back up to $700 : https://www.amazon.com/Dell-Inspiron-Desktop-Processor-Graphics/dp/B07Q3G3B67/