[00:00] kisspunch: 3x improvement in what? drive space needed? [00:41] marked: right [00:57] are you using http, warc, and/or git? previously I would sort through warc from a crawl and make a pie chart of byte count per URL [00:58] I'll take a look if your code is somewhere I can run it, or have an output for DL I can comb through [01:11] marked: I don't think I can reduce the byte count per repo further. I need to cut out repos now. [01:12] I can PM some access stuff if you want to comb through things [01:16] *** ndiddy has quit IRC () [01:42] JAA: Can you check if co8mqvzufi1dsn7jnfbcn1o3o is stuck for me (on ArchiveBot)? I want to the job to finish so that more jobs can be added. [01:43] t3: Wrong channel. [01:44] JAA: I am sorry. I'll just message you personally next time. [01:44] No, the correct place is #archivebot. [02:13] *** ivan has quit IRC (Read error: Operation timed out) [02:13] *** JAA has quit IRC (Read error: Operation timed out) [02:13] *** cfarquhar has quit IRC (Read error: Operation timed out) [02:14] *** ivan has joined #archiveteam-bs [02:14] *** wabu has quit IRC (Read error: Operation timed out) [02:14] *** svchfoo1 has quit IRC (Read error: Operation timed out) [02:14] *** fuzy802 has joined #archiveteam-bs [02:15] *** fuzzy8021 has quit IRC (Read error: Operation timed out) [02:15] *** nightpoo- has quit IRC (Read error: Operation timed out) [02:16] *** c4rc4s has quit IRC (Read error: Operation timed out) [02:16] *** simon816 has quit IRC (Ping timeout: 246 seconds) [02:24] *** nightpool has joined #archiveteam-bs [02:24] *** fuzy802 is now known as fuzzy8021 [02:27] *** dumbass_ has quit IRC (Ping timeout: 260 seconds) [02:53] *** Rome_Silv has quit IRC (Remote host closed the connection) [02:53] *** Rome_Silv has joined #archiveteam-bs [03:02] *** Mateon1 has quit IRC (Ping timeout: 265 seconds) [03:03] *** Mateon1 has joined #archiveteam-bs [03:14] *** cfarquhar has joined #archiveteam-bs [03:14] *** c4rc4s has joined #archiveteam-bs [03:15] *** simon816 has joined #archiveteam-bs [03:18] *** svchfoo1 has joined #archiveteam-bs [03:18] *** Fusl sets mode: +o svchfoo1 [03:18] *** JAA has joined #archiveteam-bs [03:18] *** Fusl sets mode: +o JAA [03:18] *** bakJAA sets mode: +o JAA [03:18] *** wabu has joined #archiveteam-bs [03:26] *** qw3rty119 has joined #archiveteam-bs [03:30] *** Rome_Silv has quit IRC (Remote host closed the connection) [03:30] *** Rome_Silv has joined #archiveteam-bs [03:32] *** qw3rty118 has quit IRC (Read error: Operation timed out) [03:32] *** odemgi_ has joined #archiveteam-bs [03:35] *** odemgi has quit IRC (Read error: Operation timed out) [03:35] *** Rome has joined #archiveteam-bs [03:38] *** Rome_Silv has quit IRC (Ping timeout: 252 seconds) [03:41] *** odemg has quit IRC (Ping timeout: 615 seconds) [03:44] ftp sites are not going to archive themselves, and more and more sites from the ftp list are dead each day. something has to be done [03:47] *** odemg has joined #archiveteam-bs [03:49] I completly agree [03:49] I cant do anything about it with no coding knowledge but last year I built up a substantial list of the FTP sites [03:50] Lord_Nigh as far as I know the scripts are outdated [03:52] Lord_Nigh there is #effteepee but its been dead for months [03:55] nothing will get done until someone either learns how to script it, or has an alternative, better solution [03:55] we can gripe about the dead effteepee project all day but it achieves nothing [03:56] do FTP sites go into WBM somehow? [03:56] Sketchcow was telling me once about a guy that was uploading FTP sites as zips but I dont remember what became of that [03:56] I don't have the skill to do it [03:56] FTP sites dont go to the wayback machine it cant handle the protocol [03:56] beyond simple 'wget -m -np -p ftp://ftp.site.com/' [03:57] I mean id happily donate time but dont have skill or money to do anything better [03:57] We can check if the ftp site has a http site equivalent and archive that [03:58] but that is a bandaid over a gunshot wound [03:58] *** omarroth has quit IRC (Remote host closed the connection) [04:02] i've been uploading ftp sites as zips for a while [04:02] i'm not very active on it though [04:03] ftp is easier than then http equivalent. what happened to the ftp grab that ran on the tracker? [04:04] i didn't know there was such a thing on the tracker ... ? [04:05] all I know is what's in the wiki https://www.archiveteam.org/index.php?title=FTP [04:05] There was but that was before my time [04:06] All i know is the scripts are broken and the project was all but abandoned I spent a lot of 2018 working through and adding stuff the FTP/List [04:11] *** Despatche has quit IRC (Quit: Read error: Connection reset by deer) [05:05] *** stapler11 has joined #archiveteam-bs [05:32] *** enowaldo has joined #archiveteam-bs [05:37] *** Zerote has joined #archiveteam-bs [05:41] *** enowaldo has quit IRC (Read error: Operation timed out) [07:06] *** Zerote has quit IRC (Ping timeout: 260 seconds) [07:20] *** Zerote has joined #archiveteam-bs [09:09] Sola CDN grab is nearly done, just a few 10k URLs remaining. [09:36] *** stapler11 has quit IRC (Read error: Connection reset by peer) [09:37] *** stapler11 has joined #archiveteam-bs [09:47] *** BlueMax has quit IRC (Read error: Connection reset by peer) [09:56] *** Verified_ has quit IRC (Ping timeout: 252 seconds) [09:59] *** Verified_ has joined #archiveteam-bs [10:19] *** Mateon1 has quit IRC (Quit: Mateon1) [10:23] *** Mateon1 has joined #archiveteam-bs [10:27] *** Verified_ has quit IRC (Ping timeout: 252 seconds) [10:31] *** Mateon1 has quit IRC (Remote host closed the connection) [10:39] *** Mateon1 has joined #archiveteam-bs [10:44] *** Verified_ has joined #archiveteam-bs [11:17] *** enowaldo has joined #archiveteam-bs [11:32] Sola CDN complete. I don't have any numbers for the total size yet, but it's over a TiB. [11:35] *** enowaldo has quit IRC (Read error: Operation timed out) [11:54] HCross: I just saw that https://archiveteam.org/index.php?title=ZAM_Network still mentions Storm Shield One (stormshield.one) as "In progress... by User:HCross" since September. I can't find a grab from around that time in the WBM. Any idea what happened to that? [12:05] *** enowaldo has joined #archiveteam-bs [12:06] *** icedice has joined #archiveteam-bs [12:23] *** enowaldo has quit IRC (Read error: Operation timed out) [13:11] *** enowaldo has joined #archiveteam-bs [13:19] *** chirlu has quit IRC (Ping timeout: 255 seconds) [13:32] *** Despatche has joined #archiveteam-bs [13:38] *** chirlu has joined #archiveteam-bs [13:44] *** omarroth has joined #archiveteam-bs [14:22] *** dumbass_ has joined #archiveteam-bs [14:25] *** Zerote has quit IRC (Ping timeout: 260 seconds) [14:34] *** enowaldo has quit IRC (Read error: Operation timed out) [14:52] *** balrog has quit IRC (Quit: Bye) [15:00] *** kiska1 has quit IRC (Read error: Operation timed out) [15:01] *** kiska1 has joined #archiveteam-bs [15:01] *** enowaldo has joined #archiveteam-bs [15:01] *** svchfoo3 sets mode: +o kiska1 [15:03] *** enowaldo has quit IRC (Read error: Operation timed out) [15:09] *** balrog has joined #archiveteam-bs [15:10] *** dumbass_ has quit IRC (Ping timeout: 260 seconds) [15:16] *** omarroth has quit IRC (Quit: Konversation terminated!) [15:16] *** Zerote has joined #archiveteam-bs [15:17] *** omarroth has joined #archiveteam-bs [15:18] *** DogsRNice has joined #archiveteam-bs [15:22] Hi. I was wondering about the feasibility of archiving the steam community (profiles and associated pages, groups and their decisions and everything else there). The coverage of it is almost nonexistent on the wayback machine beyond the surface level. While the site is likely stable in the long term there will likely be a UI update at some point in the future that may make archiving difficult [15:23] It already is difficult. Pagination of comments works through JS, for example. [15:23] Or at least it did last time I checked. [15:24] The "view all comments" button leads to a normal list of pages that can be saved but not all types of pages have that [15:31] *** DogsRNice has quit IRC (Ping timeout: 263 seconds) [15:36] *** RomeSilva has joined #archiveteam-bs [15:37] *** Rome has quit IRC (Ping timeout: 252 seconds) [15:40] *** enowaldo has joined #archiveteam-bs [15:47] Total size of my Sola CDN grab: 1.01 TiB. Technically "over a TiB". :-) [15:48] *** enowaldo has quit IRC (Ping timeout: 265 seconds) [15:48] The data is in these items: https://archive.org/details/@justanotherarchivist?and%5B%5D=sola.ai_cdn_201904_ [16:19] kiska: I've added details about my part of the Sola crawl to https://archiveteam.org/index.php?title=Sola.ai but I'm not sure how much the warrior project managed, where the data is, etc. [16:20] JAA: The data that we managed to grab is here: https://archive.org/details/archiveteam_sola_20190412062205_780e9c17 [16:21] As far as I know there is about 11.7G of data there [16:21] Any idea how many users we managed to cover? [16:22] I'll check the json and see how much we got, but I am going to estimate <=1k [16:22] *** enowaldo has joined #archiveteam-bs [16:24] zcat sola_20190412062205_780e9c17.megawarc.json.gz | wc -l indicates 307 lines [16:26] Thanks, adding that to the page. [16:30] I knew we wouldn't grab much since it was set up in a hurry and my tracker was the most unstable piece of software I have every touched [16:36] *** t3 has quit IRC (Quit: Connection closed for inactivity) [16:36] *** bitBaron has quit IRC (Read error: Operation timed out) [16:37] *** enowaldo has quit IRC (Read error: Operation timed out) [16:43] *** jspiros__ has joined #archiveteam-bs [16:43] *** enowaldo has joined #archiveteam-bs [17:35] *** omarroth has quit IRC (Ping timeout: 506 seconds) [17:35] *** omarroth has joined #archiveteam-bs [17:56] *** tephra has quit IRC (Read error: Operation timed out) [18:00] *** tephra has joined #archiveteam-bs [18:11] *** Oddly has joined #archiveteam-bs [18:24] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [18:32] *** enowaldo has joined #archiveteam-bs [19:31] *** Mateon1 has quit IRC (Remote host closed the connection) [19:32] *** Mateon1 has joined #archiveteam-bs [19:46] *** Stiletto has joined #archiveteam-bs [19:54] *** enowaldo has quit IRC (Read error: Operation timed out) [19:58] *** omarroth has quit IRC (Read error: Connection reset by peer) [19:59] *** omarroth has joined #archiveteam-bs [20:23] *** Stilett0- has joined #archiveteam-bs [20:23] *** VADemon_ has quit IRC (Read error: Connection reset by peer) [20:25] *** Stiletto has quit IRC (Ping timeout: 268 seconds) [20:25] *** enowaldo has joined #archiveteam-bs [20:29] *** fredgido has joined #archiveteam-bs [20:31] *** enowaldo has quit IRC (Read error: Operation timed out) [20:37] *** Stilett0- is now known as Stiletto [20:40] *** Stilett0- has joined #archiveteam-bs [20:43] *** Stilettoo has joined #archiveteam-bs [20:45] *** Stiletto has quit IRC (Read error: Operation timed out) [20:47] *** Stilett0- has quit IRC (Ping timeout: 265 seconds) [20:49] *** Stilettoo is now known as Stiletto [20:52] *** alex_ has joined #archiveteam-bs [21:06] *** schbirid has quit IRC (Remote host closed the connection) [21:47] *** VerifiedJ has joined #archiveteam-bs [21:51] *** Verified_ has quit IRC (Ping timeout: 252 seconds) [21:52] *** enowaldo has joined #archiveteam-bs [21:54] *** Verified_ has joined #archiveteam-bs [21:56] *** VerifiedJ has quit IRC (Ping timeout: 252 seconds) [21:58] *** Stilettoo has joined #archiveteam-bs [21:58] *** stapler11 has quit IRC (Read error: Connection reset by peer) [21:59] *** stapler11 has joined #archiveteam-bs [21:59] *** Stiletto has quit IRC (Ping timeout: 268 seconds) [22:02] *** icedice has quit IRC (Read error: Connection reset by peer) [22:02] *** icedice2 has joined #archiveteam-bs [22:03] *** enowaldo has quit IRC (Ping timeout: 252 seconds) [22:05] *** VerifiedJ has joined #archiveteam-bs [22:05] *** Verified_ has quit IRC (Ping timeout: 252 seconds) [22:18] *** VerifiedJ has quit IRC (Ping timeout: 252 seconds) [22:25] *** Verified_ has joined #archiveteam-bs [22:44] *** enowaldo has joined #archiveteam-bs [22:56] *** Stilettoo is now known as Stiletto [22:59] *** enowaldo has quit IRC (Read error: Operation timed out) [23:06] *** icedice2 has quit IRC (Quit: Leaving) [23:22] *** BlueMax has joined #archiveteam-bs [23:33] *** alex_ has quit IRC (Quit: take care ye all. Have fun!) [23:35] *** dashcloud has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [23:36] *** dashcloud has joined #archiveteam-bs [23:38] *** qw3rty119 has quit IRC (Read error: Connection reset by peer) [23:45] *** qw3rty119 has joined #archiveteam-bs [23:49] *** ndiddy has joined #archiveteam-bs [23:58] *** Ravenloft has joined #archiveteam-bs