[00:00] *** closure has joined #archiveteam-bs [00:01] *** hook54321 has joined #archiveteam-bs [00:06] *** Frogging has joined #archiveteam-bs [00:17] *** Jens has quit IRC (Read error: Operation timed out) [00:17] *** phuzion has quit IRC (Read error: Operation timed out) [00:22] *** i0npulse has joined #archiveteam-bs [00:22] *** BlueMax has quit IRC (Read error: Operation timed out) [00:24] *** phuzion has joined #archiveteam-bs [00:24] *** Jens has joined #archiveteam-bs [00:25] *** BlueMax has joined #archiveteam-bs [00:33] *** closure has quit IRC (Read error: Connection reset by peer) [00:35] *** closure has joined #archiveteam-bs [00:48] *** ndiddy has quit IRC () [00:58] *** closure has quit IRC (Read error: Operation timed out) [00:59] *** Swicher has joined #archiveteam-bs [01:01] *** closure has joined #archiveteam-bs [01:02] Swicher: So the way we typically download sites in a distributed way splits it up into individual work items. These items normally only take a relatively short amount of time, e.g. a few minutes. [01:03] So it's not a big issue if an individual machine needs to be shut down or has issues and crashes. [01:26] so i got 1720 pdfs for ERIC archive so far with this audit [01:33] *** closure has quit IRC (Read error: Connection reset by peer) [01:33] *** closure_ has joined #archiveteam-bs [01:43] JAA: What I was worried about when downloading the site only with the Warrior/ArchiveBot is that it does not cover it completely (or at least not the relevant parts). For example, if you check https://archive.org/download/archiveteam_archivebot_go_20180818100001/addons.mozilla.org-inf-20180729-181049-xew9s-00052.warc.os.cdx.gz (the latest index work that I found referring to to the Mozilla site) and compare it with the list that I made you will see tha [01:43] t many things are missing. [01:43] That's why I started doing the scripts already mentioned. The only problem I haven't been able to solve is that it takes between 5 days and a week to go through all the pages with extensions, so any idea to optimize this is welcome. Just out of curiosity, are you saving the list on your own or did you add it as an ArchiveBot job? [01:47] Swicher: Yeah, that grab is incomplete. The one from 2017-08-29 until early December 2017 should be more complete. [01:48] I'll add your list to ArchiveBot. I won't grab it myself since I'm working on a different grab method. [01:49] For any further discussion about AMO, please come to the dedicated channel: #outofammo [01:50] Ok, and thanks for the tip. [01:58] Swicher: ArchiveBot job submitted, ID akifc65k7kfhpdhfbveh79v1c. It won't start immediately though. [01:59] *** closure_ has quit IRC (Read error: Operation timed out) [02:05] so i just learn why people like CHD files for isos [02:06] they save tons of space [02:06] about at least 50% [02:06] *** closure has joined #archiveteam-bs [02:12] i'm learning this cause i'm downloading the playstation official magazine demo discs from archive.org [02:29] so i got a ton of ps2 demo discs that may go up if there not hosted on archive.org already [02:30] i have rip at least 3 discs anyways [02:30] i have 2 psm dvds which are just video dvds [02:30] and a pc tools cd [02:31] *** closure has quit IRC (Read error: Operation timed out) [02:32] luckly there is this but no playstation official magazine ps2 discs [02:32] https://archive.org/download/PlayStation2-Demos [02:34] *** closure has joined #archiveteam-bs [02:42] JAA: The A Listing for tian.yam.com: https://transfer.sh/VHO5D/tian.yam.com-fdns-a-listing [02:42] I am going to try the any dataset [02:45] kiska: T minus 23 hours 15 minutes [02:45] 25* [02:46] btw using pigz, and its halving the lookup time [02:46] Based on their description, I doubt there'll be more in the ANY set. [02:48] It's been about a year, here's an update on storage prices: https://za3k.com/archive/storage-2018-10.sc.txt [02:49] "thermal paper" lol [02:52] JAA: Here is the any listing data: https://transfer.sh/vRZ7v/tian.yam.com-fdns-any-listing [02:54] kiska: Yup, identical hostnames (except for those yamedia.tw ones). [02:56] I feel like we won't be able to save pretty much anything from this site. [02:56] Hrm, lets try cname resolves [02:58] *** closure has quit IRC (Read error: Connection reset by peer) [03:00] *** closure has joined #archiveteam-bs [03:01] I've submitted a request for the latest dns dataset, but I doubt I'll get a response soon™ [03:13] kisspunch: 8TB has been $160 for a while assuming you're OK with stripping He8 warranties and have the know-how to pry them out of the My Book/easystore cases [03:14] oh I see the second link to a Seagate [03:14] I will never buy an SMR [03:16] There's nothing wrong with SMR when used for the right purpose. (But Seagate's doing a terrible job at communicating those limitations properly.) [03:16] But -ot [03:20] JAA: Another scrape of dns data: https://transfer.sh/wXvEp/tian.yam.com-subdomains-securitytrails [03:20] There is about ~100 subdomains in that [03:32] Elon musk forced to resign [03:32] *** closure has quit IRC (Read error: Connection reset by peer) [03:34] *** closure has joined #archiveteam-bs [03:35] Yep [03:38] *** archodg_ has joined #archiveteam-bs [03:40] *** archodg__ has quit IRC (Ping timeout: 252 seconds) [03:40] *** odemg has quit IRC (Ping timeout: 260 seconds) [03:48] JAA: So do we want to start archiving what we already have? I think I can get more subdomains to process, but it looks like a very manual process, since their search is using POST data [03:53] *** odemg has joined #archiveteam-bs [03:56] *** BlueMax has quit IRC (Remote host closed the connection) [03:58] *** closure has quit IRC (Read error: Operation timed out) [03:59] *** BlueMax has joined #archiveteam-bs [03:59] *** closure has joined #archiveteam-bs [04:07] *** Mateon1 has quit IRC (Ping timeout: 268 seconds) [04:07] *** Mateon1 has joined #archiveteam-bs [04:09] *** ReimuHaku has quit IRC (Ping timeout: 633 seconds) [04:10] *** ReimuHaku has joined #archiveteam-bs [04:35] *** closure has quit IRC (Read error: Connection reset by peer) [04:40] *** closure has joined #archiveteam-bs [04:55] *** fenn_ is now known as fenn [04:59] *** closure has quit IRC (Read error: Connection reset by peer) [05:00] *** closure has joined #archiveteam-bs [05:32] *** closure has quit IRC (Read error: Connection reset by peer) [07:20] *** ivan has quit IRC (Read error: Operation timed out) [07:20] *** JAA has quit IRC (Read error: Operation timed out) [07:20] *** Frogging has quit IRC (Read error: Operation timed out) [07:20] *** Frogging has joined #archiveteam-bs [07:20] *** Petri152 has quit IRC (Read error: Operation timed out) [07:20] *** zyphlar has quit IRC (Read error: Operation timed out) [07:21] *** Darkstar has quit IRC (Read error: Operation timed out) [07:21] *** jspiros has quit IRC (Read error: Operation timed out) [07:21] *** nightpool has quit IRC (Read error: Operation timed out) [07:21] *** Swicher has quit IRC (hub.efnet.us irc.Prison.NET) [07:21] *** achip has quit IRC (hub.efnet.us irc.Prison.NET) [07:22] *** c4rc4s has quit IRC (Read error: Operation timed out) [07:23] *** ivan has joined #archiveteam-bs [07:25] *** Mayonaise has quit IRC (Read error: Operation timed out) [07:25] *** Mayonaise has joined #archiveteam-bs [07:30] *** Darkstar has joined #archiveteam-bs [07:31] *** nightpool has joined #archiveteam-bs [07:32] *** achip has joined #archiveteam-bs [07:32] *** Swicher has joined #archiveteam-bs [07:47] *** schbirid has joined #archiveteam-bs [08:10] *** m007a83_ has joined #archiveteam-bs [08:15] *** jrwr_ has joined #archiveteam-bs [08:16] *** m007a83 has quit IRC (Read error: Operation timed out) [08:16] *** thejsa_ has joined #archiveteam-bs [08:18] I NEED THE BEST WAY TO SCAN MAGAZINES NOW [08:18] my parents want to throw out a bunch of old magazines and wont let me ship them out [08:19] Godane [08:19] Sketchcow [08:20] *** thejsa has quit IRC (Ping timeout: 633 seconds) [08:20] *** jrwr has quit IRC (Ping timeout: 633 seconds) [08:20] *** Jon- has quit IRC (Ping timeout: 633 seconds) [08:20] *** jrwr_ is now known as jrwr [08:20] *** jmtd has joined #archiveteam-bs [08:20] ivan: https://www.ebuyer.com/771467-seagate-backup-plus-hub-8tb-external-hard-drive-stel8000200 this sort of thing? [08:21] *** c4rc4s has joined #archiveteam-bs [08:21] *** zyphlar has joined #archiveteam-bs [08:21] *** Petri152 has joined #archiveteam-bs [08:22] *** JAA has joined #archiveteam-bs [08:22] *** swebb sets mode: +o JAA [08:22] *** bakJAA_ sets mode: +o JAA [08:23] HCross ivan be aware that those are SMR drives. at least in germany wd mybook pros are at a similar price every now and then [08:23] Dont worry, you can get a Porsche HDD https://www.ebuyer.com/767035-lacie-porsche-design-4tb-usb-3-0-desktop-drive-stew4000400 [08:24] :D [08:26] *** jspiros has joined #archiveteam-bs [08:26] Godane do you want some collectors mags? [08:26] flashfire: a scanner. maybe you could see if you could use a friend's or if a photocopy service will do it. if push comes to shove, you could lay them out flat or cut the pages out with an exacto knife and take photos of the pages\ [08:27] Library? [08:27] Library? [08:27] No these are from a dvd collection from about 10 15 years ago [08:27] Flashfire: would your local library have a scanner you could use? [08:28] I dont think so [08:30] Archive.org doesnt have them but auspost charges an arm and a leg [09:01] *** RichardG has quit IRC (Read error: Connection reset by peer) [09:02] *** RichardG has joined #archiveteam-bs [09:11] *** Jusque has quit IRC (Ping timeout: 260 seconds) [09:11] *** Jusque has joined #archiveteam-bs [09:29] a PHP script to collect usernames with tian's search feature: https://transfer.sh/kNTuz/tian_fetch.php [09:30] Yay! [09:31] I have ~610 urls, are you able to run the script and give me unduplicated urls? [09:32] Here is my list of url's: https://pastebin.com/raw/rP4taKiW [11:03] kevinYang: [11:03] PHP Warning: mysqli_connect(): (HY000/1049): Unknown database 'tian_username' in /tian_fetch.php on line 4 [11:03] PHP Warning: mysqli_query() expects parameter 1 to be mysqli, boolean given in /tian_fetch.php on line 5 [11:04] ... [11:04] nvm [11:05] It keeps the
next to the username tho [11:07] Ah, here's a real bug [11:07] PHP Notice: ob_flush(): failed to flush buffer. No buffer to flush in /tian_fetch.php on line 46 [11:11] kiska: +159 from 200 usernames scraped using kevinYang's script [11:11] Pastebin? [11:12] There should be ~10-20k blogs, since I believe yam is a pretty big host of them [11:16] kiska: https://pastebin.com/raw/kL6A8qMF [11:16] Also you got a couple dupes in yours, plus the CDN URL [11:17] Hrm... [11:18] It should skip them during the grab [11:19] Also queued [11:30] *** decay has joined #archiveteam-bs [12:09] *** VerifiedJ has joined #archiveteam-bs [12:19] *** BlueMax has quit IRC (Quit: Leaving) [12:24] *** wp494 has quit IRC (Ping timeout: 506 seconds) [12:25] *** wp494 has joined #archiveteam-bs [12:30] Here's ~20k unduplicated usernames sorted by their blogger_id: https://pastebin.com/raw/UJyRYxdB . Maybe there's an API to query usernames with blogger_id? [12:31] Hrm... That is... useful [12:31] I think there would be ~200k blogs since post IDs are up to 200M and total population of Taiwan is 23M. [12:37] *** bakJAA_ is now known as bakJAA [12:51] JAA: Some useful information here [13:03] *** closure has joined #archiveteam-bs [13:32] *** closure has quit IRC (Read error: Connection reset by peer) [13:32] *** closure has joined #archiveteam-bs [13:38] http://c.hawc.eu/tianyamusers.txt [13:39] full URL list absed on what kevinYang sent [13:51] *** m007a83_ has quit IRC (Quit: Fuck you Comcast) [13:55] *** zerkalo has joined #archiveteam-bs [13:59] *** closure has quit IRC (Read error: Operation timed out) [14:00] *** closure has joined #archiveteam-bs [14:32] *** closure has quit IRC (Read error: Connection reset by peer) [14:34] *** closure has joined #archiveteam-bs [14:58] *** closure has quit IRC (Read error: Connection reset by peer) [14:58] *** Pixi has quit IRC (Quit: Pixi) [15:01] *** schbirid has quit IRC (Remote host closed the connection) [15:06] *** closure has joined #archiveteam-bs [15:08] *** Pixi has joined #archiveteam-bs [15:22] kevinYang, you join #archivebot I'll be queuing 200 urls per ~30 minutes. Hopefully we'll be able to get a significant amount of those [15:32] *** closure has quit IRC (Read error: Connection reset by peer) [15:33] *** closure_ has joined #archiveteam-bs [15:58] *** closure_ has quit IRC (Read error: Connection reset by peer) [15:59] *** closure has joined #archiveteam-bs [16:32] *** closure has quit IRC (Read error: Connection reset by peer) [16:36] *** closure has joined #archiveteam-bs [17:00] *** closure has quit IRC (Read error: Connection reset by peer) [17:00] *** closure_ has joined #archiveteam-bs [17:33] *** closure has joined #archiveteam-bs [17:33] *** closure_ has quit IRC (Read error: Connection reset by peer) [17:51] *** icedice has joined #archiveteam-bs [17:52] *** jut_ has quit IRC (Quit: WeeChat 2.2) [18:00] *** closure has quit IRC (Read error: Connection reset by peer) [18:00] *** closure_ has joined #archiveteam-bs [18:07] *** jut has joined #archiveteam-bs [18:33] *** closure_ has quit IRC (Read error: Connection reset by peer) [18:33] *** closure has joined #archiveteam-bs [18:38] *** closure has quit IRC (Read error: Connection reset by peer) [18:40] *** closure has joined #archiveteam-bs [19:03] *** closure has quit IRC (Read error: Operation timed out) [19:06] *** closure has joined #archiveteam-bs [19:34] *** closure has quit IRC (Read error: Connection reset by peer) [19:34] *** closure_ has joined #archiveteam-bs [20:00] *** closure_ has quit IRC (Ping timeout: 252 seconds) [20:00] *** closure has joined #archiveteam-bs [20:03] *** closure has quit IRC (Read error: Connection reset by peer) [20:04] *** closure has joined #archiveteam-bs [20:32] *** closure has quit IRC (Read error: Connection reset by peer) [20:35] *** closure_ has joined #archiveteam-bs [21:01] *** closure_ has quit IRC (Read error: Connection reset by peer) [21:01] *** closure has joined #archiveteam-bs [21:33] *** closure has quit IRC (Read error: Connection reset by peer) [21:34] *** closure has joined #archiveteam-bs [21:59] *** closure has quit IRC (Read error: Connection reset by peer) [22:00] *** closure_ has joined #archiveteam-bs [22:33] *** closure_ has quit IRC (Read error: Connection reset by peer) [22:33] *** closure has joined #archiveteam-bs [23:00] *** closure has quit IRC (Read error: Connection reset by peer) [23:01] *** closure has joined #archiveteam-bs [23:07] *** VerifiedJ has quit IRC (Quit: Leaving) [23:34] *** closure has quit IRC (Read error: Connection reset by peer) [23:34] *** closure_ has joined #archiveteam-bs [23:57] *** BlueMax has joined #archiveteam-bs [23:58] *** closure_ has quit IRC (Read error: Operation timed out)