[00:00] The keygenmusic.net job might want to ignore the ?vote page, it's just a captcha box to vote on a file and each ID is being saved multiple times with language codes that appear to make no difference e.g. http://keygenmusic.net/?page=vote&fileid=2888&lang=gr nl, es, de, etc. [00:01] atphoenix: ^ [00:02] ahh.. [00:03] Is that a useful thing to report? [00:04] yes, or at least to give a heads-up on [00:05] did you notice it in the status tracker or you saw it browsing the keygenmusic site? [00:05] On the tracker [00:06] do you know the exact correct syntax for the !ig command? basically we can skip URLs that contain page=vote [00:07] No, I'm completely new to this :) [00:10] I'm fairly new too, hence I find the report useful. from the examples it appears it might be !ig IDENTID obnoxious\?page=vote\d+ [00:10] https://archivebot.readthedocs.io/en/latest/commands.html#ignore - right? [00:10] Ryz, help^ [00:16] 'obnoxious' appears to be part of the example and there's no digits following the word vote, I believe you'd want something like: \?page=vote&fileid [00:20] yes, that's where I'm looking at examples [00:20] I'm not a regex expert [00:21] I know my way round them but still check on https://regex101.com/ to be sure [00:22] I don't think we need the fileid in the pattern match [00:22] assuming we don't care about any further vote URLs [00:23] do you know what d+ means? [00:23] Don't want to exclude http://keygenmusic.net/?page=vote [00:24] \d means any digit, + means match the \d 1 or more times [00:29] Here's an example, you can hover over the regex parts to see them explained: https://regex101.com/r/tmqy8t/1/ [00:41] *** VoltZero has joined #archiveteam-bs [00:41] *** kiska has quit IRC (Remote host closed the connection) [00:41] *** Flashfire has quit IRC (Remote host closed the connection) [00:42] I'm not much of a regex expert too, most of the ignores I do are based on what other people did as inspiration and/or guidance [00:42] @hook54321 i was aware of that post on reddit, but i'm not sure why the site can't be scraped for just the data? [00:42] not the actual ebooks? [00:42] *** kiska has joined #archiveteam-bs [00:42] *** Flashfire has joined #archiveteam-bs [00:45] *** superkuh_ is now known as superkuh [00:58] Unless someone wants to pay 50 bitcoins or $376,180.50 assuming the owner is cooperative :/ [01:01] The 50btc thing seems almost certainly a scammer [01:07] *** godane has quit IRC (Quit: Leaving.) [01:07] *** godane has joined #archiveteam-bs [01:13] *** Joseph_ has joined #archiveteam-bs [01:13] are these backed up? https://www.npr.org/2020/01/05/785672201/deceased-gop-strategists-daughter-makes-files-public-that-republicans-wanted-sea [01:13] files appear to be on g.drive [01:14] https://www.thehofellerfiles.com/ [01:19] *** VerifiedJ has quit IRC (Read error: Operation timed out) [01:22] that is for the actual books [01:23] @ryz, is there a way to just grab a screen grab of every page of the site? [01:23] whether that is html or mhtml format? [01:24] i'd do it myself but i'm not sure how to get it done. i've tried using webscraper chrome extension and i'm able to get some of the data but the pagination breaks the extension and it only grabs random pages [01:25] What website are you referring to, VoltZero ? [01:25] ebookfarm the 50btc one [01:25] i just want a browseable backup of the site / sans actual books (which are locked down via another downloader) [01:27] *** nepeat has joined #archiveteam-bs [01:27] hell the data itself would be good too. basically the site is broken down by subjects > list of books in that subject (12 on a page) > actual book page to purchase book (this last part i don't care to store). Just the first two sections [01:29] https://mega.nz/#F!dYMC3SaC!YY5qDerO741uBQdki4cIzw what the pages look like [01:29] if you guys can at least guide me how to do it i'd appreciate it. [01:29] As I mentioned before, ArchiveBot is not capable of logging in through the account barrier of http://ebook.farm/ s: [01:29] so any other methods that i can use? [01:30] VoltZero, you may be able to pass cookies to a local web archiving tool. In the past I've used HTTrack for some stuff. [01:31] okay i'm downloading httrack now. [01:31] passing cookies can work like logging in [01:31] as you take the cookies from your already logged in browser [01:32] okay that seems doable. i'm not sure how to do that but i'll ask you once i get it loaded [01:33] anyway to set it that it only goes like one link deep? [01:35] yes, there is a spider depth control [01:36] the level of success varies by site. Javascript-heavy sites are trouble in many ways, regardless of your tool choice. [01:37] ah okay [01:39] alright i'm going through the settings now [01:39] for action is it just download web site? [01:40] use default if unsure [01:40] there is a page that has something like 'get related files'. Check that too. [01:42] there is a section that lets you put in authentication [01:42] and you can capture the url? [01:47] hey could someone add https://dtcimedia.disney.com to archivebot? downloads are absent from WBM [01:49] also, is there someone i can ask about getting permissions to use archivebot myself? [01:55] *** godane has quit IRC (Ping timeout: 610 seconds) [02:06] damn its not working. thanks for your help atphoenix but seems too much hassle [02:07] is membership there free? [02:10] ya but they closed signups [02:12] the closest i got has been using webscraper since it uses chrome to actually open the pages and extract the data [02:12] also i'd share my account info if you could back it up [02:13] if i could figure out the pagination script they use it'd work [02:16] you mean webscraper chrome extension? [02:16] *** godane has joined #archiveteam-bs [02:18] ya webscraper.io [02:18] i followed the video and it does it, but skips the individual pages [02:19] so for example, it only goes through pages 1, 3, 7 and then goes to next section [02:19] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [02:19] https://youtu.be/x8bZmUrJBl0 [02:19] i think they mention that but then don't elaborate on what to do to for it [02:31] is the site an endless-scroll site? [02:36] nope, has pages, its the second option in the video [02:36] i can share a screenshot of it [02:37] https://imgur.com/a/AoPuQIy [02:37] so for example: https://i.imgur.com/jjzhUFj.png -- it has those ... between certain pages. i think that causes it to not work [02:38] the mega link above is a copy of the .html [02:38] of a couple pages. someone on reddit wanted to know, but they didn't know how to program very well, but said it was probably possible to script it [02:51] *** qnicw has joined #archiveteam-bs [02:51] #/join #archiveteam-ot [02:51] *** qnicw has quit IRC (Quit: Leaving) [02:52] Raccoon`, have you seen the ASCII art error made on https://archivebot.readthedocs.io/%20%7C ? [03:09] *** fdstw has joined #archiveteam-bs [03:09] *** fdstw has quit IRC (Quit: Leaving) [03:10] *** cerca has quit IRC (Remote host closed the connection) [03:11] *** X-Scale` has joined #archiveteam-bs [03:11] *** fdstw has joined #archiveteam-bs [03:14] *** godane has quit IRC (Read error: Connection reset by peer) [03:17] *** X-Scale has quit IRC (Ping timeout: 610 seconds) [03:17] *** X-Scale` is now known as X-Scale [03:19] anyone know how to use wget? [03:19] *** qnisz has joined #archiveteam-bs [03:25] *** fdstw has quit IRC (Read error: Operation timed out) [03:29] *** X-Scale` has joined #archiveteam-bs [03:34] *** X-Scale has quit IRC (Read error: Operation timed out) [03:34] *** X-Scale` is now known as X-Scale [03:36] ye -- i recommend the manual for learning it https://www.gnu.org/software/wget/manual/wget.html [03:58] VoltZero, atphoenix: HTTrack doesn't have WARC support. I would use grabsite. [03:59] Thanks hook54321. I'll look into that. I'm not sure what WARC support means though. [04:00] WARC support is something that IA uses for playback, but in your case I'm not sure that is critical for a site behind a login [04:02] or at least that is my interpretation of WARC's value. easier to integrate into IA WBM. [04:02] It's a standard for web archiving. [04:02] essentially [04:03] i'll look into it more tomorrow. Thanks guys! this is all new to me so a lot of it is over my head [04:03] going to bed. night everyone [04:04] *** VoltZero has quit IRC (Quit: Going offline, see ya! (www.adiirc.com)) [04:07] *** mtntmnky has quit IRC (Remote host closed the connection) [04:07] *** mtntmnky has joined #archiveteam-bs [04:13] *** qw3rty__ has joined #archiveteam-bs [04:17] *** qw3rty_ has quit IRC (Ping timeout: 276 seconds) [04:17] *** m007a83 has joined #archiveteam-bs [04:17] *** m007a83 has quit IRC (Read error: Connection reset by peer) [04:18] *** m007a83 has joined #archiveteam-bs [04:19] *** qnisz has quit IRC (Quit: Leaving) [04:45] *** mtntmnky has quit IRC (Remote host closed the connection) [04:50] *** mtntmnky has joined #archiveteam-bs [04:52] *** odemgi_ has joined #archiveteam-bs [04:55] *** odemgi has quit IRC (Ping timeout: 276 seconds) [04:59] atphoenix: .warc.gz should generally be used for archiving even if not going into WBM [05:00] *** godane has joined #archiveteam-bs [05:05] marked1: even for personal use? I consider it a desirable characteristic that is preferred and nice-to-have, but not a hard requirement...at least where personal-use is the goal or where site conditions make it hard to use tools that generate WARCs. Afterall, PGO Explorer didn't generate WARCs, and some of the Yahoo Groups it was used to save may only continue to exist in PGO-generated output. [05:15] I'm finding it hard to explain from that example, because a few concepts are tangled. [05:19] One metaphor, most people don't do proper back-ups. That's a personal choice and only affects themselves. So that's like data for personal use. Certain checkboxes have to be hit before something can be considered archived well [05:20] One of those ideal checkboxes is using an archival format [06:24] Various semi-documented IA and WBM features are listed on https://www.archiveteam.org/index.php?title=Internet_Archive#Downloading_from_archive.org [06:24] feel free to add more there [06:25] note that is not particuarly secret, so if it's a feature you'd prefer remained less-well-known -- don't put it there [06:26] e.g. in the past, there was a bug that allowed access to various stuff that IA discouraged accessing -- please don't mention stuff like that [06:49] marked1, fair enough. To extend what you said, along my lines of thought improper/incomplete backups are still better than no backups. Partial loss is better than complete loss. Both pale in comparison to a full backup. [06:59] Sure, but then if someone asks how to do a back-up- You would tell them to keep a copy off-site, and if they're keeping their backup on the same disk that they're headed for trouble. [07:02] *** fredgido_ has joined #archiveteam-bs [07:10] *** fredgido has quit IRC (Read error: Operation timed out) [07:38] *** oxguy3 has quit IRC (My MacBook has gone to sleep. ZZZzzz…) [07:38] *** oxguy3 has joined #archiveteam-bs [07:38] *** oxguy3 has quit IRC (Client Quit) [07:39] *** oxguy3 has joined #archiveteam-bs [07:39] *** oxguy3 has quit IRC (Client Quit) [07:40] *** oxguy3 has joined #archiveteam-bs [07:40] *** oxguy3 has quit IRC (Client Quit) [07:41] *** oxguy3 has joined #archiveteam-bs [07:41] *** oxguy3 has quit IRC (Client Quit) [07:42] *** oxguy3 has joined #archiveteam-bs [07:42] *** oxguy3 has quit IRC (Client Quit) [07:42] *** oxguy3 has joined #archiveteam-bs [07:42] *** oxguy3 has quit IRC (Client Quit) [07:43] *** oxguy3 has joined #archiveteam-bs [07:43] *** oxguy3 has quit IRC (Client Quit) [07:44] *** oxguy3 has joined #archiveteam-bs [07:44] *** oxguy3 has quit IRC (Client Quit) [07:44] *** oxguy3 has joined #archiveteam-bs [07:45] *** oxguy3 has quit IRC (Client Quit) [07:45] *** oxguy3 has joined #archiveteam-bs [07:46] *** oxguy3 has quit IRC (Client Quit) [07:46] *** oxguy3 has joined #archiveteam-bs [07:46] *** oxguy3 has quit IRC (Client Quit) [07:47] *** oxguy3 has joined #archiveteam-bs [07:47] *** oxguy3 has quit IRC (Client Quit) [07:48] *** oxguy3 has joined #archiveteam-bs [07:48] *** oxguy3 has quit IRC (Client Quit) [07:48] *** HP_Archiv has quit IRC (Read error: Operation timed out) [08:10] *** bsmith093 has quit IRC (Ping timeout: 615 seconds) [08:11] *** HP_Archiv has joined #archiveteam-bs [08:21] *** bsmith093 has joined #archiveteam-bs [08:44] *** Kliment has joined #archiveteam-bs [08:44] astrid: further discussion initiated :) [08:45] oh sorry i don't have much else to say :P [08:46] Oh heh. I did want to ask about another 3d printing related site [08:46] yeah? [08:46] have you heard of lulzbot/aleph objects? [08:46] lulzbot yes [08:47] It's the same company. They have a large cache of super detailed documentation on everything they built [08:47] I'm worried it will fall off the internet [08:47] hmmm [08:47] have a link? [08:48] Kliment: most basic way to help is hang out in the IRC and learn the tools we use. Or if that doesn't work for you, look up the Archiveteam WarriorVM [08:48] and run a WarriorVM on one of your computers [08:48] astrid: I do, stand by [08:48] that VM helps with some of AT projects [08:49] Okay, so their main download site is still up, some friends have archived the forums (but can't host them anywhere) [08:50] https://download.lulzbot.com/ for the download site which has most of the goods [08:50] that looks like a textbook good candidate for archivebot [08:50] do you have any idea how big this is ? [08:50] gigabytes, terabytes, petabytes? [08:51] checking with my friends, hold on [08:51] spot checks + intuition suggests it's not much more than 100G [08:52] other subdomains http://devel.lulzbot.com/ https://forum.lulzbot.com/t/ftp-rsync/1362 [08:52] this is good stuff, i'm archivebotting it [08:52] a couple of my friends made partial backups, one of them has the forums [08:52] which are no longer online [08:53] http://download.lulzbot.com/README says This is the LulzBot archive. [08:53] Also available via rsync at: [08:53] rsync://rsync.alephobjects.com/lulzbot/ [08:53] See also: [08:53] http://www.alephobjects.com [08:53] http://www.lulzbot.com [08:53] kliment, please stay in touch with us, we may be interested in the partial backups too [08:53] the remnants of the company have been acquired [08:54] but no guarantee how long the servers will stay up [08:54] i put download.lulzbot.com in archivebot, see progress at http://dashboard.at.ninjawedding.org/3?showNicks=1 ; further updates issued by bot in #archivebot [08:54] and it's past my bedtime. goodnight all :) [08:55] I'll hang out here if you have any questions, and try to drag my friends in as well [08:55] thank you for doing this! [08:57] Kliment, check out https://www.archiveteam.org/index.php?title=ArchiveTeam_Warrior for how to run it [08:57] will do later today, thanks [09:30] hmmm ... maybe this download site would best be mirrored as an ftpsite [09:30] if the archivebot run is still going in a few days maybe i'll do that instead [09:31] astrid, we also got some old podcast mp3s brought to our attention that match up to gaps in the IA [09:32] mp3s are sitting here: http://rusrs.com/old.rusrs.com/bkrs/ They are related to the blog here, however many are missing from IA http://web.archive.org/web/20090302110050/http://www.revolution31.com/blog/ [09:34] DigDug told us about it in #archivebot [09:35] *** DigiDigi has quit IRC (Remote host closed the connection) [09:52] astrid: there's a working rsync [09:52] astrid: at least I hope it's still working [09:53] astrid: and apparently the new owners managed to import the forum data into an external forum service [09:53] astrid: This is much less crawlable than the old forums but at least it's up [09:54] astrid: There's a user on freenode that has a complete snapshot of the old forums if needed [10:31] *** josey9 has joined #archiveteam-bs [10:36] *** josey has quit IRC (Ping timeout: 745 seconds) [10:42] *** BlueMax has quit IRC (Read error: Connection reset by peer) [10:48] *** HP_Archiv has quit IRC (Ping timeout: 276 seconds) [11:16] SketchCow: i'm grabbing more RTE Morning Ireland from RTE Radio 1 [11:17] turns out the old method i used doesn't work more [11:18] i have lot more creative to get the files now [11:18] i maybe able to resume from fall 2012 [11:30] *** josey has joined #archiveteam-bs [11:37] *** josey9 has quit IRC (Ping timeout: 745 seconds) [12:05] (Continuing the discussion on the SPON forums from #archivebot) [12:07] I mentioned that there could be a separate job in regards to https://www.spiegel.de/forum/ on only the members section, since it was ignored for the main job to make it go faster and to grab the main content [12:09] the members profiles are pretty sparse in terms of content. To my eye, only added piece is profile create date and an index of their posts. [12:10] and the post index could be rebuilt by scanning/sorting through all the threads [12:10] leaving the create date as the only potentially lost piece of info [12:10] ...Oh, there's apparently a 13th section - being https://www.spiegel.de/forum/blog/ - which isn't linked from https://www.spiegel.de/forum/ at all [12:10] IA WBM has seen these forums and some user profiles in the past [12:11] https://www.spiegel.de/forum/blog/ has 351 paginations - which means 351x20 topics per page = estimated 7020 topics [12:13] WBM has been there too http://web.archive.org/web/2019*/https://www.spiegel.de/forum/blog/ [12:13] http://web.archive.org/web/20180202062755/http://www.spiegel.de/forum/blog/ein-steuerberater-erzaehlt-finanzbeamte-zu-ueberzeugen-macht-spass-thread-705664-4.html [12:14] how did you find forum/blog? [12:16] Out of chance from watching the dashboard [12:17] I guess the spider found it. Good for it :) [12:19] discussion about the forum shutdown: https://www.spiegel.de/forum/blog/in-eigener-sache-der-neue-kommentarbereich-fuer-unsere-nutzerinnen-und-nutzer-bed-thread-1002822-1.html [12:19] "It looks like there is no longer a central SPON forum" ... [12:19] "Every (10) years "book burning" again..... because of "forum conversion". [12:20] pretty sure they just don't consider the forum contents valuable [12:21] but this latest move is likely just to drive people to each article's page rather than a forum, in order to have more article-specific ad impressions [12:23] the point of forums like these is to get a 'pulse' of the reader audience (or maybe pulse of trolls?) in regard to various news topics... [12:24] it's not really for long ongoing discussions, which is what some of the Yahoo groups were about [12:24] that's my interpretation on them [12:24] which isn't authorative in any way :) [12:25] More pushing for archiving old/abandoned/dead forums that is :c [12:26] and this one wasn't even dead [12:26] and 4 day notice (at best) [12:27] what's the goal...push all discussion to facebook via the facebook module I see on some sites? [12:27] Or Reddit~ [12:27] then sites get 'dirty user data' off their servers, and facebook gets broader tracking... [12:32] lol, their servers suck. [12:32] Getting disconnects within seconds of starting qwarc. [12:34] AB is on 12 connections. 5/sec [12:34] Yeah, AB is slow. [12:35] 100 connections, 70 req/s here. [12:35] AB is also see some 500 statuses [12:36] Well, clearly they can't handle my load. [12:37] do you know anything that can? Do you know what your limits actually are? [12:38] I've archived quite a few things with qwarc before, and often enough they did hold up pretty well. [12:57] 50 connections now, still getting errors. [12:57] The 5xx rate is much lower now though. [13:04] Ok, this seems stable enough and should still finish in time, hopefully. [13:19] Can I run Warrior with limited disk space? [13:21] *** Lord_Nigh has quit IRC (Ping timeout: 240 seconds) [13:24] I started it and told it to work on urlteam2 but it is using zero bw apparently [13:25] seems disk space is limited to 60GB [13:25] do I need to do anything to make it actually do things? [13:25] yes, warrior is fine with that [13:26] urlteam is a low bandwidth project [13:26] it says it's working on a project but appears to be idling [13:26] 0.00 is very low bw [13:26] it's okay urlteam also idles as rate limiting [13:26] prevents IP banning [13:27] What's a more active project so I can verify it's working correctly? [13:34] *** Lord_Nigh has joined #archiveteam-bs [13:46] that's the one for now [13:47] just leave it run in the background [13:47] alright [13:47] you can start it using the menu as detached screen/headless [13:48] in a day or so you should be able to find your warrior name in the tracker leaderboard list (if you show maybe 10000 results) [13:48] and search for your name [15:01] *** Mateon1 has quit IRC (Remote host closed the connection) [15:01] *** Mateon1 has joined #archiveteam-bs [16:31] *** systwi has quit IRC (Ping timeout: 622 seconds) [16:40] *** systwi has joined #archiveteam-bs [16:54] *** Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat) [16:54] *** Craigle has joined #archiveteam-bs [17:02] *** DigiDigi has joined #archiveteam-bs [17:11] I have my first request for AB: http://consolidatedpower.co/~donald/zero/Main_Page fan wiki for the game Kentucky Route Zero (final act about to be released!) [17:11] WBM doesn't have media files e.g. recordings from a real phone number: https://consolidatedpower.co/~donald/zero/1-858-WHEN-KRZ [17:11] It should include outlinks to http://kentuckyroutezero.com/ to get things like these: https://consolidatedpower.co/~donald/zero/Wrongle [17:25] *** DigiDigi has quit IRC (Remote host closed the connection) [17:31] *** DigiDigi has joined #archiveteam-bs [18:06] *** jamiew has joined #archiveteam-bs [18:12] *** jamiew has quit IRC (zzz) [18:13] *** jamiew has joined #archiveteam-bs [19:09] *** jamiew has quit IRC (Textual IRC Client: www.textualapp.com) [20:52] *** odemgi has joined #archiveteam-bs [20:52] *** odemgi_ has quit IRC (Remote host closed the connection) [21:01] *** schbirid has joined #archiveteam-bs [21:10] is there a workping wpull fork for python3.7? [21:15] ludios_wpull [21:18] thx [21:33] *** apache2_ has quit IRC (Remote host closed the connection) [21:33] *** jodizzle has quit IRC (Read error: Operation timed out) [21:33] *** jodizzle has joined #archiveteam-bs [21:35] *** Jens has quit IRC (Remote host closed the connection) [21:35] *** apache2 has joined #archiveteam-bs [21:36] *** Jens has joined #archiveteam-bs [21:45] *** apache2 has quit IRC (Remote host closed the connection) [21:47] *** apache2 has joined #archiveteam-bs [21:47] *** Jens has quit IRC (Remote host closed the connection) [21:48] *** Jens has joined #archiveteam-bs [21:55] *** ivan has quit IRC (Quit: Leaving) [21:56] *** jodizzle has quit IRC (Quit: ZNC 1.7.1 - https://znc.in) [21:56] *** jodizzle has joined #archiveteam-bs [22:11] *** ivan has joined #archiveteam-bs [22:12] *** svchfoo1 sets mode: +o ivan [22:12] *** svchfoo3 sets mode: +o ivan [22:25] *** schbirid has quit IRC (Quit: Leaving) [22:34] *** odemgi has quit IRC (Remote host closed the connection) [22:35] *** Joseph__ has joined #archiveteam-bs [22:35] *** Joseph_ has quit IRC (Read error: Connection reset by peer) [22:39] *** Mateon1 has quit IRC (Ping timeout: 258 seconds) [22:48] *** BlueMax has joined #archiveteam-bs [22:58] *** Craigle has quit IRC (Quit: The Lounge - https://thelounge.chat) [23:05] *** odemgi has joined #archiveteam-bs [23:20] Welp, my SPON forums qwarc grab is still running, but it crashed several times it seems and has done very little progress in 10 hours. [23:20] ~7 % of thread IDs so far... [23:21] s/crashed/was OOM-killed/ [23:23] *** cppchrisc has quit IRC (Ping timeout: 496 seconds) [23:30] *** cppchrisc has joined #archiveteam-bs [23:30] *** cppchrisc has quit IRC (Connection closed) [23:31] *** cppchrisc has joined #archiveteam-bs [23:31] *** cppchrisc has quit IRC (Connection closed) [23:31] *** cppchrisc has joined #archiveteam-bs [23:51] My SPON grab is running into infinite loops. :-/ [23:51] I guess that's why it didn't get further.