Time |
Nickname |
Message |
00:36
🔗
|
Despatche |
wait archive.org has winamp skins now? holy shittttt |
01:23
🔗
|
Flashfire |
https://motherboard.vice.com/en_us/article/d3q45v/bittorrent-usage-increases-netflix-streaming-sites |
01:24
🔗
|
|
bithippo has joined #archiveteam-ot |
01:56
🔗
|
|
bithippo has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…) |
02:46
🔗
|
|
Despatche has quit IRC (Ping timeout: 633 seconds) |
02:55
🔗
|
|
Despatche has joined #archiveteam-ot |
03:02
🔗
|
|
Despatche has quit IRC (Read error: Operation timed out) |
03:49
🔗
|
|
dashcloud has quit IRC (Remote host closed the connection) |
03:53
🔗
|
|
icedice has joined #archiveteam-ot |
04:57
🔗
|
|
Despatche has joined #archiveteam-ot |
04:58
🔗
|
|
Sanqui has quit IRC (Read error: Operation timed out) |
05:00
🔗
|
|
Sanqui has joined #archiveteam-ot |
05:01
🔗
|
|
Despatche has quit IRC (Remote host closed the connection) |
05:02
🔗
|
|
Despatche has joined #archiveteam-ot |
05:33
🔗
|
w0rmhole |
sorry ivan i lost it in the tsunami of messages here, what did i need to use to stop `grab-site' from falling into link loops like this?: http://atarimusic.exxoshost.co.uk/forum/topic?f=24&t=202&start=0?topic&f=24&t=202&start=0?topic&f=24&t=202&start=0?topic&f=24 ................ |
05:33
🔗
|
w0rmhole |
original command i ran: `grab-site --ua="Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.13; ) Gecko/20101203" "http://atarimusic.exxoshost.co.uk/"' |
05:34
🔗
|
|
Despatche has quit IRC (Remote host closed the connection) |
05:35
🔗
|
|
Despatche has joined #archiveteam-ot |
05:43
🔗
|
ivan |
w0rmhole: try an ignore that includes \?.+\? because well-formed URLs have at most one '?' |
05:43
🔗
|
ivan |
or even \?.*\? |
05:44
🔗
|
w0rmhole |
sorry im not sure how to do that |
05:44
🔗
|
w0rmhole |
do i literally type: |
05:44
🔗
|
w0rmhole |
\?.*\? |
05:44
🔗
|
ivan |
yep |
05:44
🔗
|
w0rmhole |
in ignores |
05:44
🔗
|
ivan |
into the ignores file |
05:44
🔗
|
ivan |
on a new line, hit save |
05:44
🔗
|
w0rmhole |
ok |
05:44
🔗
|
w0rmhole |
would you recommend one over the other? |
05:45
🔗
|
ivan |
\?.*\? |
05:45
🔗
|
w0rmhole |
okay, thank you for the help |
05:45
🔗
|
ivan |
.* matches zero or more characters |
05:45
🔗
|
ivan |
.+ matches one or more characters |
05:45
🔗
|
w0rmhole |
lets hope my comp. doesnt shit itself again |
05:46
🔗
|
w0rmhole |
i walked away for a day and came back to it choking on urls like that |
05:46
🔗
|
ivan |
\? is a literal '?' because an unescaped '?' means "make the last thing optional to match" |
05:46
🔗
|
w0rmhole |
and kept expanding |
05:46
🔗
|
w0rmhole |
okay i think i get it |
05:47
🔗
|
w0rmhole |
i thought your original suggestion included brackets |
05:47
🔗
|
w0rmhole |
[ |
05:47
🔗
|
w0rmhole |
] |
05:47
🔗
|
ivan |
<ivan> |
05:47
🔗
|
ivan |
w0rmhole: \?topic.+\?topic ignore? |
05:47
🔗
|
ivan |
<ivan> |
05:47
🔗
|
ivan |
and/or [\?&]p=.+[\?&]p= ignore |
05:48
🔗
|
w0rmhole |
yeah yeah that |
05:48
🔗
|
ivan |
maybe you were seeing a different loop then |
05:48
🔗
|
w0rmhole |
possibly |
05:48
🔗
|
w0rmhole |
let me checkl |
05:48
🔗
|
w0rmhole |
my terminal windows |
05:48
🔗
|
w0rmhole |
oh yep |
05:48
🔗
|
w0rmhole |
http://atarimusic.exxoshost.co.uk/forum/topic?p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108?topic&p=108? |
05:48
🔗
|
w0rmhole |
....... |
05:49
🔗
|
ivan |
multiple ? in a URL will almost always be bad but I don't know if I want to make it a default ignore |
05:50
🔗
|
w0rmhole |
so would i apply both: |
05:50
🔗
|
w0rmhole |
\?.*\? |
05:50
🔗
|
w0rmhole |
and |
05:50
🔗
|
w0rmhole |
\?topic.+\?topic ignore? |
05:50
🔗
|
w0rmhole |
in the ignores file? sorry i am still learning |
05:50
🔗
|
ivan |
\?.*\? will ignore anything that \?topic.+\?topic would ignore |
05:51
🔗
|
w0rmhole |
ok so just using \?.*\? i'll be set for both? |
05:51
🔗
|
ivan |
sure |
05:51
🔗
|
w0rmhole |
thanks |
05:52
🔗
|
w0rmhole |
hopefully that will work |
05:53
🔗
|
w0rmhole |
so with the whole phpsessid shit (since it's in the url) i dont need to worry about that? |
05:54
🔗
|
ivan |
I don't know what you mean |
05:55
🔗
|
w0rmhole |
sorry, bad english, if i visit https://www.exxoshost.co.uk/forum/ and click on one of the links i get https://www.exxoshost.co.uk/forum/viewtopic.php?f=64&t=1241&sid=baa4aee6927b67f3d18a63e8f3e7edf8 note the sid=xxxxxxxxxxx...... in the url |
05:56
🔗
|
w0rmhole |
grab-site will prevent this automatically? |
05:56
🔗
|
ivan |
reload the page and you'll see it no longer has the ?sid= once the cookie is set |
05:56
🔗
|
ivan |
you can start the crawl with a cookie or pick some irrelevant page as the first url |
05:57
🔗
|
ivan |
grab-site takes multiple URLs if needed |
05:57
🔗
|
w0rmhole |
so something like `grab-site https://www.exxoshost.co.uk/forum?archiveteam --igset forums' ? |
05:57
🔗
|
ivan |
sure |
05:58
🔗
|
w0rmhole |
or do i need to do it like this: `grab-site https://www.exxoshost.co.uk/forum/?archiveteam --igset forums' or is no different? |
05:58
🔗
|
w0rmhole |
sorry for asking |
05:59
🔗
|
ivan |
you might need to quote the URL argument because shells expand ? to a matching character in a filename |
06:00
🔗
|
w0rmhole |
ok good idea |
06:13
🔗
|
|
Despatche has quit IRC (Quit: Error: Connection reset by peer) |
06:15
🔗
|
w0rmhole |
ivan: is `--wpull-args=--tries=1000' a bit extreme? |
06:16
🔗
|
ivan |
yes |
06:16
🔗
|
w0rmhole |
15 more reasonable? |
06:16
🔗
|
ivan |
even 5 should be plenty |
06:17
🔗
|
w0rmhole |
just the website im trying keeps erroring out but after 700 or so attempts (time out) it finally gets what it was trying to get |
06:17
🔗
|
ivan |
was the site down for a while? |
06:18
🔗
|
w0rmhole |
dont believe so |
06:18
🔗
|
w0rmhole |
wasn't for that site i said earlier, but i had it happen to another site and, well, that one too, but not THAT many retries |
06:18
🔗
|
ivan |
if something takes 700 tries to succeed that's not something grab-site is designed to archive |
06:19
🔗
|
ivan |
you can do whatever you feel you need to do but trying 1000 times might get you banned from some sites |
06:19
🔗
|
w0rmhole |
a .zip file |
06:19
🔗
|
w0rmhole |
good point |
06:19
🔗
|
w0rmhole |
3.2mb zip file |
06:43
🔗
|
w0rmhole |
ivan: this look like an acceptable cookies.txt for this to you? https://share.dmca.gripe/dQYlzLvcwDF8eWUP.txt |
06:43
🔗
|
w0rmhole |
generated it by visiting https://www.exxoshost.co.uk/forum then refreshing; and http://atarimusic.exxoshost.co.uk/ and refreshing. |
06:45
🔗
|
ivan |
you probably need to remove the #HttpOnly_ |
06:45
🔗
|
w0rmhole |
the entire entry or just that |
06:46
🔗
|
ivan |
just that string |
06:46
🔗
|
ivan |
start with the . on the domain |
06:46
🔗
|
ivan |
you can also try changing that 0 in the first cookie to 1570084624 |
06:46
🔗
|
w0rmhole |
whats that do? |
06:46
🔗
|
w0rmhole |
idk what TRUE and FALSE mean either in this |
06:47
🔗
|
ivan |
https://unix.stackexchange.com/questions/36531/format-of-cookies-when-using-wget |
06:47
🔗
|
w0rmhole |
so i do not touch TRUE/FALSE |
06:48
🔗
|
ivan |
yeah those should be fine |
06:50
🔗
|
w0rmhole |
okay but i am still confused as to why i replace 0 with 157...... |
06:50
🔗
|
w0rmhole |
i am sorry |
06:50
🔗
|
w0rmhole |
why is 0 bad |
06:50
🔗
|
ivan |
it might expire when the crawl starts |
06:50
🔗
|
ivan |
I haven't tested it |
06:50
🔗
|
w0rmhole |
oh ok |
06:51
🔗
|
w0rmhole |
and the number you suggested? |
06:51
🔗
|
w0rmhole |
is that a yr from now? |
06:53
🔗
|
w0rmhole |
yes |
06:54
🔗
|
w0rmhole |
ivan: look good? https://share.dmca.gripe/CJ1Jm0dcCyCgmnAw.txt |
06:54
🔗
|
ivan |
yes |
06:55
🔗
|
w0rmhole |
thank you very much for the helkp |
06:55
🔗
|
w0rmhole |
*help |
06:55
🔗
|
w0rmhole |
one last thing, does grab-site auto copy cookies.txt to folder where warc is stored? |
06:57
🔗
|
ivan |
it tells wpull to --load-cookies from a file and --save-cookies to DIR/cookies.txt |
06:58
🔗
|
ivan |
a lot of stuff may happen inside wpull |
06:58
🔗
|
w0rmhole |
thanks |
06:59
🔗
|
ivan |
https://github.com/ludios/grab-site/blob/29b9825dc5f49c25f01d93746cfb0638c724c22a/libgrabsite/main.py#L240-L259 |
07:00
🔗
|
ivan |
and up to line 216 above |
07:00
🔗
|
w0rmhole |
ty |
07:15
🔗
|
|
djsundog has quit IRC (Read error: Operation timed out) |
07:15
🔗
|
|
mal_ has quit IRC (Read error: Operation timed out) |
07:16
🔗
|
|
ivan has quit IRC (Read error: Operation timed out) |
07:17
🔗
|
|
ivan has joined #archiveteam-ot |
07:17
🔗
|
|
svchfoo1 sets mode: +o ivan |
07:17
🔗
|
|
Albardin has quit IRC (Read error: Operation timed out) |
07:18
🔗
|
|
vectr0n_ has joined #archiveteam-ot |
07:19
🔗
|
|
godane has quit IRC (Read error: Operation timed out) |
07:21
🔗
|
|
dxrt_ has quit IRC (Read error: Operation timed out) |
07:22
🔗
|
|
Mateon1 has quit IRC (Read error: Operation timed out) |
07:22
🔗
|
|
Mateon1 has joined #archiveteam-ot |
07:23
🔗
|
|
Albardin has joined #archiveteam-ot |
07:24
🔗
|
|
kiska1 has quit IRC (Read error: Connection reset by peer) |
07:24
🔗
|
|
kiska1 has joined #archiveteam-ot |
07:25
🔗
|
|
vectr0n has quit IRC (Ping timeout: 600 seconds) |
07:25
🔗
|
|
vectr0n_ is now known as vectr0n |
07:26
🔗
|
|
godane has joined #archiveteam-ot |
07:27
🔗
|
|
svchfoo1 sets mode: +o godane |
07:27
🔗
|
|
dxrt_ has joined #archiveteam-ot |
07:27
🔗
|
|
dxrt sets mode: +o dxrt_ |
07:29
🔗
|
|
mal_ has joined #archiveteam-ot |
07:33
🔗
|
|
djsundog has joined #archiveteam-ot |
07:39
🔗
|
w0rmhole |
ivan: if i want to use both --tries and --load-cookies in grab-site do i do: `--wpull-args=--tries=5 --wpull-args=--load-cookies=/PATH\ TO/cookies.txt' |
07:39
🔗
|
w0rmhole |
? |
07:41
🔗
|
w0rmhole |
nvm the github page tells me to wrap in quotes |
07:44
🔗
|
w0rmhole |
one question, does running something like `grab-site https://www.reddit.com/r/oculus/ --igsets=reddit' grab all of that subreddit or only as much as reddits api allows? |
07:46
🔗
|
Flashfire |
Is everyone else having trouble with instagram |
07:51
🔗
|
|
m007a83_ has joined #archiveteam-ot |
07:54
🔗
|
|
m007a83 has quit IRC (Read error: Operation timed out) |
07:56
🔗
|
|
icedice has quit IRC (Ping timeout: 252 seconds) |
08:01
🔗
|
|
wp494 has quit IRC (west.us.hub irc.Prison.NET) |
08:07
🔗
|
|
wp494 has joined #archiveteam-ot |
08:48
🔗
|
w0rmhole |
does anybody here know who danooct1 is |
08:49
🔗
|
Flashfire |
the malware guy? |
08:49
🔗
|
w0rmhole |
yeah |
08:49
🔗
|
Flashfire |
I know of him why? |
08:50
🔗
|
w0rmhole |
i found a guilty-pleasure song he co-produced |
08:50
🔗
|
w0rmhole |
https://www.youtube.com/watch?v=UJRt41HNLJw |
08:50
🔗
|
Flashfire |
Tubeup it |
08:50
🔗
|
w0rmhole |
its so bad it' |
08:50
🔗
|
w0rmhole |
is godlike |
08:50
🔗
|
w0rmhole |
just did |
08:53
🔗
|
Flashfire |
I have programs designed for windows NT |
08:53
🔗
|
w0rmhole |
need link nao pls |
08:53
🔗
|
w0rmhole |
lol |
08:54
🔗
|
Flashfire |
Looks like I will need to rip it later |
08:58
🔗
|
ivan |
w0rmhole: reddit is terrible and doesn't link to everything |
08:58
🔗
|
ivan |
there's an issue on snscrape for reddit support |
08:58
🔗
|
w0rmhole |
oh that sucks -_- |
09:02
🔗
|
eientei95 |
Flashfire: https://puu.sh/BEWV1/0e44e80636.mp3 |
09:49
🔗
|
|
VerifiedJ has joined #archiveteam-ot |
10:20
🔗
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
12:18
🔗
|
|
arkiver has quit IRC (Read error: Operation timed out) |
12:18
🔗
|
|
kiska1 has quit IRC (Read error: Operation timed out) |
12:19
🔗
|
|
mal_ has quit IRC (Write error: Broken pipe) |
12:19
🔗
|
|
djsundog has quit IRC (Read error: Operation timed out) |
12:19
🔗
|
|
dxrt_ has quit IRC (Read error: Operation timed out) |
12:19
🔗
|
|
Albardin has quit IRC (Write error: Broken pipe) |
12:20
🔗
|
|
wp494 has quit IRC (Ping timeout: 255 seconds) |
12:20
🔗
|
|
wp494 has joined #archiveteam-ot |
12:21
🔗
|
|
arkiver has joined #archiveteam-ot |
12:31
🔗
|
|
Albardin has joined #archiveteam-ot |
12:32
🔗
|
|
kiska1 has joined #archiveteam-ot |
12:36
🔗
|
|
dxrt_ has joined #archiveteam-ot |
12:36
🔗
|
|
dxrt sets mode: +o dxrt_ |
12:40
🔗
|
|
mal_ has joined #archiveteam-ot |
12:41
🔗
|
|
djsundog has joined #archiveteam-ot |
14:03
🔗
|
|
BlueMax has quit IRC (Read error: Connection reset by peer) |
14:26
🔗
|
|
odemg has joined #archiveteam-ot |
14:57
🔗
|
|
bithippo has joined #archiveteam-ot |
15:13
🔗
|
|
bithippo has quit IRC (Textual IRC Client: www.textualapp.com) |
15:36
🔗
|
|
VerifiedJ has joined #archiveteam-ot |
16:01
🔗
|
|
odemg has quit IRC (Ping timeout: 260 seconds) |
16:07
🔗
|
|
schbirid has joined #archiveteam-ot |
16:13
🔗
|
|
odemg has joined #archiveteam-ot |
16:14
🔗
|
ivan |
twitter search is down right now in case anyone is running a bunch of snscrape and wonders why no results |
16:20
🔗
|
|
astrid has joined #archiveteam-ot |
18:07
🔗
|
|
odemg has quit IRC (Ping timeout: 260 seconds) |
18:18
🔗
|
|
odemg has joined #archiveteam-ot |
20:08
🔗
|
|
Mateon1 has quit IRC (Ping timeout: 268 seconds) |
20:08
🔗
|
|
Mateon1 has joined #archiveteam-ot |
20:12
🔗
|
|
VerifiedJ has quit IRC (Quit: Leaving) |
21:01
🔗
|
|
schbirid has quit IRC (Read error: Connection reset by peer) |
21:42
🔗
|
|
odemg has quit IRC (Ping timeout: 260 seconds) |
21:54
🔗
|
|
odemg has joined #archiveteam-ot |
22:04
🔗
|
|
Jens has quit IRC (Remote host closed the connection) |
22:05
🔗
|
|
Jens has joined #archiveteam-ot |
22:39
🔗
|
|
Stiletto has joined #archiveteam-ot |
22:40
🔗
|
|
Stilett0 has quit IRC (Ping timeout: 252 seconds) |
23:06
🔗
|
|
m007a83_ is now known as m007a83 |
23:19
🔗
|
|
godane has quit IRC (Read error: Operation timed out) |
23:23
🔗
|
|
dashcloud has joined #archiveteam-ot |
23:36
🔗
|
|
BlueMax has joined #archiveteam-ot |
23:46
🔗
|
|
odemg has quit IRC (Ping timeout: 260 seconds) |
23:48
🔗
|
|
Stilett0 has joined #archiveteam-ot |
23:48
🔗
|
|
Stiletto has quit IRC (Read error: Operation timed out) |
23:57
🔗
|
|
odemg has joined #archiveteam-ot |
23:59
🔗
|
|
Stiletto has joined #archiveteam-ot |